issue with xmltextwriter

Topics: Developer Forum
May 19, 2011 at 3:16 PM
Edited Jun 21, 2011 at 8:28 AM

Hello

if I save my htmldoc to an xml file and then load this xml in a xmldocument it works fine

but when I use the XmlTextwriter I get the below error when I load it to the xmldocument

The ':' character, hexadecimal value 0x3A, cannot be included in a name. Line 1, position 369.

Do you have any idea ?

here is the code:

 

        private void uploadData()
        {
            //create xml document to store all data
            XmlDocument xmlNewTask;
            XmlDocument XML2Html;
            XmlElement xmlNodeRoot = null;           
           
            HtmlAgilityPack.HtmlDocument Htmldoc = new HtmlAgilityPack.HtmlDocument();
            Htmldoc.OptionFixNestedTags = true;
            Htmldoc.OptionOutputAsXml = true;           

            xmlNewTask = new XmlDocument();
            XML2Html = new XmlDocument();

            xmlNodeRoot = xmlNewTask.CreateElement("NewTask");   

            try
            {
                System.IO.MemoryStream stream = new System.IO.MemoryStream();
                XmlTextWriter xtw = new XmlTextWriter(stream, null);

                Htmldoc.LoadHtml(webEditorControlForm1.EditedText.Replace("\r\n", ""));                

                // save the content into the writer
				 Htmldoc.Save(xtw);
                //Htmldoc.Save(@"D:\Dev\test.xml");               

                // Rewind the memory stream
                stream.Position = 0;

                //load the html into the XML document
                XML2Html.Load((((new System.IO.StreamReader(stream)).ReadToEnd());                
                //XML2Html.Load(@"D:\Dev\test.xml");
            }
            catch (Exception ex)
            {
                System.Windows.Forms.MessageBox.Show("Process cannot convert HTML to XML and report the following error:\n" 
+ ex.Message, "failed to convert HTML to XML!", MessageBoxButtons.OK, MessageBoxIcon.Warning); } xmlNewTask.AppendChild(xmlNodeRoot); }

 

and here is the html string:

<!DOCTYPE HTML PUBLIC "-/W3C/DTD HTML 4.0 Transitional/EN"><HTML><HEAD><META content="text/html; charset=utf-8" 
http-equiv=Content-Type><META name=GENERATOR content="MSHTML 8.00.6001.19046"></HEAD><BODY><P style="MARGIN: 0in 0in 
0pt"><SPAN style="COLOR: #1f497d"><FONT face=Calibri>Hello Again,<?xml:namespace prefix = o ns = 
"urn:schemas-microsoft-com:office:office" /><o:p></o:p></FONT></SPAN></P><P style="MARGIN: 0in 0in 0pt" 
class=MsoNormal><SPAN style="COLOR: #1f497d"><o:p><FONT face=Calibri>&nbsp;</FONT></o:p></SPAN></P><P style="MARGIN: 0in 
0in 0pt"><SPAN style="COLOR: #1f497d"><FONT face=Calibri>Please see my comments 
below<o:p></o:p></FONT></SPAN></P><P style="MARGIN: 0in 0in 0pt"><SPAN style="COLOR: #1f497d"><o:p><FONT 
face=Calibri>&nbsp;</FONT></o:p></SPAN></P><P style="MARGIN: 0in 0in 0pt"><B><SPAN style="COLOR: 
red"><FONT face=Calibri>: Three tasks below will take 2.5 days only, could you pls 
check?<o:p></o:p></FONT></SPAN></B></P><P style="MARGIN: 0in 0in 0pt"><SPAN style="COLOR: #17365d"><FONT 
face=Calibri>: I already completed several actions (new build creation, deployment ) that is why it remains only 2.5 
days<o:p></o:p></FONT></SPAN></P><P style="MARGIN: 0in 0in 0pt"><SPAN style="COLOR: #17365d"><FONT 
face=Calibri>For the first <B>activate Email notification</B> most of the job will be on your side that is to says fill 
for each task and each task status who should be notified.<o:p></o:p></FONT></SPAN></P><P style="MARGIN: 0in 0in 0pt" 
class=MsoNormal><SPAN style="COLOR: #17365d"><FONT face=Calibri>For the training I put 2*2 hours due users time zone 
differences<o:p></o:p></FONT></SPAN></P><P style="MARGIN: 0in 0in 0pt"><o:p><FONT 
face=Calibri>&nbsp;</FONT></o:p></P><P style="MARGIN: 0in 0in 0pt"><FONT face=Calibri><SPAN 
style="BACKGROUND: yellow; mso-highlight: yellow">[AX] got it. Thanks!</SPAN><o:p></o:p></FONT></P><P style="MARGIN: 0in 
0in 0pt"><B><SPAN style="COLOR: red"><o:p><FONT face=Calibri>&nbsp;</FONT></o:p></SPAN></B></P><P 
style="MARGIN: 0in 0in 0pt"><FONT face=Calibri><B><SPAN style="COLOR: red">:</SPAN></B><SPAN 
style="COLOR: red"> Total 7.5 hours for all tasks below exclude FAQ?<o:p></o:p></SPAN></FONT></P><P style="MARGIN: 0in 
0in 0pt"><FONT face=Calibri><SPAN style="COLOR: #17365d">: Here again that will be N and A</SPAN><SPAN 
style="COLOR: #1f497d; mso-themecolor: dark2"> </SPAN><SPAN style="COLOR: #17365d">that will have to produce the biggest 
effort by creating the documentation that is why 1 day for these task seems correct.<o:p></o:p></SPAN></FONT></P><P 
style="MARGIN: 0in 0in 0pt"><SPAN style="COLOR: #1f497d"><o:p><FONT 
face=Calibri>&nbsp;</FONT></o:p></SPAN></P><P style="MARGIN: 0in 0in 0pt"><FONT face=Calibri><SPAN 
style="BACKGROUND: yellow; COLOR: #1f497d; mso-highlight: yellow">[AX] Regarding the big effort from N and A, my 
understand is they need to read carefully on those documentation and raise question?</SPAN><SPAN style="COLOR: #1f497d"> 
<o:p></o:p></SPAN></FONT></P><P style="MARGIN: 0in 0in 0pt"><B><SPAN style="COLOR: red"><o:p><FONT 
face=Calibri>&nbsp;</FONT></o:p></SPAN></B></P><P style="MARGIN: 0in 0in 0pt"><SPAN style="COLOR: 
#17365d"><FONT face=Calibri>Regards<o:p></o:p></FONT></SPAN></P><P style="MARGIN: 0in 0in 0pt"><SPAN 
style="COLOR: #1f497d"><o:p><FONT face=Calibri>&nbsp;</FONT></o:p></SPAN></P><P style="MARGIN: 0in 0in 0pt" 
class=MsoNormal><SPAN style="FONT-FAMILY: 'Arial','sans-serif'; COLOR: #003399; mso-ansi-language: FR" 
lang=FR>________________________________</SPAN><SPAN style="COLOR: #1f497d; mso-ansi-language: FR" lang=FR><BR 
style="mso-special-character: line-break"><BR style="mso-special-character: line-break"></SPAN><SPAN style="FONT-FAMILY: 
'Arial','sans-serif'; COLOR: #262626; FONT-SIZE: 10pt; mso-ansi-language: FR" lang=FR><o:p></o:p></SPAN></P><P 
style="MARGIN: 0in 0in 0pt"><SPAN style="COLOR: #1f497d; mso-ansi-language: FR" lang=FR><FONT 
face=Calibri>&nbsp;&nbsp;</FONT></SPAN><B><SPAN style="FONT-FAMILY: Webdings; COLOR: green; FONT-SIZE: 26pt; 
mso-ansi-language: FR" lang=FR>P</SPAN></B><SPAN style="FONT-FAMILY: 'Tahoma','sans-serif'; COLOR: green; FONT-SIZE: 
7.5pt; mso-ansi-language: FR" lang=FR> </SPAN><SPAN style="FONT-FAMILY: 'Arial','sans-serif'; COLOR: green; FONT-SIZE: 
8pt; mso-ansi-language: FR" lang=FR>Avant d'imprimer ce mail, assurez-vous que&nbsp;cela est 
nécessaire.&nbsp;</SPAN><SPAN style="FONT-FAMILY: 'Arial','sans-serif'; COLOR: black; FONT-SIZE: 10pt; 
mso-ansi-language: FR" lang=FR><o:p></o:p></SPAN></P><P style="MARGIN: 0in 0in 0pt"><I><SPAN 
style="FONT-FAMILY: 'Arial','sans-serif'; COLOR: green; FONT-SIZE: 8pt">Before printing this email, assess if it is 
really needed.<o:p></o:p></SPAN></I></P><P>&nbsp;</P></BODY></HTML>

 

 

 

 

 

 

           
May 19, 2011 at 5:34 PM

Html Agility Pack doesn't support XML namespaces.  When you try to reinterpret the HTML tree as xml, it sees elements like "o:p" - but that's not a valid element name (it isn't in HTML either, but browsers will probably just ignore it, or ignore the prefix).  What "should" happen is that an element "p" is create in some namespace with namespace prefix "o" - but Html Agility Pack simply doesn't support that kind of processing.

There are 3 choices:

 - Clean up the Html before parsing it, for example with HTML Tidy.  Make sure to select the clean up MS office gunk setting.

 - Fix Html Agility Pack.  This isn't as daunting as it sounds, since you can just change HtmlNodeNavigator to preprocess any element names by removing any namespace prefixes.

 - (simplest) Preprocess with Html agility pack to rename elements called "o:p" to "p" or to replace the elements by their content (depends on what you want to do with those non-sensical elements, are they paragraphs with a weird office prefix, or are they just nonsense?

May 20, 2011 at 9:07 AM

THanks for your solutions

in fact I have office gunk code because users copy email part from Outlook and past it into my webform.

I have tried to remove those <o:p> node but I get the same error.

I will try to use the tidy to cleanup the html

May 20, 2011 at 9:22 AM

In that case (though you've probably already found it), look at http://sourceforge.net/projects/tidynet/ - it's a .NET port of Tidy, which may be easier for you to use.  It's probably not quite as good as the original, and I've never used it, but it's probably worth a shot.  Good luck!

Jun 2, 2011 at 3:29 PM

The patch to issue 29218 includes a fix for this issue: it removes all xml namespace declarations and namespace prefixes: http://htmlagilitypack.codeplex.com/workitem/29218

Jun 17, 2011 at 3:17 PM

I've tried the tidynet.dll but I cannot managed to cleanup the stream

When you talk about to clean up the MS office Gunk you want to talk about the CleanWork2000 option ? this is the only one I see for office

Another point

To use the Patch I need to manually update files mentioned or there are somewhere files already implemented with this Patch ?

Regards

 

Jun 20, 2011 at 3:45 PM

You'll need to apply the patch yourself.  This isn't that hard if you've used TortoiseSVN before, simply checkout HtmlAgilityPack (the url is https://htmlagilitypack.svn.codeplex.com/svn), right click the repository and select "Apply Patch".  That's it!  You can then compile the project and do with the dll's whatever you want.

Jun 20, 2011 at 4:02 PM
thanks for the tips
I will try to follow your instructions

regards
Jun 20, 2011 at 4:36 PM
Edited Jun 21, 2011 at 8:32 AM

Okay, I took a look at your sample, and it contains even more gibberish & some unsupported stuff.

It's using elements with namespace prefixes (which HtmlAgilityPack doesn't support).  I've patched that too, and attached a third version of this patch set which strips namespace prefixes from the XPathNavigator (rather than crash).

Secondly, you've a weirdly misformed xml processing instruction in your sample.  Best to remove that in code e.g.:

 

var doc = new HtmlDocument();
doc.Load(@"C:\Users\Eamon\Desktop\htmlMSO.html");

//doc.OptionOutputAsXml = true; //unnecessary now.
using(var sw = new StringWriter())
using(var xtw = XmlWriter.Create(sw, new XmlWriterSettings{ Indent=true, }  )) {
//you should remove elements starting with '?' - these are processing instructions (unsupported).
var bad = doc.DocumentNode.DescendantNodesAndSelf().Where(
node=>node.NodeType == HtmlNodeType.Element && node.Name.StartsWith("?"));
  foreach(var node in bad.ToArray()) node.Remove();
  //doc.Save(xtw); //DON'T use save to generate Xml; save doesn't use the new xml code and will thus still crash.
// instead...

doc.CreateNavigator().WriteSubtree(xtw);
  xtw.Close();
//This is basically equivalent to save but using the XPathNavigator which has been fixed.
Console.Writeline(sw.ToString());
}

The above code works on your example document, producing (I've added a few linebreaks manually):

 

<?xml version="1.0" encoding="utf-16"?>
<!--<!DOCTYPE HTML PUBLIC "-/W3C/DTD HTML 4.0 Transitional/EN">-->
<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
    <meta name="GENERATOR" content="MSHTML 8.00.6001.19046" />
  </head>
  <body>
    <p style="MARGIN: 0in 0in 0pt">
      <span style="COLOR: #1f497d">
        <font face="Calibri">Hello Again,<p /></font>
      </span>
    </p>
    <p style="MARGIN: 0in 0in 0pt">
      <span style="COLOR: #1f497d">
        <p>
          <font face="Calibri">&amp;nbsp;</font>
        </p>
      </span>
    </p>
    <p style="MARGIN: 0in 0in 0pt">
      <span style="COLOR: #1f497d">
        <font face="Calibri">Please see my comments below<p /></font>
      </span>
    </p>
    <p style="MARGIN: 0in 0in 0pt">
      <span style="COLOR: #1f497d">
        <p>
          <font face="Calibri">&amp;nbsp;</font>
        </p>
      </span>
    </p>
    <p style="MARGIN: 0in 0in 0pt">
      <b>
        <span style="COLOR: red">
          <font face="Calibri">: Three tasks below will take 2.5 days only, could you pls check?<p /></font>
        </span>
      </b>
    </p>
    <p style="MARGIN: 0in 0in 0pt">
      <span style="COLOR: #17365d">
        <font face="Calibri">: I already completed several actions (new build creation, deployment ) that is
 why it remains only 2.5 days<p /></font> </span> </p> <p style="MARGIN: 0in 0in 0pt"> <span style="COLOR: #17365d"> <font face="Calibri">For the first <b>activate Email notification</b> most of the job will be on your
side that is to says fill for each task and each task status who should be notified.<p /></font> </span> </p> <p style="MARGIN: 0in 0in 0pt"> <span style="COLOR: #17365d"> <font face="Calibri">For the training I put 2*2 hours due users time zone differences<p /></font> </span> </p> <p style="MARGIN: 0in 0in 0pt"> <p> <font face="Calibri">&amp;nbsp;</font> </p> </p> <p style="MARGIN: 0in 0in 0pt"> <font face="Calibri"> <span style="BACKGROUND: yellow; mso-highlight: yellow">[AX] got it. Thanks!</span> <p /> </font> </p> <p style="MARGIN: 0in 0in 0pt"> <b> <span style="COLOR: red"> <p> <font face="Calibri">&amp;nbsp;</font> </p> </span> </b> </p> <p style="MARGIN: 0in 0in 0pt"> <font face="Calibri"> <b> <span style="COLOR: red">:</span> </b> <span style="COLOR: red"> Total 7.5 hours for all tasks below exclude FAQ?<p /></span> </font> </p> <p style="MARGIN: 0in 0in 0pt"> <font face="Calibri"> <span style="COLOR: #17365d">: Here again that will be N and A</span> <span style="COLOR: #1f497d; mso-themecolor: dark2"> </span> <span style="COLOR: #17365d">that will have to produce the biggest effort by creating the documentation
 that is why 1 day for these task seems correct.<p /></span> </font> </p> <p style="MARGIN: 0in 0in 0pt"> <span style="COLOR: #1f497d"> <p> <font face="Calibri">&amp;nbsp;</font> </p> </span> </p> <p style="MARGIN: 0in 0in 0pt"> <font face="Calibri"> <span style="BACKGROUND: yellow; COLOR: #1f497d; mso-highlight: yellow">[AX] Regarding the big effort
 from N and A, my understand is they need to read carefully on those documentation and raise question?</span> <span style="COLOR: #1f497d"> <p /></span> </font> </p> <p style="MARGIN: 0in 0in 0pt"> <b> <span style="COLOR: red"> <p> <font face="Calibri">&amp;nbsp;</font> </p> </span> </b> </p> <p style="MARGIN: 0in 0in 0pt"> <span style="COLOR: #17365d"> <font face="Calibri">Regards<p /></font> </span> </p> <p style="MARGIN: 0in 0in 0pt"> <span style="COLOR: #1f497d"> <p> <font face="Calibri">&amp;nbsp;</font> </p> </span> </p> <p style="MARGIN: 0in 0in 0pt"> <span style="FONT-FAMILY: 'Arial','sans-serif'; COLOR: #003399; mso-ansi-language: FR" lang="FR">
________________________________</span> <span style="COLOR: #1f497d; mso-ansi-language: FR" lang="FR"> <br style="mso-special-character: line-break" /> <br style="mso-special-character: line-break" /> </span> <span style="FONT-FAMILY: 'Arial','sans-serif'; COLOR: #262626; FONT-SIZE: 10pt; mso-ansi-language: FR"
 lang="FR"> <p /> </span> </p> <p style="MARGIN: 0in 0in 0pt"> <span style="COLOR: #1f497d; mso-ansi-language: FR" lang="FR"> <font face="Calibri">&amp;nbsp;&amp;nbsp;</font> </span> <b> <span style="FONT-FAMILY: Webdings; COLOR: green; FONT-SIZE: 26pt; mso-ansi-language: FR"
 lang="FR">P</span> </b> <span style="FONT-FAMILY: 'Tahoma','sans-serif'; COLOR: green; FONT-SIZE: 7.5pt; mso-ansi-language: FR"
lang
="FR"> </span> <span style="FONT-FAMILY: 'Arial','sans-serif'; COLOR: green; FONT-SIZE: 8pt; mso-ansi-language: FR"
lang
="FR">Avant d'imprimer ce mail, assurez-vous que&amp;nbsp;cela est nécessaire.&amp;nbsp;</span> <span style="FONT-FAMILY: 'Arial','sans-serif'; COLOR: black; FONT-SIZE: 10pt; mso-ansi-language: FR"
lang
="FR"> <p /> </span> </p> <p style="MARGIN: 0in 0in 0pt"> <i> <span style="FONT-FAMILY: 'Arial','sans-serif'; COLOR: green; FONT-SIZE: 8pt">Before printing this email,
 assess if it is really needed.<p /></span> </i> </p> <p>&amp;nbsp;</p> </body> </html>
Jun 20, 2011 at 4:38 PM
Edited Jun 20, 2011 at 4:40 PM

By the way, could you edit your first post and introduce line breaks?  this thread is very hard to read as-is.  Finally, your document contains \" whereever " should be, I'm assuming that's some kind of encoding error.

The edit link is in the top-right corner.

Jun 21, 2011 at 8:26 AM

Thanks for your help and your time

I will try today or tomorrow your new code and keep you informed on my result

I have modified the Html code as suggested

regards

Jun 27, 2011 at 3:04 PM

Hello

I did some test using your patches and my generated XML is nearly empty.

this is my Html get from my topEditor WebBroswer Form input:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"><HTML><HEAD><META content="text/html; charset=utf-8" 
http-equiv=Content-Type><META name=GENERATOR content="MSHTML 8.00.6001.19088"></HEAD><BODY><P style="MARGIN: 0in 0in 
0pt" class=MsoNormal><SPAN style="COLOR: #1f497d"><FONT face=Calibri>Hi Olivier, <?xml:namespace prefix = o ns = 
"urn:schemas-microsoft-com:office:office" /><o:p></o:p></FONT></SPAN></P><P style="MARGIN: 0in 0in 0pt" 
class=MsoNormal><SPAN style="COLOR: #1f497d"><o:p><FONT face=Calibri>&nbsp;</FONT></o:p></SPAN></P><P style="MARGIN: 0in 
0in 0pt" class=MsoNormal><SPAN style="COLOR: #1f497d"><FONT face=Calibri>Thanks! I’ll tell her to create a job 
first.&nbsp; Pls see further comments below.<o:p></o:p></FONT></SPAN></P><P style="MARGIN: 0in 0in 0pt" 
class=MsoNormal><SPAN style="COLOR: #1f497d"><o:p><FONT face=Calibri>&nbsp;</FONT></o:p></SPAN></P><P 
style="TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-list: l0 level1 lfo1" class=MsoListParagraph><SPAN 
style="BACKGROUND: yellow; COLOR: #1f497d; mso-fareast-font-family: Calibri; mso-bidi-font-family: Calibri; 
mso-highlight: yellow"><SPAN style="mso-list: Ignore"><FONT face=Calibri>1.</FONT><SPAN style="FONT: 7pt 'Times New 
Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </SPAN></SPAN></SPAN><SPAN style="COLOR: #1f497d"><FONT face=Calibri>do you 
have news from Ted about pending developments ? <SPAN style="BACKGROUND: yellow; mso-highlight: yellow">[AX] Not yet. 
I’ll follow up. I think</SPAN> <SPAN style="BACKGROUND: yellow; mso-highlight: yellow">the Remaining Actions from the 7 
days features already approved, just need him approval on new action, right?<o:p></o:p></SPAN></FONT></SPAN></P><P 
style="TEXT-INDENT: -0.25in; MARGIN: 0in 0in 0pt 0.5in; mso-list: l0 level1 lfo1" class=MsoListParagraph><SPAN 
style="BACKGROUND: yellow; COLOR: #1f497d; mso-fareast-font-family: Calibri; mso-bidi-font-family: Calibri; 
mso-highlight: yellow"><SPAN style="mso-list: Ignore"><FONT face=Calibri>2.</FONT><SPAN style="FONT: 7pt 'Times New 
Roman'">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; </SPAN></SPAN></SPAN><SPAN style="COLOR: #1f497d"><FONT face=Calibri>Do you 
need help to complete the Email Template ? <SPAN style="BACKGROUND: yellow; mso-highlight: yellow">[AX] sorry to keep 
you wait. I’ll try to reply to your email no later than June 22.<o:p></o:p></SPAN></FONT></SPAN></P><P style="MARGIN: 
0in 0in 0pt" class=MsoNormal><SPAN style="COLOR: #1f497d"><o:p><FONT face=Calibri>&nbsp;</FONT></o:p></SPAN></P><P 
style="MARGIN: 0in 0in 0pt" class=MsoNormal><SPAN style="COLOR: #1f497d"><FONT face=Calibri>Best Regards, 
</FONT></SPAN></P></BODY></HTML>

this is my code

 

        //convert
        private void button1_Click(object sender, EventArgs e)
        {
            htmlDoc = new HtmlAgilityPack.HtmlDocument();
            htmlDoc.LoadHtml(this.TopEditor.EditedText.Replace("\r\n",""));

            sw = new StringWriter();
            xtw = XmlWriter.Create(sw, new XmlWriterSettings { Indent = true, });
            
            //you should remove elements starting with '?' - these are processing instructions (unsupported).
            var bad = htmlDoc.DocumentNode.DescendantNodesAndSelf().Where(
                node => node.NodeType == HtmlNodeType.Element && node.Name.StartsWith("?"));

            foreach (var node in bad.ToArray())
            {
                node.Remove();
            }

            htmlDoc.CreateNavigator().WriteSubtree(xtw);
            xtw.Close();

            //This is basically equivalent to save but using the XPathNavigator which has been fixed. 
            this.textBox1.Text = sw.ToString();
            Console.WriteLine(sw.ToString()); 
        }


 

this is my output

?xml version="1.0" encoding="utf-16"?>
<!--<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">-->
<html>
  <head>
    <meta content="text/html; charset=utf-8" http-equiv="Content-Type" />
  </head>
</html>

Do you have any idea why all the content is removed ?

After the foreach loop is there any way to check if the ? node has been well removed?

I have checked several options (innerHtml, outerhtml) withou any success

Jul 15, 2011 at 9:22 AM

I'm rather busy ATM, sorry for the late reply:

Your code sample mostly works in my case.  Apparently Html Agility Pack also doesn't unescape entities in the XPathNavigator; but that's easily fixed (I'll update the patch and add that...)

But in any case, if I run the code you pasted (modified to run without the references to the GUI) I get a reasonable reproduction of the document.

Jul 15, 2011 at 9:57 AM

OK, patch updated.  A few things you may want to try:

Indenting adds spaces, and some of these are significant to the layout; you'll get very slightly different rendering if indenting is on (unfortunately).  Including the xml declaration is probably incorrect in your case since you're writing to a stringwriter - the encoding the header lists will not match the file's real encoding.  So:

 

var xtw = XmlWriter.Create(sw, new XmlWriterSettings { Indent = false, OmitXmlDeclaration = true });

 

Furthermore, to perfectly render, you want to strip microsoft-specific elements.  Unfortunately, there's no easy way of identifying these, but a reasonable heuristic is that it starts with "o:".  You could then replace these elements by their children as follows:

var unknown = htmlDoc.DocumentNode.DescendantNodesAndSelf().Where(
	node => node.NodeType == HtmlNodeType.Element && node.Name.StartsWith("o:"));

foreach (var node in unknown.ToArray())
{
	var parent = node.ParentNode;
	var siblings = parent.ChildNodes.ToArray();
	var children = node.ChildNodes.ToArray();
	node.RemoveAllChildren();
	parent.RemoveAllChildren();
	var newSiblings = siblings.SelectMany(sibling=>sibling==node?children:new[]{sibling});
	foreach(var newSibling in newSiblings)
		parent.AppendChild(newSibling);
}

With these two changes (and the new patch) your sample renders exactly identically to the source after conversion to XHtml.