translation to xml

Topics: User Forum
Dec 26, 2006 at 9:40 PM
Hi,

When translating html to xml, HtmlAgilityPack translates the html entity  
Dec 26, 2006 at 9:47 PM
My text was mangled, or at least my post is not displaying correctly in the browser, so I'm going to try this again:

Hi,

When translating html to xml, HtmlAgilityPack translates the html entity   to  . I believe the accurate translation of a non-breaking space character to xml should be  .

The code that demonstrates this problem is simple, something like the following:

HtmlDocument doc = new HtmlDocument();
doc.OptionOutputAsXml = true;
// Load an html file that has an  
Dec 28, 2006 at 2:06 AM
Here's some more information on my previous post. I am trying to use HtmlAgilityPack to read html that I then want to process with xslt. I load an HtmlDocument with the html, call CreateNavigator() and use the returned XPathNavigator to select the html element (which is under a span element that HtmlAgilityPack adds as the root node for reasons that are unclear to me). I then pass the html element to the xslt processor. The flow from HtmlDocument.CreateNavigator() to xslt processor is using the standard .NET xml api's.

The problem that I am running into is that character entities like   in the original html are getting translated to   in the html that the xslt processor produces. This problem is true for character entities is general - < is translated to < and so on. You can also see this problem in the string that is returned by HtmlDocument.CreateNavigator().OuterXml.

I tracked the problem down to HtmlNodeNavigator.Value. HtmlNodeNavigator overrides the base class implementation from XPathNavigator. Evidently there's something in the MS xml/xpath classes that will replace every & in the string that Value returns with &.

If I change HtmlNodeNavigator.Value to replace the character entities with the actual character they represent, then the html that the xslt processor produces is fine.

The change I tried as an experiment was this:

case HtmlNodeType.Text:
InternalTrace(">" + ((HtmlTextNode)_currentnode).Text);
return ((HtmlTextNode)_currentnode).Text.Replace("&nbsp;", "\u00A0").Replace("&lt;", "<");


The effect was that the html produced by the xslt transformation contained &nbsp; rather than &amp;nbsp; and &lt; rather than &amp;lt;. Incidently, the change has no effect on the output from HtmlDocument.Save, which was what I was originally reporting on, so that method goes through some other pathway.
Coordinator
Jan 1, 2007 at 1:37 PM
Hi,

All these entities issues are a real nightmare :-) mostly because everybody really wants a different output at the end of the days.

I know the current html to xml implementation satisfies many people, but the source is there to be modified :-)

Concerning the span root element added, the reason is simple: every Xml document needs a root element (unlike HTML), so the library adds one when there is not a default one (usually the HTML element).
Jan 3, 2007 at 7:55 PM
Yes, it is a pain.

One thing I forgot to mention that might convince you how this should be handled:

I used the .NET class XmlDocument to load an xml file with character entities and then created a XPathNavigator from that. When I called XPathNavigator.Value, I could see that entities like &#160; in the original xml were translated to the corresponding unicode character. Based on that example of how the base class XPathNavigator is handling character entities in xml, it seems like the subclass HtmlNodeNavigator should do the same for html character entities.

I'll send you the changes if I end up making any.

As for the span element that is being added to the root - since I am loading a single html file, why isn't the html element from that being used as the root? What I am seeing is that there is a span element at the root, with a single html element underneath that. It seems to me that the html element could be used as the root.