This project has moved and is read-only. For the latest updates, please go here.
6
Vote

InnerText returns HTML entities rather than their raw character

description

I use the HtmlAgilityPack 14.6 NuGet package, and I noticed that when text containing HTML entities is parsed into an HtmlDocument, or set through HtmlNode.InnerHtml, then later retrieved via HtmlNode.InnerText, these entities are returned verbatim. I would expect the entities to be resolved to the characters they represent.

To reproduce, see the attached test case for details. Create an empty C# class library project, install NUnit and HtmlAgilityPack from NuGet, then paste the attached code in. Two of the three tests fail because of the behavior described above.

file attachments

comments

dandreica wrote Feb 5, 2013 at 5:40 PM

The reverse is also broken, i.e. reading InnerHtml off of an HtmlTextNode also returns incorrect result. See updated test case.

wrote Feb 5, 2013 at 5:49 PM

wrote Feb 15, 2013 at 5:45 AM

wrote Feb 22, 2013 at 2:46 AM

wrote Jul 31, 2013 at 12:49 PM

wrote Aug 28, 2013 at 2:47 PM

wrote Mar 13, 2014 at 3:09 PM

a_h wrote Mar 13, 2014 at 3:15 PM

Quick workaround for some cases can be to use the System.Web.HttpUtility class:

e.g.: HttpUtility.HtmlDecode(node.InnerText)

wrote Nov 6, 2014 at 2:20 AM