Html Entities get html encoded?

Topics: User Forum
Apr 26, 2011 at 3:20 PM

Hello there,

how come or how do I prevent HAP to "double" encodes html entities from html input .. basically what I am seeing is this:

 

 

var htmlDocument = new HtmlDocument { OptionOutputAsXml = true };
                htmlDocument.LoadHtml(Text);

return htmlDocument.DocumentNode.InnerHtml;

 

Using this with 'Text' being:

<HTML><HEAD></HEAD><body>&nbsp;</BODY></HTML>

I get this from htmlDocument.DocumentNode.InnerHtml back:

 

<html><head></head><body>&amp;nbsp;</body></html>

Same goes for all other entites. So basically whenever I pass in actually valid html, I get modifed, wrong html back. Is there any way to prevent this?

 

-J

May 2, 2011 at 11:17 PM

I haven't looked too deeply into this, but it looks like something to do with the 'OptionOutputAsXml' you are setting - '&nbsp;' is not valid in XML. A possible workaround is to do this:

htmlDocument.LoadHtml(System.Web.HttpUtility.HtmlDecode(Text));

Passing in already decoded html results in the following output:

<HTML><HEAD></HEAD><body> </BODY></HTML>
May 11, 2011 at 8:17 PM

This is by design, XML doesn't allow '&' cahracter, and "escapes" it as &amp; destroying your HTML. You should not use output as XML.

Jan 10, 2015 at 1:30 AM
string noNbsp= Regex.Replace(inputHTML, @"&nbsp;", "").Trim();