This project has moved and is read-only. For the latest updates, please go here.

HAP Swallows HTML Attributes with Un-Encoded GT/LT Chars

Topics: Developer Forum
Nov 30, 2009 at 7:52 AM
Edited Nov 30, 2009 at 7:52 AM

I use HAP to parse HTML output on the fly from an ASP.NET filter I developed called Secure Parameter Filter (SPF).  An SPF user pointed out an odd scenario that I figured I would post here to get an official response.  Essentially the text from an asp:linkbutton was being swallowed when it contained an un-encoded < or > character. 

For example, the following code:

<asp:LinkButton ID="Foo" runat="server" Text="< BAR" Font-Size="8" Font-Bold="true"></asp:LinkButton>

normally produces the following HTML: 

<a id="Foo" href="javascript:__doPostBack('Foo',')" style="font-size:8pt;font-weight:bold;"><BAR</a>

However, if the page is loaded into HAP and then rendered back out, it produces the following: 

<a id="Foo" href="javascript:__doPostBack('Foo',')" style="font-size:8pt;font-weight:bold;"><></a>

So HAP appears to be parsing the '<' in the string "< BAR" and then incorrectly "fixing" the HTML by replacing "BAR" with a '>'

The workaround I suggested was to HTML encode the '<' so that the string renders as "&lt; BAR" instead.  This worked perfectly and is arguably the way that value should have been represented in the first place, however I am curious to hear whether this is a scenario that HAP should be able to handle.  You can find the original thread here for reference. 

Dec 4, 2009 at 2:42 AM

Thanks for bringing this up -- I've run into the same issue while scraping a site that uses unencoded GT/LT in a textarea.

Very simply:

<textarea> 1 < 2 </textarea>

is parsed as:

<textarea> 1 < 2=\"\"></textarea>

Consequently, the inner text ends up as "1". DarthObiwan recently mentioned plans to make the parser more configurable. An XML file to specify parsing options for specific node types (such as "textarea - don't parse descendants") might be useful.

Dec 4, 2009 at 10:13 AM

Ok so the way XML handles < and > in it's data is by using CDATA tags. Basically whenever there is text to be put in it should be enclosed in CDATA tags. Doing that might be cheaper than having configuration options; It's just an idea.

Dec 4, 2009 at 11:14 AM

Thanks, kurtnelle, that was extremely helpful -- I should have figured there's already a built-in way to indicate parsing should be skipped.

Searching for "CData" in the project clarified DarthObiwan's comments on the "default list of tags and their parsing options", as well as some of the other questions I skimmed over. In my case, all I needed to do was:

HtmlDocument htmlDocument = new HtmlDocument(); // initialization of the static HtmlNode.ElementsFlags occurs
HtmlNode.ElementsFlags.Add("textarea", HtmlElementFlag.CData);

and then load my document/stream. Worked perfectly.

So, bholyfield, presumably if you call:
HtmlNode.ElementsFlags.Add("a", HtmlElementFlag.CData);
before loading the HTML, "BAR" won't be replaced by ">".

Dec 23, 2009 at 5:57 PM

Hi guys,

Using unencoded < and > leads to ... unpredictible results! Although you *could* think it's predictible because you've tested it with all modern browsers :) If you want to use < and >, use HTML entities (&lt; &gt;). That's the canonical form.