12
Vote

Incorrect parsing of HTML4 optional end tags

description

Parsing valid HTML 4.01 that omits optional end tags results in an incorrect parse tree.
 
For example,
input: <ul><li>a<li>b</ul>
is parsed as: <ul><li>a<li>b</li></li></ul>
but should be: <ul><li>a</li><li>b</li></ul>
 
In practice, this means that quite commonly html documents are parsed incorrectly. The bug does not apply to all optional end tags; e.g. <p> tags are correctly auto-closed.
 
Example document:
<html><body>
<ul><li>TestElem1
    <li>TestElem2
    <li>TestElem3 List:
        <ul><li>Nested1
            <li>Nested2</li>
            <li>Nested3
        </ul>
    <li>TestElem4
</ul>
<p>paragraph 1
<p>paragraph 2
<p>paragraph 3
</body></html>

file attachments

comments

DarthObiwan wrote Oct 20, 2010 at 3:00 PM

which Html Agility Pack handles. Just modify the HtmlNode.ElementsFlags dictionary. There is an enumeration of HtmlElementFlag that sets the parsing and closing tag options for a particular tag. By default <p> is set to HtmlElementFlag.Empty | HtmlElementFlag.Closed
The closed tells HAP to auto close the tag if no end tag was found. You can easily add <li> to the list as well (this has been covered many times in the Discussions forum)

emn13 wrote Oct 20, 2010 at 3:29 PM

It's great news that there's a workaround, but the behavior should still be fixed - as is, this is a nasty gotcha that reflects poorly on the library, particularly since the library has been around for years and fails to parse valid, simple documents - and that's just a shame. How is a user of the library to know that it can only parse normal HTML given particular flags? Are there any more of these flags required to parse vanilla HTML?

If the fix is trivial, hopefully it will be included swiftly.

emn13 wrote Jan 11, 2011 at 2:13 PM

Is there a fix in the pipeline? According to DarthObiwan, the fix is simple...

emn13 wrote May 14, 2011 at 1:36 PM

Hmm, I'm doing a bit of weekend hacking, and came across this again.

Turns out, just setting the HtmlElementFlag.Empty | HtmlElementFlag.Closed flags doesn't work: this produces incorrect parse trees in which the auto-closed element is left empty, followed by the nodes which should have been its children as siblings.

I'll see if I can patch it; the HAP model isn't entirely clear to me, however.

emn13 wrote May 15, 2011 at 12:07 AM

This turns out to be trickier that I thought - supporting optional end tags as in html doesn't quite fit into how HAP normally processes elements. There is a method to fix element nesting, but this isn't quite sufficient. What this method does:
  • only looks at start tags
  • only considers start tags where there's a previous open identical start tag
  • if these tags aren't separated by a "resetter" the previous tag is closed.
  • meanwhile, it's necessary to jiggle a virtual stack of previous tags per tag name, this to ensure if a element is close we know where to look. This part isn't written to be able to generally fix up just anything, only two such nested identical tags. It's complicated, in any case.
So to be able to patch this I needed to make that last point manageable, and that means a slight refactoring. In the patched version, HtmlDocument's Lastnodes no longer manually updated, but rather, when an node is closed or opened, the node is responsible for registering itself. That makes rearranging the tree a little easier since you no longer need to patch up lastparentnode or Lastnodes in as many places.

The rest is kind of simple; it'll correctly parse the above example, empty tags, and also stuff like in <dl><dt><dd><dt><dd></dl> - which is particularly tricky since the "previous" element need not have the same name.

The patch is definitely still in need of review, and if you like the idea, it'd need some polish to get rid of a few remaining debug statements and the like.

Finally, I also added two helper methods to the XPathNavigator to make it more xml compliant (the previous implementation was barfing on xhtml) In particular, it skips raw text nodes outside of the root element, and it skips xmlns and xmlns:... attributes. That's not particularly nice, but at least it can generate syntactically valid xml now.

Hoping for some feedback,

--Eamon Nerbonne

emn13 wrote May 15, 2011 at 12:17 PM

I removed some of the debugging gunk; new version of the patch attached.

bigjoesmith wrote May 27, 2011 at 5:34 AM

Tremendous!
I had some rather complex HTML that I was trying to scape. It wasn't being parsed correctly by the HTML Agility Pack. I suspected it had something to do with some unclosed <br> and/or <img> tags, or other tags without an end tag.

I played around with the HtmlNode.ElementsFlags table, but I couldn't find a combination that produced a correct parse.

But your patch (version 2) worked like a charm. I haven't determined the exact problem area in the original code's parse, but yours is working and I'm going with it.

emn13 wrote Jun 20, 2011 at 4:29 PM

Fixed XPathNavigable implementation to also skip namespace prefixes in patch v3.

emn13 wrote Jun 20, 2011 at 4:30 PM

added compiled binaries for those that need em.

emn13 wrote Jul 15, 2011 at 9:51 AM

Patch updated to version 4. Sum total changes:
  • Optional end tags properly supported (v1 & v2)
  • XPathNavigator doesn't choke on namespace declarations; strips namespace prefixes (v3)
  • XPathNavigator appropriately no longer causes double-encoded entities. (v4)
Notes:
  • Doctypes are not correctly processed; this is impossible with an XPathNavigator. Instead, doctypes render as comments. To fix this would require a different approach that avoids XPathNavigator for creating output in the first place.
  • Using HtmlDocument.Save(XmlWriter) does NOT use the new code; this is an entirely different code path - so Save will still crash on documents with namespaces and will still double-encode entities!

emn13 wrote Jul 15, 2011 at 9:58 AM

Binaries attached for v4 patch.

ARLindsay wrote Jul 15, 2011 at 6:11 PM

The patch seems to fix a problem I was having with nested dt/dd, but now OptionFixNestedTags throws an error. Is this intentional?
Error 1 'HtmlAgilityPack.HtmlDocument' does not contain a definition for 'OptionFixNestedTags' and no extension method 'OptionFixNestedTags' accepting a first argument of type 'HtmlAgilityPack.HtmlDocument' could be found (are you missing a using directive or an assembly reference?)

emn13 wrote Jul 15, 2011 at 6:33 PM

Yeah, that's intentional - the old adhoc fixes aren't in the code anymore and in any case aren't necessary; the new code simply follows the spec (with a small omission concerning divs and the like in p tags). If you want to suppress the automatic inference of optional end tags there's a new option OptionSupportOptionalEndTags, which defaults to true and you can set to false (I can't think of a reason why, however).

gmacgregor wrote May 4, 2012 at 4:44 PM

Hi Eamon,

I made one update to your patch for this fix. I have a page which needs the optional end tags closed but the tags in the page are all caps. So on line 1148 in the function FixNestedTags I changed:

string currentNodeName = CurrentNodeName();

To:

string currentNodeName = CurrentNodeName().ToLower();

Is this something you will incorporate into your patch and/or next release?

Thanks

Glenn

DarthObiwan wrote Jun 5, 2012 at 11:39 PM

I have a fix which I'm going to include in svn soon. The patches provided here go way overboard for the actual issue. The bug happened when converting to support Silverlight and the actual fix is 2 lines. Sorry it has taken so long to do this.. expect many more updates coming soon

emn13 wrote Jun 17, 2012 at 12:03 PM

@gmacgregor if I update the patch I'll take a look at the capitalization issue. Thanks for the heads up! However, it sounds like DarthObiwan has resumed development, so maybe this'll get fixed in the mainline.

@DarthObiwan Yeah, I imagine the patch does change more than necessary. HAP's fairly complex, so to ensure the patch works, I needed some invariants I could understand, and likely those weren't precisely the same as the ones in the old code.

However, the bug is certainly not fixable in just 2 lines; the requisite infrastructure just isn't there. The approach GetResetters/FixNestedTag takes differs from the spec which is in this case also how browsers actually work.

Part of the reason the patch is so large is that it also tries to fix the XPathNavigator implementation, which might be best split into a seperate path.