8

Closed

Parsing problem

description

Hi, found parsing problem with following html code:
 
<html id="thisisTest">
<head>
    <title>TT</title>
</head>
<body id="test">
    <form>
        <b>test</b>
    </form>
    <br />
    This is plain text
    <span>
        <b>BOLD</b>
        whee
    </span>wheeagain
</body>
</html>
 
All going well for HTML, HEAD, TITLE, BODY tag, they all appear as proper child tree (e.g. HEAD is child of HTML, TITLE is child of HEAD etc.) but then you get to BODY tag, all elements in the BODY tag pear as it children no matter in what tag they are located in. Good example is FORM tag.
Closed Aug 23, 2012 at 1:57 PM by DarthObiwan
This bug has been fixed as of 1.4.5. It is available in nuget, here on codeplex and via the source code

comments

DarthObiwan wrote Oct 3, 2009 at 5:02 AM

In the HtmlNode class there is a list of elements and what the parser should do with those elements.
Comment out Following line and you will get the behavior your are expecting.
ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);

This is where the power and the drawbacks HAP has when parsing Html. This line helps it deal with malformed code but unfortunately also ends up parsing it wrong when the code is correct. I'm thinking of making an xml format to define these better to make it easier to tweak. Also the list shouldn't be in html node, it should be it's own separate strongly typed collection.

Sarabe wrote Apr 21, 2010 at 6:16 PM

This seems to be a recurring issue for a lot of people. At least 6 have voted to fix it in various tracked issues. My team is currently just using the dlls. We could get the source but it seems like overkill to fix such an easy thing... We have been using the pack for about a year now and so far this is the only issue that would inspire us to have the source. Is it possible to fix it for all?

DarthObiwan wrote Apr 22, 2010 at 2:13 AM

You can change this without recompiling. The ElementFlags list is a static property on the HtmlNode class.
It can be removed with
        HtmlNode.ElementsFlags.Remove("form");
before doing the document load

emn13 wrote Aug 23, 2012 at 1:11 PM

This bug is also fixed by the patch to http://htmlagilitypack.codeplex.com/workitem/29218.