Malformed HTML parsing problem - unclosed li element within a form

Topics: Developer Forum
Sep 12, 2012 at 1:01 PM

Hi Community,

I am working on an HTML parsing related utility. During this work HTML Agility Pack is helping me so much.

I am just having a problem, in parsing some html content which is malformed. I want to get all the forms of the html and process them one by one. But on of my forms has an unclosed <li> tag, due to which, the html agility parser, brings all the html present after its parent form in it.

For example:

<form1></form1>

<form2>

<li>

</form2>

<form3></form3>

<form4></form4>

Now, when I do something like this:

var _document = new HtmlDocument();

_document.OptionAutoCloseOnEnd = true;

HtmlAgilityPack.HtmlNode.ElementsFlags.Remove("form");            HtmlAgilityPack.HtmlNode.ElementsFlags.Remove("option");
_document.Load(@"C:\HTMLPage1.htm");
var formNodes = _document.DocumentNode.SelectNodes("//form");

foreach (var node in formNodes)

{

Console.Log(node.OuterHtml);

}

for second form node, it will emit html of form3 and form4 as well.

Any help will be highly appreciated.

Thanks,

 

 

Jan 25 at 8:16 AM
Edited Jan 25 at 8:16 AM
Hej there,

I know this discussion is quiet old but I just encountered the same problem with the unclosed <li> tag. I've searched for hours because I did not believe the parser could be the problem rather my incapacity to understand the complex form structure.

I'm using Version 1.4.9 of the html agility pack.

Would be great if the htmlagilitypack would be tolerant enough to parse such malformed html documents as they are quiet often malformed in the web...


Greetings
Mexallon