This project has moved and is read-only. For the latest updates, please go here.
1
Vote

Incorrect processing <p> node

description

The HTMLAgilityPack incorrect processing following <p> node in the attached HTML document (please find html file attached to this report). The closing tag gets removed from the document after processing.


<p style="margin-top:20px; margin-bottom:0em; border-top:1px solid #AAAAAA;" title="Ähnlich geschriebene oder gleich klingende Wörter"></p>


I'm using HTMLAgilityPack v. 1.4.9.0 in VS 2013 (v.12.0.31101.00).
The HTMLAgilityPack works well with thousand of other HTML files, but failed on this one.
I'm loading the document from the string this way:

HtmlDocument d = new HtmlDocument();
d.LoadHtml(HTML);

After loading d.ParseErrors is 0 (no errors).
The initial document seems valid for me.
After loading, d.DocumentNode.OuterHtml reveal that HTMLAgilityPack removed closing tag from the <p> node in line 94. I found this node in DOM and outer HTML is following:


<p style="margin-top:20px; margin-bottom:0em; border-top:1px solid #AAAAAA;" title="Ähnlich geschriebene oder gleich klingende Wörter">


You may see that closing tag </p> goes missing. The same result, if you get the OuterHTML of the entire document.

This way entire HTML document appears to be invalid.

I assume there's something specific in this document which cause incorrect behaviour of the Agility pack (once again, it works well with thousands of similar documents), but I haven't time to investigate deeper. I workaround this error by removing all empty <p> tags before further processing. I report this bug in a hope that my discovery may help make the product better.

Thank you for great job of developing and improving this exciting project!
Thank you,
Mike.

file attachments

comments