This project has moved and is read-only. For the latest updates, please go here.
1
Vote

OutOfMemoryException when Parsing malformed HTML

description

If you have a malformed HTML document where a starting tag doesn't match the ending tag, iterating through the html tree via SelectNodes cause an endless loop and terminate with an OutOfMemoryException :

https://gist.github.com/anonymous/a09a66b9e2138af180ab
  • The HtmlDocument loads successfully
  • HAP handles the malformed html 'as correctly as maybe possible' (ignores the wrong end-tag, but does not add a </div>)
  • SelectNodes() works normally at first
  • after DOM manipulation via AppendChild etc., the next call to SelectNodes throws OutOfMemoryException
Workaround: reload the html via html.LoadHtml(html.DocumentNode.WriteTo());

comments

bergie wrote Aug 26, 2015 at 6:45 AM

Does this out of memory exception only happen when you try to select nodes or on the .load of the html document?

lutz_rosema wrote Aug 26, 2015 at 11:32 AM

I investigated further. It wasn't easy to reconstruct this issue.

The html loads correctly and HAP can handle the malformedness: html.DocumentNode.WriteTo(); returns the code without the section end-tag. Also, calling SelectNodes() works normally, at first.

Problems occur only, when you do some manipulation to your HTML document and then try to call SelectNodes().

Here's a small example:
https://gist.github.com/anonymous/a09a66b9e2138af180ab

There are two div's in the body with a "data-group" tag. The first one contains malformed html.
The two divs are removed from the body-tag and re-appended in reversed order.
After that, the OutOfMemoryException occurs in the SelectNodes() call.

lutz_rosema wrote Aug 26, 2015 at 12:34 PM

Workaround: just reload the HtmlDocument via html.LoadHtml(html.DocumentNode.WriteTo()); after the DOM manipulation with body.ChildNodes.Clear and body.AppendChild.

wrote Aug 26, 2015 at 2:42 PM

wrote Aug 26, 2015 at 6:49 PM