Do not load HTML comments

Topics: Developer Forum
Jan 12, 2010 at 2:23 PM

Is there a possibility to specify an option to skip comments when loading html?

Jan 12, 2010 at 2:31 PM

There are no options currently to do that. You can however remove them. They have an NodeType of HtmlNodeType.Comment

Jan 12, 2010 at 8:59 PM
DarthObiwan wrote:

There are no options currently to do that. You can however remove them. They have an NodeType of HtmlNodeType.Comment

That's right - I was thinking of adding a patch to optimize the loading process to exclude loading comments to reduce the memory footprint and improve performance in multithreaded large-scale parsing

Jan 12, 2010 at 9:17 PM

I'd be interested in seeing the patch and getting some performance metrics. The memory footprint reduction will be wildly variable, since HAP keeps a reference to the original document and does substrings off of it to get the tag name, inner text, inner html. Comments are handled a little differently where the comment is copied. So it might help if you had a document with a large number of comments.

As for the parsing speed that might not see too much improvement since it still needs to keep parsing until it finds the end of the comment. Though the parser is pretty complex I've found it to be pretty darn efficient. Every time I think I found a way to "fix" something that looks ugly I find that it is ugly because it is avoiding costly operations.

With that said I'm always open to other ideas.