Latest version

Topics: Developer Forum
Oct 12, 2009 at 1:06 PM

I noticed you released a new version of this – I’d like to be able to use it but I’ve customized it too much from the original version I downloaded.  Some of my changes were a bit of a kludge for my own fairly specific needs (like having each node remember what the last attribute accessed was), but one change I think you should seriously consider is adding a System.Xml.Xsl.XsltContext parameter to the SelectNodes and SelectSingleNode functions, that can be passed to the underlying HtmlNodeNavigator.Select() function. This enables adding custom XPath functions etc.  I also made sure that attribute value matches were reliably case insensitive, so that something like

 

doc.DocumentNode.SelectSingleNode("//input[@type='text']") would work even if the HTML was all uppercase

 

Oh, and I added a public accessor for HtmlDocument._text.

 

I also JUST managed to fix a bug that I’d started to explain in this message as a long-standing problem that I’d only been able to work around by setting OptionAutoCloseOnEnd for the documents that needed it, but in the process of trying to distill a better example for you I actually finally worked out how to fix it properly.

Even with the latest version the following code causes an exception to be thrown:

 

HtmlDocument doc = new HtmlDocument();

doc.Load("<table><tr></table>");

Console.WriteLine(doc.DocumentNode.SelectSingleNode("//tr").InnerHtml); // <- crashes with “Length cannot be less than 0”

 

This is because  when it’s trying to automatically close unclosed child nodes in Node.CloseNode() nodes,  it creates a “fake” closer node with an _outerstartindex of -1, but this causes the child node’s _innerlength property to be invalid (indeed, negative), because it’s calculated by subtracting the current _innerstartindex value from the fake node’s -1 _outerstartindex value.  The _outerlength property is similarly negative, and this can cause both the InnerHtml and OuterHtml accessors to crash – unless of course _innerchanged or _outerchanged is true, which is exactly the fix I’ve made to Node.CloseNode().  It now checks before calculating the new _innerlength and _outerlength whether the endnode’s _outerstartindex is -1, and if so simply always sets _innerchanged and _outerchanged to true:

 


                if (endnode._outerstartindex < 0)

                {

                    _innerchanged = _outerchanged = true;

                    return;

                }

                  // create an inner section

 

This seems to be the right solution, because if there are child nodes that are unclosed when the parent is being closed, the HTML should be regenerated anyway.

 

Oct 27, 2009 at 1:58 AM

Any chance you want to provide your modified codebase so I can compare it against 1.3.0 (and then 1.4.0 which changed quite a bit, resharper cleanups and field/property/method arranging). I'll look at the last one you provided and see if it has any other repercussions.

Oct 27, 2009 at 4:37 AM
Edited Oct 27, 2009 at 5:40 AM

Tried to reply with zip attachment but I guess not supported. Will email you directly.