This project has moved and is read-only. For the latest updates, please go here.

problem parsing form

Topics: Developer Forum
Sep 20, 2009 at 7:50 AM

I'm using htmlagilitypack to parse forms.  i commented out the line that makes the form have no innerhtml because i use that property in my code.


ElementsFlags.Add("form", HtmlElementFlag.CanOverlap/* | HtmlElementFlag.Empty*/);

            // they sometimes contain, and sometimes they don 't...
            //ElementsFlags.Add("option", HtmlElementFlag.Empty);


My problem is with a specific form.  its not getting all the inner html.  some options and inputs are missing (like it got cut off).  not sure if this is an issue with the xpath by microsoft or htmlagilitypack.


the line of code is

HtmlDocument doc = new HtmlDocument();


HtmlNodeCollection coll = doc.DocumentNode.SelectNodes("//form");

//coll[0].InnerHtml has problems


testparseform.html is a random survey form i saved the source for, the link is:



Sep 20, 2009 at 3:22 PM

I'm  figuring it has something to do with how HtmlDocument.Parse() parses this particular page.  It stops looking for childnodes (i assume) after a new table tag is found inside this form. 

<input id="gwProxy" type="hidden" /><input id="jsProxy" onclick="jsCall();" type="hidden" />

Sep 22, 2009 at 12:50 AM

actually, what i've been wanting to do is modify the code so that FORM, INPUT, OPTION, TEXTAREA etc are placed in a separate tree of the html body.  I'm wondering if someone can point me to some links and documentation to help me with the learning curve of altering the htmldocument class.

<input id="gwProxy" type="hidden" /><input id="jsProxy" onclick="jsCall();" type="hidden" />

Oct 3, 2009 at 5:02 AM

I'm not going to be able to be too much help on modifying the parsing engine. I'm still learning it myself. There's no documentation on it that I'm aware of. Simon the developer who created this is probably the only person that knows it well and he's a rather busy man.

I did hit up that URL and was able to get all 407 input elements from it using

var nodes = _html.DocumentNode.Descendants("input").Count();

(with the lates 1.4.0 beta 1)

I did a count on the source code of all "<input" strings that could be found it was 407.

I think the XpathNavigator may be the culprit

Nov 6, 2009 at 9:37 AM

Mozilla firefox  highlighter add on causes the problem, try to disable the plugin and test it again