parsing form tags

Topics: User Forum
Nov 16, 2007 at 8:47 AM
Hi guys,

I'm running into a weird issue with form tags. This is my stripped down testcase:

Test
public void ParseFormTest()
{
string html ="<body> <form><table></table></form> <table></table> </body>";

HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.Load(new StringReader(html));

// Get all tables in the document
HtmlNode formNode = htmlDocument.DocumentNode.SelectSingleNode("//form");
HtmlNodeCollection tablesInFormCollection = formNode.SelectNodes(".//table"); // <-- return null
HtmlNodeCollection tableInDocumentCollection = htmlDocument.DocumentNode.SelectNodes("//table");

Assert.AreEqual(2,tableInDocumentCollection.Count);
Assert.AreEqual(1,tablesInFormCollection.Count);
}

If I modify the html to:

string html ="<body> <div><table></table></div> <table></table> </body>";

and I use the div instead of form in my xpath query it works fine.

Any suggestions about this?

thanks!

Ernst.
Nov 16, 2007 at 9:57 PM
Hi Ernst,

This is because by default, Forms are parsed as empty nodes - this is because forms are allowed to overlap other elements in the HTML spec.

In other words, the following is technically legal HTML, even though it gives us developer hives:

<table>
<form>
<some input elements>
</table>
</form>

Here, the form overlaps the closing of the table and when properly rendered, will be contained inside the table. Since HtmlDocument attempts to allow this as valid without automatically correcting the HTML, HtmlDocument by default makes no attempt to populate the child nodes of the form.

Ok. All that is merely an introduction. You can get around this default behavior by adding the following line:

HtmlNode.ElementsFlags.Remove("form");

before you make ANY use of HtmlDocument. This will allow it to parse the nodes of the form, but it sacrifices the ability of the form to overlap other nodes. It will force the form to be closed properly.

If you have a reasonable assurance that your HTML will be well-formed, this is probably not a problem. For commercial web sites, this is probably a good bet, as most of them use page builders nowadays instead of hand-coding the HTML. I wouldn't have said that 2 or 3 years ago; but there's been alot of emphasis on standards recently since the browser wars are heating up again.
Coordinator
Nov 17, 2007 at 4:37 PM
Wow, I could not have said it better. I see some people really start to understand the Html Agility Pack :-)
Nov 18, 2007 at 6:10 PM

simonm wrote:
Wow, I could not have said it better. I see some people really start to understand the Html Agility Pack :-)


Thanks Simon! Finding this library was, for me, like seeing the light after centuries of darkness. It's the only elegant, simple approach to HTML parsing I have ever seen - all the others attempt to do too much and be too smart. Why create a litany of HTML node types when you know you're going to need to look for a "td", for example, anyway? It just complicates things.

Off topic: Given any thought to incorporating the FormProcessor into the core codebase? I've also gone through all the comments.. on every blog/board you've used to post this project.. and incorporated every bug fix or feature addition I've found that I thought would be useful. The project could certainly use a refresh :)
Nov 19, 2007 at 8:01 AM
hi guys,

thanks for the fix and detailed explanation!
Very cool and handy libaray indeed :)

cheers,
ernst.
Jan 10, 2008 at 7:43 AM
Thanks for the excellent explanation.

I was snuggling with workarounds for the form issue, until I had finally decided to file a bug and found this.
Jun 26, 2009 at 10:52 PM

I have a question on this.  I'm wondering why the form nodes cannot contain the inputs as child nodes since the inputs are always going to be children, even if the form overlaps other html.

I didn't understand by what was meant here:

[QUOTE] It will force the form to be closed properly.[/QUOTE] for parsing out the form element.

My problem is that I want to process pages with more than one form and need to differentiate between inputs as to which form they belong to.  What is my best course of action?

Feb 16, 2010 at 9:13 PM
simonm wrote:
Wow, I could not have said it better. I see some people really start to understand the Html Agility Pack :-)

 No thanks to the documentation.