empty "form" elements in SelectNodes

Topics: Developer Forum
Jan 22, 2009 at 10:04 AM

When web scraping html page with multiple form elements and form is specified in the path, agility pack returned the form elements thatdoes not contain anything at all though the in html form has valid content inside.

   For example , for xPath ="//table[@class='pageBotBlock']//form"; SelectNodes returns form element with no children. 

For xPath ="//table[@class='pageBotBlock']//form/table" it also didn't find any elements.

The interesting to note that if I exclude "form" from xpath, the SelectNodes will return elements, that are actually children of "form", e.g. xPath ="//table[@class='pageBotBlock']//table"  returns back expected results.

 In debugger I was able to find that in HtmlNode.SelectNodes the form element losts the children when XPathNodeIterator.Current casts to  HtmlNodeNavigator , and then HtmlNodeNavigator.CurrentNode is accessed

 

 

public HtmlNodeCollection SelectNodes(string xpath)

HtmlNodeCollection list = new HtmlNodeCollection(null); 
HtmlNodeNavigator nav = new HtmlNodeNavigator(_ownerdocument, this); 
XPathNodeIterator it = nav.Select(xpath); 
while (it.MoveNext())

HtmlNodeNavigator n = (HtmlNodeNavigator)it.Current;//form with children    
list.Add(n.CurrentNode);//form has no children??

return list;
}

 

I've asked Tommi Laukkanen, who posted about  the same problem on http://www.codeplex.com/htmlagilitypack/wiki/comments/archive/view?title=Home&page=1 , but  he was not able to use html agility pack because of mal formed html tags.  

Fortunately for me, I was able to specify xpath without mentioning "form" .
However it will be good if someone will be able to explain/resolve this behavior.

Mar 17, 2010 at 9:27 PM

I think I can explain this a bit, if anyone is still wondering about this. I was searching the internet for why I couldn't seem to remove form elements from a page.

I first came across http://www.fremus.co.za/blog/2009/12/interesting-code-with-htmlagilitypack/ which explains about text nodes, and then I found this.

I had a similar issue in that doc.DocumentNode.SelectNodes("//body//form//input") was turning up null. Using this small script I was able to determine that while the form element shows up as an element, it is not a container.

foreach (HtmlNode node in bodynode.ChildNodes)
{
    Console.WriteLine(node.Name);
    foreach (HtmlNode childNode in node.ChildNodes)
    {
        Console.WriteLine("\t" + childNode.Name);
        if (childNode.Name == "#text") Console.WriteLine("\t\t" + childNode.InnerText);
        foreach (HtmlNode grandChildNode in childNode.ChildNodes)
        {
            Console.WriteLine("\t\t\t" + grandChildNode.Name);
            if (grandChildNode.Name == "#text") Console.WriteLine("\t\t\t\t" + grandChildNode.InnerText);
        }
    }
}

Results in this output:

#text
#text
#text
#text
#text
#text
#text
#text
#text
div
        #text


        div
                        #text

 

 

 


                        form
                        #text


                        #text

 


                        #text

 

 

 

                        h1
                        #text

 

                        div
                        #text

 


                        #text
                                </form>
                        #text

 

 


        #text

 


#text

The HTML that was used to generate this output started with:

<div id="container">
  <div id="content">
    <form name="aspnetForm" ...>
      <h1>

You can see that the <form> entry IS AN ELEMENT, but not a container. I am currently looking for an XPath method of finding #text elements but I havn't gotten there yet. I keep getting an error when trying the XPath. I'll probably just have to enumerate ChildNodes to find a #text element who's InnerHtml contains </form>

 

-J

Mar 17, 2010 at 9:47 PM

This has been re-iterated many times on the discussion forum and issues tracker. By default the form tag is not a container due to it being allowed to be placed in and outside of other containers per HTML 3. There is an option to change this behavior.

In the HtmlNode contstructor there is a list of defaults. If you remove this line

ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);
It will behave as expected. The plan is for the next release to make these defaults more up to the times. Also easier ways of overriding these values. You can pass in your own collection when creating the html document.