This project has moved and is read-only. For the latest updates, please go here.
5
Vote

Wrong Xpath if an element is part of form

description

Hi, I found this package helpful for parsing html. But it is giving wrong xpath when the element is part of form. Any idea how to fix this?

comments

kurtnelle wrote Dec 21, 2010 at 2:29 PM

Can you post and example?

wrote Mar 4, 2011 at 4:07 PM

wrote May 15, 2011 at 12:29 AM

rsturley wrote May 15, 2011 at 12:34 AM

I have a similar problem that is probably related. The interior html of a form are not shown as child nodes. I have attached a C# program which uses the library to parse the attached HTML file and display the nesting level of the various tags.

This is the test HTML file:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head> <meta content="text/html; charset=windows-1252" http-equiv="Content-Type" /> </head> <body>
<form method="post" action=""> <input name="Text1" type="text" /> <input name="Submit1" type="submit" value="submit" /> </form> </body>

</html>

Here is the C# program:

using HtmlAgilityPack;
using System;
using System.IO;

namespace AgilityTest
{
class Program
{
    static void Main(string[] args)
    {
        new Program().Run();
    }

    void Run()
    {
        string docString = HtmlDocument();
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(docString);
        // This document has an html tag which had a head and a body.  Display the tag hierarchy.
        DisplayNode(doc.DocumentNode, 0);
    }

    void DisplayNode(HtmlNode node, int level)
    {
        for (int i = 0; i < level; i++)
        {
            Console.Write("  ");
        }
        Console.Write("<{0}> <{1}>\n", level, node.OriginalName);
        foreach (HtmlNode child in node.ChildNodes)
            DisplayNode(child, level + 1);
    }

    string HtmlDocument()
    {   
        StreamReader reader = new StreamReader("../../test_web_page.html");
        string text = reader.ReadToEnd();
        reader.Close();
        return text;
    }
}
}

Here is the output. Note that the input tags are shown as siblings of the form tag rather than as children.

<0> <#document>
<1> <#comment>
<1> <#text>
<1> <html>
<2> <#text>
<2> <head>
  <3> <#text>
  <3> <meta>
  <3> <#text>
<2> <#text>
<2> <body>
  <3> <#text>
  <3> <form>
  <3> <#text>
  <3> <input>
  <3> <#text>
  <3> <input>
  <3> <#text>
  <3> <#text>
  <3> <#text>
<2> <#text>
<1> <#text>

wrote Jun 22, 2011 at 6:59 PM

wrote Feb 22, 2013 at 2:47 AM

wrote Feb 11, 2014 at 9:43 PM