Accessing node by variable reference

Topics: Developer Forum, Project Management Forum, User Forum
Jul 31, 2009 at 9:52 AM

Hello.

I am writing a piece of code in VB to capture some data from some websites. I have started with relatively easy websites but now things got a bit more complicated and I got stuck.


I need to be able to access a node by the value of a variable.Consider this for example.

x = /...../tr[1]

y = /..../span[1]

z = /..../text()[2]

X,Y,Z represent the strings which I need for a single entry of my program.

The problem is that I need to iterate over this 20+ times.

So the general pattern would be

for i = 0 to 20 do

x = /.../tr[i]

y = /.../span[i]

z = /.../text()[2*i]

 

But I don't know how to access a node by variable reference

I've tried

/.../tr[position()=i]

/.../tr[position()= $i]

/../tr[$i]

None seem to be working.


Some help would be greatly appreciated.

 

Thank you,

Robert

May 28, 2010 at 11:15 PM
Edited May 28, 2010 at 11:18 PM

Hi Robert,

Your message pretty dated... So I hope someone bothers to look at this.  :-)

I'm working in C# and have the same problem.  I'm trying to get the first row from a table (the column names/headers).  It works fine if I know ahead of time how many columns there are, but if I don't I'm totally out of luck.  The following works if I have four columns:

// Create a new HtmlDocument object:
HtmlDocument doc = new HtmlDocument();
// Load the HtmlDocument object with the contents of an HTML file:
doc.Load(FileName);

// Get all tables in the document
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");

Console.WriteLine("There were {0} tables found in this file.", tables.Count);

for (int i = 1; i <= tables.Count; i++)
{
    Console.WriteLine("Reading Table {0}", i);

    Console.WriteLine("Xpath: /table[1]/tr[1]/td[1]");
    HtmlNode MyTest = doc.DocumentNode.SelectSingleNode(@"/table[1]/tr[1]/td[1]");
    Console.WriteLine("{0}", MyTest.InnerText);
    MyTest = doc.DocumentNode.SelectSingleNode(@"/table[1]/tr[1]/td[2]");
    Console.WriteLine("{0}", MyTest.InnerText);
    MyTest = doc.DocumentNode.SelectSingleNode(@"/table[1]/tr[1]/td[3]");
    Console.WriteLine("{0}", MyTest.InnerText);
    MyTest = doc.DocumentNode.SelectSingleNode(@"/table[1]/tr[1]/td[4]");
    Console.WriteLine("{0}", MyTest.InnerText);

}
How can I:
1.  Identify the number of columns?
2.  Specify the column number I desire so I can access them dynamically at runtime?   (something like doc.DocumentNode.SelectSingleNode(@"/table/[1]/tr[1]/td[VARIABLENAME]")
Thanks,
Russell Schutte
May 28, 2010 at 11:35 PM

Hi Everyone,

Dang it, I try not to ask for help until I'm really stuck...  And as sure as I posted, I found half of my solution:

2.  The answer is easier than would be expected: 

MyTest = doc.DocumentNode.SelectSingleNode(@"/table[1]/tr[1]/td[" + j + "]");
Console.WriteLine("{0}", MyTest.InnerText);

My other question remains, hopefully someone can help:

1.  How can I identify the number of columns in an HTML table?

Thanks,

Russell Schutte

 


May 29, 2010 at 12:21 AM
Edited May 29, 2010 at 2:02 AM

Have you tried using some of the new LINQ compatible methods? Select the tr node and do a node.Descendants("td").Count()?

 doc.DocumentNode.SelectSingleNode(@"/table[1]/tr[1]).Descendants("td").Count() ?

1.4.0 added a bunch of new methods that you can use LINQ against.

var firstTableRows = doc.DocumentNode.DescendantNodes("table")
                                        .Select(table => table.Descendants("tr").FirstOrDefault())
                                        .Where(tr=>tr!=null)
                                        .Select(tr=>tr.Descendants("td").Select(td => td.InnerText));
firstTableRows.ToList()
              .ForEach((tr) => tr.ToList()
                                 .ForEach(td =>{
                                                 Console.WriteLine("Count {0}:", td.Count());
                                                 Console.WriteLine(td);
                                                }));


May 29, 2010 at 4:12 AM

Hi DarthObiwan,

I'm brand new to C# (a couple of weeks new), and while I'm comfortable with SQL, I have no experience with LINQ.  I guess I'll get there someday.  :-)

I don't know if this is the best method, but I was able to determine the number of columns (this may not work with complex HTML tables, like those that include COLSPAN, for example - it's untested):

                // Create a new HtmlDocument object:
                HtmlDocument doc = new HtmlDocument();
                // Load the HtmlDocument object with the contents of an HTML file:
                doc.Load(FileName);

                // Get all tables in the document
                HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");

                Console.WriteLine("There were {0} tables found in this file.", tables.Count);

                for (int i = 1; i <= tables.Count; i++)
                {
                    Console.WriteLine("Reading Table {0}", i);

                    HtmlNodeCollection columns = doc.DocumentNode.SelectNodes("//td");
                    HtmlNodeCollection rows = doc.DocumentNode.SelectNodes("//tr");

                    Console.WriteLine("Rows: {0}", rows.Count);
                    Console.WriteLine("Columns: {0}", (columns.Count/rows.Count));
                    int NumberofColumns = (columns.Count / rows.Count);

                    HtmlNode MyTest = null;

                    for (int j = 1; j <= NumberofColumns; j++)
                    {
                        MyTest = doc.DocumentNode.SelectSingleNode(@"/table[1]/tr[1]/td[" + j + "]");
                        Console.WriteLine("{0}", MyTest.InnerText);
                    }
                }
If there's a better way, please let me know.  (Linq looks like total greek to me, but it looks like you've done kinda the same thing to get the count.)
Thanks,
Russell Schutte
Jun 1, 2010 at 3:43 PM
Edited Jun 1, 2010 at 3:54 PM

Hi Everyone,

I worked a bit on this over the weekend and I'm still a bit stuck.

First, I haven't figured out how to read each table - or nested tables.

Secondly, sometimes I get the error:

"Object reference not set to an instance of an object."

referring to the line

Console.WriteLine("{0}", MyTest.InnerText);

I suspect it has something to do with the tables I'm reading - if they don't have a TR or TD?  How can I handle this correctly?  Testing for MyTest.InnerText == null gives me the error as well.

Thanks for any help you can provide.  (In the meantime, I'm looking into LINQ - as it seems this might be my best answer?)

Russell Schutte

Jun 1, 2010 at 4:01 PM
That is due to the XPATH functions can return null if nothing is found. It works this way because that is how System.Xml works and Html Agility Pack was written to mimic that API. So SelectSingleNode may return a null
Jun 1, 2010 at 4:09 PM

Hi Darth,

Thank you for such a cool tool.  I hope I can figure out how to make it work for me, eventually.  :-)

I have to read table data from a variety of websites - I'm trying to read the headers (top row) for each column and then I can parse the data from the HTML tables, knowing what each column contains.  Often these tables will be nested for formatting purposes.

Seems to me that this should be a lot easier than I'm making it.

What's the best way to do this?  (I will work to figure it out - just give me some pointers).

Thanks,

Russell Schutte