Parsing HTML Table Data

May 17, 2007 at 2:29 AM
I have been searching on google for the last day or so and I cannot find any examples of how to pull data from a html table on a remote webserver.

I have a project where I need to parse data from a specific table on a webpage with multiple tables.

The issue here is I am trying to use htmlweb.load to pull the remote page but I guess I need to turn that into a stream before I can parse it with htmldocument? Would it be easier to just save the file to the local disk and parse it with htmldoc from there? If so, are there any issues with overwriting an existing document? (this task needs to be done daily)

This looks like an excellent tool to use but the lack of examples in the documentation makes it a little hard to get started.

Any help would be appreciated.
May 29, 2007 at 2:26 PM
Hi Pikoro,

same to me :-)

Still have a Webpage where i try to get the Collection of a specific <Table>.

// - - - bof

string sUrlHtml = @"http://www.myweb.com";

HtmlAgilityPack.HtmlWeb oHtmlWeb = new HtmlAgilityPack.HtmlWeb();

HtmlDocument oHtmlDocument = oHtmlWeb.Load(sUrlHtml);

HtmlNode oRootNode = oHtmlDocument.DocumentNode;

// - - - eof

gives me the Root. So i can step through <oRootNode.ChildNotes> like:


// - - - bof

HtmlNodeCollection oNC1 = oRootNode.Childnotes
HtmlNode oHN1 = oNC1<n>

HtmlNodeCollection oNC2 = oHN1.Childnotes
HtmlNode oHN2 = oNC2<n>

// ...and so on

// - - - eof

But in case of more complex HTML/WebPages it is very complicate to do it in that way. Espacialy as i know what Table i amn looking for. I am still looking for an easier Way - for example stepping recursive through the Nodes and ther Collection looking for Node with known Attributes.

Any Idea?

Regards
Jan Waiz

You can contact me also directly via Mail: hamburg@icomedv.de



Jun 5, 2007 at 6:53 AM
Hi Jan,
I have just read your message

I'm try to solve the same problem in these days...

I'm trying to find a simple solution to write code to reach the same
result of the standalone programs that extract data from the web.

I'm using "watin", that permit to find some tables in an easy way, but
I'm still searching something like:
in this page, there are some repeated tables with 10x5 fields; grab me
this this this and this field of each table...

If somebody is interested to solve the same problem, please contact me...

Regards
Stefano

stefano2212 @
gmail.com


Jun 14, 2007 at 9:55 AM
Hi,

This is how I solved the problem of extracting info from a html table

HTML
<BODY>
<TABLE>
<TR>
<TD>Row 0, Col 0</TD>
<TD>Row 0, Col 1</TD>
</TR>
<TR>
<TD>Row 1, Col 0</TD>
<TD>Row 1, Col 1<TD>
</TR>
</TABLE>
</BODY>

Code
// Load the html document
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://myServer/myTable.htm");

// Get all tables in the document
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//TABLE");

// Iterate all rows in the first table
HtmlNodeCollection rows = tables[0].SelectNodes(".//TR");
for (int i = 0; i < rows.Count; ++i) {

// Iterate all columns in this row
HtmlNodeCollections cols = rows[i].SelectNodes(".//TD");
for (int j = 0; j < cols.Count; ++j) {

// Get the value of the column and print it
string value = cols[j].InnerText;
Console.WriteLine(value);
}
}

Result
Row 0, Col 0
Row 0, Col 1
Row 1, Col 0
Row 1, Col 1

Hope this helps!

Cheers!
Johan Olsson

Aug 3, 2007 at 6:32 PM
"HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//TABLE");"

It appears that the parameter to SelectNodes() is case-sensitive. Is there a way to switch it to case-insensitive?
Aug 7, 2007 at 9:41 PM
Edited Aug 7, 2007 at 9:46 PM
Hello Jan,
To me it sounds like using an XPath query will work for you. For example, the following will give you all the cell nodes in your html document where the attribute Class = details.

public void xSearch(string url)
{
//Load the Html Document
HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(url);

//Set the XPath query
string path = "//td\[@class='details']";

//Xpath query the document for all matching nodes
HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes(path);

//Get all the nodes in the node collection
foreach (HtmlNode n in nodes)
{
System.Console.WriteLine(n.InnerHtml);
}
}
...
Aug 7, 2007 at 9:54 PM
Edited Aug 7, 2007 at 9:55 PM
Hi Johan,

You could simplify your code a great deal. XPath Rules allow you to use //td to select all elements that match, so no need to iterate tables and rows. Also, if you have multiple tables and wanted to select all columns from only the first table, use xPath query //table[1]//td Try it out. (also, I think all elements should be in lowercase, no?)

Code
// Load the html document
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://myServer/myTable.htm");

// Get all columns in the document
HtmlNodeCollection cols = doc.DocumentNode.SelectNodes("//TD");

// Get the value of the column and print it
foreach (HtmlNode col in cols)
{
Console.WriteLine(col.InnerText);
}


JohanOlsson wrote:
Hi,

This is how I solved the problem of extracting info from a html table

HTML
<BODY>
<TABLE>
<TR>
<TD>Row 0, Col 0</TD>
<TD>Row 0, Col 1</TD>
</TR>
<TR>
<TD>Row 1, Col 0</TD>
<TD>Row 1, Col 1<TD>
</TR>
</TABLE>
</BODY>

Code
// Load the html document
HtmlWeb web = new HtmlWeb();
HtmlDocument doc = web.Load("http://myServer/myTable.htm");

// Get all tables in the document
HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//TABLE");

// Iterate all rows in the first table
HtmlNodeCollection rows = tables[0].SelectNodes(".//TR");
for (int i = 0; i < rows.Count; ++i) {

// Iterate all columns in this row
HtmlNodeCollections cols = rows[i].SelectNodes(".//TD");
for (int j = 0; j < cols.Count; ++j) {

// Get the value of the column and print it
string value = cols[j].InnerText;
Console.WriteLine(value);
}
}

Result
Row 0, Col 0
Row 0, Col 1
Row 1, Col 0
Row 1, Col 1

Hope this helps!

Cheers!
Johan Olsson



May 28, 2010 at 8:12 PM

Hi Everyone,

Thanks for posting, this was really helpful for me.  There are a couple of corrections to Johan's code:

1.  TABLE, TR, and TD should all be lowercase.

2.  It's HtmlNodeCollection not HtmlNodeCollections for the "Iterate Rows Line".

With these corrections, and loading from a LOCAL file, my code looks like this:

                string FileName = "C:\mydirectory\myfile.html";

                //HtmlWeb web = new HtmlWeb();
                HtmlDocument doc = new HtmlDocument();
                doc.Load(FileName);

                // Get all tables in the document
                HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");

                // Iterate all rows in the first table
                HtmlNodeCollection rows = tables[0].SelectNodes(".//tr");
                for (int i = 0; i < rows.Count; ++i)
                {

                    // Iterate all columns in this row
                    HtmlNodeCollection cols = rows[i].SelectNodes(".//td");
                    for (int j = 0; j < cols.Count; ++j)
                    {

                        // Get the value of the column and print it
                        string value = cols[j].InnerText;
                        Console.WriteLine(value);
                    }
                }
I hope this helps other "newbies."
Thanks,
Russell Schutte
Dec 6, 2010 at 10:28 AM

how about if am having more than one table and tables within tables,   and i need only Top level tables ie the first tables on the document

eg

<table><tr><td> Top table 1 </td></tr></table>

<table>

        <tr><td> Top table 2 </td></tr>

         <tr><td> <table><tr><td>Inner table</td></tr></table> </td></tr>

</table>

 

when i use  selectNodes("//table") all the tables will be selected even the inner table,  if i need only the top level tables that is its count is only two how can i achieve this........

 

any help will be greatly appreciated

 

Jan 30, 2013 at 9:13 AM

Hi eosjack,

Did you find the solution for case sensitivity?