extract table from HTML and loop through its rows and cells

Topics: Developer Forum, User Forum
Aug 27, 2009 at 9:14 PM

I have a HTML page which I get from another url. In this HTML there is one table which data I need to extract.
Basically I need to loop through all the rows in the table and then for each cell in that row I will need to get the data in it.

The table I need starts with: <table width="700" cellspacing="1" cellpadding="1" border="1" bgcolor="#555555">

So I need to get the entire table from this point up to the </table>

How can I get this table from the HTML AND how can I loop through the table rows and cells?

Can anyone provide me with a good example using HTML agility pack, which btw looks cool and exactly what I need for this! :)

Thanks!

Aug 28, 2009 at 8:16 AM

use SelectSingleNode with the relevant xpath: "html/body/table/"

Aug 28, 2009 at 9:36 AM

Ok, I thin I can get the table with:

SelectSingleNode(/html/body/div/table[5]/tbody/tr/td[2]/table[2])

But how do I loop through the rows and cells of the table..im looking for an example :)

Thanks!

 

Aug 28, 2009 at 6:57 PM

I am looking for examples too. plesae keep me posted.  thanks.

Aug 29, 2009 at 4:29 PM

There are 2 key classes to know, besides HtmlDocument ofcourse.

HtmlNode
HtmlNodeCollection

To "read" the nodes of a HtmlDocument, use DocumentNode property. DocumentNode is a HtmlNode object.

Another important thing to KNOW about is that SelectSingleNode and SelectNodes return null when no nodes are found. So in this code, you should test for null. Pretend table.SelectNodes("./tr") returns null, and there are 3 tables in the html document, your code will not go past the first one, and throw an exception.

HtmlDocument doc = HtmlWeb.Load("sample.html");
foreach(HtmlNode table in doc.SelectNodes("//table"))
{
    foreach(HtmlNode tr in table.SelectNodes("./tr"))
    {
        HtmlNodeCollection tds = tr.SelectNodes("./td");
        if(tds == null )
                continue;
    }
}

I am writing this off my mind, so it might not compile at all. But it is just a reference sample.

If you add the HAP project to your solution, you can more easily figure out how HAP works than using only the HtmlAgilityPack.dll.

Good luck!

Mar 1, 2010 at 9:33 PM

I was using the sample below as an example and came up with:

 

HtmlDocument doc = hw.Load("http://xxxxxxxxxx/GeneralContent/Active/PrintPage/PrintPage.aspx?PageId=3270");
                  // Get all columns in the document
            HtmlNodeCollection table = doc.DocumentNode.SelectNodes("//table");
            // Get the value of the column and print it
            foreach (HtmlNode col in table)
            {

                    HtmlNodeCollection rows = col.SelectNodes("//tr");
                    
                    foreach (HtmlNode row in rows)
                    {
                        HtmlNodeCollection td = row.SelectNodes("//td");
                        foreach (HtmlNode cell in td)
                        {
                            Response.Write(cell.InnerText);
                        }
                        
                    }

            }

        }

My question is; why does my outer loop for the TR tags only get hit once? Basically all content on the page exists in the TD node. This could be how the xml is structured I just wanted to verify this was true?

Apr 2, 2012 at 6:42 PM

I want to know what is line and line position means in HAP. How they are related to xpath?