This project has moved and is read-only. For the latest updates, please go here.

extracting information from unsteady html source

Topics: Developer Forum
Aug 3, 2009 at 8:38 AM

Hi. I am a student and i have a project about getting some required information from an unsteady html source.


Firstly, in order to get texts which are bold, i tried to make many regular expression.But, i couldn't make a useful expression.Also, source of html is so long. Later, I thought to take only tables out from the source. "<table>(.*?)</table>" However, there were many tables and only 3 of these were required. Then, i noticed a comment line in the source. Required tables are between these comment lines.

"<!-- big table start -->" "<!-- big table end -->" After that, i decided to take lines between these comments out.

                WebClient wClient = new WebClient();
                string source = wClient.DownloadString("link");

                RegexOptions options = RegexOptions.IgnoreCase | RegexOptions.Singleline;
                Regex regX = new Regex("<!-- big table start -->(?<theBody>.*)<!-- big table end -->", options);

                Match match = regX.Match(source);
                string theBody = "";
                if (match.Success)
                    theBody = match.Value;
                    if (theBody != null && theBody != "")
                        textBox1.Text = theBody;

                StreamWriter sw = new StreamWriter("source.html");
                foreach (char line in theBody)
            catch (Exception ex)

After that, i needed to extract first and third cells of tables. While i was seeking how i was going to solve this on the internet, i found html agility pack. Immediately, i began to try somethings using html agility pack. This is easier than all what i tried for days. Html agility pack can easily extract tables selecting nodes. While i surfing on your forums, i saw that code below. I performed this code into my project. Though, program was crashing in loops cause of long source. At the moment, i dont know what i should do. As well, i tried to take required tables from source and i saved them as text and html, but i couldnt convert them into HtmlDocument. Do you have any idea or suggestion for me? 

string result = "";

                HtmlWeb web = new HtmlWeb();
                HtmlAgilityPack.HtmlDocument doc = web.Load("link");
                HtmlNodeCollection tables = doc.DocumentNode.SelectNodes("//table");
                HtmlNodeCollection rows = tables[0].SelectNodes("//tr");

                StringCollection strCll = new StringCollection();    

                for (int i = 0; i < rows.Count; i++)
                    HtmlNodeCollection cols = rows[i].SelectNodes("//td");

                    for (int j = 0; j < cols.Count; j++)
                        result = cols[j].InnerText;

                string str = "";
                for (int i = 0; i < strCll.Count; i++)
                    str += strCll[i].ToString();

                textBox1.Text = str;

 Kind regards,

Aug 3, 2009 at 12:08 PM

Hi and welcome to the club!

First I must advice you to always check for null when calling .SelectNodes or SelectSingleNode. If the XPath fails/doesn't find a match, these methods will return null!

Secondly, if you need to find the XPath to any given element, you can use one or both FireFox addons: "XPath checker" and "XPather". Note that Firefox for some reason always adds tbody after table, so you have to check manually in the source if there actually is a tbody. Otherwise you must remove it. For simple XPath these addons works well.

If my understanding of XPath is correct, in your loop, your code rows[i].SelectNodes("//td"), you are actually asking for *every* TD-element in the document, and not all TD-elements that are child of row[i]. If you want children only, your XPath should look like ".//td", note the punctuation before //. 

Assuming that SelectNodes and SelectSingleNode will return the expected node, has always been what makes trouble for me before.

What exceptions do you get? Also the stack trace is key here.