This project has moved and is read-only. For the latest updates, please go here.

Parse/screen scrape web page

Topics: User Forum
Aug 4, 2010 at 3:39 PM

I have several html documents that I need to extract data out off.  Currently I am using MS Excel to exctract the tables from the html pages, my company would like to get away from this practice.  I'm still realatively new to programming so bare with me. 

First Question.  How do I Parse a web page from an external (not mine) website or with a saved html document to my Harddrive, the examples that I have seen are vary vague. 

I have used the HAPExplorer to get to the table that I need in one of my HTML documents however the XPath looks like this: /html[1]/body[1]/table[1]/tr[4]/td[1]/table[1]/tr[2]/td[3]/placeholder[1]/table[1]/tr[1]/td[1]/table[1]/tr[2]/td[2]/table[1]/tr[1]/td[1]/table[1]/tr[2]/td[1]/div[1]/table[3]/tr[1]/td[1].  How in the world do I write this in code?

Any assistance would be greatly welcomed.

Aug 5, 2010 at 2:59 PM

Relativly new to programming? Hum...

Just to give you an idea of what you've walked into; these links will help you lean XPath:

An XmlDocument object has a method called "" which given the Xpath above will allow you to select data from the structure of the xml.


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;

namespace ConsoleApplication1
    class Program
        static void Main(string[] args)
            XmlDocument _xmlDocument = new XmlDocument();

            _xmlDocument.Load(""); ///<- Takes a file name to load from. LoadXml takes string data

            XmlNodeList _nodeList = _xmlDocument.SelectNodes(""); ///<- place query here
                                                                  /// _nodeList is populated with 
                                                                  /// the result of the xpath query
                                                                  /// or null

So if you have the xml data saved as files, you can load them with the ".Load" method.

(NB: This is just a first step, not a complete solution)