Problem to scrape several tags at once VB .NET

Topics: Developer Forum, User Forum
Mar 17, 2011 at 6:30 PM

Hi Guys,
I'm a pretty new HTML Agility Pack user, only played around with it last few days
trying to scrape some business info but I can't get it to work as I want.

Instead of trying to explain in writing, I better show you my code below.

The result I get is this, different company name, but all with same phone and address:
   Le Soleil Spa, 403-766-9231, 2359 Banff Trail NW, Calgary, AB, T2M4L2
   Spa Ritual The, 403-766-9231, 2359 Banff Trail NW, Calgary, AB, T2M4L2
   Sante Spa, 403-766-9231, 2359 Banff Trail NW, Calgary, AB, T2M4L2

And this is what I am expecting to get:
   Le Soleil Spa, 403-766-9231, 2359 Banff Trail NW, Calgary, AB, T2M4L2
   Spa Ritual The, 403-766-9231, 106 Crowfoot Terr NW, Calgary, AB, T3G4J8
   Sante Spa, 403-766-9231, 240-508 24 Ave SW, Calgary, AB, T2S0K4

I hope someone can show me what I done wrong, since I almost pulled my hair off already while trying to figure it out ...
If I'm on a totally wrong road here, please show me some sample (VB) code that will point me the right way ...

  Private Function CleanItem(ByVal sRecord As String)
    Dim replaceUnwanted As String = ""
    replaceUnwanted = sRecord.Replace("&", "&") '
    replaceUnwanted = replaceUnwanted.Replace("'", "'")
    Return replaceUnwanted
  End Function


  Private Sub do_scrape()
    ' This is the URL I'm testing to scrape info from
    '  http://www.yellowpages.ca/search/si/1/Estheticians/Calgary+AB

    Dim content As String = ""
    Dim address As HtmlAgilityPack.HtmlNode
    Dim phone As HtmlAgilityPack.HtmlNode

    Dim doc As New HtmlAgilityPack.HtmlDocument()
    doc.Load(WB1.DocumentStream)
    Dim hnc As HtmlAgilityPack.HtmlNodeCollection = doc.DocumentNode.SelectNodes("//span[@class='listingTitle']")

    For Each link As HtmlAgilityPack.HtmlNode In hnc
      phone = link.SelectSingleNode("//h4[@class='phoneLink']/a")
      address = link.SelectSingleNode("//div[@class='address']")

      content &= CleanItem(link.InnerText) & ", " & Trim(phone.InnerText) & ", " & Trim(address.InnerText) & vbNewLine
    Next

    txtResult.Text = content

End Sub

Mar 18, 2011 at 8:39 PM

Try this instead:

HtmlNodeCollection nodes = htmlDocument.DocumentNode.SelectNodes("//div[@class='listingDetail']");
foreach(HtmlNode node in nodes)
{
    HtmlNode phone = node.SelectSingleNode("./div/h4[@class='phoneLink']/a");
    HtmlNode address = node.SelectSingleNode("./div[@class='listingDetailLHS']/div[@class='address']");
}

Mar 20, 2011 at 9:52 AM

Thanx roux!

I got it working like a charm now :)