Help with HtmlAgilityPack, match 2 different nodes with each other.

Topics: Developer Forum, User Forum
Sep 11, 2010 at 10:21 PM
My app retrieves "Place names" and their "Addresses" from a website. 
Here is the code so far:
Dim content As String = ""
�� �Dim web As New HtmlAgilityPack.HtmlWeb
�� �Dim doc As New HtmlAgilityPack.HtmlDocument()
�� �doc.Load(WebBrowser1.DocumentStream)
�� �Dim hnc As HtmlAgilityPack.HtmlNodeCollection = doc.DocumentNode.SelectNodes("//span[@class='listingTitle']") '//////Gets PlaceName/////////
�� �For Each link As HtmlAgilityPack.HtmlNode In hnc
�� � � �Dim replaceUnwanted As String = ""
�� � � �replaceUnwanted = link.InnerText.Replace("&", "&") '
�� � � �replaceUnwanted = replaceUnwanted.Replace("'", "'")
�� � � �replaceUnwanted = replaceUnwanted.Replace("See full business details", "")
�� � � �replaceUnwanted = replaceUnwanted.ToLower().Replace(vbCrLf, "")

�� � � �content &= replaceUnwanted & vbNewLine
�� �Next
�� �RichTextBox1.Text = content
�� �Me.RichTextBox1.Lines = Me.RichTextBox1.Text.Split(New Char() {ControlChars.Lf}, _
�� � � � � � � � � � � � � � � � � � � � � � � StringSplitOptions.RemoveEmptyEntries)
�� �Dim content2 As String = ""
�� �Dim doc2 As New HtmlAgilityPack.HtmlDocument()
�� �doc2.Load(WebBrowser1.DocumentStream)
�� �Dim hnc2 As HtmlAgilityPack.HtmlNodeCollection = doc2.DocumentNode.SelectNodes("//div[@class='address']/text()[normalize-space(.)]")'//////Gets Address//////
�� � � �For Each link As HtmlAgilityPack.HtmlNode In hnc2
�� � � � � �Dim replaceUnwanted As String = ""
�� � � � � �replaceUnwanted = link.InnerText.Replace("&", "&")
�� � � � � �replaceUnwanted = replaceUnwanted.Replace("'", "'")
�� � � � � �replaceUnwanted = link.InnerText.Replace("Map", "")
�� � � � � �replaceUnwanted = replaceUnwanted.Replace("Map", "")
�� � � � � �content2 &= replaceUnwanted & vbNewLine

�� � � �Next
�� � � �RichTextBox2.Text = content2.Replace(ControlChars.Tab, "")
�� � � �Me.RichTextBox2.Lines = Me.RichTextBox2.Text.Split(New Char() {ControlChars.Lf}, _
�� � � � � � � � � � � � � � � � � � � � � � � � � StringSplitOptions.RemoveEmptyEntries)

 

So my app gets all the PlaceNames and puts them in richtextbox1 and gets all the addresses and puts them in richtextbox2. This would be perfect if yellowpages didnt have flaws,,, but they do. Some of their "PlaceNames" dont have "Addresses". 
Eg:
  1. JH Ryder Machinery Limited
  2. Convenience Storage Ltd 3344 Rideau Rd, Gloucester, ON, K1G3N4 Map
  3. Regional Physiotherapy Clinic 1443 Woodroffe Ave, Nepean, ON, K2G1W1 Map
So now there are more place names than addresses and they dont match up with each other. How can I make sure they always match up? Or any other workaround, like deleting/skipping "PlaceNames" without addresses.
Here is the url if someone want to take a look at the html: http://www.yellowpages.ca/search/?stype=si&what=sh&where=Ottawa,+ON&x=0&y=0

 

Sep 15, 2010 at 1:23 PM

 

1. select data which is relevant for ONE company and from that data (for example, you can load new HtmlDocument):

a) extract company name

b)extract address

In this case You wont need to think which address match which company name.