Extract Forum Thread Content

Jul 21, 2011 at 5:15 PM

Hi Guys,

I wonder if anybody here could help me out.

I am building a forum crawler whereby I am trying to extract the thread content which include the user messages which

I then plan to store in the dB for analysis.

So has anybody done this in the past using the HTML agility pack if so then please point me in the direction where I can get some help.

Thanks

Jul 21, 2011 at 9:52 PM
Edited Jul 21, 2011 at 9:54 PM

Hey

using HTML agility pack it should be pretty easy using XPath to select the container element that holds the thread content.

So you might want to have registered in the database your different forums/websites, for each you might want to have registered the XPath rules that extract the forum content has you want and then use HTML Agility pack to SelectNodes according to all the XPath rules.

here are some examples to obtain elements from html with xpath using hap

//table[contains(@id, '_tblProperty')]  select all TABLE elements which contains '_tblProperty' in it's ID
//div[@class='col3']/p[2] select the second P, which parent it's a DIV with class='col3'
//span[contains(@id, '_lblPrice')]/a[1] select the first A element child of a span that contains in ID  '_lblPrice'

the rest of the work it's by HTML Agility pack 

Hope it helps. 

Jul 21, 2011 at 10:05 PM
Edited Jul 21, 2011 at 10:08 PM
dherbe wrote:

Hey

using HTML agility pack it should be pretty easy using XPath to select the container element that holds the thread content.

So you might want to have registered in the database your different forums/websites, for each you might want to have registered the XPath rules that extract the forum content has you want and then use HTML Agility pack to SelectNodes according to all the XPath rules.

here are some examples to obtain elements from html with xpath using hap

//table[contains(@id, '_tblProperty')]  select all TABLE elements which contains '_tblProperty' in it's ID
//div[@class='col3']/p[2] select the second P, which parent it's a DIV with class='col3'
//span[contains(@id, '_lblPrice')]/a[1] select the first A element child of a span that contains in ID  '_lblPrice'

the rest of the work it's by HTML Agility pack 

Hope it helps. 

'dherbe' thanks for your excellent reply. I really appreciate it.................

The problem is the way my application is designed. Basically it is a GUI application whereby user browses to 'ANY' web forum. I have

one textbox whereby it uploads the HTML contents and the other textbox which is using regex, cleans the HTML contents of textbox1 and inputs it into

textbox2 ready for tranfer to the database along with the web URL.

Regex only works upto a certain limit whereby it struggles to cope if the right regex arent mentioned in the code.

Hence, why I have turned towards htmlagilitypack :D

After reading the above, is there anything HAP can help me with where it can be again 'ANY' forum?

Please advice.

Thanks