This project has moved and is read-only. For the latest updates, please go here.

iterate through like items

Topics: Developer Forum, User Forum
Jul 22, 2013 at 8:31 PM
HTML I'm scraping is below. It contains a post and 2 replies:
<div class="share_buttons noprint">...</div>

<strong>Dan</strong> Says:<br/>
<span class="small soft"><time datetime="2009-10-05T02:27:38Z">Sun, Oct 04 '09, 7:27 PM</time></span>
<div class="quote_top">&nbsp;</div>
<div class="quote_item">Hello all, this is my original post.<br/></div>

<form class="action_heading noprint">

<div class="post_number" id="r_140626">1</div>
<strong>AnnieMae</strong> Says:<br/>
<span class="small soft"><time datetime="2009-10-05T02:30:27Z">Sun, Oct 04 '09, 7:30 PM</time></span>
<div class="quote_top clear_float">&nbsp;</div>
<div class="quote_item">What do you think of it?<br/></div>

<div class="post_number" id="r_140627">2</div>
<strong>Thomas77</strong> Says:<br/>
<span class="small soft"><time datetime="2009-10-05T02:32:32Z">Sun, Oct 04 '09, 7:32 PM</time></span>
<div class="quote_top clear_float">&nbsp;</div>
<div class="quote_item">Not really sure, can't see this pic?<br/>
So I've already figured out how to get the original post...
'get AUTHOR and DATE of original post
Dim divOriginalPostAuthor As HtmlNode = threadDoc.DocumentNode.SelectSingleNode("//div[@class='share_buttons noprint']/following-sibling::strong")
Dim divOriginalPostDate As HtmlNode = threadDoc.DocumentNode.SelectSingleNode("//div[@class='share_buttons noprint']/following-sibling::span/time")

Dim strDate As String = divOriginalPostDate.InnerText.Trim
strDate = strDate.Remove(0, InStr(strDate, ", ")).Trim
strDate = Replace(strDate, "'", 20)
Dim strAuthor As String = (divOriginalPostAuthor.InnerText).Trim
dtPosted = CDate(strDate)
divOriginalPostText = threadDoc.DocumentNode.SelectSingleNode("//div[@class='share_buttons noprint']/following-sibling::div[@class='quote_item']")
Now I'm just trying to figure out how to get the replies...I was thinking of getting the current line position like this:
Dim currentNodePosition As Integer = threadDoc.DocumentNode.SelectSingleNode("//form[@class='action_heading noprint']").Line
and then using that to iterate through the replies as I increment the current line position. The think that makes this tricky for me is that the replies don't have a "container" html element for me to collect at once.... Any ideas?
Jul 23, 2013 at 6:07 PM
just for the record, I figured this out and wanted to post the answer for anyone that needs it in the future.
'then get thread replies
Dim nodesPostNumber As HtmlNodeCollection = threadDoc.DocumentNode.SelectNodes("//form[@class='action_heading noprint']/following-sibling::div[contains(@id, 'r_')]")
Dim replies As New List(Of ThreadReply)

If Not nodesPostNumber Is Nothing Then

Dim intNumberOfReplies As Integer = nodesPostNumber.Count
For i = 1 To intNumberOfReplies
    Dim nodeReplyDate As HtmlNode = threadDoc.DocumentNode.SelectSingleNode("//form[@class='action_heading noprint']/following-sibling::span[@class='small soft' and position()=" + i.ToString + "]")
    Dim strXPathForDate As String = nodeReplyDate.XPath
    Dim strReplyText As String = threadDoc.DocumentNode.SelectSingleNode(strXPathForDate + "/following-sibling::div[@class='quote_item']").InnerHtml
    strReplyText = Left(strReplyText, InStr(strReplyText, "<div class=""noprint""") - 1)
    Dim strReplyAuthor As String = threadDoc.DocumentNode.SelectSingleNode(nodeReplyDate.XPath + "/preceding-sibling::strong").InnerText
    Dim strReplyDate As String = nodeReplyDate.InnerText.Trim
    strReplyDate = strReplyDate.Remove(0, InStr(strReplyDate, ", ")).Trim
    strReplyDate = Replace(strReplyDate, "'", 20)
    strReplyDate = Replace(strReplyDate, "via mobile", "")
    Dim thisReply As New ThreadReply With {.Author = strReplyAuthor, .DatePosted = strReplyDate, .ThreadID = thisThread.ThreadID, .Text = strReplyText}
End If
So, it's about "grabbing" the node that was used for 1 reply and using it in xpath again to make sure you only get replies that come AFTER the node you grabbed. I did this by using HTMLNode.Xpath which gives you the xpath string for any given HTMLAgilityPack.htmlnode and then adding "/following-sibling".