Suggestions for LINQ expression?

Topics: User Forum
Jan 13, 2011 at 2:59 AM
Edited Jan 13, 2011 at 3:00 AM

I'm trying to get the following information from Div class "dtc":

Href (attribute: http://www.someurl.com/etc etc etc)
Img src (attribute: http://address.forimages.com.etc etc etc)
Title (from img src: header=[Title] body =[Author] )

Sample HTML looks like this:

<div class="dtc" style="width:4%;border-top:2px solid black;padding-top:5px;">
<em style="font-size:19px;font-weight:bold;">1</em>
</div>

<div class="dtc" style="width:12%;border-top:2px solid black;padding-top:5px;">
<a href="http://www.someurl.com/information/book/123456/">
 <img src="http://address.forimages.com/m/87/4387/9781565124387.jpg" class="book_image" title="header=[Title] body=[Author]"></a>
</div>

I've looked at http://blogs.msdn.com/b/saveenr/archive/2010/10/08/scraping-the-nhl-2010-2011-schedule-with-c-linq-and-the-html-agility-pack.aspx which is similar but not quite the same as to what I'm trying to do.

Any suggestions?  I've also tried the example from http://www.4guysfromrolla.com/articles/011211-1.aspx  - it gets hrefs but not the ones within the div class (I'm sure it's because I'm not gettinginto the div class ="dtc" node)

Thank you!

Thor

Jan 13, 2011 at 11:02 PM

What you want to use is HtmlNodeCollection, filter out "div" tags, not the class tag, once you have that collection flip through it to get the contained nodes for any div whose class is "dtc".

hth, tom

Jan 14, 2011 at 1:16 AM
Edited Jan 14, 2011 at 3:13 AM

Tom,

Thanks for the suggestion - I can run with this!

EDIT:

Well, I can almost run with it.  I'm working with HAPLight and when using the HtmlNodeCollection, there is no  SelectNodes  method as in this example:

HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//div");

EDIT 2:

I added HtmlNode.Xpath.cs to the HAPLight directory and built it - this seems to work.

Of course, with one step forward, I'm still two steps back...  Not sure if I'm on the right track with the following as I get a null reference :

 pbbooklists.LoadHtml(e.Result);
            HtmlNodeCollection nc = pbbooklists.DocumentNode.SelectNodes("//div");
            if (nc != null)
            {
                foreach (HtmlNode node in nc)
                {
                    if (node.Attributes["class"].Value == "dtc")
                        MessageBox.Show("Working...");

                }

            }

Jan 15, 2011 at 1:57 AM

Ok,

Making some progress now with the following code:

var pbbooklists = new HtmlAgilityPack.HtmlDocument();
pbbooklists.LoadHtml(e.Result);
HtmlNodeCollection nc = pbbooklists.DocumentNode.SelectNodes("//div[@class='dtc']");

if (nc != null)
            {
                foreach (HtmlNode node in nc)
                {
                    try
                    {
                          MessageBox.Show(node.Element("a").Attributes["href"].Value.ToString());  //Gets href attribute
                     }
                   catch (System.NullReferenceException)
                   {
                   }
                }
             }

Not having any success though in trying to get the img src attribute.  I thought I would use node.Element("img").Attributes["src"].Value.ToString()) but that's coming back as null.

Any suggestions?  (Is the above code looking at the information as tags instead of nodes?)

Thor

Jan 15, 2011 at 11:46 AM

This works to get the img layer:

 MessageBox.Show(node.Element("a").Element("img").GetAttributeValue("src", null).ToString());