This project has moved and is read-only. For the latest updates, please go here.

Why can't I reliably extract links from nodes?

Topics: User Forum
Dec 4, 2012 at 9:44 PM

I'm looking at nodes that definitely contain links.  Sometimes I get the proper link by Node.Attributes["href"] but sometimes that's null and I can't see why it sometimes works and sometimes doesn't.  A brute force approach of looking for href=" and then taking everything up to the next quote is working but I would like to understand what's going on here.

Going to the real page the links do work and the pages do correctly display as my routine crawls them.

Jan 6, 2013 at 1:07 AM


I canot give you final solution as there is no example of the HTML. I suppose it is something to do with Xpath expression that returns incorrect nodes in the nodeset. Also,sometimes tag has no href attribute, i.e. link as Ajax navigation element.  So, try to run your Xpath on the page using something like Xpath checker for Firefox and watch the output.

If you can provide the example page where your app malfunctions and piece of code for link extraction, I can be more specific.