Getting absolute links instead relative ones

Topics: Developer Forum, User Forum
May 25, 2011 at 6:42 PM
Edited May 25, 2011 at 6:43 PM

I get all "a href" links on a web page using

 

var linksOnPage = from lnks in doc.DocumentNode.Descendants()
                              where lnks.Name == "a" &&
                                   lnks.Attributes["href"] != null &&
                                   lnks.InnerText.Trim().Length > 0
                              select new
                              {
                                  
                                  Url = lnks.Attributes["href"].Value,
                              };

            foreach (var link in linksOnPage) .......

It works very good, the only problem is the relative paths. For example, If I crawle (http://newyorktimes)
and it's a http//newyorktimes/pdf/document.pdf) hardcoded it's all ok
but if it is hardcoded /pdf/document.pdf it is a problem. 

How can I solve this?
May 26, 2011 at 11:11 PM

What's wrong with prepending the url you are crawling (http://newyorktimes) to the link url in the page if it is a relative url path?

Somthing like:

Url = lnks.Attributes["href"].Value;
if(!Url.StartsWith("http"))
{
    Url = pageUrl + Url;
}

May 30, 2011 at 11:14 AM
Edited May 30, 2011 at 11:16 AM

System.Uri has built-in support for resolving relative Uri's.  To use this, you should determine the base uri of the page.  The base uri of the page is either the uri of the page itself, or the uri specified in the base tag (in the document's head, XPath: /html/head/base/@href or simply //base/@href).

Then the following pseudo-code will resolve a Uri:

Uri baseUri = new Uri(stringOfBaseHrefOrPage, UriKind.Absolute);
Uri resolvedUri = new Uri(baseUri, stringOfRelativeOrAbsoluteHref); 

You should use this method, not string concatenation: it's more robust: e.g. supports other protocols, not just http, will deal with .. and uri's with a query segment or hash-tag etc. cleanly, and will throw an exception in the case of junk.  In general, using string concatenation for this kind of operation represents a potential security risk; avoid where possible.

May 30, 2012 at 9:03 AM

Can someone write down the complete as to how the final code will look like. 

Aug 24, 2012 at 4:19 AM

Works soft and smooth:

    public List<Uri> getLinks() 
        {
            var linksOnPage = from lnks in doc.DocumentNode.Descendants()
                              where lnks.Name == "a" &&
                                   lnks.Attributes["href"] != null &&
                                   lnks.InnerText.Trim().Length > 0
                              select new
                              {

                                  Url = lnks.Attributes["href"].Value,
                              };
            List<Uri> Uris = new List<Uri>();

            foreach (var link in linksOnPage)
            {
                Uri baseUri = new Uri(urlBase, UriKind.Absolute);
                Uri page = new Uri(baseUri, link.Url.ToString());
                Uris.Add(page);
            }

            return Uris;
        }
See you Guys next time!!!

       

Apr 21, 2013 at 2:41 PM
Edited Apr 21, 2013 at 2:47 PM
So how could I call the first record in the returned results from @foxmulder82 's answer above?

Im also getting a 'does not exists in context' error on 'doc' in the second line.
Any idea what I need to do to solve this?

Thanks.