code example for saving an entire web page including images/css etc???

Topics: User Forum
Sep 16, 2009 at 9:56 PM

Hi,

I've been struggling to find an exmample of some C# code (I'm using C# Visual Studio 2008 express) that can programmatically save an entire web page (given a URL) including the images and formatting (e.g. CSS). The intension is that in a subsequent I'd ship this off (not sure how yet) so it could be view later via a browser.  I've been put onto this library by someone.

Any have some sample code they could post or refer me to that implements a method that saves an entire web page (given a URL) to file, including images/css/etc to a subdirectory, in a way that the links in the actual initial HTML file know to reference the images in the subdirectory.  Effectively just emulating the fire "save entire page" function but programmatically in C#?

thanks

 

Sep 16, 2009 at 10:32 PM

I'm not sure HAP would be the ideal solution for this. It would take quite a bit of code to go through, get all tags that can have urls, download the items, change the urls. 

I would highly recommend http://www.httrack.com/ for doing this. There's a FireFox extension SpiderZilla and a Windows Front end for it. It is meant for downloading entire pages/sites for offline use. You could just write some C# code to kick off an httrack process to do the heavy lifting.

If you really need to do it in C#, I'd suggest using the new 1.4.0 branch, you can use LINQ against HAP with that one.

Here's just some psuedo C# code, you can accomplish the same thing with the current trunk version using the navigator and XPATH.

 

var images = doc.DocumentNode.Decendents("img");

foreach(HtmlNode image in images)
{
   string image = image.Attributes["src"].Value;
   //do cleanup on the image url (relative, external)
  //Do an HttpWebRequest to download the image
  //Place it in a directory based on the original path
  image.Attributes["src"].Value = newPath;
}

Basically you'd want to set up a list of tags that have things to download and run through them.
var sources = new List<Source>{
                               new Source {tag="a", attribute="href"},
                               new Source {tag="img", attribute="src"}//and so on
};

 

Sep 17, 2009 at 1:51 AM

thanks Darth,

As I'm new to C#/.NET I might go for the option "current trunk version using the navigator and XPATH" - might be a bit easier than having to get across LINQ for me...

If it's easy do you have an sample code for using this approach to tackle the problem, or pointers to an example? 

thanks again

Sep 17, 2009 at 2:26 AM

http://www.w3schools.com/XPath/xpath_syntax.asp for examples of how to do xpath. For using it with HAP you can see Simon's original post about it. http://smourier.blogspot.com/2005/05/net-html-agility-pack-how-to-use.html

LINQ works more like SQL

var results = from node in doc.DocumentElement.Descendants()
                  where node.Name=="a" || node.Name == "img"
                  select node;

that will get you list of all the a and img nodes on the page.