Html Agility Pack Examples

For example, here is how you would fix all hrefs in an HTML file:
 HtmlDocument doc = new HtmlDocument();
 doc.Load("file.htm");
 foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
 {
    HtmlAttribute att = link["href"];
    att.Value = FixLink(att);
 }
 doc.Save("file.htm");

Last edited Oct 1, 2009 at 9:38 PM by DarthObiwan, version 1

Comments

LeeC Feb 7 at 8:18 PM 
That same line of code gives me this error:

DocumentElement' is not a member of 'HtmlAgilityPack.HtmlDocument'.

foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href]"))

LeeC Feb 7 at 8:13 PM 
Hello. I just wanted to point out that there are two syntax errors in the example code at the top of this page.

foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href]"))

sherihan Jun 7, 2013 at 6:51 AM 
can we use HTML Agility Pack to extract data from asp.net web items like web pages, stile sheets etc?

ParitoshSoni Jun 1, 2013 at 2:03 PM 
I am not able to Load pages with different extensions like .cms.
What can be the problem? or it is not supported?

eljoker2k Jan 29, 2013 at 11:53 AM 
is there documentation or i have to discover myself ?

MikePanter Oct 30, 2012 at 2:35 AM 
After fixing the bug in the example (as per rbaettler's comment), I have had no trouble using the Html Agility Pack at all. So far it seems to do what it says on the tin, which is to allow you to parse HTML documents using xpath expressions, or just navigate the html dom directly. Nice, thanks!

Dheeraj2012 Sep 12, 2012 at 1:40 PM 
i am using HAP ,i want know why i am not able to get the child node of a selected node, for example if i select a form tag using .SelectSingleNode() which successfully returns me correct form(checked by ID) but with no child node , please help me to find a way to select form tag with its child node .
Thank you

valgussev Aug 24, 2012 at 12:47 PM 
I figured it out... i have used VS 2010 C++ but now i'm using VB 2010 and everything works great. Which are you use?

valgussev Aug 24, 2012 at 10:06 AM 
Hi, can you help me with a problem of adding Html Agility Pack?

I've tried to use it via Visual Studio and qt creator, but every time when i add this lines
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
i have errors like 'HtmlAgilityPack' : symbol cannot be used in a using-declaration.
Via which program you usually add Html Agility Pack?

mcap Aug 9, 2012 at 2:45 AM 
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;

namespace HtmlAgility
{
class Program
{
static void Main(string[] args)
{
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
HtmlAttribute att = link.Attributes["href"];
att.Value = "http://www.google.com";
}
doc.Save("file.htm");
}
}
}

justin_romaine Aug 1, 2012 at 8:57 AM 
HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
HtmlAttribute att = link["href"];
att.Value = FixLink(att);
}
doc.Save("file.htm");
This does not compile. Nor Document element. What am i missing?

VigneshPT May 27, 2012 at 5:22 AM 
Is there no library supported by the Windows Phone.?

wronskiano40 Apr 30, 2012 at 12:58 PM 
hello, but if I wanted to copy the url IMG (.jpg) address within this code as I do?

code:
<TABLE id=uezszu_24 class="uiGrid fbPhotosGrid" cellSpacing=0 cellPadding=0>
<TBODY>
<TR>
<TD class="vTop">
<DIV class=Wrapper><A
class="uiMediaThumb uiScrollableThumb uiMediaThumbHuge"
href="www.cccc.com/index.php"
name=43563463 rel=theater aria-label="photo"
ajaxify="dsgdgbdfgr45y6ghd"><I
style="BACKGROUND-IMAGE: url(http://www.fressdgf.com/image.jpg)"></I></A></DIV></TD>
</TR>
</TBODY>
</TABLE>

maudish Apr 1, 2012 at 2:14 AM 
How to find the ID value of a particular value? e.g. Below is the string I obtain from a website and when the user selects a particular school from a list of schools, the ID of that school is to be captured for further processing. The part of html tag I need to look at is given below:


<td class="Miles">&nbsp;</td><td CLASS="InstDesc"><A href="JavaScript:MoreInfo('sch_info_popup.asp'+'?Type=Public'+'&#38;ID=181134001822');">BATTLE GROUND MIDDLE SCHOOL</a><br />6100 N 50 W, WEST LAFAYETTE, IN 47906<br />(765) 269-8140&nbsp;&nbsp;-&nbsp;&nbsp;<i>TIPPECANOE COUNTY COUNTY</i></td><td class="InstDetail">grades: <strong>6</strong> - <strong>8</strong></td>


So when the user selects Battle Ground as school, my program should be able to figure out the ID 181134001822 from above html.

renegade809 Dec 8, 2011 at 9:14 PM 
This looks like a great tool, but I REALLY wish that you had a tutorial in vb.net that would help teach us how to properly use it!

uppadhyayraj Oct 10, 2011 at 7:34 AM 
Also it is not working with following code

static void Main(string[] args)
{
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com");
//doc.DocumentNode.SelectSingleNode("//*[@id=\"lst-ib\"]");//("/html/body/div[2]/form/div/div[2]/table/tbody/tr/td/table/tbody/tr/td/div/table/tbody/tr/td/table/tbody/tr/td[2]/div/input");
//System.Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[@id=\"lst-ib\"]").Id);
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("/html/body/div[2]/form/div"))
{
HtmlAttribute att = link.Attributes["id"];

System.Console.Write(att.Value);

}
System.Console.ReadKey();

}

uppadhyayraj Oct 10, 2011 at 7:27 AM 
Not able to get result from Xpath containing second element or attribute value, please provide the way to do this. Following is my code
static void Main(string[] args)
{
HtmlAgilityPack.HtmlWeb web = new HtmlWeb();
HtmlAgilityPack.HtmlDocument doc = web.Load("http://www.google.com");
//System.Console.WriteLine(doc.DocumentNode.SelectSingleNode("//*[@id=\"lst-ib\"]").Id); //This is not working as well
foreach (HtmlNode link in doc.DocumentNode.SelectNodes("//*[@id=\"gbw\"]"))
{
HtmlAttribute att = link.Attributes["id"];

System.Console.Write(att.Value);

}
System.Console.ReadKey();


}

teljj001 Apr 14, 2011 at 11:49 PM 
@lennymail, xpath doesn't need a selector on each level - what you're after is the descendent axis. This is represented by two slashes ("//"). For example to get all form tags below body, you could use: /html/body//form - or, equally: //form.

XPath is actually well suited to the task once you understand it. That said, the Sizzle engine is better adapted to HTML specifically rather than being a general XML solution.

JoeO Mar 24, 2011 at 6:01 PM 
The issue of the documentation not displaying in the HtmlAgilityPack.Documentation.chm is that the CHM file contains JavaScript code. Windows 7 blocks code that has been downloaded from the web.

To make the CHM file function correctly you need to right-click on the file and choose the "Properties" command. In the "Properties" dialog box, at the bottom of the "General" tab, click the "Unblock" button.
That should fix the problem.

lennymail Mar 23, 2011 at 2:35 PM 
I got excited at first but it does not help me. First of all the chm file does not work on windows 7, I get:

Navigation to the webpage was canceled What you can try: Retype the address

In my reader. I have abcChm reader on another machine but I'm not going to bother. Why can't you post the docs in HTML online? Disk space is cheap in 2011?? I am familiar with Xpath but its not a good choice for complex HTML that may even be dynamic. Not worth your time if getting paid.

Also, I have a site where the table is buried in few divs, I can't find any way to access tables directly, my Xpath may have 10 levels (and HTML changes with different pages) and I don't have the time to dig it out. The table does not even show up in the sample tree app and I can't find it (the tree only seems to pick up the first 3 levels of tags). Seriously, My 'chopstring' (break a string at two landmarks) function and a regex to pull links, tds, etc is much more efficient than this. Ever wonder why CSS notation was invented? Imagine if they used Xpath?

PHP has PHP simpleDOM, take an example from there where you can use CSS/JQUERY selectors. http://simplehtmldom.sourceforge.net/. And even though it uses CSS, they have TONS of examples.
Almost motivated to convert this project to that. I ended up using PHP for this project, I did it in 3 lines!

kashifkhanin Mar 2, 2011 at 7:01 AM 
Hi eveyone, HTML Agility pack does not load the html document source contained in this page http://goo.gl/Ej2Gd. please help. i am on way to write code that will download all the Channel9.msdn.com videos so that i can view it at my leisure time and avoid waiting time to download videos while surfing.

poornachander Dec 31, 2010 at 8:22 AM 
Really cool library. Great work.

Raged Nov 23, 2010 at 11:01 PM 
rmp251

If you have the sources. You can fix this!
Line 105 of HtmlNode.cs: change to ElementsFlags.Add("form", HtmlElementFlag.CanOverlap); from ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);

rmp251 Oct 13, 2010 at 4:48 AM 
This looked very promising but I can't get it to give me what I need... I just want the form node with all the HTML contents!

doc.DocumentNode.SelectSingleNode("/html/body") // returns body with entire contents :)
doc.DocumentNode.SelectSingleNode("/html/body/form") // returns only the form element, inner content is empty - go figure :(

MC9000 Sep 3, 2010 at 8:18 PM 
I only see 1 example on this page. Are there any more?

samdanae Aug 26, 2010 at 12:46 AM 
This library is awesome :)

Aaronontheweb's comment pretty much sums up the documentation issue - familiarise yourself with XPath querying and System.Xml, and you're away.

http://www.w3schools.com/xpath/xpath_syntax.asp is perfect for beginners or as a reference.

LeePavlou Aug 21, 2010 at 7:15 PM 
Great library!! Quick question though, how can i replace a node with another node eg replace a div with a paragraph, assuming i have already retrieved my div from the dom?

krisrajz Jul 8, 2010 at 6:50 AM 
How I can search for elements having prefix like b:bookmark?

sumothecat Jul 2, 2010 at 3:25 PM 
HtmlNode.Replace(HtmlNode) ....? :-)

sumothecat Jul 2, 2010 at 3:21 PM 
Thanks! LINQ to HTML - just what I needed!

Aaronontheweb Apr 30, 2010 at 1:49 AM 
The real key to figuring out how to get the most out of the HTML Agility Pack is understanding how to use XPath properly in combination with Agility Pack's collections.

I highly recommend checking this out as a refresher / reference on XPath if you're having trouble getting Agility Pack to do exactly what you want:

http://www.w3schools.com/xpath/xpath_syntax.asp

bogolese Apr 28, 2010 at 7:19 PM 
What is "FixLink" and what does it do? Is this something that is supposed to be part of the product? It's samples?

pawanm Apr 13, 2010 at 12:44 PM 
agree with Maxbl4 ...... but still i think there are lack of examples atleast for the starters, it'll be great if we have some tricky examples of DOM manipulation & parsing :)

Maxbl4 Mar 26, 2010 at 4:20 PM 
This library is very good! You don`t need any docs if you ever worked with XML DOM. If you didn`t go for System.Xml docs and examples, API is almost the same.

andychops Mar 4, 2010 at 7:46 PM 
To the guy complaining about the documentation, all you need to do is spend about 15 minutes going through the object model and reading the descriptions in the VS Object Viewer and you'll have it figured out.

mahyar_dodo2 Feb 19, 2010 at 10:58 PM 
Lack of good examples is the biggest problem of this project , I want to get some text values form specific tags in a HTML Document , how can i achieve this ?

droidi Jan 15, 2010 at 3:02 AM 
There's some examples with the source download. The html to rss one is a nice starting point for filtering crap out of html.

Just remove the " and @target='_new'" from HtmlNodeCollection hrefs = doc.DocumentNode.SelectNodes .... line and try it with some sample.html you got from any web site. It will produce all the links only from the html.

Kdodman Dec 16, 2009 at 12:38 AM 
I agree, the library looks great. But the lack of examples is really a pain in the butt.

cooperpx Dec 12, 2009 at 7:21 PM 
Really cool implementation. Downloaded the binaries only. Didn't look at any docs at all. Within minutes, I have sanitized incoming html comments due to my familiarity with the System.Xml (i.e: removing script & object tags, converting "font tags" to spans w/ style attributes, etc.

It took longer for me to create this account and post my feedback. ;)

bugmenot2 Dec 11, 2009 at 2:51 PM 
The lib looks nice, but these examples are all wrong, and no examples were included in any of the downloads, well, none that I downloaded.. (I got bins, and docs, don't need src.. I hope you wouldn't put them with the source, that just goes against reason, obviously, bins, or docs is the place for these..)

While you did include a rather extensive help file, it is, rather extensive, and you are insane if you think anyone NOT being paid $50+ an hour is going to want to try to decipher the libs usage from that alone. (This is for a small personal project, I need one little function to clean up some excess html data.. Certainly not worth that kind of investment in time. I can write this myself using regex, or string manipulation, etc,.)

Really, I am just trying to help you out here, I'm planning on creating my own function as I am in a hurry to get this done.. So it won't benefit me in anyway, as I will have long been done before you can rectify this, if you even decide to do so, that is.

Btw, sorry bout the bugmenot account, I'll eventually get around to creating a codeplex account..

rblaettler Nov 30, 2009 at 2:21 PM 
I think this might work a little better.

HtmlWeb hw = new HtmlWeb();

HtmlDocument doc = hw.Load(txtLink.Text);

foreach(HtmlNode link in doc.DocumentNode.SelectNodes("//a[@href]"))
{
HtmlAttribute att = link.Attributes["href"];

Response.Write(att.Value);
}