Html Agility Pack to LINQ to XML Converter

Topics: Developer Forum
May 17, 2008 at 1:41 PM
Edited May 17, 2008 at 1:47 PM
I submitted a patch (ID=1258) that upgrades the pack to 08 and .net 3.5, then added an extension method to convert a HtmlDocument to a LINQ to XML XDocument.
Now that LINQ to XML is around, I can trash all that XPath syntax from my brain (yay - another DSL is hidden!).
I called the extension method ToXDocument, but perhaps there's a better name that is closer to the Save semantics.


May 28, 2008 at 10:07 PM
+1

How can I get a copy of this patch?

I've noticed that a lot of web sites won't validate as XHTML. Does this fix this?  If not, would setting up the HTML Agility Pack to directly support LINQ be a better idea?
May 29, 2008 at 12:56 AM
You can get the patch form from the source code tab in the patches area.

Alot of websites dont validate as XHTML because they're probably not. they're HTML, which isn't exactly XML. In addition to that, the parsers have evolved to be very lenient to malformed X/HTML. This allows browsers to open a wider range of sketchy files, but makes scraping harder without a browsers parser.

Enter, htmlagility pack.

This is a gem IMHO in the C# OSS world. It brings a very lenient html parser and offers a set of external format converters, XML being one of them.

My patch simply uses the XML converter to stream data into the linq2xml XDocument parser. Very simple.
Check out my post at,
http://vijay.screamingpens.com/archive/2008/05/26/linq-amp-lambda-part-3-html-agility-pack-to-linq.aspx

I chose to use linq because my team needed to do some scraping work tasks and weren't very profficient in XPath, but had enough linq skills to parse xml. I prefer linq instead of xpath because it's easier to read in my opinion. It may be a bit slower, but perf is rarely something I particularly care about until we stress test and performance benchmark our apps. That said, sometimes from the start we're sure performance is going to be an issue (like on embedded devices), and we take care to design for performance early and give ourselves enough time for optimization. But I digress.

-CV