This project has moved and is read-only. For the latest updates, please go here.

Scraping Javascript

Topics: User Forum
Aug 26, 2006 at 5:59 PM
Hi Simon,
I am using HtmlAgilityPack to the hilt. Thx to you.
I am very successful in scraping HTML using your toolkit and XPATH.
However how can i scrape javascript without resorting to regexp?
As you might have noticed, recently, lot of sites are using "dynamic scripting" technique to pass data from the server instead of traditional HTML!!!

<script text="text/javascript" language="javascript">
var myArr = new Array();
var myRt=false;
myArr0={Wh:'1715',Nm:'Tommy'} ;
myArr1={Wh:'1715',Nm:'Tommy'} ;
myArr2={Wh:'1715',Nm:'Tommy'} ;
myArr3={Wh:'1715',Nm:'Tommy'} ;
myArr4={Wh:'1715',Nm:'Tommy'} ;
var myDArr= new Array();
var myDRt=false;
myDArr0={Wh:'1615',Nm:'Greg'} ;
myDArr1={Wh:'1615',Nm:'Greg'} ;
myDArr2={Wh:'1615',Nm:'Greg'} ;
myDArr3={Wh:'1615',Nm:'Greg'} ;

Any help would be appreciated.


Aug 26, 2006 at 6:42 PM
Hi Simon,
My above question will come in handy while scraping AJAX/ATLAS based websites where lotta javascripts are used. Please note I am talking about Javascript and not JSON, for which many deserializers are available.

Any hints would be greatly appreciated.

Aug 31, 2006 at 7:34 PM
Hi, Simon.
Love the tool - I have the same question, and am wondering if your tool supports fn:substring-before(fn:substring-after...?

Sep 1, 2006 at 9:12 PM
Hi guys,

Unfortunately, SCRIPT and STYLE tags are completely opaque to the Html Agility Pack. It just considers it as a blob of text (CData actually). If you think about it, the Internet Explorer internal parser (same with other browsers) does quite the same, it parses the content of the SCRIPT tag, and hands it off directly to the script engine in question (JSCript.dll, VBScript.dll, ...), together with a living object model (the HTML DOM). So the whole thing works, but not statically, not without a script engine. It works, because some piece of code actually runs (including hacker code!)

So there is no out-of-the-box solutions to your problems... Sorry :P