How to evaluate Javacript within C#? (need to get all links for a web page, including java-script generated ones)

Topics: Developer Forum
Feb 14, 2010 at 8:47 PM

Hi,

I'm wondering if the Html Agility Pack would help me out with this??

Background: I have to download webpages with their resources for offline viewing, however as part of this I have to "rewrite" the URL's for links with the HTML webpage so they work. This is fine more the standard types of links however I'm realizing now that there are some links that are dynamically created by javascript.

Question: What approach (or even existing library) could I use to transcribe a web page with dynamically generated links (from javascript) to a webpage with normal non-dynamic links. (as then I can do the URL rewriting I need to do)

Notes:

  • It's almost as if I need to have a Javascript interpreter library that I pass the page HTML to, and it then spits out the generated java code perhaps? Then I can rewrite the links as I wish (the result would then not use the javascript dynamic approach).
  • Context is a C# WinForms (3.5) application.

Thanks

Feb 14, 2010 at 9:40 PM

HAP does not provide any mechanism for executing Javascript. Think of it as a specialized XmlDocument.

Since you are in a WinForms app, you could use the Browser control to load the HTML. Or check out http://groups.google.com/group/csexwb, there is a HtmlParser class in there that will load the HTML using MSHTML without the need to display the browser. BE WARNED however, when you start loading html and executing script- you take on all the headaches of the full load. Such as: Javascript errors that stop execution, popups, unresponsive loads, downloading all external linked resources (images, css, javascript, iframes)....and more. It can be a tough battle to fight.

Consider this: You say the javascript builds some of the links on the page... where does the javascript get the info to build those links? Can you just grab that info and build them yourself?

 

Hope this helps.

-William

Feb 15, 2010 at 12:11 AM

thanks - re you last point here's an example...


<script type="text/javascript">
        <!--
            document.write("<a href=\"/home.asp\" onMouseOver=\"MM_swapImage('tab_home','','/_includes/images/tab_home_.gif',1)\" onMouseOut=\"MM_swapImgRestore()\"><img src=\"/includes/images/tab_home.gif\" alt=\"Home\" name=\"tab_home\" width=\"45\" height=\"18\" border=\"0\" id=\"tab_home\"><\/a>");

            if (window.document.location.pathname.indexOf("mysite.asp") != "-1") {
                document.write("<a href=\"/mysite.asp\" onMouseOver=\"MM_swapImage('tab_my_site','','/_includes/images/tab_my_site_.gif',1)\" onMouseOut=\"MM_swapImgRestore()\"><img src=\"/_includes/images/tab_my_site_.gif\" alt=\"My Site\" name=\"tab_my_site\" width=\"76\" height=\"18\" border=\"0\" id=\"tab_my_site\"><\/a>");
            }
            else {
                document.write("<a href=\"/mysite.asp\" onMouseOver=\"MM_swapImage('tab_my_site','','/_includes/images/tab_my_site_.gif',1)\" onMouseOut=\"MM_swapImgRestore()\"><img src=\"/_includes/images/tab_my_site.gif\" alt=\"My Site\" name=\"tab_my_site\" width=\"76\" height=\"18\" border=\"0\" id=\"tab_my_site\"><\/a>");
            }

Feb 15, 2010 at 12:44 AM

Ouch. Yep- worst case scenario.

HAP can get you the script tag, and then you can do some creative parsing on the script text- or you'll need to render it as I mentioned above.

You could also try a hybrid approach- get the script from HAP and inject it into a Browser to get it to render those document.write calls. But...I have a feeling that you're situation is more complex than the code sample you show above that probably won't really work for you.

Best of luck!
-William 

Feb 15, 2010 at 12:54 AM

thanks :)

Tried a firefox "download as webfile" and noted they didn't cover this scenario either

Feb 15, 2010 at 6:49 AM

another example I found


<script type="text/javascript">
var fo = new FlashObject("/homepage/ia/flash/hero/banner.swf?q=1", "hero", "642", "250", "8", "#ffffff");
fo.addParam("wmode", "transparent");
fo.addParam("allowScriptAccess", "always");
fo.addParam("base", "/homepage/ia/flash/hero/");
fo.write("flashContent");
</script>