This project has moved and is read-only. For the latest updates, please go here.

Can HAP access or preserve the original source HTML?

Topics: Developer Forum, User Forum
Dec 27, 2010 at 11:09 PM

Can I access the original/preserved source HTML through HAP?  It seems like in order to build it's own DOM it needs to make at least some modifications or fixups...  For a specific example, say I have a source HTML fragment with empty elements like:

<td  width="15px"/>

I then set the HtmlElementFlag to keep HAP from adding a closing </td>

if (HtmlNode.ElementsFlags.ContainsKey("td"))
    HtmlNode.ElementsFlags["td"] = HtmlElementFlag.Empty | HtmlElementFlag.Closed;
    HtmlNode.ElementsFlags.Add("td", HtmlElementFlag.Empty | HtmlElementFlag.Closed);

However, I can't get at the original source at all, and accessing the child.OuterHtml through HAP returns a fragment with the trailing "/" missing:

<td  width="15px">

Is there any way I can get at the original HTML source?

Dec 28, 2010 at 12:38 PM

Once the data has been parsed the original source isn't avaliable anymore.

Dec 28, 2010 at 12:46 PM

Personally I preffer to use HttpWebRequest/HttpWebResponse. Then you have html page in string variable and only then parse with HtmlAgilityPack. Internal web client of HtmlAgilityPack doesn't support POST method. This way original page can be saved for later debugging or for other purposes.

Dec 28, 2010 at 6:57 PM

I would be fine to use the original request/response object which I do have access too.  But I'd need HAP to give me an offset into it which I don't think it does.  Otherwise I'm back to using regex to find the position in the original source code.  My scenario is reporting security issues/vulnerabilities in the HTML code which I then need to display to the user in a report.

Jan 11, 2011 at 9:49 PM

HAP does keep a reference to the original string but it is an internal field. Right now you could use Reflection to access it. It is the Text field on the HtmlDocument. Though I had actually been contemplating clearing this out since it is the source of some memory issues when loading large documents. The HtmlDocument object ends up using more than 2x the amount of memory than the original document. So loading a 4mb html file will result in a 8mb+ variable. Clearing out this field would help reduce that