Agility Pack changes "&" to "&" ??

May 27, 2009 at 2:03 AM
Edited May 27, 2009 at 10:09 AM

I load a page via HtmlWeb.Load().

Sometimes the loaded page contains a php page as a link which gets displayed as follows in my bowser:

e.g. href="http://www.test.com/hello.php?option=bla?param=more"

But when I read in this page via AgilityPack's HtmlWeb.Load() method the HtmlDocumemt's text property (including the link) becomes:

href="http://www.test.com/hello.php?amp;option=bla?amp;param=more"

 

so the "?" get exchanged by "?amp;" !

 

1.) How can I avoid this?

2.) Are there more similar substitutions AgilityPack does when extracting the links, which I should know?

3.) Why does Agility Pack do this?

 

Regards R4DIUM

 

Jun 11, 2009 at 4:04 PM
Edited Jun 11, 2009 at 4:07 PM

I believe it is a bug that was never patched (or hasn't been patched yet).

In the HtmlDocument source file change the following line in the HtmlEncode method from:

Regex rx = new Regex("&(?!(amp;)|(lt;)|(gt;)|(quot;))", RegexOptions.IgnoreCase);

To

Regex rx = new Regex(@"&(?!([a-z]+)|(\#[0-9]+)|(\#(x|X)[0-9a-fA-F]+);)", RegexOptions.IgnoreCase);

And it should resolve that issue.

Jun 11, 2009 at 4:04 PM
Edited Jun 11, 2009 at 4:08 PM

I believe it is a bug that was never patched (or hasn't been patched yet).

In the HtmlDocument source file change the following line in the HtmlEncode method from:

Regex rx = new Regex("&(?!(amp;)|(lt;)|(gt;)|(quot;))", RegexOptions.IgnoreCase);

To

Regex rx = new Regex(@"&(?!([a-z]+)|(\#[0-9]+)|(\#(x|X)[0-9a-fA-F]+);)", RegexOptions.IgnoreCase);

And it should resolve that issue.

 

Sorry about the double post... Not sure how that happened...