FormProcessor addon: handling redirects

Topics: User Forum
Nov 11, 2007 at 6:32 PM
Edited Nov 11, 2007 at 6:39 PM
For an application I am working on, I had previously written a TON of manual redirect handling code (yes, I know the HttpWebRequest will handle redirects for you, but I needed to extract additional information from the redirections).

I've been able to avoid that need by making an extremely simple modification to the FormProcessor addon which, while I know it hasn't been officially integrated into the HtmlAgilityPack core (and I think it should be!), it an absolutely indispensable addon that can be downloaded here:

http://dotnetjunkies.com/WebLog/joshuagough/archive/2006/1/20.aspx

The modification I made is quite simple. In the FormProcessor.GetForm(string,string,FormQueryModeEnum) method, change it to the following:



private string _GetFormUrl = "";
public Form GetForm(string url, string xpath, FormQueryModeEnum queryMode)
{
_GetFormUrl = url;
_web.PostResponse += new HtmlWeb.PostResponseHandler(postResponseHandler);
HtmlDocument doc = _web.Load(url);
_web.PostResponse -= new HtmlWeb.PostResponseHandler(postResponseHandler);
return GetForm(doc, _GetFormUrl, xpath, queryMode);
}



And add the following method for the PostResponse handler:



private void postResponseHandler(HttpWebRequest req, HttpWebResponse resp)
{
if (_GetFormUrl != resp.ResponseUri.ToString()) // indicates a redirect
{
_GetFormUrl = resp.ResponseUri.ToString();
}
}



This simple modification ensures that the final location of the redirects is the base url for the "action" attribute of the form, if that attribute is a relative URL instead of an absolute one. The AttributeReferenceAbsolutizer checks if the action url is relative; if so, it creates an absolute URL using the original request URL, and the relative action URL. If no redirect occurred, nothing needs to be done and the original code will operate as expected.

That's why I was originally handling the redirects myself - I needed to know the final base URL that the form's action attribute pointed to.

In the case of a redirect, that original request URL will be wrong. This simple change "fixes" the issue. It's arguable this is a design flaw; it's more like what I've done is added a feature.

Hope this helps someone.