gzip encoding

Topics: Developer Forum
Feb 25, 2008 at 11:08 AM
Using the HtmlAgilityPack I came across a html page that gave an exception at HtmlWeb.cs in function Get. When the line respenc = System.Text.Encoding.GetEncoding(resp.ContentEncoding); was executing and resp.ContentEncoding is "gzip" an exception is thrown say gzip is not a supported encoding name.

This is how I solved the problem:

try
{
respenc = System.Text.Encoding.GetEncoding(resp.ContentEncoding);
}
catch (Exception)
{
respenc = null;
}

and when getting the responsestream:

if (resp.ContentEncoding.ToLower().Contains("gzip"))
s = new GZipStream(s, CompressionMode.Decompress);

Now the code also works with gzip compressed pages.

Would it be possible to solve this in the offcial code?
Jun 8, 2010 at 9:47 PM

ok found it

 

Just change the following lines in HtmlWeb.cs - line 433

 

            if ((resp.ContentEncoding != null) && (resp.ContentEncoding.Length>0))
            {
                try
                {
                    respenc = System.Text.Encoding.GetEncoding(resp.ContentEncoding);
                }
                catch
                {
                    respenc = null;
                }
            }
            else
            {
                respenc = null;
            }

 

I hope this will be usefull...

 

Jun 22, 2010 at 12:21 PM
Edited Jun 22, 2010 at 12:29 PM
abutun, just ignoring the content exception did not solve the problem for me using the 1.4 source. Like topomania, I had to wrap the stream in a GZipStream for decompression.

Unlike either of you, I created a method under IsHtmlContent to test explicitly for the gzip encoding

        private bool IsGZipEncoding(string contentEncoding)
        {
            return contentEncoding.ToLower().StartsWith("gzip");
        }

I then modified the HtmlWeb.Get method from (original starting line 1466)

        Encoding respenc = !string.IsNullOrEmpty(resp.ContentEncoding)
                ? Encoding.GetEncoding(resp.ContentEncoding)
                : null;

to

        Encoding respenc = null;
        var isGZipEncoding = false;
        if (!string.IsNullOrEmpty(resp.ContentEncoding))
        {
            isGZipEncoding = IsGZipEncoding(resp.ContentEncoding);
            if (!isGZipEncoding)
            {
                respenc = Encoding.GetEncoding(resp.ContentEncoding);
            }
        }

Since the exception thrown is a general ArgumentException, I did not want to leave a catch that ate the exception in place.

Finally, I further modified HtmlWeb.Get from (original line 1486)

        Stream s = resp.GetResponseStream();

to

        Stream s;
        if (isGZipEncoding)
        {
            s = new GZipStream(resp.GetResponseStream(), CompressionMode.Decompress);
        }
        else
        {
            s = resp.GetResponseStream();
        }
Mar 8, 2012 at 9:22 AM

I can confirm that RunO2's solution works perfectly. 

The only small edit that I would make is that one needs to add using System.IO.Compression; to the top of HtmlWeb.cs

Dave A

Jul 12, 2012 at 7:03 PM

Are there plans to incorporate a solution to this in the NuGet Package? I am not in a position to use and modify the source code. I am using the Html Agility Pack NuGet Package and more often than not get the same gzip encoding problem.: 'gzip' is not a supported encoding name.

My code is opening various urls. Lots of pages are handled fine and then randomly I get the error. Note the error may come up on a url that was already loaded successfully on a previous run or if I just rerun the code it loads fine but then errors on another url etc.

I do not know why sometimes it works and sometimes not. 

So I am really just wondering if there will be a fix to this added to the package or if I need to somehow figure out how/what code to modify?

Thanks for any help.

Linda

 

Jul 12, 2012 at 7:28 PM
Edited Jul 12, 2012 at 7:28 PM

Hi Linda,

You don't have to figure it out, it is written in the posts above. My solution and Ron02's solution both will do the job.

good luck

Jul 12, 2012 at 7:35 PM

Thanks but I definitely saw your posts above and tried to make it work before posting my reply. I am not a trained developer and can usually hack my way through based on review of solutions like yours but in this case I just have no idea where I am suppose to add/modify the code.  I will make another attempt and see if I can hack my way through...

In the meantime, do you know if these fixes will be incorporated into the NuGet package so I won't have to modify the code?

Thanks again,

Linda

Jul 12, 2012 at 7:44 PM

Update: I do believe I just got Ron02's solution to work - THANKS!!! :)

But still the question: do you know if these fixes will be incorporated into the NuGet package so I won't have to modify the code?

Jul 12, 2012 at 7:56 PM
Edited Jul 12, 2012 at 7:57 PM

The solutions are more or less the same, Ron02 did a little more work, my solution is simpler and as effective. I am still using this code daily and it works all the time.

I really don't know about this beeing fixed in the source, I am not involved in the development of the HtmlAgilityPack and I even don't know what the NuGet package is ;)

Sorry about that.

good luck

Developer
Jul 12, 2012 at 10:55 PM

I'll look into adding this in for the next release. It won't work for all versions since GzipStream isn't supported across all .NET platforms. Most notably anything based on Silverlight (SL, Windows Phone, Windows Metro)

Jul 12, 2012 at 11:18 PM

Very cool - thanks so much.