charset/unicode conversion

Jan 1, 2010 at 9:43 PM

Im processing a number of non-english html docs with HAP, and realized a number of them are not UTF-8 but use a specific charset instead. I'd like to detect the charset, whether through the meta tag, or any other method, and try to auto-convert it into UTF-8 to simplify my further processing. Are there some enhanced capabilities in HAP for detecting the source charset and/or helping with this conversion? I'd prefer not to resort to having to hack together a bunch of code to try to figure this out.

Thank you

Jan 1, 2010 at 11:06 PM

HAP supports that fully. First HAP does read the charset declared in the document and will try to use that. If it is not a supported encoding it will throw an ArgumentException. You can however tell it what encoding to use by passing your own Encoding with the Load method. HtmlDocument.DetectEncoding will return an Encoding object based on the charset declared in the html.

var doc = new HtmlDocument();
doc.Load(someUrl,Encoding.GetEncoding("utf-8"));
var encoding = doc.DetectEncoding(someUrl); //to get the detected encoding.

You can also use the overload on Load to have HAP detect the encoding from the byte order of the file/stream. This is a binary detection.

Jan 1, 2010 at 11:12 PM

Thanks Darth, Im off to work on this some more and will post my progress here.

Jan 2, 2010 at 12:56 PM

if u find  a solution can u post it i m also stuck with the encoding

 

 

Jan 2, 2010 at 8:35 PM

ok ive made a few discoveries-

Im using www.sina.com.cn as an example - a chinese website that is *not* using utf-8, but uses codepage (charset) gb2312 instead.

It seems that some assumptions are made on encoding with HAP, so the chinese text on the page was being corrupted.

If I use:

var web = new HtmlAgilityPack.HtmlWeb();
web.AutoDetectEncoding = true;
doc = web.Load("http://www.sina.com.cn");

This fails... apparently  it loads the page first, then detects the encoding, but it has already converted it into in-memory unicode without knowing that it is gb2312 to start with (and thus is corrupted).

The way I found to make this work is as follows:

Jan 2, 2010 at 8:38 PM

for some reason my post was cut off above, this is the rest:

//request page only for detecting the encoding used
HttpWebRequest request =(HttpWebRequest)WebRequest.Create("http://www.sina.com.cn");
HttpWebResponse response =(HttpWebResponse)request.GetResponse();
Encoding enc = doc.DetectEncoding(response.GetResponseStream()); 
//now request it. 
request = (HttpWebRequest)WebRequest.Create("http://www.sina.com.cn"); 
response = (HttpWebResponse)request.GetResponse();
doc.Load(response.GetResponseStream(), enc);
            

What Id like to find is a method that lets this all be done in one request, as this version effectively requests the same page twice.

 

Jan 2, 2010 at 9:38 PM

Try the DetectEncodingAndLoad method ;)

If you want control over the authentication or user agent use the HtmlWeb class. It has an autodetectencoding property and will use the DetectEncodingAndLoad when it makes the webrequest.

Jan 3, 2010 at 1:10 AM

Darth, look above and  you will see I already tried your approach, and even specficially set the AutoDetectEncoding = true on the HtmlWeb class(to be sure), but it fails to work correctly, at least in this instance. It will have the correct encoding set once it completes, but the text was converted incorrectly and has been corrupted. Let me know if I should log this as a bug.

I think everything works fine when dealing with utf-8 encoded sites, but a site that uses a foreign language codepage (and perhaps just multi-byte codepages?), fails with this approach.

Jan 3, 2010 at 2:22 AM

I see now, looking into it further. The end of you one post got changed into tiny type, missed that. It does seem that it is messing up on the multibyte chars.

 

Jan 3, 2010 at 3:53 AM

Excellent, let me know how I can help.

Jan 3, 2010 at 6:18 AM

Darth, in case this helps, this seems to work (and doesnt require re-downloading the html)

HttpWebRequest request =(HttpWebRequest)WebRequest.Create("http://www.sina.com.cn");
HttpWebResponse response =(HttpWebResponse)request.GetResponse();
long len = response.ContentLength;
byte[] barr = new byte[len]; 
response.GetResponseStream().Read(barr, 0, (int)len); 
response.Close();
string data = Encoding.UTF8.GetString(barr); 
var encod = doc.DetectEncodingHtml(data);
string convstr = Encoding.Unicode.GetString(Encoding.Convert(encod, Encoding.Unicode, barr));
doc.LoadHtml(convstr); 

 

Jan 3, 2010 at 9:01 PM

Ive already found several other cases that must be addressed differently, so now I think the only version that is not converting correctly is where there is no indication of the charset at all (http headers, meta tag, etc). I believe IE handles these by using heuristics to try to guess at the charset, but I won't worry about these for now.

Darth, if you are interested, I can address this and submit a patch to try to handle all the various methods of determining encoding. Feel free to add me to the dev team if you like, thanks.

I may also be adding in the capability to redirect to alternative pages viw the meta refresh tag (server redirects work ok for now, in theory).

One question while I'm here- are there capabilities built in to decode unicode entity tags? ie, 〲 --> convert into a real unicode char.

Brady

Jan 3, 2010 at 10:07 PM

I just submitted a change to support Unicode Html Enities a few days ago. It was a patch submitted by tsai. It will detect the x and then do a base16 convert instead of base 10.

I'm interested in seeing your patches. I did some delving last night and came to similar code as you posted last night. One of the difficulties is going to be adding more encoding support while not adding more complexity to the library and maintaining backwards compatibility. That is one thing that will probably break in the 2.0 series. I also discovered some things in the encoding detection that did not work how I thought it was and places where it was counter-intuitive. Including when one passes in a particular encoding it is not used when saving the document, it still uses the one that was detected out of the charset. I think it should use whatever was passed in to read the document.

I do not have the power to add new developers, that is up to simonm.

Jan 4, 2010 at 1:52 AM

Thanks for the update.

Actually I just tested running HtmlDecode() on the text and it seems to convert the hex entities without problem, looks like this might be a solution? Not to stray too far off topic.

Regarding saving the documents - I think saving with the original encoding makes the most sense, perhaps these rules are best:

-If an encoding was detected, us this.
-if a specific encoding is assigned for saving, convert to that.
-if no encoding was detected and none was assigned, use a default- utf-8 the likely candidate.

 

 

 

Jan 5, 2010 at 8:38 PM
Edited Jan 5, 2010 at 9:13 PM

 

HtmlAgilityPack/HtmlWeb.cs
                        if (respenc != null)
                        {
                            doc.Load(s, respenc);
                        }
                        else
                        {
                        
      -                    // doc.Load(s, true);                 
      +                   doc.Load(s); 
                        }
HtmlAgilityPack/HtmlDocument.cs
       public void Load(Stream stream)
        {
  -         // Load(new StreamReader(stream, OptionDefaultStreamEncoding));
           
  +          Load(new StreamReader(stream, System.Text.Encoding.GetEncoding(1252))); 
        }
HtmlAgilityPack/HtmlWeb.cs
            Encoding respenc = !string.IsNullOrEmpty(resp.ContentEncoding)
                                   ? Encoding.GetEncoding(resp.ContentEncoding)
                                   : null;
   +       if (respenc == null)
   +         {   string encode= Charset(contentType);
   +             if(encode!=null){
   +             respenc = Encoding.GetEncoding(encode);
   +             }
   +        }
HtmlAgilityPack/HtmlWeb.cs
+        private string Charset(string contentType)
+        {
            //Aqui entran campos del tipo "text/html; charset=utf-8; etc..."
            /*
            *Lo que se pretende aqui es filtrar el charset y ver cual es su atributo
            */
+            try
+            {
+                string[] filtrar = contentType.Split(new string[] { "charset=" }, StringSplitOptions.None);
+                string charset = filtrar[1].Split(';')[0];
+                return charset;               
+              }
+            catch (Exception)
+            {
+                return null;
+           }
+        }

I downloaded the latest version of HAP, and arranged the HtmlWeb class and the class HtmlDocument. This code worked for me excellent.

 

 

HtmlAgilityPack/HtmlWeb.cs

                        if (respenc != null)

                        {

                            doc.Load(s, respenc);

                        }

                        else

                        {                        

      -                    // doc.Load(s, true);                 

      +                   doc.Load(s); 

                        }

 

 

 

HtmlAgilityPack/HtmlDocument.cs

 

 

       public void Load(Stream stream)

        {

  -         // Load(new StreamReader(stream, OptionDefaultStreamEncoding));

            //código modificado por si no tiene un Encode y tiene caracteres especiales

  +          Load(new StreamReader(stream, System.Text.Encoding.GetEncoding(1252))); 

        }

 

 

HtmlAgilityPack/HtmlWeb.cs

 

_requestDuration = Environment.TickCount - tc;

_responseUri = resp.ResponseUri;

- bool html = IsHtmlContent(resp.ContentType;);

+ string contentType = resp.ContentType;

+ bool html = IsHtmlContent(contentType);


            Encoding respenc = !string.IsNullOrEmpty(resp.ContentEncoding)

                                   ? Encoding.GetEncoding(resp.ContentEncoding)

                                   : null;

 

   +       if (respenc == null)

   +         {   string encode= Charset(contentType);

   +             if(encode!=null){

   +             respenc = Encoding.GetEncoding(encode);

   +             }

   +        }

 

 

HtmlAgilityPack/HtmlWeb.cs

 

+        private string Charset(string contentType)

+        {

            //Aqui entran campos del tipo "text/html; charset=utf-8; etc..."

            /*

            *Lo que se pretende aqui es filtrar el charset y ver cual es su atributo

            */

+            try

+            {

+                string[] filtrar = contentType.Split(new string[] { "charset=" }, StringSplitOptions.None);

+                string charset = filtrar[1].Split(';')[0];

+                return charset;               

+              }

+            catch (Exception)

+            {

+                return null;

+           }

+        }

 

 

 

Jan 5, 2010 at 9:36 PM

thisi s very similar to what I did, looks good!

 

Jan 5, 2010 at 9:38 PM

except, i think this solution will still fail with double/multibyte character sets... like gb2312.

Mar 24, 2010 at 2:58 PM

I ran into this issue recently and i found the solution here:

In this years old patch:

http://htmlagilitypack.codeplex.com/WorkItem/View.aspx?WorkItemId=15535