Chinese become to mess codes after using HTML Agility

Topics: Developer Forum, Project Management Forum
Jul 20, 2010 at 10:41 AM

Hi All,

Recently I'm using HTML Agility to convert html to text, I always works excellent, but when I converting a Chinese html, the text became to mess code. 

 

Do any one have any suggestion?  Thanks in advance!

Jess 

Aug 12, 2010 at 9:29 AM

Set the right Encoding.

Aug 16, 2010 at 11:45 PM

The "Encoding" and "DeclaredEncoding" property of Document class has already got the web page right encoding, such as GB2312; you can get the encoding and decode the web text.

But, personally, HAP should automatically decode the html text string.

Aug 16, 2010 at 11:59 PM
This has been covered extensively before. The while I agree it should be more automatic, the problem with is that by the time it can be detected via the html header it is already reading the stream which would need to be restarted. While it is possible and has been looked into in the past I found GB2312 doesn't decode it correctly. I found even after using the encoding the characters would still be corrupted. Furthermore it was found switching to UTF-8 or UTF-16 would allow the document to be parsed correctly. Personally I think most of the HtmlWeb class needs to be refactored and possibly dropped. It recreates/hides a lot of functionality added in .NET after it was written.