This project has moved and is read-only. For the latest updates, please go here.
1
Vote

Detection encoding before loading of HTML

description

HtmlWeb class throws an exception when it can not detect the document encoding.
After some investigation I decided not to use this class at all and wrote my own charset
detector based on Mozilla charset detector and ICU library detector.
The latter requires C++ CLI to make C# compatable assembly. For the most perfect accuracy I
delete any non-letter symbols, excluding spaces, \n, \t and \r symbols, but including English characters
(they have common codepoints in most of encodings) if the text is not in English. Extracting of text only requires Regex, but you can not apply .Net Regex, because it doesn't support Byte array. You can choose either STL regex class or boost library. The latter is more convenient and supports perl regex syntax.

The detection is splitted in 4 steps:
  1. Download Byte Array representing html document, using System.Net.WebClient.DownloadData() method.
  2. Mozilla detector. Unfortunetly in some cases it can not detect encoding especially for non
    English languges.
  3. ICU detector, if Mozilla detection is failed.
  4. <meta> charset declaired in html header, if the second step is failed.
    <meta> charset could in some cases content wrong encoding, therefore it is used at the end of
    detection chain.
Now you can load the document with HtmlDocument.Load(Stream, Encoding) method.
I've tested it over 1000 html-pages almost in Russian and it worked fine for me.

comments