This project has moved. For the latest updates, please go here.


Topics: Developer Forum, Project Management Forum, User Forum
Jun 20, 2009 at 9:02 PM

Hello everyone,

when i save a webside (htmlweb class), that i loaded before, the special chars (german) are destroyed. the encoding of the website is latin 1 (I use the detectencodinghtml function).

Has anyone an idea how to solve my problem?


Thank you very much.

Best regards,




Jun 23, 2009 at 4:29 PM

Have you tried using the detect encoding overload?

public void Load(string path, bool detectEncodingFromByteOrderMarks)

If that doesn't work you'll need to send your own encoder

Load(string path, Encoding encoding, bool detectEncodingFromByteOrderMarks)


Both of these are just small wrappers around the StreamReader class. It might be good to look into how to accomplish what you need with that class.


From looking at the code it does look like HAP will try to read the encoding from the meta tag too. The ReadDocumentEncoding method on line 1498 of HtmlDocument.cs is where you'd want to look to see if the encoding is being detected properly from your html.



Jul 16, 2009 at 10:09 PM
Edited Jul 16, 2009 at 10:14 PM

Sorry for the formatting, but this forum doesnt set a minimal width so my text went on and on and on to the left...

I had the same problem. But not from a file, but from an Url.

Normally HAP (HtmlAgilityPack) will try detect the encoding from the HTTP response headers,
the Content-Type specifically. But when this header is missing, the default encoding is UTF8Encoding.

Once this method is called

public void Load(Stream stream, bool detectEncodingFromByteOrderMarks)

which calls

public void Load(TextReader reader)

If you break after ReadToEnd():

_text = reader.ReadToEnd();
_documentnode = CreateNode(HtmlNodeType.Document, 0);
And inspect sr.CurrentEncoding, you will see this is UTF8Encoding by default (dont take my word for it).
It is at this point the problem is irreversible, the ReadToEnd method converts any special characters, such as
ø,æ,å,ß etc to ? (ASCII 0x3F or decimal 63).
For details on why special characters become ?, read oel on Softwares "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)".
He also mentions how IE handles this problem. Its worth the read.
When you now want to convert any InnerText with Encoding.Convert with the various encodings, this will
never work, because the character *IS* a questionmark, its original byte is no more.
As is the case with the current release, because you dont have the actual bytes that was received from the server,
only the ones "translated" (into questionmarks) by the ReadToEnd(), you cant reparse or fix this problem.
If you're lucky, your webpage contains a meta tag, <meta http-equiv="content-type" content="text/html; charset=iso-5589-1" />.
The HAP does have a method that detects this encoding, its called ReadDocumentEncoding,
and its in HtmlDocument around line 1500+. As you read from this, it actually adds an object of HtmlParseError
into the ParseErrors list, so you know the author knew about this issue. Its nice to know if the content
encodings mismatch, but the problem still remains, how do you get your beloved characters back?
In this method:
public void Load(Stream stream, bool detectEncodingFromByteOrderMarks)
I use a MemoryStream, which stores bytes, so no conversion problems. You can also use a FileStream and
Path.GetRandomFileName() or Path.GetTempFileName(), this way you can inspect the HTML as it was sent from the server, before the parser gets it.
I then read from the stream parameter, into the MemoryStream. Once its done I call Seek and put the read pointer at the beginning.
Now I call the Load, but instead of using the stream parameter, I  use the MemoryStream.
The code will run and soon hit the Parse() method. What I wanted is to, immediately when any meta-tag is parsed,
to check if the content-type matched the _streamencoding. From the authors method, he compares the
WindowsCodePage of the Encoding object. I do the same.
I create a method  -- you could actually modify the authors method ReadDocumentEncoding. Now, I want to detect
this meta as early as possible, as I dont want to waste cpu cycles on parsing HTML. Therefore I call my own
method at the very end of CloseCurrentNode().
My method throws an exception and sets a new bool _IncorrectStreamEncoding property to true, and sets the
encoding from the meta in either _streamencoding or _declaredencoding.
This exception is caught in the Load method, which then checks if _IncorrectStreamEncoding is true. If it is, I
use Seek on the MemoryStream to reset it at the beginning, and call Load with my memorystream object, and the
encoding I got from the meta. The stream parameter is closed by the caller, and the MemoryStream is disposed
then the Load method gets out of scope, thanks to the Using statement. The html content is stored in _text.
You might want to fix the various FormatExceptions HAP can throw at you from Encoding.GetEncoding(), by try/catch'ing them.
When I wrote my method to determine if the meta contained the content type, I didnt know of the authors existing
method. Take note, when testing the meta, do check out the authors method, it contains some optimizations that
will save alot of time when its parsing. Like ONLY using _currentnode.Name.Equals("meta") is bad.
Also you can use the System.Net.Mime.ContentType class to parse the "content" of the meta, and use
the CharSet property.

And this problem, is the exact purpose of why you can use the meta-tag to set the content type and encoding,
also mentioned in Joels post.
The reason I didnt provide any code here, is because I dont consider it a very hard challenge to fix this
problem, given what I've written - basicly every main step to get it working. But if you're reaaally
desperate... you might convince me.


Nov 24, 2009 at 2:21 PM
Edited Nov 25, 2009 at 7:43 AM

i occurred in the same problem I m loading html page and the encoding is wrong

HtmlAgilityPack.HtmlWeb hw = new HtmlWeb();
HtmlAgilityPack.HtmlDocument Doc=new  HtmlAgilityPack.HtmlDocument();

Doc = hw.Load(address);

and i m getting wrong characters

i  also detect the encoding

System.Text.Encoding EncDoc = HtmlAgilityPackDoc.Encoding;

how can i fix the wrong characters?

the parse errror says that :

Encoding mismatch between StreamEncoding  and DeclaredEncoding


Nov 27, 2009 at 3:44 PM

The HttpWebResponse has a property called Codepage that you can use use with the Encoding object to get the encoding from, Then you can use that encoding in the load? Just an idea

Jul 15, 2011 at 5:59 PM

Unfortunately HAP doesn't seem to properly detect stream encoding before parsing the stream as text, but it is what it is. This workaround requires creating a local buffer to hold the stream data:

var doc = new HtmlDocument();
var request = (HttpWebRequest)HttpWebRequest.Create(_requestUri);
using (var response = request.GetResponse())
using (var responseStream = response.GetResponseStream())
    // To determine the page encoding the stream has to be read;
    // since the stream is forward-only, that means copying it to a local buffer
    // so it can be re-read by HtmlAgilityPack.
    var responseData = new byte[response.ContentLength];
    responseStream.Read(responseData, 0, (int)response.ContentLength);

    using (var ms = new MemoryStream(responseData))
        // Default encoding to UTF8 if it isn't detected
        Encoding encoding = doc.DetectEncoding(ms) ?? Encoding.UTF8;
        ms.Seek(0, SeekOrigin.Begin);
        using (var sr = new StreamReader(ms, encoding))
responseData = null; 


Sep 16, 2011 at 8:01 PM

Hi hemp!
I just tried your example and it seems to work great except for one thing. It fails to read the entire page. So I get the first part of the page, with correct encoding, but a bit into the page it seems that the rest is not read for some reason. If I use the build in method in HtmlAgilityPack to fetch the page I do get the entire page.

Any ideas are greatly appreciated!

Sep 16, 2011 at 8:30 PM
gardebring wrote:

I get the first part of the page, with correct encoding, but a bit into the page it seems that the rest is not read for some reason.

The issue may be that the value for response.ContentLength is shorter than the data in the response stream. I'm using it as a shortcut for determining the buffer size, but ContentLength isn't guaranteed to be correct. I would guess you're testing against a URL for which it is not.

Try reading the response stream in chunks (in a loop) while increasing the size of the buffer as necessary; it will require a pretty big rewrite of my example, which is why I didn't do it that way. The pages I'm hitting do report ContentLength as the size of the response in bytes.

Aug 16, 2012 at 12:18 PM

Hi all!

Having the same problem as described by rmoritz above. Tried the workaround hemp suggested but with special characters like æ ø å etc. still getting converted to ?. After some searching through the System.Net namespace for something that would read my url right, I ended up using the following:

HtmlDocument htmlDoc = new HtmlDocument(); 
using (System.Net.WebClient client = new System.Net.WebClient())
     var html = client.DownloadString(url);

I have not tested this on other machines than my own...