Garbled Output?

Topics: Developer Forum, User Forum
Apr 13, 2010 at 3:36 PM

Hello all, I've been reading everything I can find about this error but can't seem to locate a solution
(or at least one that I can comprehend enough to be able to implement). My problem is this - I am reading
and changing some attributes inside a series of local html pages, that part of the application is fine, however,
when the file is saved then a lot of random garbage text is inserted (or replaces existing characters). The content
pages that I am parsing say that they are utf-8, which is what I understand HAP uses internally? I cannot, however,
confirm that these pages are correctly implementing the format. The problem was very noticeable on pages with
some Japanese characters which were written in #xxxxx form - but it was strange because many characters of the
same type were unaffected. Also, on other pages that had no special codes, it converted a quotation mark to â€
and an apostrophe to ’. Random spaces seem to turn into Â.

I found one related discussion on this site from a while ago, but I couldn't understand how they fixed it.

Is there a way to turn off encoding altogether and just read it like single characters in a text file? I just need a way
to get it to stop eating my text and leave everything alone besides what I've told it explicitly to touch. It seems like
I remembered having a similar problem when using HTML Tidy for something else, but there was some kind of
option flag I was able to turn off in the settings to make it stop.

Thank you for any help you can offer!

Apr 13, 2010 at 3:38 PM
Edited Apr 13, 2010 at 3:38 PM

Here is a link to the related discussion: http://htmlagilitypack.codeplex.com/Thread/View.aspx?ThreadId=60174

Note that in my case I do not use any http requests, just a local file.

 

Apr 13, 2010 at 11:41 PM

Ah - after rereading things a few hundred times, Darth O's info about overloading the load function was just what I needed.

For those having a similar problem, try the following code:

doc.Load(fileName, System.Text.Encoding.UTF8, false);

I believe that what this does is prevent HAP from trying to guess the encoding using the Byte Order Mark (BOM - which apparently overrides the declared type in the meta tag). Forcing it to use UTF8 in this manner seems to have solved all my present issues. :)