This project has moved and is read-only. For the latest updates, please go here.

Encoding false, so XPath-Query fails

Topics: Developer Forum, Project Management Forum, User Forum
Dec 5, 2007 at 9:46 PM
Edited Dec 5, 2007 at 9:49 PM

there is a bug? in the HTMLWeb. Yesterday I took a download of an EBAY
to scan with HTMLAgilityPack such :

The content is in "iso-8859-1" and contains the text "Verkäufer" for instance.
HTMLWeb says "utf-8" for the "StreamEncoding" and "iso-8859-1" for
"Encoding" and "DeclaredEncoding". But XPath has the data from the
"utf-8" encoding. So the query fails.

My workaround is a download with System.Net.WebClient.DownloadString
and HtmlAgilityPack.HtmlDocument.LoadHtml, than it works right. So I can
make queries like : "td[starts-with(text(), 'Verkäufer')]"
Dec 18, 2007 at 8:17 PM
This is actually really a problem, as soon as I try to get the content as Unicode I'm getting an exception:

public static XmlDocument GetHtmlAsXml(string spyURL)
MemoryStream stream = new MemoryStream();
XmlTextWriter writer = new XmlTextWriter(stream, System.Text.Encoding.Unicode);
HtmlWeb web = new HtmlWeb();
web.LoadHtmlAsXml(spyURL, writer);
XmlDocument xml = LoadFromStream(stream);
return xml;
.. XmlReader reader = XmlReader.Create(stream);
xml.Load(reader); .. throws this error:
'.', hex Value 0x00, is an invalid char. Line 2, Position 1.

When I'm using utf-8 I can't read text contain german Umlauts (e.g.) äöü .. anyone knows how to fix this with XPath?

Dec 19, 2007 at 10:38 AM
Ok, I just read the other thread about this issue, we just shouldn't use HtmlWeb. Thanks :)