3

Resolved

Throw ArgumentException If Charset Is Invalid

description

For a html page like below. charset is empty or invalid string
<html>
...
<head> <meta http-equiv="Content-Type" content="text/html; charset="> </head> ...
</html>
 
Parsing of the document would throw a ArgumentException. This is not user-friendly. It is preferable to have HtmlAgility ignore invalid charset value.
 
Root cause is in file HtmlDocument.cs, function ReadDocumentEncoding, Encoding.GetEncoding(charset) throws a ArgumentException if the argument is not invalid charset name.

comments

alekz wrote Sep 15, 2009 at 3:38 PM

Successfully parses encoding in:
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1">

but fails to parse
<meta http-equiv="Content-Type" content="text/html; charset= ISO-8859-1">
(with the whitespace preceding the encoding name)

DarthObiwan wrote Jan 1, 2010 at 3:37 AM

this appears to be partially fixed in the current build. I was however able to find if the charset is set to a string that is not a valid encoding it will still throw the exception. I have added in a default of utf-8 and a property to allow it to be set if needed.
Usage would be
HtmlDocument.DefaultEncodingCharSet = "ISO-8859-1"
var doc = new HtmlDocument();
doc.Load("somefile.html");