This project has moved. For the latest updates, please go here.
5
Vote

HtmlEntity.DeEntitize throwing KeyNotFoundException

description

We are using the HtmlAgilityPack version 1.4.9 from nuget and we are facing an issue when calling Entity.DeEntitze on text containing '&' char.

We implemented a method thats clean all Html tags from an input string to return a pure text only, this method is using HtmlAgilityPack to obtain the text of every tag encountered on the input html string, for exemple on Text nodes we are getting the Node.Text string and then we are calling the HtmlEntity.DeEntitize to remove inner tags, this is working perfectly except for html string containing the '&' char.

For exemple, consider the below html input :
<p class="full">Red was the theme colour of the day; 31st August, 2012 as L'Oréal West Africa’s Human resources Department  launched one of the group’s great tools  L’Oréal & Me tool; My learning.</p> When calling to HtmlEntity.DeEntitize with the string "Red was the theme colour of the day; 31st August, 2012 as L'Oréal West Africa’s Human resources Department launched one of the group’s great tools L’Oréal & Me tool; My learning." the below exception is thrown systematicaly :

at System.Collections.Generic.Dictionary`2[System.String,System.Int32].get_Item (System.String key) [0x000a2] in /Developer/MonoTouch/Source/mono/mcs/class/corlib/System.Collections.Generic/Dictionary.cs:148
at HtmlAgilityPack.HtmlEntity.DeEntitize (System.String text) [0x00000] in <filename unknown>:0
at LOreal.Mynews.Portable.Tools.HtmlHelper.ConvertTo (HtmlAgilityPack.HtmlNode node, System.IO.TextWriter outText) [0x00096] in /Users/Kamel/Documents/Mynews/Sources/Mobile/Portable/Tools/HtmlHelper.cs:196

But if we replace the '&' char by an empty string (using string.Replace("&", "") method) the HtmlEntity.DeEntitize work perfectly.

We are using HtmlAgilityPack on PCL supporting Xamarin, .Net4.5, W8, WP8, ...Etc.

Any help please ?

Samir.

comments

MoCoJohn wrote Feb 19, 2015 at 10:10 PM

I believe this is not simply because of the '&' but because of its proximity to the ';' .. & Me tool;

We have run in to similar issue. "Peel & stick;"

aloker wrote Apr 20, 2015 at 11:40 AM

As far as I can see, this is due to a wrong usage of the dictionaries inside HtmlEntity. Instead of using TryGetValue to check for an optional value, the code uses the indexer and checks for null. Dictionary<K,V> does not return null if the key does not exist, but throws the KeyNotFoundException. Instead, TryGetValue should be used to check for optional values. Probably, a non-generic dictionary has been used in former versions of HtmlEntity.

HtmlEntity.cs, lines 645ff
int code;
object o = _entityValue[entity.ToString()];
if (o == null)
{
    // nope
    sb.Append("&" + entity + ";");
}
else
{
    // we found one
    code = (int) o;
    sb.Append(Convert.ToChar(code));
}
Should be:
int code;
if(!_entityValue.TryGetValue(entity.ToString(), out code))
{
    // nope
    sb.Append("&" + entity + ";");
}
else
{
    // we found one
    sb.Append(Convert.ToChar(code));
}
HtmlEntity.cs, lines 774ff
string entity = _entityName[code] as string;
if ((entity == null) || (!useNames))
{
    sb.Append("&#" + code + ";");
}
else
{
    sb.Append("&" + entity + ";");
}
Should be:
string entity;
if(!useNames || !_entityName.TryGetValue(code, out entity))
{
    sb.Append("&#" + code + ";");
}
else
{
    sb.Append("&" + entity + ";");
}

wrote Nov 11, 2015 at 11:24 AM

RoryPS wrote Nov 11, 2015 at 11:29 AM

minimal replication:

HtmlAgilityPack.HtmlEntity.DeEntitize("'")

throws KeyNotFoundException

wrote Mar 16, 2016 at 11:47 AM

wrote Sep 1, 2016 at 9:12 AM

wrote Feb 2 at 4:37 PM

taj707 wrote Feb 2 at 4:40 PM

Is there any plans to fix this in a future release? This issue causes exception when parsing some valid HTML entities such as ' (was invalid in HTML 4 , but is now valid in HTML 5).

This is a significant issue since it causes failures on valid HTML. At least it shouldn't throw an exception and should be an quick fix as others have noted.

Thank you

taj707 wrote Feb 2 at 4:42 PM

The entity I tried to mention in previous post is "apos"