This project has moved and is read-only. For the latest updates, please go here.

[patch] HtmlEntity.DeEntitize fails on numeric entities > 0xFFFF


The method call HtmlEntity.DeEntitize fails to correctly decode HTML escapes whose numeric value is greater than 65535 (0xFFFF), which causes trouble with the Supplementary Multilingual Planes of Unicode. For example, the escape 🔈 is decoded as the string "&##128264;" instead of the character U+1F508 (which maps to 0xD83D 0xDD08 in UTF-16).

The problem is with the use of System.Convert.ToChar(Int32): for values above 65535, this causes an OverflowException to be thrown. Since .NET 2.0, there exists the static method System.Char.ConvertFromUtf32(Int32), which takes a numeric Unicode codepoint and returns a string with the correct UTF-16 representation (which is either one or two chars long). Replacing the calls to Convert.ToChar with Char.ConvertFromUtf32 fixes the problem (since StringBuilder.Append can work with both chars and strings).

A patch is attached.

file attachments


RavuAlHemio wrote Dec 26, 2014 at 6:00 PM

I apologize for the duplicate issue -- it appears my browser has submitted it twice...

wrote Jan 1, 2015 at 2:47 PM