Problems with HTML Character References (e.g. '1') with proposed fix.

Topics: Developer Forum, User Forum
Sep 19, 2013 at 3:27 AM
I downloaded the source code and made a unit test (appended below) which fails on HAP 1.4.6. The problem is that HTML character references (e.g. '1') have the ampersands encoded so that they look like this: '1'.

The code that does this is HtmlDocument.HtmlCode. The Regex in this method does not ignore HTML character references. Modified source code for this method is shown below. Does this look correct or am I misunderstanding something?
    public static string HtmlEncode(string html)
    {
        if (html == null)
        {
            throw new ArgumentNullException("html");
        }
        // replace & by & but only once!
        // Bugfix: add '(#)' to the regex so that HTML character references are not corrupted.
        //  Example: '1' should NOT be converted to '1'
        //Regex rx = new Regex("&(?!(amp;)|(lt;)|(gt;)|(quot;))", RegexOptions.IgnoreCase);
        Regex rx = new Regex("&(?!(amp;)|(#)|(lt;)|(gt;)|(quot;))", RegexOptions.IgnoreCase);
        return rx.Replace(html, "&amp;").Replace("<", "&lt;").Replace(">", "&gt;").Replace("\"", "&quot;");
    }

// THE UNIT TEST
    [Test]
    public void HtmlCharacterEntities()
    {
        string html = "<html><body>"
            + "<h1>&#65298;&#65296;&#65297;&#65299;&#12469;&#12510;&#12540;&#12461;&#12515;&#12531;&#12506;&#12540;&#12531;&#38283;&#20652;&#20013;&#65281;</h1>"
            + "<p>My first paragraph.</p>"
            + "</body></html>";

        HtmlDocument hdoc = new HtmlDocument();
        hdoc.LoadHtml(html);
        hdoc.OptionOutputAsXml = true;
        hdoc.OptionCheckSyntax = true;
        hdoc.OptionFixNestedTags = true;

        HtmlAgilityPack.HtmlNode htmlNode = hdoc.DocumentNode.SelectSingleNode("html");

        string main = htmlNode.OuterHtml;
        Assert.AreEqual(html, main);
    }
Jan 17, 2014 at 7:09 PM
A better fix would be to consider all possible Html Entities like this:
        private const string HtmlEntitiesPattern = @"&amp;([a-z]{2,10}|#\d{1,10}|#x[0-9a-f]{1,8});";
        private static readonly Regex HtmlEntitiesPatternRegex = new Regex(HtmlEntitiesPattern, RegexOptions.Compiled | RegexOptions.IgnoreCase);

        public static string FixDoublEntityEncoding(string document)
        {
            return HtmlEntitiesPatternRegex.Replace(document, "&$1;");
        }