Stripping harmful HTML from user input, but allowing other HTML?

Topics: Developer Forum, Project Management Forum, User Forum
Mar 19, 2008 at 7:15 PM
Has anyone used this tool for such a task?

I'd like to allow users (on a comment form, for example) to put some HTML, like links, some formatting, or images. But I naturally don't want to allow anything else, and I want to make sure I don't open myself to XSS. Has anyone used this tool to pull off such a feat?

Thanks in advance for any advice.
May 29, 2008 at 4:42 PM
I needed to do the same thing, but couldn't find any example code, so here's mine - it's not perfect, but it works well enough for my purposes...




public string ScrubHTML(string html)
{
    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(html);

    //Remove potentially harmful elements
    HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.ParentNode.RemoveChild(node, false);

        }
    }           

    //remove hrefs to java/j/vbscript URLs
    nc = doc.DocumentNode.SelectNodes("//a[starts-with(@href, 'javascript')]|//a[starts-with(@href, 'jscript')]|//a[starts-with(@href, 'vbscript')]");
    if (nc != null)
    {

        foreach (HtmlNode node in nc)
        {
            node.SetAttributeValue("href", "protected");
        }
    }



    //remove img with refs to java/j/vbscript URLs
    nc = doc.DocumentNode.SelectNodes("//img[starts-with(@src, 'javascript')]|//img[starts-with(@src, 'jscript')]|//img[starts-with(@src, 'vbscript')]");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.SetAttributeValue("src", "protected");
        }
    }

    //remove on<Event> handlers from all tags
    nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]");
    if (nc != null)
    {
        foreach (HtmlNode node in nc)
        {
            node.Attributes.Remove("onFocus");
            node.Attributes.Remove("onBlur");
            node.Attributes.Remove("onClick");
            node.Attributes.Remove("onMouseOver");
            node.Attributes.Remove("onMouseOut");
            node.Attributes.Remove("onDoubleClick");
            node.Attributes.Remove("onLoad");
            node.Attributes.Remove("onUnload");
        }
    }


   return doc.DocumentNode.WriteTo();
}

May 29, 2008 at 9:55 PM
Cool - thanks for that.
Jun 30, 2008 at 2:44 PM
Edited Jun 30, 2008 at 4:00 PM
This piece of code works excellent.
Jun 30, 2008 at 3:57 PM
Have a look into this thread... it might help you
http://www.codeplex.com/htmlagilitypack/Thread/View.aspx?ThreadId=16092
Oct 14, 2009 at 4:43 AM

Great example (almost), Thanks!  A few ways to make it stronger that I saw, though:

1) Use case-insensitive search when looking for links with "javascript:", "vbscript:", "jscript:".  For example, the original example would not remove the HTML:

<a href="JAVAscRipt:alert('hi')">click me</a>

2) Remove any style attributes that contain an expression rule.  Internet Explorer evaluates the CSS rule express as script.  For example, the following would product a message box:

<div style="width: expression(alert('hi'));">bad code</div>

3) Also remove <embed> tags

I honestly have no idea why "expression" has not been removed from IE - major flaw in my opinion. (Try the div example in internet explorer and you'll see why - even IE8.)  I just wish there was an easier/standard way to clean-up html input from a user.

 

Here's the code updated with these improvements.  Let me know if you see anything wrong:

 

    public string ScrubHTML(string html)
    {
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(html);

        //Remove potentially harmful elements
        HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//link|//iframe|//frameset|//frame|//applet|//object|//embed");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.ParentNode.RemoveChild(node, false);

            }
        }

        //remove hrefs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//a[starts-with(translate(@href, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
        if (nc != null)
        {

            foreach (HtmlNode node in nc)
            {
                node.SetAttributeValue("href", "#");
            }
        }


        //remove img with refs to java/j/vbscript URLs
        nc = doc.DocumentNode.SelectNodes("//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'javascript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'jscript')]|//img[starts-with(translate(@src, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'vbscript')]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.SetAttributeValue("src", "#");
            }
        }

        //remove on<Event> handlers from all tags
        nc = doc.DocumentNode.SelectNodes("//*[@onclick or @onmouseover or @onfocus or @onblur or @onmouseout or @ondoubleclick or @onload or @onunload]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.Attributes.Remove("onFocus");
                node.Attributes.Remove("onBlur");
                node.Attributes.Remove("onClick");
                node.Attributes.Remove("onMouseOver");
                node.Attributes.Remove("onMouseOut");
                node.Attributes.Remove("onDoubleClick");
                node.Attributes.Remove("onLoad");
                node.Attributes.Remove("onUnload");
            }
        }

        // remove any style attributes that contain the word expression (IE evaluates this as script)
        nc = doc.DocumentNode.SelectNodes("//*[contains(translate(@style, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'), 'expression')]");
        if (nc != null)
        {
            foreach (HtmlNode node in nc)
            {
                node.Attributes.Remove("stYle");
            }
        }

        return doc.DocumentNode.WriteTo();
    } 

 

 

 

Jun 29, 2010 at 10:44 PM

Because this thread comes up first in the Google search results for HtmlAgilityPack sanitize html, I figured I would add another, more restrictive version. This version is based on a white list and only allows through the elements and attributes that you specifically allow.

 

It also uses the new Linq syntax (to contrast with the above xPath methods).

 

Any comment or suggestions for improvement are welcome:

private static string SanitizeHtml(string html)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);


            string[] elementWhitelist = {
                                            "a", "u", "b", "i", "br", "br ", "br", "h1", "h2", "h3", "h4", "h5", "h6", "span",
                                            "div", "blockquote", "em", "sub", "sup", "s", "font", "ul", "li", "ol", "p", "#text"
                                        };

            string[] attributeWhiteList = { "class", "style", "src", "href", "color", "size" };

            IList<HtmlNode> hnc = doc.DocumentNode.DescendantNodes().ToList();



            //remove non-white list nodes
            for (int i = hnc.Count - 1; i >= 0; i--)
            {
                HtmlNode htmlNode = hnc[i];
                if (!elementWhitelist.Contains(htmlNode.Name.ToLower()))
                {
                    htmlNode.Remove();
                    continue;
                }

                for (int att = htmlNode.Attributes.Count - 1; att >= 0; att--)
                {
                    HtmlAttribute attribute = htmlNode.Attributes[att];
                    //remove any attribute that is not in the white list (such as event handlers)
                    if (!attributeWhiteList.Contains(attribute.Name.ToLower()))
                    {
                        attribute.Remove();
                    }

                    //strip any "style" attributes that contain the word "expression"
                    if (attribute.Value.ToLower().Contains("expression") && attribute.Name.ToLower() == "style")
                    {
                        attribute.Value = string.Empty;
                    }


                    if (attribute.Name.ToLower() == "src" || attribute.Name.ToLower() == "href")
                    {
                        //strip if the link starts with anything other than http (such as jscript, javascript, vbscript, mailto, ftp, etc...)
                        if (!attribute.Value.StartsWith("http")) attribute.Value = "#";
                    }
                }
            }
            return doc.DocumentNode.WriteTo();
        }

Nov 16, 2010 at 4:22 PM

Would anyone know how to modify the script above so it will work with .net 2.  It doesn't have the ToList method :/

Sep 6, 2011 at 11:08 PM

I have a problem where the input text contains email addresses in brackets:  me<me@aaa.com>

(Yahoo mail will put this in replies.)

The browser treats <me@aaa.com> as an element and does not display it.  I'm trying to figure out how to encode unknown HTML tags, such as this, so that they are displayed.

For example, given this input string:

 @"me<me@aaa.com>;<br>To: <bbb@yahoo.com>;<br>
    a -> b <table style='font-family:times new roman;font-size:14pt;color:blue'><tr><td>hello world</td></tr></table>
    <table style='font-family:times new roman'><tr><td>hello world2</td></tr></table>
    <table ><tr><td>hello world3</td></tr></table>"

The PatrickBurrows solution doesn't work because it removes everything after the first aaa.com; apparently it thinks < me@aaa.com > is an opening tag.  (And it also removes tables.)

The joelthedrummer solution seems better in general (less risk of removing text that you'd want to see), but the < me@aaa.com > still does not display.  (Unmatched tags, such as the ->, display OK.)

Does anyone know how I can detect non-html < > "elements" and encode them?

Thanks!

Sep 7, 2011 at 2:10 PM

I'd be curious to see how you tried to modify SanitizeHtml to meet your needs. For instance, did you add the table elements to the white list? Did you try to match an element that was really an email address?

Here is a version of SanitizeHtml which meets your needs. I added the table tags to the white list and checked for an @ sign in the element name before stripping it. It would be even better to replace the @ sign check with your favorite email address matching RegEx (as the comment says):

        private static string SanitizeHtml(string html)
        {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(html);


            string[] elementWhitelist = {
                                            "a", "u", "b", "i", "br", "br ", "br", "h1", "h2", "h3", "h4", "h5", "h6", "span",
                                            "div", "blockquote", "em", "sub", "sup", "s", "font", "ul", "li", "ol", "p", "#text", 
                                            "table", "tr", "td", "th"
                                        };

            string[] attributeWhiteList = { "class", "style", "src", "href", "color", "size" };

            IList<HtmlNode> hnc = doc.DocumentNode.DescendantNodes().ToList();



            //remove non-white list nodes
            for (int i = hnc.Count - 1; i >= 0; i--)
            {
                HtmlNode htmlNode = hnc[i];
                if (!elementWhitelist.Contains(htmlNode.Name.ToLower()))
                {
                    //note: replace this with your favorite email address matching regex
                    if(!htmlNode.Name.Contains("@")) htmlNode.Remove();
                    continue;
                }

                for (int att = htmlNode.Attributes.Count - 1; att >= 0; att--)
                {
                    HtmlAttribute attribute = htmlNode.Attributes[att];
                    //remove any attribute that is not in the white list (such as event handlers)
                    if (!attributeWhiteList.Contains(attribute.Name.ToLower()))
                    {
                        attribute.Remove();
                    }

                    //strip any "style" attributes that contain the word "expression"
                    if (attribute.Value.ToLower().Contains("expression") && attribute.Name.ToLower() == "style")
                    {
                        attribute.Value = string.Empty;
                    }


                    if (attribute.Name.ToLower() == "src" || attribute.Name.ToLower() == "href")
                    {
                        //strip if the link starts with anything other than http (such as jscript, javascript, vbscript, mailto, ftp, etc...)
                        if (!attribute.Value.StartsWith("http")) attribute.Value = "#";
                    }
                }
            }
            return doc.DocumentNode.WriteTo();
        }

Sep 7, 2011 at 5:59 PM
Edited Sep 7, 2011 at 6:00 PM

Thanks PatrickBurrows!  (I just mentioned the tables in case others wanted to add it - not that it was a reason to dismiss your solution at all.)   This is a very good solution.  
It just depends on the situation and whether a person wants to approach it from the blacklist or the whitelist angle.  I agree that your solution is definitely the safest! 
I'm parsing emails and just got worried that I might strip something unintentional if there's some bizarre tag.  I should've clarified where my opinion was coming from. :)

That looks like a good fix for the email address!  However, the angle brackets would also need to be encoded for them to be visible in the browser.  Is that possible?

The other problem I came across with the agility pack is if the text had a legitimate < in it; for example, inside a table cell.  The agility pack seemed to get confused. 
For example, a cell containing < hello world appeared as < hello=""

I didn't have a lot of time to spend, but I got concerned about this and couldn't see how to fix it.   Thanks again!

Sep 7, 2011 at 7:10 PM

Yeah, the problem here is that the Email addresses aren't actually HTML. They just happen to also use angle brackets. HtmlAgilityPack is trying to parse it using the rules of Html (expecting closing nodes and the like.) It does a good job, but when it comes to replacing a malformed node like that, it just isn't going to work.

Pre-parsing the Email Addresses using a RegEx (and changing the angle brackets to &lt; and &gt; respectively) and then parsing with HtmlAgilityPack would be safest.

Sep 20, 2011 at 6:39 AM

The code is working absolutely fine. Thank you for posting!

 

Raffle Tickets Printing |Ticket Book Printing Sheets |Ticket Sheets Printing | Event Tickets Printing

Nov 8, 2013 at 10:04 AM
Thanks PatrickBurrows, Your code are helpful and it's work, Again thanks.


Cardboard Boxes | Retail Boxes | Custom Boxes | Software Boxes