How to make a whitelist with html agility pack?

Jun 10, 2010 at 11:24 PM

Hi

 

How would I make a whitelist and use html agility pack to remove all elements not in the whitelist?

Thanks

Jun 15, 2010 at 1:05 PM

Something like this.

void Main() { HtmlDocument doc = new HtmlDocument(); doc.Load("someFile.html");//load html var whiteList = new List<string>{"a", "b", "img"};//fill whitelist tags RemoveNotInWhiteList(doc.DocumentNode, whiteList); } public void RemoveNotInWhiteList(HtmlNode pNode, List<string> pWhiteList) { if (!pWhiteList.Contains(pNode.Name)) { pNode.Remove(); return; } if (pNode.ChildNodes != null && pNode.ChildNodes.Count > 0) { var children = pNode.ChildNodes.ToList(); foreach (var child in children) { RemoveNotInWhiteList(child, pWhiteList); } } }

Jun 15, 2010 at 3:53 PM
Edited Jun 15, 2010 at 3:54 PM

How about attributes? Like "herf". How would deal with those? Especially for the fact that they can stick <script> tags in the link.

Jun 16, 2010 at 5:51 AM

?

Same. Attribute doesn't nest, so you don't need recurse.

 

foreach (HtmlAttribute att in pNode.Attributes)
{
   if(!pWhiteList.Contains(att.name))
       att.Remove();
}

 

 

Jun 16, 2010 at 6:48 PM
VikciaR wrote:

?

Same. Attribute doesn't nest, so you don't need recurse.

 

foreach (HtmlAttribute att in pNode.Attributes)
{
   if(!pWhiteList.Contains(att.name))
       att.Remove();
}

 

 

I am not following. Should I have 2 loops one for tags and one for their attributes?

Jun 18, 2010 at 7:54 AM

void Main()
{
   HtmlDocument doc = new HtmlDocument();
   doc.Load("someFile.html");//load html 
   var whiteList = new List<string>{"a", "b", "img"};//fill whitelist tags
   RemoveNotInWhiteList(doc.DocumentNode, whiteList);
}



public void RemoveNotInWhiteList(HtmlNode pNode, List<string> pWhiteList)
        {
            if (!pWhiteList.Contains(pNode.Name))
            {
                pNode.Remove();
                return;
            }

            if (pNode.Attributes !=null && pNode.Attributes.Count>0)
            {

               foreach (HtmlAttribute att in pNode.Attributes)
               {
                   if(!pWhiteList.Contains(att.name))
                   att.Remove();
               }
           }


            if (pNode.ChildNodes != null && pNode.ChildNodes.Count > 0)
            {
                var children = pNode.ChildNodes.ToList();
                foreach (var child in children)
                {
                    RemoveNotInWhiteList(child, pWhiteList);
                }
            }
        }




Developer
Jun 18, 2010 at 12:18 PM

.NET 3.5 way with LINQ ;)

public void RemoveNotInWhiteList(HtmlNode pNode, IEnumerable<string> pWhiteList)
{
    if (!pWhiteList.Contains(pNode.Name))
    {
        pNode.Remove();
        return;
    }

    pNode.Attributes
         .Where(att => !pWhiteList.Contains(att.Name))
         .ToList()
         .ForEach(att => att.Remove());            

    pNode.ChildNodes
         .ToList()
         .ForEach(att => RemoveNotInWhiteList(att, pWhiteList));
}

In HAP both Attributes and Childnodes are never null so it is fine to do operations on them without checking.

Jun 18, 2010 at 5:23 PM
DarthObiwan wrote:

.NET 3.5 way with LINQ ;)

 

public void RemoveNotInWhiteList(HtmlNode pNode, IEnumerable<string> pWhiteList)
{
    if (!pWhiteList.Contains(pNode.Name))
    {
        pNode.Remove();
        return;
    }

    pNode.Attributes
         .Where(att => !pWhiteList.Contains(att.Name))
         .ToList()
         .ForEach(att => att.Remove());            

    pNode.ChildNodes
         .ToList()
         .ForEach(att => RemoveNotInWhiteList(att, pWhiteList));
}

 

In HAP both Attributes and Childnodes are never null so it is fine to do operations on them without checking.

Hi what is HAP?

 

I like the linq way :)   I don't get what is going on though. I am not sure what the if statement is actually doing. I am not sure what it is actually removing.  It seems like this method is calling it self and it is doing for each node? is that correct?

 

Developer
Jun 18, 2010 at 5:30 PM
HAP = Html Agility Pack Yes, it is remove the node and the nodes descendants. Any nodes that pass muster then pass their child nodes through.
Jun 18, 2010 at 10:04 PM
chobo2 wrote:
DarthObiwan wrote:

.NET 3.5 way with LINQ ;)

 

public void RemoveNotInWhiteList(HtmlNode pNode, IEnumerable<string> pWhiteList)
{
    if (!pWhiteList.Contains(pNode.Name))
    {
        pNode.Remove();
        return;
    }

    pNode.Attributes
         .Where(att => !pWhiteList.Contains(att.Name))
         .ToList()
         .ForEach(att => att.Remove());            

    pNode.ChildNodes
         .ToList()
         .ForEach(att => RemoveNotInWhiteList(att, pWhiteList));
}

 

In HAP both Attributes and Childnodes are never null so it is fine to do operations on them without checking.

Hi what is HAP?

 

I like the linq way :)   I don't get what is going on though. I am not sure what the if statement is actually doing. I am not sure what it is actually removing.  It seems like this method is calling it self and it is doing for each node? is that correct?

 

I still don't get what the first if statement does.

 

I have this html

 

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
	<head>
		<title></title>
	</head>
	<body>
	 <script type="text/jscript">bad code here</script>
     <p>Hello I am not on white list yet</p>
	</body>
</html>

I have this code.

 

static void Main(string[] args)
        {

            HtmlDocument doc = new HtmlDocument();
            string dir = @"c:\ConsoleApplication1\ConsoleApplication1\HTMLPage1.htm";
            doc.Load(dir);//load html 
            var whiteList = new List<string> { "a", "b", "img" };//fill whitelist tags
            RemoveNotInWhiteList(doc.DocumentNode, whiteList);
          
        }

        public static void RemoveNotInWhiteList(HtmlNode pNode, IEnumerable<string> pWhiteList)
        {
            if (!pWhiteList.Contains(pNode.Name))
            {
                pNode.Remove();
                return;
            }

            pNode.Attributes
                 .Where(att => !pWhiteList.Contains(att.Name))
                 .ToList()
                 .ForEach(att => att.Remove());

            pNode.ChildNodes
                 .ToList()
                 .ForEach(att => RemoveNotInWhiteList(att, pWhiteList));
        }

So I try it and it just goes into RemoveNotInWhiteList once and that's it.

pNode.Name = #doucment  so it goes into that node and then I guess removes it and calls it a day.

Jun 18, 2010 at 11:11 PM
Edited Jun 19, 2010 at 1:20 AM

Ok this is what I have so far.

static void Main(string[] args)
        {

            HtmlDocument doc = new HtmlDocument();
            string dir = @"Path";
            doc.Load(dir);//load html 
            var whiteList = new List<string> { "a", "img", "p", "#text" };//fill whitelist tags
            var attrWhiteList = new List<string> { "name" , "herf"};
            RemoveNotInWhiteList(doc.DocumentNode, whiteList, attrWhiteList);
   
      
        }

        public static void RemoveNotInWhiteList(HtmlNode pNode, List<string> pWhiteList, List<string> attrWhiteList)
        {
   
            // remove all attributes not on white list
            foreach (var item in pNode.ChildNodes)
            {
                item.Attributes.Where(u => attrWhiteList.Contains(u.Name) == false).ToList().ForEach(u => Test(u));

            }

            // remove all html and their innerText and attributes if not on whitelist.
            pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => u.Remove());

            Console.WriteLine(pNode.OuterHtml);
        }


        private static void Test(HtmlAttribute u)
        {
            u.Value = u.Value.ToLower().Replace("javascript", "");
            u.Remove();
            
        }

      }

So a couple this. 1 I put #text as it seems to always be "\r\n" can it every be something different(ie can it be something bad?)

 

Now the problems I am facing

 

Also how would you stop this?

<a href="javascript:(function(){
alert('hello');
})()">Hello</a>


I want to allow hrefs but a person can do the above and it will work. How can you use html agility script to stop something like that?

 

 

Say if I have a nested tags like this and bold is not on my whitelist

<p>hi <b>I am </b> bold</p>

It will not remove "b" tag. So nested tags will get through. How can I stop this?

Entire html with comments

 

<script type="text/jscript">bad code here</script>  // removes
<p id="Hello">Hello I am not on white list yet</p>  //allowed but id is removed
<a href="javascript:alert('hi');">Bad</a>   //removes javascript rendering this useless
<p>Hi I am <b>Bold</b></p>  // fails can't handle nested bold 
<p>Hi I am <b><big>Bold</big></b></p> // fails can't handle nested bold nest with big

 

Jun 19, 2010 at 1:34 AM

Ahh now I get why you guys where using recursion it all makes sense!

So My only questions is what is #document (this confused me as was not in the whitelist so it never made it past the first if statement).

Is it ok to put #document and #text in the whitelist? I am not sure what they are so it is hard to say.

How would you stop the javascript part? You can see what I came up with. Finally would you just remove the whole node or try to take the inner text?

so if you have <b> not on white list</b> would you remove or just have "not on white list"?

 class Program
    {
        static void Main(string[] args)
        {

            HtmlDocument doc = new HtmlDocument();
            string dir = @"Path";
            doc.Load(dir);//load html
            var whiteList = new List<string> { "a", "img", "p", "#text", "#document" };//fill whitelist tags
            var attrWhiteList = new List<string> { "name" , "herf"};
            RemoveNotInWhiteList(doc.DocumentNode, whiteList, attrWhiteList);

            Console.WriteLine(doc.DocumentNode.OuterHtml);
      
        }

        public static void RemoveNotInWhiteList(HtmlNode pNode, List<string> pWhiteList, List<string> attrWhiteList)
        {

            if (pWhiteList.Contains(pNode.Name) == false)
            {
                pNode.Remove();
                return;
            }

            pNode.ChildNodes.Where(u => pWhiteList.Contains(u.Name) == false).ToList().ForEach(u => RemoveNotInWhiteList(u, pWhiteList, attrWhiteList));


            // remove all attributes not on white list
            foreach (var item in pNode.ChildNodes)
            {
                item.Attributes.Where(u => attrWhiteList.Contains(u.Name) == false).ToList().ForEach(u => Test(u));
            }   
        }

        private static void Test(HtmlAttribute attr)
        {
            attr.Value = attr.Value.ToLower().Replace("javascript", "");
            attr.Remove();     
        }


      }

 

Jun 19, 2010 at 4:36 AM

I found one more hole.

 

if "p", "a" and href are on the whitelist

 

<p><a href="javascript:alert('hi');">Bad</a></p>
this will get through. However I want to go into all these valid attributes and check for bad things like "javascript:alert('hi');

 

Jun 19, 2010 at 7:55 AM

 

This hole is other question ;-) you want validate not only nodes (1st question), not only attributes (2nd question), but also attribute values (3rd question)

 

If you need validate also attribute value, you need parse href value as string:

if(pNode.Attribute["href"].Value.StartsWith("javascript:")
{
    //then clean attribute
}

 

Jun 19, 2010 at 10:27 PM
VikciaR wrote:

 

This hole is other question ;-) you want validate not only nodes (1st question), not only attributes (2nd question), but also attribute values (3rd question)

 

If you need validate also attribute value, you need parse href value as string:

 

if(pNode.Attribute["href"].Value.StartsWith("javascript:")
{
    //then clean attribute
}

 

 

Ya alot of questions(but alot of ways for hackers to get through).  That won't really work because you can write that javascript in hex as well or utf. I found that Microsoft has a great library to deal with that. The only thing that I am trying to figure out now is how to get all Urls as they can have also dangerous query strings in them.

so I have to find

<a href="http://somesite.com/view.aspx?q="badCode"> hi </a>

<a href="somesite.com/view.aspx?q="badCode"> hi </a

<img src="somesite.com/view.aspx?q="badCode" />

<img src="http://somesite.com/view.aspx?q="badCode" />

http://somesite.com/view.aspx?q="badCode"

somesite.com/view.aspx?q="badCode

So all these ways can have links in them some have new http in front of them and some are not even in html tags.

Jun 25, 2010 at 7:50 AM

How would you make this remove the tags, but not the contents of the tags ?

I pibked up this code from this blog: http://thomasjo.com/blog/archive/a-pessimistic-html-sanitizer/ , but I don't know how to make it remove just tags but not their inner content.

Here is the class:

using System.Collections.Generic;
using System.Linq;
using HtmlAgilityPack;

namespace Wayloop.Blog.Core.Markup
{
    public static class HtmlSanitizer
    {
        private static readonly IDictionary<string, string[]> Whitelist;

        static HtmlSanitizer()
        {
            Whitelist = new Dictionary<string, string[]> {
                { "a", new[] { "href" } },
                { "strong", null },
                { "em", null },
                { "blockquote", null },
                };
        }

        public static string Sanitize(string input)
        {
            var htmlDocument = new HtmlDocument();

            htmlDocument.LoadHtml(input);
            SanitizeNode(htmlDocument.DocumentNode);

            return htmlDocument.DocumentNode.WriteTo().Trim();
        }

        private static void SanitizeChildren(HtmlNode parentNode)
        {
            for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--) {
                SanitizeNode(parentNode.ChildNodes[i]);
            }
        }

        private static void SanitizeNode(HtmlNode node)
        {
            if (node.NodeType == HtmlNodeType.Element) {
                if (!Whitelist.ContainsKey(node.Name)) {
                    node.ParentNode.RemoveChild(node);
                    return;
                }

                if (node.HasAttributes) {
                    for (int i = node.Attributes.Count - 1; i >= 0; i--) {
                        HtmlAttribute currentAttribute = node.Attributes[i];
                        string[] allowedAttributes = Whitelist[node.Name];
                        if (!allowedAttributes.Contains(currentAttribute.Name)) {
                            node.Attributes.Remove(currentAttribute);
                        }
                    }
                }
            }

            if (node.HasChildNodes) {
                SanitizeChildren(node);
            }
        }
    }
}
Jun 28, 2010 at 12:19 PM

Here is the answer to my question. Not the fancyest code, but hey... it does it's job.

The reason for redefining the node is because the <?xml:namepace /> tags give errors of xpath parsing.

Here is the code.

 

public static class HtmlSanitizer
    {
        private static readonly IDictionary<string, string[]> Whitelist;
        private static List<string> DeletableNodesXpath = new List<string>();

        static HtmlSanitizer()
        {
            Whitelist = new Dictionary<string, string[]> {
                { "a", new[] { "href" } },
                { "strong", null },
                { "em", null },
                { "blockquote", null },
                { "b", null},
                { "p", null},
                { "ul", null},
                { "ol", null},
                { "li", null},
                { "div", new[] { "align" } },
                { "strike", null},
                { "u", null},                
                { "sub", null},
                { "sup", null},
                { "table", null },
                { "tr", null },
                { "td", null },
                { "th", null }
                };
        }

        public static string Sanitize(string input)
        {
            if (input.Trim().Length < 1)
                return string.Empty;
            var htmlDocument = new HtmlDocument();

            htmlDocument.LoadHtml(input);            
            SanitizeNode(htmlDocument.DocumentNode);
            string xPath = HtmlSanitizer.CreateXPath();

            return StripHtml(htmlDocument.DocumentNode.WriteTo().Trim(), xPath);
        }

        private static void SanitizeChildren(HtmlNode parentNode)
        {
            for (int i = parentNode.ChildNodes.Count - 1; i >= 0; i--)
            {
                SanitizeNode(parentNode.ChildNodes[i]);
            }
        }

        private static void SanitizeNode(HtmlNode node)
        {
            if (node.NodeType == HtmlNodeType.Element)
            {
                if (!Whitelist.ContainsKey(node.Name))
                {
                    if (!DeletableNodesXpath.Contains(node.Name))
                    {                       
                        //DeletableNodesXpath.Add(node.Name.Replace("?",""));
                        node.Name = "removeableNode";
                        DeletableNodesXpath.Add(node.Name);
                    }
                    if (node.HasChildNodes)
                    {
                        SanitizeChildren(node);
                    }                  

                    return;
                }

                if (node.HasAttributes)
                {
                    for (int i = node.Attributes.Count - 1; i >= 0; i--)
                    {
                        HtmlAttribute currentAttribute = node.Attributes[i];
                        string[] allowedAttributes = Whitelist[node.Name];
                        if (allowedAttributes != null)
                        {
                            if (!allowedAttributes.Contains(currentAttribute.Name))
                            {
                                node.Attributes.Remove(currentAttribute);
                            }
                        }
                        else
                        {
                            node.Attributes.Remove(currentAttribute);
                        }
                    }
                }
            }

            if (node.HasChildNodes)
            {
                SanitizeChildren(node);
            }
        }

        private static string StripHtml(string html, string xPath)
        {
            HtmlDocument htmlDoc = new HtmlDocument();
            htmlDoc.LoadHtml(html);
            if (xPath.Length > 0)
            {
                HtmlNodeCollection invalidNodes = htmlDoc.DocumentNode.SelectNodes(@xPath);
                foreach (HtmlNode node in invalidNodes)
                {
                    node.ParentNode.RemoveChild(node, true);
                }
            }
            return htmlDoc.DocumentNode.WriteContentTo(); ;
        }

        private static string CreateXPath()
        {
            string _xPath = string.Empty;
            for (int i = 0; i < DeletableNodesXpath.Count; i++)
            {
                if (i != DeletableNodesXpath.Count - 1)
                {
                    _xPath += string.Format("//{0}|", DeletableNodesXpath[i].ToString());
                }
                else _xPath += string.Format("//{0}", DeletableNodesXpath[i].ToString());
            }
            return _xPath;
        }
    }

It does it's job so far. Anyone up for code refactoring ?