How to validate nodes in HtmlDocument are actual html nodes/elements

Topics: Developer Forum
Jan 9, 2014 at 11:26 PM
Edited Jan 9, 2014 at 11:38 PM
I have built an HTML sanitizer using a white list, but the problem encountered is that text like this: "< 1" gets accepted as a valid HtmlNodeType.Element??? Because of this, it gets removed via the white list processing, when it's valid text entered by a user and not an actual node of any kind.

The following as an excerpt of my sanitizer with the entry point to processing:
 public string RetainWhiteListedItems(string HTMLToScrub) {
            if (string.IsNullOrWhiteSpace(HTMLToScrub)) return HTMLToScrub;            

            HtmlDocument HTMLDoc = new HtmlDocument();
            HTMLDoc.OptionWriteEmptyNodes = true;
            HTMLDoc.LoadHtml(HttpUtility.HtmlDecode(HTMLToScrub));            

            /*THIS CHECK LETS A DOCUMENT THAT IS JUST "< 1" CONTINUE AS IT SEES IT AS 
            A VALID ELEMENT WITH A NAME OF "1"*/
            if (HTMLDoc.DocumentNode.ChildNodes.Where(node => node.NodeType == HtmlNodeType.Element).Any()) { 
                IList<HtmlNode> hnc = HTMLDoc.DocumentNode.Descendants().ToList();
                if (hnc.Count == 0) {
                    return HTMLToScrub;
                }

                //remove non-white list nodes
                for (int i = hnc.Count - 1; i >= 0; i--) {
                   HtmlNode htmlNode = hnc[i];
                  //if the htmlnode is not in the whitelist, turf it
                  //...all other processing for attributes, scripting etc....
                }
           } 
}
Does the HtmlNodeType.Element and/or the HtmlNode not validate that the element is in fact an html element? Is there something else i need to do to get this functionality? I'd like to have it so if a user enters text using less than/greater than symbols that are not tags/html elements to simply bypass being processed by the sanitzer

thx