Parse HTML with whitelist

Topics: Developer Forum
Nov 26, 2013 at 11:59 PM
Edited Dec 23, 2013 at 9:30 PM
This code was taken and revised from a previous discussion authored by DarthObiwan. Mainly, I moved removing any children to the end of the recursion cycle. That and I call removeChild in order to keepGrandChildren. Make sure you update the HtmlNode.RemoveChild() method. There is a bug in RemoveChild where the children are out of order. This is issue item # 28756 "RemoveChild(node, true) reverses the order of the grandchildren it keeps".

public void RemoveNotInWhiteList(HtmlNode pNode, IEnumerable<string> pWhiteList)
{
pNode.Attributes
     .Where(att => !pWhiteList.Contains(att.Name))
     .ToList()
     .ForEach(att => att.Remove());            

pNode.ChildNodes
     .ToList()
     .ForEach(node => RemoveNotInWhiteList(node, pWhiteList));

// this operation should be performed at the termination of all stack frames.
if (!pWhiteList.Contains(pNode.Name))
{
    pNode.ParentNode.RemoveChild(pNode, true); // preserve children
}
}