This project has moved and is read-only. For the latest updates, please go here.

Parse HTML with whitelist

Topics: Developer Forum
Nov 27, 2013 at 12:59 AM
Edited Dec 23, 2013 at 10:30 PM
This code was taken and revised from a previous discussion authored by DarthObiwan. Mainly, I moved removing any children to the end of the recursion cycle. That and I call removeChild in order to keepGrandChildren. Make sure you update the HtmlNode.RemoveChild() method. There is a bug in RemoveChild where the children are out of order. This is issue item # 28756 "RemoveChild(node, true) reverses the order of the grandchildren it keeps".

public void RemoveNotInWhiteList(HtmlNode pNode, IEnumerable<string> pWhiteList)
     .Where(att => !pWhiteList.Contains(att.Name))
     .ForEach(att => att.Remove());            

     .ForEach(node => RemoveNotInWhiteList(node, pWhiteList));

// this operation should be performed at the termination of all stack frames.
if (!pWhiteList.Contains(pNode.Name))
    pNode.ParentNode.RemoveChild(pNode, true); // preserve children