19
Vote

Ordering problem with RemoveChild

description

When I remove an item from a parent node, the ordering of the children changes:
 
The following example replaces the less-than and greater-than signs with square brackets. If you comment out the RemoveChild() line the ordering works ok:
 
        String testhtml = "<levelone>\r\n"
            + "   <leveltwo>\r\n"
            + "      <h1><levelthreepartone></levelthreepartone></h1>\r\n"
            + "      <p><levelthreeparttwo></levelthreeparttwo></p>\r\n"
            + "   </leveltwo>\r\n"
            + "</levelone>";
 
        HtmlDocument doc = new HtmlDocument();
        doc.LoadHtml(testhtml);
 
        HtmlNodeCollection nodes = doc.DocumentNode.SelectNodes("//*");
 
        foreach (HtmlNode node in nodes)
        {
            HtmlNode parentnode = node.ParentNode;
 
            HtmlTextNode textnode1 = node.OwnerDocument.CreateTextNode("["+node.Name+"]");
            parentnode.InsertBefore(textnode1, node);
 
            HtmlTextNode textnode2 = node.OwnerDocument.CreateTextNode("[/" + node.Name + "]");
            parentnode.InsertAfter(textnode2, node);
 
            parentnode.RemoveChild(node, true);  // comment this out!
 
        }
        StringBuilder sb = new StringBuilder();
        StringWriter sw = new StringWriter(sb);
        HtmlTextWriter writer = new HtmlTextWriter(sw);
        doc.Save(writer);
        System.Console.Write(sb.ToString());
 
Note that the lines with "levelthreepartone" and "levelthreeparttwo" get reversed.

comments

mikebridge wrote Mar 23, 2007 at 5:55 PM

I think the problem is in HtmlNode.RemoveChild(HtmlNode oldChild, bool keepGrandChildren). It looks like the previous sibling is assigned only once, then everything gets inserted at that point. This causes the order of the nodes to be reversed.

The current code is this:
        if ((oldChild._childnodes != null) && keepGrandChildren)            {               // get prev sibling             HtmlNode prev = oldChild.PreviousSibling;

            // reroute grand children to ourselves              foreach(HtmlNode grandchild in oldChild._childnodes)                {                   InsertAfter(grandchild, prev);              }           }           RemoveChild(oldChild);
I think it should be this:
  // get prev sibling      HtmlNode prev = oldChild.PreviousSibling;      if ((oldChild._childnodes != null) && keepGrandChildren)      {          // reroute grand children to ourselves          foreach (HtmlNode grandchild in oldChild.ChildNodes)          {            prev = InsertAfter(grandchild, prev);          }

  }

wrote Jul 6, 2008 at 11:06 PM

wrote Feb 18, 2009 at 8:57 PM

wrote Jul 20, 2009 at 8:33 PM

adamface wrote Jul 20, 2009 at 8:34 PM

Mike - You are a lifesaver! I ran into the exact same problem (grandchildren nodes being reversed). I can verify that your code fixes the issue.

Thank you!

Adam

wrote Aug 18, 2010 at 10:30 AM

0tto wrote Aug 18, 2010 at 10:32 AM

Thanks mike, your fix worked for me as well.

I've put up fixed binaries here if somebody needs them:
http://www.mediafire.com/?t5eadu70nc79z9w

wrote Nov 18, 2010 at 7:42 PM

tbone47 wrote Nov 18, 2010 at 7:42 PM

wrote Nov 18, 2010 at 10:48 PM

sysmo wrote Jan 17, 2011 at 12:30 AM

The "prev" reference is unnecessary. Looping backwards through the children and inserting after the parent works. Here is the updated RemoveChild method.
    public HtmlNode RemoveChild(HtmlNode oldChild, bool keepGrandChildren)
    {
        if (oldChild == null)
        {
            throw new ArgumentNullException("oldChild");
        }

        if ((oldChild._childnodes != null) && keepGrandChildren)
        {
            // get prev sibling
            //HtmlNode prev = oldChild.PreviousSibling;
            int count = oldChild._childnodes.Count;

            // reroute grand children to ourselves
            for (var i = count - 1; i >= 0; i--)
            {
                //looping through children backwards allows us to insert after the parent 
                //and end up with the same node order
                HtmlNode grandchild = oldChild._childnodes[i];
                //InsertAfter(grandchild, prev);
                InsertAfter(grandchild, oldChild);
            }
        }
        RemoveChild(oldChild);
        _outerchanged = true;
        _innerchanged = true;
        return oldChild;
    }

wrote Feb 4, 2011 at 12:45 AM

wrote May 3, 2011 at 6:11 AM

tednyberg wrote May 3, 2011 at 6:40 AM

Here's an example I used in a project to remove span tags without compiling my own modified version of HtmlAgilityPack:
    public static void RemoveSpanTags(this HtmlDocument html)
    {
        var spanTags = html.DocumentNode.SelectNodes("//span");

        if (spanTags!=null)
        {
            foreach (var span in spanTags)
            {
                if (!span.HasChildNodes)
                {
                    span.ParentNode.RemoveChild(span);

                    continue;
                }

                int count = span.ChildNodes.Count;

                for (var i = count - 1; i >= 0; i--)
                {
                    var child = span.ChildNodes[i];

                    span.ParentNode.InsertAfter(child, span);
                }

                span.ParentNode.RemoveChild(span);
            }
        }
    }

jongalloway wrote Sep 10, 2011 at 10:38 PM

Any reason not to apply the patch for 28756?

wrote Sep 15, 2011 at 6:00 PM

wrote Feb 20, 2012 at 2:23 PM

wrote Sep 5, 2012 at 3:05 AM

wrote Feb 22, 2013 at 1:47 AM

wrote Jun 3, 2013 at 4:27 PM

wrote Aug 10, 2013 at 10:44 AM

wrote Oct 29, 2013 at 9:55 PM

wrote Mar 19, 2014 at 6:12 PM

wrote Dec 3, 2014 at 7:36 PM

wrote Dec 4, 2014 at 9:25 AM

JeffBlankenbiller wrote Jan 14 at 2:17 PM

Sysmo posted code to fix this problem 4 years ago.

Why has this not been corrected yet?

Why is this also a Low impact? Reordering the child nodes incorrect is bad. When used to strip unwanted HTML from paragraphs and sentences, it reorders the sentences or words which totally messes up the original content.

wrote Jan 14 at 2:18 PM