Selecting a specific bolded text

Jun 29, 2009 at 3:35 PM

I have an application which I want to read a HTML page and store the third, fourth, and fifth occurrences of  bolded text in a string.   Does anyone have a good idea how to do this?

Jun 29, 2009 at 3:50 PM

build a normalized list of all the nodes (recursive foreach building a List<HtmlNode>)

After that do a list.Where(x=>x.Name.Equals("b")).Skip(2).Take(3); (note code may not be exact but damn close).

Or if you're interested in performance over maintainability you can keep track of the bold tags as you find them while recursing over the nodes.


If the text is bold via CSS then there isn't a way to do it without writing a CSS rendering engine. If strong and b are being used then you'll have to run a few queries.

 

The normalizing of the list will be a bit difficult in the current version of Html Agility Pack. Here's the code I use after I converted the HtmlCollection to implement IList<T> . This is a method on the HtmlNode object

 

        public IEnumerable<HtmlNode> Descendants()
        {
            List<HtmlNode> list = new List<HtmlNode>();

            if (HasChildNodes)
                list.AddRange(ChildNodes);

            foreach (HtmlNode node in ChildNodes)
                if (node.HasChildNodes)
                    list.AddRange(node.Descendants());

            return list;
        }

Jun 29, 2009 at 7:49 PM

I'm sorry to bother you Darth, but I'm a beginner at the whole HTML Agility Pack thing.  I was wondering if you could be a little more specific on how to do the second step of your explanation. 

"After that do a list.Where(x=>x.Name.Equals("b")).Skip(2).Take(3); (note code may not be exact but damn close)."

Thanks

Jun 29, 2009 at 8:00 PM

list would be

IEnumerable<HtmlNode> list = document.DocumentNode.Descendants();

That is doing a LINQ statement to grab every node where tag name is b, skip the first 2 and then grab the next three. You'll need to be compiling in .NET 3.5 and have included System.Linq.

Below is basically what that one line does

int count = 0;
List<HtmlNode> boldTags = new List<HtmlNode>
foreach(HtmlNode node in list)
{
   if(node.Name.Equals("b"))
   {
      count++;
      if(count>6)
        break;
 
      if(count>=3 && count<=6)
        boldTags.Add(node);
      
   }
}

Jun 29, 2009 at 8:34 PM

I'm getting errors on HasChildNodes, ChildNodes, and the call to the Descendant method.  Any help on how to solve these errors would be greatly appreciated.

  public IEnumerable<HtmlNode> Descendants()
        {
            List<HtmlNode> list = new List<HtmlNode>();

            if (HasChildNodes)
                list.AddRange(ChildNodes);

            foreach (HtmlNode node in ChildNodes)
                if (node.HasChildNodes)
                    list.AddRange(node.Descendants());

            return list;
        }

Jun 29, 2009 at 8:42 PM
Edited Jun 29, 2009 at 8:43 PM

The Descendants() method should be added to the HtmlNode.cs file.

If you don't want to edit the html agility pack source code, you can add an extension method.

 

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;

namespace HtmlAgilityPack
{
    public static class Extensions
    {
        public static IEnumerable<HtmlNode> Descendants(this HtmlNode theNode)
        {
            List<HtmlNode> list = new List<HtmlNode>();

            if (theNode.HasChildNodes)
                list.AddRange(theNode.ChildNodes);

            foreach (var node in theNode.ChildNodes)
                if (node.HasChildNodes)
                    list.AddRange(node.Descendants());

            return list;
        }
    }
}

 

 

Though all of this may need the HtmlCollection.cs to be modified. Here's my copy (I'm hoping to get these changes added to Html Agility Pack once I get get in touch with simonm)

 

using System;
using System.Collections;
using System.Collections.Generic;
using System.Linq;

namespace HtmlAgilityPack
{
    /// <summary>
    /// Represents a combined list and collection of HTML nodes.
    /// </summary>
    public class HtmlNodeCollection : IList<HtmlNode>
    {
        private List<HtmlNode> items = new List<HtmlNode>();
        private HtmlNode _parentnode;

        public HtmlNodeCollection(HtmlNode parentnode)
        {
            _parentnode = parentnode; // may be null
        }

        /// <summary>
        /// Gets the number of elements actually contained in the list.
        /// </summary>
        public int Count
        {
            get
            {
                return items.Count;
            }
        }

        public void Clear()
        {
            foreach (HtmlNode node in items)
            {
                node._parentnode = null;
                node._nextnode = null;
                node._prevnode = null;
            }
            items.Clear();
        }

        public void RemoveAt(int index)
        {
            HtmlNode next = null;
            HtmlNode prev = null;
            HtmlNode oldnode = (HtmlNode)items[index];

            if (index > 0)
            {
                prev = (HtmlNode)items[index - 1];
            }

            if (index < (items.Count - 1))
            {
                next = (HtmlNode)items[index + 1];
            }

            items.RemoveAt(index);

            if (prev != null)
            {
                if (next == prev)
                {
                    throw new InvalidProgramException("Unexpected error.");
                }
                prev._nextnode = next;
            }

            if (next != null)
            {
                next._prevnode = prev;
            }

            oldnode._prevnode = null;
            oldnode._nextnode = null;
            oldnode._parentnode = null;
            
        }

        public void Replace(int index, HtmlNode node)
        {
            HtmlNode next = null;
            HtmlNode prev = null;
            HtmlNode oldnode = (HtmlNode)items[index];

            if (index > 0)
            {
                prev = (HtmlNode)items[index - 1];
            }

            if (index < (items.Count - 1))
            {
                next = (HtmlNode)items[index + 1];
            }

            items[index] = node;

            if (prev != null)
            {
                if (node == prev)
                {
                    throw new InvalidProgramException("Unexpected error.");
                }
                prev._nextnode = node;
            }

            if (next != null)
            {
                next._prevnode = node;
            }

            node._prevnode = prev;
            if (next == node)
            {
                throw new InvalidProgramException("Unexpected error.");
            }
            node._nextnode = next;
            node._parentnode = _parentnode;

            oldnode._prevnode = null;
            oldnode._nextnode = null;
            oldnode._parentnode = null;
        }

        public void Insert(int index, HtmlNode node)
        {
            HtmlNode next = null;
            HtmlNode prev = null;

            if (index > 0)
            {
                prev = (HtmlNode)items[index - 1];
            }

            if (index < items.Count)
            {
                next = (HtmlNode)items[index];
            }

            items.Insert(index, node);

            if (prev != null)
            {
                if (node == prev)
                {
                    throw new InvalidProgramException("Unexpected error.");
                }
                prev._nextnode = node;
            }

            if (next != null)
            {
                next._prevnode = node;
            }

            node._prevnode = prev;

            if (next == node)
            {
                throw new InvalidProgramException("Unexpected error.");
            }

            node._nextnode = next;
            node._parentnode = _parentnode;
        }

        public void Append(HtmlNode node)
        {
            HtmlNode last = null;
            if (items.Count > 0)
            {
                last = (HtmlNode)items[items.Count - 1];
            }

            items.Add(node);
            node._prevnode = last;
            node._nextnode = null;
            node._parentnode = _parentnode;
            if (last != null)
            {
                if (last == node)
                {
                    throw new InvalidProgramException("Unexpected error.");
                }
                last._nextnode = node;
            }
        }

        public void Prepend(HtmlNode node)
        {
            HtmlNode first = null;
            if (items.Count > 0)
            {
                first = (HtmlNode)items[0];
            }

            items.Insert(0, node);

            if (node == first)
            {
                throw new InvalidProgramException("Unexpected error.");
            }
            node._nextnode = first;
            node._prevnode = null;
            node._parentnode = _parentnode;
            if (first != null)
            {
                first._prevnode = node;
            }
        }

        public void Add(HtmlNode node)
        {
            items.Add(node);
        }

        /// <summary>
        /// Gets the node at the specified index.
        /// </summary>
        public HtmlNode this[int index]
        {
            get
            {
                return items[index] as HtmlNode;
            }
            set
            {
                items[index] = value;
            }
        }
      
        public int GetNodeIndex(HtmlNode node)
        {
            // TODO: should we rewrite this? what would be the key of a node?
            for (int i = 0; i < items.Count; i++)
            {
                if (node == ((HtmlNode)items[i]))
                {
                    return i;
                }
            }
            return -1;
        }

        /// <summary>
        /// Gets a given node from the list.
        /// </summary>
        public int this[HtmlNode node]
        {
            get
            {
                int index = GetNodeIndex(node);
                if (index == -1)
                {
                    throw new ArgumentOutOfRangeException("node", "Node \"" + node.CloneNode(false).OuterHtml + "\" was not found in the collection");
                }
                return index;
            }
           
        }
        public HtmlNode this[string name]
        {
            get
            {
                return items.SingleOrDefault(x=>x.Name.Equals(name.ToLower()));
            }
        }
      
        public int IndexOf(HtmlNode item)
        {
            return items.IndexOf(item);
        }

 
        public bool Contains(HtmlNode item)
        {
           return items.Contains(item);
        }

        public void CopyTo(HtmlNode[] array, int arrayIndex)
        {
            items.CopyTo(array, arrayIndex);
        }

        public bool IsReadOnly
        {
            get { return false; }
        }

        public bool Remove(HtmlNode item)
        {
            int i = items.IndexOf(item);
             RemoveAt(i);
             return true;
        }
        public bool Remove(int i)
        {
            RemoveAt(i);
            return true;
        }
        IEnumerator<HtmlNode> IEnumerable<HtmlNode>.GetEnumerator()
        {
            return items.GetEnumerator();
        }


        IEnumerator IEnumerable.GetEnumerator()
        {
            return items.GetEnumerator();
        }

        public HtmlNode FindFirst(string name)
        {
            return FindFirst(this, name);
        }

        public static HtmlNode FindFirst(HtmlNodeCollection items, string name)
        {
            foreach (var node in items)
            {
                if (node.Name.ToLower().Contains(name))
                    return node;
                if (node.HasChildNodes)
                {
                    var returnNode = FindFirst(node.ChildNodes, name);
                    if (returnNode != null)
                        return returnNode;
                }
            }
            return null;
        }

    }

}