Need help with SelectBodes

Topics: Developer Forum, User Forum
Jul 30, 2014 at 10:00 PM
Hello,

I am a complete novice to XPath and the HtmlAgilityPack. Please help me understand how to gather HTML content. Here is my code:
    static void Main(string[] args) 
    { 
      string input = @" 
<span style=""font-style: italic"">This is the title</span>. 
This is the introductory text: 
<ol> 
  <li>List Item One</li> 
  <li>List Item Two</li> 
  <li>List Item Three</li> 
  <li>This list item is nested: 
      <ol> 
        <li>List Item Four A.</li> 
        <li>List Item Four B.</li> 
      </ol> 
      Yes it is. 
  </li> 
  <li>List Item Five</li> 
</ol> 
This is the footer text. Last updated: July 20, 2014 

"; 

      HtmlDocument doc = new HtmlDocument(); 

      try 
      { 
        doc.LoadHtml(input); 
      } 
      catch (Exception e) 
      { 
        LogIt("ERROR: " + e.Message); 
        return; 
      } 

      HtmlNode get_title = doc.DocumentNode.SelectSingleNode("//span"); 
      if (get_title != null) 
      { 
        LogIt("Title: '" + get_title.InnerHtml + "'"); 
      } 

      HtmlNodeCollection get_outer_lists = doc.DocumentNode.SelectNodes("//ol//li"); 

      if (get_outer_lists != null) 
      { 
        foreach (HtmlNode hn_outer in get_outer_lists)  
        { 
          LogIt("Begin outer for"); 
          LogIt("outer HTML: '" + hn_outer.OuterHtml + "'"); 

          // Now fetch inner list, the text above the inner list, and the  
          // text below the inner list. 

          HtmlNodeCollection get_inner_lists = doc.DocumentNode.SelectNodes("//ol//li//ol//li"); 

          if (get_inner_lists != null) 
          { 
            foreach (HtmlNode hn_inner in get_inner_lists) 
            { 
              LogIt("\tinner HTML: '" + hn_inner.OuterHtml + "'"); 
            } 
          } 
          else 
          { 
            LogIt("ERROR: Could not get inner list"); 
          } 
        } 
      } 
      else 
      { 
        LogIt("ERROR: Could not select //ol//li"); 
        Console.Read(); 
        return; 
      } 

      Console.Read(); 
      return; 
    } 

    private static void LogIt(string str) 
    { 
      Console.WriteLine(str); 
       
      return; 
    } 
...and here is my output:
Title: 'This is the title' 
Begin outer for 
outer HTML: '<li>List Item One</li>' 
        inner HTML: '<li>List Item Four A.</li>' 
        inner HTML: '<li>List Item Four B.</li>' 
Begin outer for 
outer HTML: '<li>List Item Two</li>' 
        inner HTML: '<li>List Item Four A.</li>' 
        inner HTML: '<li>List Item Four B.</li>' 
Begin outer for 
outer HTML: '<li>List Item Three</li>' 
        inner HTML: '<li>List Item Four A.</li>' 
        inner HTML: '<li>List Item Four B.</li>' 
Begin outer for 
outer HTML: '<li>This list item is nested: 
      <ol> 
        <li>List Item Four A.</li> 
        <li>List Item Four B.</li> 
      </ol> 
      Yes it is. 
  </li>' 
        inner HTML: '<li>List Item Four A.</li>' 
        inner HTML: '<li>List Item Four B.</li>' 
Begin outer for 
outer HTML: '<li>List Item Four A.</li>' 
        inner HTML: '<li>List Item Four A.</li>' 
        inner HTML: '<li>List Item Four B.</li>' 
Begin outer for 
outer HTML: '<li>List Item Four B.</li>' 
        inner HTML: '<li>List Item Four A.</li>' 
        inner HTML: '<li>List Item Four B.</li>' 
Begin outer for 
outer HTML: '<li>List Item Five</li>' 
        inner HTML: '<li>List Item Four A.</li>' 
        inner HTML: '<li>List Item Four B.</li>' 
Questions:
  1. I can get the title text just fine, but how do I get the introductory text or the footer? They don't belong to an HTML element I can select.
  2. The outer foreach loop iterates through both the outer and the inner ordered lists. How do I change the XPath string so that the outer for loop only iterates through the outer list? The inner for loop should take care of the inner list.
Jul 30, 2014 at 10:01 PM
The subject should say "Need help with SelectNodes", not "Need help with SelectBodes".
Nov 13, 2014 at 3:16 PM
Not sure if you need help with this any more, but I figured I'd answer since it should be easy enough to fix.
  1. For text that is not entirely within an html tag like the introductory text or footer, you will look for a node that has name="#text". These types of nodes correspond with exactly the situation you are talking about.
  2. The "//" xpath operator searches for any descendant of the current node. Therefore, "//ol//li" will match every single <ol> throughout the document. I would do something more like this: "/ol/li". This matches any top level ol, and li's immediately underneath it. (Your original would get all li's (even from the second level list) underneath the <ol>. It would also grab embedded <ol>'s. Rather than do the xpath from the top level as you are attempting to do for the inner lists, you should use the outer list node as the starting point. Your code should be something more like this (only selection code displayed here):
HtmlNodeCollection get_outer_lists = doc.DocumentNode.SelectNodes("/ol/li"); 

foreach (HtmlNode hn_outer in get_outer_lists)  
{

          HtmlNodeCollection get_inner_lists = hn_outer.SelectNodes("./ol/li");

}