nodes and child nodes

Topics: Developer Forum, Project Management Forum, User Forum
Oct 15, 2009 at 1:10 PM

i am new to this library and i chose it for its LINQ support. thanks for making it open source!

i did some initial tests and what puzzles me is that some nodes like <TITLE> and <STYLE> return the inner text as child nodes.

my code:


			using( StreamWriter sw = File.CreateText("c:\\out.txt") )
			{
				HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
				doc.Load( @"C:\Temp\index.html" );
				var results = from node in doc.DocumentNode.Descendants()
			  // 		where node.HasChildNodes == false
					  select node;

				foreach( HtmlNode node in results )
				{
//					if( !node.HasChildNodes )
					{
						sw.WriteLine( node.OuterHtml );
						sw.WriteLine( "++++++++++++++++++++++++++++++" );
					}
				}
			}

as you can see, i simply take all nodes and their descendants and write them to a text file.

however, a node like

<TITLE>M$$</TITLE>

yields actually two nodes: the original line above and M$$ as child node

if i uncomment "if( !node.HasChildNodes )" i only see M$$. the same goes for the <STYLE> node in the sample mshome.htm

this seems wrong to me, or am i missing something?

Oct 15, 2009 at 2:03 PM

HAP creates nodes for any text that is not a tag. You'll see it has a node type of #text. It does this to do it's best to maintain formatting. Say you have a newline after M$$. 

I've been on the fence about this behavior myself. For nodes like title it really should just have the value in the InnerHtml/InnerText. But then that would bring an inconsistent way of accessing what is inside the node. I've been trying to think of some ways to make this a little more like LINQ to XML. But in the long run I would have to make a break with backwards compatibility so work on this would be slated for a major release.