Problem with HtmlNode.Descendants()

Topics: User Forum
Nov 30, 2009 at 11:36 AM
Edited Nov 30, 2009 at 11:37 AM

I'm parsing some pages that have a <UL> list of items and specifically a <LI> item, on some of the pages the <LI> has 3 <SPAN> tags and in others it has 2 <SPAN> tags and one <A> in the place of the third <SPAN> tag.

 

So I figure I'd just use the HtmlNode.Descendants().ToList() without any string parameter and get the third item from the list. Problem is this returns 10 items! And the extra items are actually \n and \t that are in the raw html:

 

<li>
<span>Release:</span>
<span>
<span>Nov 3, 2009</span> </span>
</li>

Think of that but a little messed up to us humans. So my question is, is this by design or a bug? And how can I work around it.

Nov 30, 2009 at 9:57 PM

This is by design. HtmlAgilityPack creates objects/nodes for all text inbetween tags so it can maintain formatting. You can filter out those nodes by doing a where HtmlNodeType .

I'll see about changing the behavior of Descendants() to default to this and add a new one that includes all. The thing is that extra data may be something someone else is looking for

Nov 30, 2009 at 11:25 PM

Thanks for the reply, never looked at NodeType property so my bad.