How to parse these tags?

Topics: Developer Forum, User Forum
Mar 30, 2011 at 3:21 PM

I got some problem to parse a html file.
Here is a html snippet ...


 <dt>header 1</dt>
 <dd>body text 1</dd>

 <dt>header 2</dt>
 <dd>body text 2</dd>

 <dt>header 3</dt>
 <dd>body text 3</dd>


I need to get the text in the dd tags into different variables. I can get the dt tags with this pattern: ".//[dt='header 1']"
If the dd was a child of dt, then no problem, just add a /dd to pattern, but now it isn't ... So I got problem ;)

And ... The number of dt and dd tags are not fixed, it can change ...
Anyone with any ideas ? VB .NET sample pls if possible :)


Mar 31, 2011 at 9:41 PM

You should be able to do the following to select all the dd tags:

var ddNodes = htmlDocument.DocumentNode.SelectNodes("//dd");

Hope this helps.

Mar 31, 2011 at 10:49 PM

Thanks Roux ... It will work to get all DD ... But how do I know what is what ?

To show another sample code so you see what I mean ...

Say sometime there is an address, some other times no address ... Or no Last Name ...
How do I figure out what is what so I can save it into correct firlds in my db ?


 <dt>First Name</dt>

 <dt>Last Name</dt>

 <dd>Elm Street</dd>



Mar 31, 2011 at 11:25 PM

Well then you could do it how you were already doing it.  Look for the dt node with the correct value, then get it's next sibling, and grab the value out of that.  For example...

//look for first name
var node = document.DocumentNode.SelectSingleNode("//[dt='First Name']");
if(node != null && node.NextSibling != null && string.Equals(node.NextSibling.Name, "dd"))
    string firstName = node.NextSibling.InnerText;

And you would have to do this for each field you wanted.  I would recommend putting this in a function that you can pass in the value you are looking for and have it return node.NextSibling.InnerText.

Hope this helps.