Sloppy Html doesn't parse

May 3, 2010 at 2:01 PM
Edited May 4, 2010 at 1:42 PM

I have something like this

 

<div name='locations';>

<p>&nbsp;</p>

<span>some text</span>

<p>

     <select>

                 <option></option>    <=== many option items here

       </select>

</div>

The second <p> does not have a closing tag because of sloppy coding. I'm trying to get the <option> values inside the <select>. I'm using

doc.DocumentNode.SelectNodes( "//div[@name='locations']/p/select");

However, this doesn't work and I get a null back. I can do:

doc.DocumentNode.SelectNodes( "//div[@name='locations']/p/);

and get back two nodes, the second one which should contain the <select> has innerhtml and innertext both an empty string.

So it seems that HtmlAgility isn't  cleaning up the HTML before running the xpath query. IIRC the documentaton said that Html Agility Pack was 'tolerant' of malformed HTML but I'm not sure if it corrects malformed stuff in the DOM. Do I need to pass it through Tidy.NET first before handing it to HTML Agility Pack?

Or I'm doing something wrong. Can someone set me straight?

 

 

 

 

 

 

May 4, 2010 at 1:38 PM
Edited May 4, 2010 at 1:44 PM

I've now ran the html through Tidy.NET before handing it over to Html Agility Pack and I keep getting the same error, namely that it can't find the nodes I specify in the XPath query.

I can do this:

 

doc.DocumentNode.SelectNodes( "//div[@name='locations']");

and get a single node whose inner html contains the two <p> and single <span>.

But as soon as I do this:

doc.DocumentNode.SelectNodes( "//div[@name='locations']/p");

I get two <p> nodes, one of them contains the &nbsp; and the other has an *empty* inner html!

 

May 5, 2010 at 2:36 PM

Hello BigPilot,

<div name='locations';>

<p>&nbsp;</p> <- p node #1

<span>some text</span>

<p>                  <- p node #2

     <select>

                 <option></option>    <=== many option items here

       </select>

</div>

//div[@name='locations']/p/select <- won't run because p node #1 doesn't have a select child node.

Perhaps it's better to say

doc.DocumentNode.SelectNodes("//div[@name='locations']//select");

Let us know how it goes... once you find the solution

 

Coordinator
May 6, 2010 at 8:36 AM

Hi everyone (yes, I'm still alive :)


bigpilot, the Html Agility Pack *is* cleaning the HTML you gave.


But... by default, it's tailored for HTML 3.x, and in HTML 3.x, you *don't always have* to close tags. It means a <p> alone is perfectly valid, so it's automatically closed, because there is no corresponding </p> found. If you try the same HTML in a browser, you will see that browser behave exactly like this (unless you set DOCTYPES to more strict parsing).


So the parsed tree is like this:


+div
  +p
  +span
  +p
  +select
    +option


Here, the <select> is not a child of <p> but the next sibling. You can get the <select> with this xpath: //div[@name='locations']/select or what is suggested by kurtnelle.

Now, you can tweak the HTML agility pack to better suit what you expect using the HtmlNode.ElementFlags static property (please search for this in this forum for more information, or have a look into HtmlNode.cs). What you can do is tell it you don't want to support unclosed <p> tags:

            HtmlNode.ElementsFlags.Remove("p"); // remove the Empty and Closed flags
            HtmlDocument doc = new HtmlDocument();
            doc.Load(...);

And bingo, the pack has closed the malformed <p> because it's not valid anymore, and your original xpath works, because now the parsed tree is:

+div
  +p
  +span
  +p
    +select
      +option


Cheers!

May 10, 2010 at 12:56 PM

@kurtnelle: thanks for your advice. My XPath query was indeed incorrect. Got it working now

@simonm: thanks for the advice, I later learned that a paragraph closing tag is indeed not mandatory, so it seems that  HTML Agility Pack was not to blame.