Html agility Pack mis-parsed unclosed <li> and <p> elements

Apr 27, 2010 at 10:21 AM
Edited Apr 27, 2010 at 10:25 AM
While trying to reconstruct simple html content, I run into the problem that HAP mis-parses certain commonly unclosed elements.



For example:

<html><body>
<ul>
  <li>TestElem1
  <li>TestElem2
  <li>TestElem3 List:
      <ul>
          <li>Nested1
          <li>Nested2</li>
          <li>Nested3
      </ul>
  <li>TestElem4
</ul>
<p>paragraph 1
<p>paragraph 2
<p>paragraph 3
</body></html>

is misparsed as (using xml-style formatting with manually editted whitespace to highlight the problematic nesting):

<?xml version="1.0" encoding="iso-8859-1"?><html><body>
<ul>
    <li>TestElem1
        <li>TestElem2
            <li>TestElem3 List:
                <ul>
                    <li>Nested1
                        <li>Nested2</li>
                        <li>Nested3</li>
                    </li>
                </ul>
                <li>TestElem4</li>
            </li>
        </li>
    </li>
</ul>
<p />paragraph 1
<p />paragraph 2
<p />paragraph 3
</body></html>

Which of course isn't correct - unclosed li tags should be closed when the next li tag in the same ul is found opened or at the end of the ul tag, and successive p tags should be left open until the next p (or other block level?) tag.