This project has moved. For the latest updates, please go here.

strange behaviour when encoutering non html not closed tags

Topics: Developer Forum, User Forum
Dec 20, 2009 at 7:07 PM

given this example html
    <p><strong>Elem_A</strong>String_A1_1 String_A1_2</p>
    <p><strong>Elem_B</strong>String_B1_1 String_B1_2</p>
    <p><strong>Elem_A</strong>String_A2_1 <String_A2_2> String_A2_3</p>
    <p><strong>Elem_B</strong>String_B2_1 String_B2_2</p>

when i try to get the p nodes using the XPath "//p" the return is as following

lststrText[0] = "<strong>Elem_A</strong>String_A1_1 String_A1_2"  
[1] = "<strong>Elem_B</strong>String_B1_1 String_B1_2"  
[2] = ""  
[3] = ""  
[4] = "<strong>Elem_B</strong>String_B2_1 String_B2_2"

but i think this is wrong since the <String_A2_2> is not a known html tag and so it should be treated as it dont need a close Tag.
i tried to play with all the options flags but i couldnt change this behaviour.
what i figured out is that the html agility pack figures out that the <String_A2_2> tag actually encloses "</p><p><strong>Elem_B</strong>String_B2_1 String_B2_2</p>"

i tried SgmlReader "" and it did fix the <String_A2_2> to <String_A2_2></String_A2_2>.

so the questin is "is there is a way to let htmlagility pack treats non html elemets as not required to be closed?


below is the code that i used to get this result


HtmlAgilityPack.HtmlDocument objHtmlDocument = new HtmlAgilityPack.HtmlDocument();
HtmlAgilityPack.HtmlNodeCollection colnodePs = objHtmlDocument.DocumentNode.SelectNodes("//p");
List<string> lststrText = new List<string>();
foreach (HtmlAgilityPack.HtmlNode nodeP in colnodePs)
Dec 23, 2009 at 4:51 PM

Don't play with the Options flags, but play with the Elements flags (defined in HtmlNode.cs). HTML is not XML, and allows overlaps, unclosed tags, etc... see other posts about the Elements flags.