8

Closed

HtmlAgilityPack v1.4.3 parses tables wrong

description

I have installed HtmlAgilityPack via NuGet and it installed version 1.4.3
 
This version has an error when handling tables!
 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>    
    <title>hap test table</title> 
</head> 
<body>    
    <table>      
        <tr>        
            <td>foo</td>        
            <td>bar</td>     
        </tr>    
    </table>  
</body>
</html>
 
becomes
 
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 3.2//EN">
<html>
<head>    
    <title>hap test table</title>  
</head>  
<body>    
    <table>      
        <tr>        
            <td>foo                
            <td>bar
          </td></td></tr></table> 
</body>
</html>
 
If I go back to version 1.4.0 then it works like it should...
Closed Aug 23, 2012 at 2:56 PM by DarthObiwan

comments

owenberry wrote Feb 22, 2012 at 4:41 AM

I'm using 1.4.0 (by the looks of it, in the Reference properties - gotten from Nuget 22/02/2012) - but still having a problem with a table not being read properly.

Viewing source in chrome or firefox shows the table to be properly formatted, but looking at InnerHtml of a further up node shows HtmlAgilityPack removing the ending tags from the TH,TR,TD tags ... and so it is having problems getting nodes out properly.

Ksu wrote Feb 22, 2012 at 11:30 AM

Oh my god, and I thought I'm already mad.... I've been trying for hours to parse a table until I noticed the wrong closing td,tr tags in InnerHtml. This totally screwed up my XPath selectors and returns nodes that I don't want at all.

I have tried versions 1.4.3 from NuGet, 1.4.0 and 1.4.0 Beta 2 from Codeplex. They all have this issue.

I hope this will get fixed soon. At first I was so excited about how easy it is to parse HTML with HTMLAgilityPack until I tried to parse a table...

Ksu wrote Feb 22, 2012 at 3:31 PM

I have to correct myself; version 1.4.0 from Codeplex works correct. Problem was I saved the loaded HTML files to my disk with version 1.4.3 so I didn't have to fetch them all the time through http. So my local copies all had the messed up td/tr closing tags. I fetched them all again with 1.4.0 and now it works.

majochaa wrote Mar 21, 2012 at 11:23 PM

I can confirm, 1.4.3 from NuGet, I've lost all day scratching my head until I've noticed the mangled closing tags in inner html.

Ksu wrote Mar 23, 2012 at 1:43 PM

@owenberry It removes the endig TH,TR,TD tags but then appends them at the end of InnerHtml!
I took a look at the source code but this is way over my head. Unfortunately I'm not that of a skilled coder yet. I really hope the main dev's will take care of this problem since it makes the current version unusable for table parsing. So, go dev's, go!!

DarthObiwan wrote Apr 27, 2012 at 1:56 AM

So I'm finally able to get back working on HAP (been a few years of long hours and busy life). I am trying to repro this issue and so far with 1.3, 1.4, 1.4.3 I haven't been able to find any difference in InnerHtml nor WriteTo html output. Does anyone have a working example in code they could share that I could look at?

DarthObiwan wrote Apr 27, 2012 at 1:59 AM

Note I'm using the information provided in this post, like the original html that is supposed to demonstrate it. I wrote a program to save the output to a text file, made 3 projects in my solution, each referencing a different version of the dll and then compared the results. I've tried WriteTo() and InnerHtml so far

tom103 wrote May 18, 2012 at 5:04 PM

It seems this problem is related to HtmlWeb; it occurs if you load the document with HtmlWeb.Load, but NOT if you load it with WebClient.OpenRead + HtmlDocument.Load...

tom103 wrote May 18, 2012 at 5:31 PM

Further investigation in the code shows that HtmlWeb forces OptionFixNestedTags to true, which causes the issue. There should be a way to control these flags (PreHandleDocument occurs too late, so it's not a viable option)

DarthObiwan wrote Jun 5, 2012 at 11:48 PM

This will be fixed in the next release and in svn soon

emn13 wrote Aug 23, 2012 at 2:11 PM

This bug is also fixed by the patch to http://htmlagilitypack.codeplex.com/workitem/29218.

DarthObiwan wrote Aug 23, 2012 at 2:56 PM

This was fixed in 1.4.5 and is available via nuget, codeplex and the source code