Attribute Values Containing Quotes

Topics: Developer Forum, User Forum
Jul 31, 2009 at 3:50 PM

I am trying to parse an HTML document that contains double quote characters (") inside quoted attribute values.  This is of course causing the HtmlAgilityPack to truncate the attribute value.  Does anyone have any ideas on how to handle this?  I don't mind if I have to modify the source code for HtmlAgilityPack and run a modified version of it.  Here is a snippet of the HTML that I am dealing with:

 

<meta name="description" content="Hooden sweatshirt with 6" drawstrings.  Buy now at half price!" />

 

Thanks for any help with this.

Jul 31, 2009 at 4:10 PM
Edited Jul 31, 2009 at 4:10 PM

That is because one of the most basic rules of HTML is you use quotes to delimit attribute text. http://www.w3schools.com/HTML/html_attributes.asp If a browser hits that it automatically goes into quirks mode and throws standards out the window.

Html Agility Pack when parsing will look for quotes to delimit the beginning and ending of a value for an attribute within quotes. You should use a quote html entity in your string. &quot;

HAP and any browser will see this as

tag = Meta

attribute["name"] = "description";
attribute["content"] = "Hooden sweatshirt with 6";
attribute["drawstrings."] = "";
attribute["Buy"] = "";
attribute["now"] = ""; .. you get the picture

 

Jul 31, 2009 at 4:30 PM

Unfortunately, this document was given to me by a client, and it is approximately 70MB in size.  Editing the document is not really an option.

Jul 31, 2009 at 4:36 PM

70MB html document? ouch

Not sure what you can do other than editing it or writing a small program that loads it up in a stream and modifies that one line. Loading all that into memory is going to be large. I've never used HAP on that large of a document, unsure how it will handle it all.

Another thing you can do is add some code to HAP to look for that 6" as it's parsing and ignore it/change it.

WordPad does a decent job of editing large files, it might take a few minutes to load but I've had it load multi-hundred megabyte files before.

But if that's just the head, I'd be afraid of what the rest of it looks like

Jul 31, 2009 at 4:44 PM

I think I've got it.

 

string description = descNode.Attributes["content"].Value;

if (descNode.Attributes.Count > 2)
{
    description += "\"";
}

foreach (HtmlAttribute attrib in descNode.Attributes)
{
    if (attrib.Name != "name" && attrib.Name != "content")
    {
        description += " " + attrib.Name.TrimEnd('"');
    }
}

This works because the words after the inch sign are read by HAP as attributes with no values.

The only thing I had to do to get this to work is remove the ToLower() from line 81 of HtmlAttribute.cs from HAP so that the attribute names don't get set to lower case.

_name = _ownerdocument._text.Substring(_namestartindex, _namelength).ToLower();

Jul 31, 2009 at 4:47 PM

The other thing I forgot to mention is that I trim the quote character from the end because HAP thinks it is part of the attribute name of the last attribute.

Jul 31, 2009 at 4:49 PM

Yep, I removed the ToLower in the patch I've submitted. The main update of my Patch was to provide LINQ compatibility to the library but that's one of the things I fixed. You can get it with the OriginalName property.

What's even cooler is you could change that last line do something like String.Join(" ", tag.Attributes.Skip(2).Select(x=>x.OriginalName).ToArray())