Reading from MSWord-generated HTML

Topics: User Forum
Sep 5, 2006 at 8:57 PM
I'm trying to use HtmlAgilityPack to scan Word documents saved as HTML. There are a couple things that Word is doing that are confounding HtmlAgilityPack and before I invest too much time into a solution, I thought I'd see if you knew of other folks who had solved it already.

When you use Word to create a document and "Save As" HTML, Word puts blocks like this into the <head>:

<!--if gte mso 9><xml>
<o:Author>Steve Benz</o:Author>
<o:TestText dt:dt="string">Yowsa.</o:TestText>
<o:TestBool dt:dt="boolean">1</o:TestBool>
<o:TestNum dt:dt="float">56</o:TestNum>

I'd like to use the HtmlAgilityPack to get at some of that data, but I'm having a couple problems. First, it's hidden from HtmlAgilityPack's scanner by the "<!--..." surrounding tags. With those comments there, I'm unable to get at the <xml> tag or anything underneath it.

What I considered doing to thwart that was to create a translation StreamReader, which would read my HTML file and strip out all those markers so that HtmlAgilityPack could actually see <xml.../>. For test purposes, I manually removed the comments from my test file and tried again.

At this point, I got an exception when I tried to do SelectNodes("//o:CustomDocumentProperties/*"). That's because the XPath processor underlying the Agility Pack demands that the "o" namespace be resolvable. I'm no expert on XPath etc., but still it seemed that the way to fix it was to alter HtmlNode's SelectNodes method so that it could take an IXmlNamespaceResolver argument to pass to XPathNodeIterator's Select method. With that in place, it no longer throws the exception, but it also doesn't find the nodes either.

I could use a preprocessor to clean those namespace qualifiers from the stream so that the agility pack could process it. But surely there's a more elegant solution than that...

Has anybody seen this problem before or otherwise know of a nice way to solve this?
Sep 7, 2006 at 8:18 PM

hmmm. many questions!

First of all, the Html Agility Pack parses comments (<!-- --> markup) and creates HtmlTextNode objects with it. So you can get all HtmlTextNodes, read comments, and use an XmlDocument on each (note the xml does not seem valid, you may have to add the namespace uri for the prefix o: manually)

Concerning HTML namespaces and prefixes, the Html Agility Pack has very limited support for it. Basically, the : character is seen just like another character (not like in XML). Have you tried such a syntax:


It may work, but I prefer the first version (getting comment and loading an XmlDocument on it).

Sep 12, 2006 at 6:13 PM
I agree with your notion of parsing out the comments and feeding them to the native XML parser. Afterall, the contents are supposed to be XML... Whether they are or not is another matter, but yeah, that's the idea.

I'm posting the solution that I came up with in case anybody else needs it and maybe others can improve on what I've got.

My first problem was parsing out all the comments. I really thought I could do:

HTMLDocument doc = ...;
doc.DocumentNode.SelectNodes( "//#comment" );

but I guess that just shows that my understanding of XPath is far from complete. The only way I could figure to do it was to walk the tree myself. I crafted this function to help me with that:

delegate void _ForeachComment( HtmlCommentNode comment );
static void ForeachComment( HtmlNode rootNode, _ForeachComment doWhat )
if (rootNode.NodeType == HtmlNodeType.Comment)
doWhat( (HtmlCommentNode)rootNode );
foreach (HtmlNode n in rootNode.ChildNodes)
ForeachComment( n, doWhat );

If this really is the only way to find comments and text nodes, then I'd recommend adding some walker API's to HtmlNode like the one shown above. They facilitate some nice looking stuff:

ForeachComment( doc.DocumentNode, delegate( HtmlCommentNode n )
int xmlStart = n.Comment.IndexOf( "<xml>" );
int xmlEnd = n.Comment.LastIndexOf( "</xml>" );
if (xmlStart >= 0 && xmlEnd > xmlStart)
XmlDocument xmldoc = new XmlDocument();
string rawXml = n.Comment.Substring( xmlStart, xmlEnd+6 - xmlStart );

xmldoc.LoadXml( DeNamespaceifyXML(rawXml) );
XmlNodeList nl = xmldoc.SelectNodes( "//CustomDocumentProperties/*", nsMgr );
if (nl != null)
foreach (XmlNode customDocNode in nl)
XmlAttribute dtAttr = customDocNode.Attributes"dt";
if (dtAttr != null)
string name = customDocNode.Name;
string dt = dtAttr.Value;
string text = customDocNode.InnerText;

You're right that the XML scanner was not pleased with all the undeclared namespaces in the <xml> block. I think the "correct" way to handle it would be to use the information from the <html> tag, which, for Word-saved HTML files looks like this:

<html xmlns:v="urn:schemas-microsoft-com:vml"

You'd use that xmlns information to reconstruct the <xml> tag with appropriate namespace references. But I skipped that in favor of a simpler solution. I just wrote a regex to strip out the namespaces:

internal static string DeNamespaceifyXML( string xml )
Regex entities = new Regex( @"(?<before>\</?)\w\:(?<after>\w> )" );
string fixedEntities = entities.Replace( xml, delegate(Match m)
return m.Groups"before".Value + m.Groups"after".Value;

Regex attrs = new Regex( @" \w\:(\w)=" );
return attrs.Replace( fixedEntities, delegate( Match m )
return " " + m.Groups1.Value + "=";