This project has moved and is read-only. For the latest updates, please go here.

Clean Word HTML

Topics: Developer Forum, User Forum
Jun 18, 2010 at 2:36 AM

I've found that while Word 2007 does a decent job with Save As->Web Page, Filtered, it isn't yet ready for inclusion in a web application, where the content is going to be included in another page. The reason is that a lot of styles are stored in the header, not inline, thus making it hard to load only the body. Also, there are numerous references to fonts and the like which won't work with most browsers. What I am looking for is a library/code that will take Word Filtered HTML, and turn it into something ready to be put into a CMS system. Is anyone aware of such a library/code using the HtmlAgilityPack?


Jun 18, 2010 at 3:45 AM

I know SiteCore CMS uses Html Agility Pack to do something like this. I'm not aware of any open code to do it.

I usually use Dreamweaver to do it, It has a great word cleaner built in

Jun 18, 2010 at 3:49 AM

Thanks, I am looking for a (hopefully) open solution, or at least a library. This is for a web application backend, so integrating with anothe CMS doesn't really make sense, but it's worth taking a look at.


Jun 18, 2010 at 3:52 AM

This is a little old but it might help.



Jun 18, 2010 at 11:12 PM

I  know Tinymce ( text editor can get rid of the junk styling from MS word. I know when a user pastes it in it does it. I am not sure about if it comes from the server. You might have to look and try it out. You might have to do some call or something after it is loaded into tiny.



Jun 18, 2010 at 11:54 PM

I'll check out that CodeProject code more closely, but on a first look, it appears to only strip tags, not do any refactoring.

I've been using another editor, CKEditor, which does a decent job cleaning a Paste from Word. However, it doesn't support all features (e.g., multi-level lists), but it does get 80% of the way there. I tried TineMCE and found it does about the same. Because they work from a Paste, I assume that they grab the plain text (fairly easy to do), and add HTML based patterns. At least, that is how it appears.

Thanks for the great suggestions - I'm going to keep digging and see if there isn't someone out there with a better solution. I'll add to this thread if I find anything.