Why don't input elements have InnerText?

Topics: User Forum
Apr 14, 2011 at 7:41 PM

I'm in a philosophical argument with someone who is suggesting using an invisible WebBrowser control to parse HTML because it's "easier to understand" than Html Agility Pack. So I proposed this simple HTML file as an example of how cumbersome his approach would be (ignoring the other problems with that approach):




        <input type="checkbox">1!!!</input>
        <input type="checkbox">2!!!</input>



I'm an HTML novice (more like I can read it but unfamiliar with forms) so perhaps my HTML is wrong. It renders in a web browser; what a guarantee!

The goal is to get the text displayed by each checkbox. So I wrote this code:


string source = // blah blah
var doc = new HtmlDocument();

var allCheckboxes = doc.DocumentNode.SelectNodes("//input");
var innerTexts As new List<string>();
foreach (var checkbox in allCheckboxes)


I was surprised to find there's nothing in my list of strings or, more accurately, there's one empty string per checkbox. I poked around and found out that NextSibling is a #text node that contains the text I want, but that seems weird; that node should be a child of the input element, not a sibling right?

Which is it, an issue with the HtmlAgilityPack or some ignorance of HTML on my part?

Apr 14, 2011 at 7:52 PM

This is configurable if you have this code before parsing


By default input is set to be an empty type and will give you the behavior you are seeing.

HtmlNode.ElementsFlags.Add("input", HtmlElementFlag.Empty);

Here's the full list of these special cases

            ElementsFlags = new Dictionary<string, HtmlElementFlag>();
            ElementsFlags.Add("script", HtmlElementFlag.CData);
            ElementsFlags.Add("style", HtmlElementFlag.CData);
            ElementsFlags.Add("noxhtml", HtmlElementFlag.CData);

            // tags that can not contain other tags
            ElementsFlags.Add("base", HtmlElementFlag.Empty);
            ElementsFlags.Add("link", HtmlElementFlag.Empty);
            ElementsFlags.Add("meta", HtmlElementFlag.Empty);
            ElementsFlags.Add("isindex", HtmlElementFlag.Empty);
            ElementsFlags.Add("hr", HtmlElementFlag.Empty);
            ElementsFlags.Add("col", HtmlElementFlag.Empty);
            ElementsFlags.Add("img", HtmlElementFlag.Empty);
            ElementsFlags.Add("param", HtmlElementFlag.Empty);
            ElementsFlags.Add("embed", HtmlElementFlag.Empty);
            ElementsFlags.Add("frame", HtmlElementFlag.Empty);
            ElementsFlags.Add("wbr", HtmlElementFlag.Empty);
            ElementsFlags.Add("bgsound", HtmlElementFlag.Empty);
            ElementsFlags.Add("spacer", HtmlElementFlag.Empty);
            ElementsFlags.Add("keygen", HtmlElementFlag.Empty);
            ElementsFlags.Add("area", HtmlElementFlag.Empty);
            ElementsFlags.Add("input", HtmlElementFlag.Empty);
            ElementsFlags.Add("basefont", HtmlElementFlag.Empty);

            ElementsFlags.Add("form", HtmlElementFlag.CanOverlap | HtmlElementFlag.Empty);

            // they sometimes contain, and sometimes they don 't...
            ElementsFlags.Add("option", HtmlElementFlag.Empty);

            // tag whose closing tag is equivalent to open tag:
            // <p>bla</p>bla will be transformed into <p>bla</p>bla
            // <p>bla<p>bla will be transformed into <p>bla<p>bla and not <p>bla></p><p>bla</p> or <p>bla<p>bla</p></p>
            //<br> see above
            ElementsFlags.Add("br", HtmlElementFlag.Empty | HtmlElementFlag.Closed);
            ElementsFlags.Add("p", HtmlElementFlag.Empty | HtmlElementFlag.Closed);

Apr 15, 2011 at 12:57 PM

That invisible control thing does not work. The control needs to be visible on screen for the parser to work. It is for this reason the parsers like this and others were written. Also, they use less memory and execute faster.

Apr 17, 2011 at 7:58 PM

@darthobiwan: Thanks! That worked just fine.

@kurtnelle: Preaching to the choir here. I don't even like using a visible web browser to parse HTML. But since it's the only built-in way to parse HTML in .NET it's what a lot of newbies stumble upon and no sooner had I pointed out you can parse HTML without a control than someone else popped in and complained that using HTML Agility Pack was "more complicated". Lots of fun discussion after that.