This project has moved. For the latest updates, please go here.

Extraction of Text from HTML

Topics: Developer Forum, User Forum
Dec 28, 2016 at 5:50 AM
I have a scenario of extraction proper texts from HTML.

consider for eg:
<p>Almost a year ago, <a href="http://scienceblogs.com/aetiology/2014/09/26/hiv-denial-live-and-well-in-2014/" target="_blank">I wrote about a terrible article</a> that was published in the journal Frontiers in Public Health. FiPH is a legitimate, peer-reviewed journal, and they had just published a manuscript that was straight-up HIV denial</p> Output:
the above HTML after extraction give me output in three separate lines like these

Almost a year ago,
I wrote about a terrible article
that was published in the journal Frontiers in Public Health. FiPH is a legitimate, peer-reviewed journal, and they had just published a manuscript that was straight-up HIV denial


Expected Output:
Almost a year ago,I wrote about a terrible article that was published in the journal Frontiers in Public Health. FiPH is a legitimate, peer-reviewed journal, and they had just published a manuscript that was straight-up HIV denial

Is there anyway to achieve this as expected??

I currently I use the way it was mentioned in source code.


case HtmlNodeType.Text:
                    // script and style must not be output
                    string parentName = node.ParentNode.Name;
                    if ((parentName == "script") || (parentName == "style"))
                        break;


                    html = ((HtmlTextNode)node).Text;

                    // is it in fact a special closing node output as text?
                    if (HtmlNode.IsOverlappedClosingElement(html))
                        break;

                    // check the text is meaningful and not a bunch of whitespaces
                    if (html.Trim().Length > 0)
                    {
                        string replaceWith = "";
                        string replacedLine = html.Replace("\r\n", replaceWith).Replace("\n", replaceWith).Replace("\r", replaceWith);
                        outText.AppendLine(WebUtility.HtmlDecode(replacedLine));
                    }
                    break;
Please provide sugesstions, if there are </br>tags used in HTML as well