Readability using HtmlAgilityPack

May 18, 2010 at 6:48 AM

Hi to All. I need C# program which is call "readability". It means when you download page you can chose part of this page where main content is. If you interesting to know you can visit http://lab.arc90.com/experiments/readability/, add bookmaker on your browser and see how it works. But this script is writing using JavaScript and the same i need but written on C#. Can i ask you some help 'cause i'm totally new in C#. May be you can give me some ideas? Here is the script http://lab.arc90.com/experiments/readability/js/readability.js so you can see it. One thing i wrote today is removing some negative tags(sidebar, menu...):

 

namespace ConsoleApplication5
{
    class ee
    {
        public string ScrubHTML(string html)
        {
            HtmlWeb htmlWeb = new HtmlWeb();
            HtmlDocument doc = htmlWeb.Load(html);
            //HtmlDocument doc = new HtmlDocument();
            //doc.LoadHtml(html);

            //Remove potentially harmful elements

            

            HtmlNodeCollection nc = doc.DocumentNode.SelectNodes("//script|//noscript|//style|//div[@id='footer']|//div[@id='feed-link']|//div[@class='spacer']|//div[@id='header']|//div[@id='custom-header']|//div[@id='sidebar']|//div[@id='post-form']|//div[@id='menu']|//div[@id='navigation']|//div[@id='p-lang']");
            if (nc != null)
            {
                foreach (HtmlNode node in nc)
                {
                    node.ParentNode.RemoveChild(node, false);

                }
     

            return doc.DocumentNode.WriteTo();
        }
    }


    class Program
    {
       

        static void Main(string[] args)
        {
            ee bb = new ee();
            Console.WriteLine(bb.ScrubHTML("http://ru.wikipedia.org/wiki/%D0%9A%D0%BE%D0%BC%D0%BF%D1%8C%D1%8E%D1%82%D0%B5%D1%80"));
            Console.ReadKey();
           
        }
    }
}

Thanks :)