This project has moved and is read-only. For the latest updates, please go here.
1
Vote

Multithread OutOfMemoryException when calling LoadHtml

description

Hi

I'm getting 2 different out of memory exceptions that are coming up from the LoadHtml method in the HtmlDocument class.

the 1st is:
System.OutOfMemoryException was caught
HResult=-2147024882
Message=Exception of type 'System.OutOfMemoryException' was thrown.
Source=mscorlib
StackTrace:
   at System.String.InternalSubString(Int32 startIndex, Int32 length)
   at System.String.Substring(Int32 startIndex, Int32 length)
   at HtmlAgilityPack.HtmlDocument.Parse() in d:\SVN_CHECKOUT\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 1467
   at HtmlAgilityPack.HtmlDocument.Load(TextReader reader) in d:\SVN_CHECKOUT\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 614
   at HtmlAgilityPack.HtmlDocument.LoadHtml(String html) in d:\SVN_CHECKOUT\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 660
And the 2nd is:
System.OutOfMemoryException was caught
HResult=-2147024882
Message=Exception of type 'System.OutOfMemoryException' was thrown.
Source=mscorlib
StackTrace:
   at System.Collections.Generic.Dictionary`2.Resize(Int32 newSize, Boolean forceNewHashCodes)
   at System.Collections.Generic.Dictionary`2.Insert(TKey key, TValue value, Boolean add)
   at System.Collections.Generic.Dictionary`2.Add(TKey key, TValue value)
   at HtmlAgilityPack.HtmlNode..ctor(HtmlNodeType type, HtmlDocument ownerdocument, Int32 index) in d:\SVN_CHECKOUT\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlNode.cs:line 156
   at HtmlAgilityPack.HtmlDocument.CreateNode(HtmlNodeType type, Int32 index) in d:\SVN_CHECKOUT\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 752
   at HtmlAgilityPack.HtmlDocument.PushNodeStart(HtmlNodeType type, Int32 index) in d:\SVN_CHECKOUT\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 1624
   at HtmlAgilityPack.HtmlDocument.NewCheck() in d:\SVN_CHECKOUT\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 1115
   at HtmlAgilityPack.HtmlDocument.Parse() in d:\SVN_CHECKOUT\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 1155
   at HtmlAgilityPack.HtmlDocument.Load(TextReader reader) in d:\SVN_CHECKOUT\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 614
   at HtmlAgilityPack.HtmlDocument.LoadHtml(String html) in d:\SVN_CHECKOUT\htmlagilitypack\Trunk\HtmlAgilityPack\HtmlDocument.cs:line 660
My code that loads the html is:
HtmlDocument htmlDocument = new HtmlDocument();
htmlDocument.LoadHtml(html);
HtmlNodeCollection htmlNodes = htmlDocument.DocumentNode.SelectNodes("//a[@href]");
html is a string variable that contains the string for an html document that has been downloaded.

I think it's important to mention that the document loading is being called from a multi threaded environment.
       for (int i = 0; i < _threadCount; i++)
       {
            asyncPool.Add(Task.Run(async () =>
            {
                while (!_cancellationToken.IsCancellationRequested)
                {
                    List<Task<WebPageResult>> asyncDownloadTasks = DownloadNextPageGroup();
                    await ProcessPagesOnCompletion(asyncDownloadTasks);
                }
            }, _cancellationToken));
        }
The html document loading happens inside the DownloadNextPageGroup method.


This exception doesn't happen immediately, I have to let the application run for about 5 minutes before the exception occurs.

What I've tried:
  • Loading the html string into a stream before loading it into an HtmlDocument class (with and without encoding specified)
  • setting the instance of htmlDocument to null after each use of the HtmlDocument.
  • Limiting the size of pages loaded to 10 Megabytes.
My Html Agility Pack version is 1.4.9.0

Any solutions to this problem or temporary workaround would be appreciated.

Thanks

comments