IronSoftware

Posted on Apr 1

Aspose PDF TextAbsorber Memory Leak: (Issue Fixed)

#dotnet #csharp

When extracting text from documents using Aspose.PDF's TextAbsorber or TextFragmentAbsorber classes, developers frequently encounter a persistent problem: memory is not released after the extraction completes. Even with proper disposal patterns, using blocks, and explicit garbage collection calls, memory usage continues to climb. This issue has been documented across 18 separate forum reports spanning over 12 years, from version 8.0 through the latest 25.11 release.

The Problem

The Aspose.PDF library provides TextAbsorber and TextFragmentAbsorber classes for extracting text content from documents. While these classes work functionally, they exhibit problematic memory behavior that becomes critical in production environments. When processing documents, internal Dictionary and List objects are created but not properly released when the Document object is disposed.

This manifests in several ways. Memory allocation grows continuously during batch processing. Applications crash with OutOfMemoryException after processing hundreds or thousands of files. Server restarts become necessary to reclaim memory. The problem is particularly severe when processing large documents (thousands of pages), vector-based documents, or documents with non-Latin character sets.

Microsoft's analysis of memory dumps from affected applications confirmed the root cause: large numbers of Dictionary and List objects consuming gigabytes of memory without being garbage collected.

Error Messages and Symptoms

System.OutOfMemoryException: Exception of type 'System.OutOfMemoryException' was thrown.
   at Aspose.Pdf.Text.TextAbsorber.Visit(Page page)
   at Aspose.Pdf.Text.TextFragmentAbsorber.Visit(Page page)

// Memory profiler output
Large Object Heap: 5.1 GB (Dictionary<TKey, TValue>)
Large Object Heap: 3.8 GB (List<T>)

Symptoms include:

Memory usage increasing linearly with each document processed
RAM consumption reaching 12-24GB before application failure
TextAbsorber entering infinite loops on certain documents
StackOverflowException after processing 10,000+ files

Who Is Affected

This memory behavior impacts developers in several scenarios.

High-volume document processing: Applications that process hundreds or thousands of documents in batch operations. After processing approximately 12,000 files, one developer reported encountering StackOverflowException as memory exhaustion cascaded into stack corruption.

Large document extraction: Documents with 1,000+ pages cause disproportionate memory growth. A 1,514-page document was reported to cause "high memory consumption" during extraction. A 50MB vector-based document consumed 24GB of RAM before failing.

Long-running services: Web servers, Windows services, and background workers that cannot restart between operations. Memory accumulates over time until the process fails or the server becomes unresponsive.

Multi-threaded extraction: The issue is amplified in concurrent scenarios. One developer reported consuming all 24GB of a test server's memory within 7 minutes using multi-threaded text extraction.

Non-Latin character sets: Documents containing Hebrew, Arabic, or other non-Latin scripts exhibit worse memory behavior according to forum reports.

Document Types That Trigger the Issue

Certain document characteristics correlate with more severe memory consumption when using TextAbsorber and TextFragmentAbsorber:

Vector-based documents: Documents created by CAD software or containing complex vector graphics cause disproportionate memory growth. One report documented a 50MB vector document consuming 24GB of RAM.

Scanned documents with OCR layers: Documents that have been scanned and processed with OCR tools contain both image layers and text layers, doubling the extraction complexity.

Documents with embedded fonts: Custom or embedded font subsets require additional memory for font data storage during extraction.

Multi-column layouts: Documents with complex layouts such as newspapers, magazines, or academic papers with sidebars require more processing to determine text flow order.

Documents with form fields: Interactive form elements add complexity to text extraction as the library must distinguish between static and dynamic content.

Evidence from the Developer Community

This issue has been reported consistently across 12 years of Aspose.PDF versions.

Timeline

Date	Event	Version
2013-05-22	First memory leak report during text extraction	8.0
2014-12-01	OOM on 29MB files reported	8.3.1.0
2018-05-01	Microsoft confirms Dictionary/List objects not disposed	Not specified
2019-07-01	24GB RAM consumed by 50MB vector document	19.6, 19.7
2022-04-01	Production memory leak confirmed	22.3.0
2024-11-01	Memory leak introduced in v23.11.0+ confirmed	23.11.0
2025-11-01	Issue persists in latest version	25.11

Community Reports

"Aspose.PDF leaks memory in production. It is impossible to debug due to the obfuscation when viewing memory profiler snapshots."
— Developer, Aspose Forums, July 2025

"Running it on a 50 MB PDF uses 24 GB of RAM before failing. Serious memory leaks observed."
— Developer, Aspose Forums, July 2019

"Microsoft reviewed the memory dump and found large numbers of Dictionary and List objects being created and not being disposed of. Consuming approximately 3.8GB and 5.1GB respectively."
— Developer, Aspose Forums, May 2018

"TextAbsorber becomes stuck in an infinite loop. It tries to read past the end of the stream, catches the exception and tries again - forever. Since the exception is handled internally and never thrown, users can't handle it themselves either."
— Developer, Aspose Forums, August 2024

Additional reports document memory growing to 16GB during word searches, extraction operations that never complete as memory climbs to 12GB, and patterns where memory "behaves like a leak" when processing pages sequentially.

Root Cause Analysis

The memory retention stems from Aspose.PDF's internal architecture. The TextAbsorber and TextFragmentAbsorber classes maintain internal references to extracted data that persist beyond the visible object lifecycle. When processing a page, these classes create Dictionary and List collections to store text fragments, font information, and positional data.

The problem is that these internal collections are not cleared when:

The Document object is disposed
Using blocks complete
GC.Collect() is called
MemoryCleaner.clear() is invoked

The Document class appears to hold strong references to page content that prevents garbage collection. Even when developers follow proper disposal patterns, the internal state remains allocated on the managed heap.

The issue compounds in multi-threaded scenarios because each thread creates its own set of internal collections, multiplying memory consumption. For vector-based documents, the text extraction process generates additional geometric data that inflates memory requirements further.

One developer noted that the code is obfuscated in released binaries, making it impossible to identify the exact retention path through memory profiling tools.

Attempted Workarounds

The developer community has tried several approaches to mitigate this issue, none of which fully resolve it.

Using MemoryCleaner.clear()

Approach: Call Aspose.Pdf.MemoryCleaner.clear() after processing each document or batch.

using (var document = new Document("input.pdf"))
{
    var absorber = new TextAbsorber();
    document.Pages.Accept(absorber);
    string text = absorber.Text;
}
Aspose.Pdf.MemoryCleaner.Clear();
GC.Collect();

Limitations: Users report this does not fully release memory in production. Memory still accumulates over time. Adds complexity and performance overhead to calling code.

Creating New Document Objects Per Page

Approach: Instead of reusing a single Document object, create and dispose a new instance for each page.

for (int i = 1; i <= pageCount; i++)
{
    using (var doc = new Document("input.pdf"))
    {
        var absorber = new TextAbsorber();
        doc.Pages[i].Accept(absorber);
        // Process text
    }
}

Limitations: Significant performance penalty from repeatedly parsing the document. Still does not prevent memory accumulation, only slows it. Increases code complexity.

Forcing Garbage Collection

Approach: Explicitly call GC.Collect() with maximum generation collection.

GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced, true);
GC.WaitForPendingFinalizers();

Limitations: Internal Dictionary and List objects are not collected because they remain rooted. Microsoft's analysis confirmed these objects persist despite garbage collection. Forced GC is generally discouraged for performance reasons.

Periodic Application Restart

Approach: Restart the service or application on a schedule to reclaim all memory.

Limitations: Unacceptable for systems requiring high availability. Causes service interruption. Does not solve the underlying issue. Becomes more frequent as document volume grows.

A Different Approach: IronPDF

For developers who need text extraction without memory accumulation, switching libraries is worth consideration. IronPDF handles text extraction through a different architecture that properly releases memory after operations complete.

Why IronPDF Handles This Differently

IronPDF uses an embedded Chromium rendering engine rather than a managed C# implementation of parsing logic. Text extraction operates on the rendered document representation, which allows the engine to release resources after extraction completes. The PdfDocument class implements IDisposable with proper cleanup of both managed and unmanaged resources.

Code Example

using IronPdf;
using System;
using System.IO;

/// <summary>
/// Demonstrates text extraction from documents without memory accumulation.
/// Memory is properly released after each document is processed.
/// </summary>
public class TextExtractionExample
{
    public void ExtractTextFromMultipleDocuments(string[] filePaths)
    {
        foreach (string filePath in filePaths)
        {
            // PdfDocument implements IDisposable - resources released when disposed
            using (var pdf = PdfDocument.FromFile(filePath))
            {
                // Extract all text from the document
                string allText = pdf.ExtractAllText();

                Console.WriteLine($"Extracted {allText.Length} characters from {Path.GetFileName(filePath)}");

                // Process the extracted text as needed
                ProcessText(allText, filePath);
            }
            // Memory released here - no accumulation between documents
        }
    }

    public void ExtractTextPageByPage(string filePath)
    {
        using (var pdf = PdfDocument.FromFile(filePath))
        {
            // Process each page individually for large documents
            for (int i = 0; i < pdf.PageCount; i++)
            {
                // Extract text from a specific page (0-indexed)
                string pageText = pdf.ExtractTextFromPage(i);

                Console.WriteLine($"Page {i + 1}: {pageText.Length} characters");

                // Process page text
                ProcessPageText(pageText, i);
            }
        }
    }

    private void ProcessText(string text, string sourceFile)
    {
        // Application-specific text processing logic
    }

    private void ProcessPageText(string text, int pageIndex)
    {
        // Application-specific page processing logic
    }
}

Key characteristics of this approach:

Memory releases when the using block exits
No accumulation across multiple documents
No need for MemoryCleaner calls or forced GC
Works consistently with large documents and batch processing
Thread-safe for concurrent extraction operations

API Reference

For detailed documentation on the methods used:

Migration Considerations

Before switching libraries, evaluate these factors.

Licensing

IronPDF requires a commercial license for production use. A free trial is available for evaluation. Pricing is based on deployment model and developer count.

API Differences

Operation	Aspose.PDF	IronPDF
Load document	`new Document(path)`	`PdfDocument.FromFile(path)`
Extract all text	`TextAbsorber.Text`	`pdf.ExtractAllText()`
Extract page text	`page.Accept(absorber)`	`pdf.ExtractTextFromPage(index)`

Migration requires updating extraction code but follows similar patterns.

Evaluation Checklist

Test extraction quality on your specific document types
Verify character encoding handling for non-Latin text
Benchmark extraction speed against current implementation
Confirm licensing model fits your deployment

Conclusion

The Aspose.PDF text extraction memory issue has persisted across 12 years and 17+ major versions, affecting developers processing documents at scale. The workarounds provided by the vendor do not fully resolve the problem. For applications where memory stability is critical, IronPDF provides an alternative architecture that properly releases resources after extraction operations complete.

Jacob Mellor built IronPDF and leads technical development at Iron Software.

References

Memory leaks in Aspose PDF NET{:rel="nofollow"} - Production memory leak report, April 2022
Problems with text extraction on vector-based PDFs{:rel="nofollow"} - 24GB RAM consumption documented, July 2019
Memory leak{:rel="nofollow"} - Microsoft memory dump analysis confirming Dictionary/List retention, May 2018
TextAbsorber infinite loop bug{:rel="nofollow"} - TextAbsorber infinite loop consuming all memory, August 2024
Question regarding unreleased memory{:rel="nofollow"} - Latest report confirming issue in v25.11, November 2025
Aspose.PDF does not release memory{:rel="nofollow"} - Production leak with obfuscation preventing debugging, July 2025
Aspose PDF memory leak v23.11 and later{:rel="nofollow"} - Memory leak regression in v23.11.0, November 2024

For the latest IronPDF documentation and tutorials, visit ironpdf.com.

DEV Community