DEV Community

IronSoftware
IronSoftware

Posted on

How to Parse and [Extract Text from PDF] in C#

Extracting text from PDF files is harder than it should be. PDFs are designed for consistent visual presentation, not data extraction. They store text as positioned glyphs on a page — individual characters placed at specific coordinates. There's no inherent structure defining paragraphs, sentences, or even words. Reconstructing readable text from this low-level representation requires sophisticated parsing.

I've built document processing systems that extract text from thousands of PDFs daily — invoices, contracts, reports, scanned documents. The challenge isn't just pulling text out; it's getting it in a usable format. Bad PDF parsers give you scrambled output where words run together, spaces appear in random places, or reading order is completely wrong. You spend more time cleaning the extracted text than you would have typing it manually.

The reason PDF text extraction is difficult stems from how PDFs encode content. Text isn't stored as strings or paragraphs. It's a series of individual character rendering commands. The PDF specifies "place glyph 'H' at position (100,200), place glyph 'e' at position (110,200), place glyph 'l' at position (120,200)" and so on. The parser must reconstruct words by detecting which glyphs are close enough to form a word, which words align vertically to form lines, and which lines group logically into paragraphs.

Further complicating extraction is that PDF has no semantic structure. A heading looks like a heading because it's larger and bold, not because it's tagged as a heading. A table looks like a table because text is positioned in rows and columns, not because there's a table element. Multi-column layouts are just text positioned in different regions. The parser must infer structure from visual positioning, which is error-prone.

I've encountered PDFs with text in bizarre reading orders. Visual layout suggested left-to-right, top-to-bottom. Actual glyph order in the PDF was right column first, then left column, with individual sentences interleaved. The PDF viewer rendered it correctly, but simple text extraction produced nonsensical output. We had to use spatial analysis to reconstruct the intended reading order.

Despite these challenges, modern PDF libraries handle common cases well. IronPDF uses spatial analysis to group glyphs into words, words into lines, and lines into paragraphs. It handles multi-column layouts, tables, headers, and footers reasonably well. For standard business documents — invoices, reports, form letters — extraction is reliable and accurate. Edge cases exist, but they're the exception.

The workflow I use is straightforward: load a PDF, extract text (all at once or page by page), process the text with regular expressions or NLP libraries. For structured documents like invoices, I extract text then parse for known patterns: "Invoice Number: 12345", "Total: $199.99". For unstructured documents like contracts, I extract text and feed it to search indexing or ML models.

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("document.pdf");
string allText = pdf.ExtractAllText();
Console.WriteLine(allText);
Enter fullscreen mode Exit fullscreen mode

That's the minimal case — load a PDF, extract all text, use it. The extracted text preserves reading order and includes line breaks between paragraphs. For most documents, this is sufficient. You get text you can search, index, or process further.

How Do I Extract Text from Specific Pages?

PDFs often have hundreds of pages. Extracting all text wastes time and memory if you only need specific pages. Use ExtractTextFromPage() with zero-based indexing (first page is index 0).

Extracting text from the first page:

var pdf = PdfDocument.FromFile("report.pdf");
string firstPageText = pdf.ExtractTextFromPage(0);
Enter fullscreen mode Exit fullscreen mode

Extracting from multiple specific pages:

var pdf = PdfDocument.FromFile("document.pdf");

for (int i = 0; i < 5; i++)  // First 5 pages
{
    string pageText = pdf.ExtractTextFromPage(i);
    Console.WriteLine($"Page {i + 1}:\n{pageText}\n");
}
Enter fullscreen mode Exit fullscreen mode

This is useful for documents where important information is always on specific pages. Invoices typically have summaries on the first page. Contracts have signatures on the last page. I extract just those pages instead of processing the entire document.

Page-by-page extraction also helps with memory management. Large PDFs with hundreds of high-resolution images consume gigabytes when fully loaded. Extracting text page by page keeps memory usage constant — load one page, extract text, dispose, move to next page. I've processed 5000-page PDF archives this way without running out of memory.

For parallel processing, extract pages concurrently. Load the PDF once, then spawn tasks that each extract from different pages. This scales well on multi-core servers. I've processed 100-page PDFs in under a second using 8-core parallelization with IronPDF.

What If the Extracted Text Is Garbled or Incomplete?

Text extraction quality depends on how the PDF was created. PDFs generated from HTML, Word, or databases extract cleanly. Scanned PDFs (images of pages) have no text to extract — you need OCR. Some PDFs use custom fonts or encoding that extraction libraries misinterpret.

For scanned PDFs, text extraction returns nothing or garbage. The PDF contains images of text, not actual text data. You must use OCR (Optical Character Recognition) to recognize text in the images. IronPDF integrates with IronOCR for this:

using IronPdf;
using IronOcr;

var pdf = PdfDocument.FromFile("scanned.pdf");
var ocr = new IronTesseract();

// Rasterize PDF page to image, then OCR it
var pageImage = pdf.ToBitmap(0);  // First page
var ocrResult = ocr.Read(pageImage);
string extractedText = ocrResult.Text;
Enter fullscreen mode Exit fullscreen mode

OCR is slower and less accurate than text extraction, but it's the only option for image-based PDFs. Accuracy depends on image quality and font characteristics. Clean scans with standard fonts achieve 95%+ accuracy. Poor scans or unusual fonts drop to 70-80%. I always validate OCR results when using them for automated processing.

For PDFs with selectable text but poor extraction, the issue is often custom fonts or non-standard encoding. Some PDF creators embed fonts in ways that prevent extraction libraries from mapping glyphs to Unicode characters. The PDF displays correctly, but extraction fails because the library can't determine what character each glyph represents.

There's no universal fix for encoding issues. Try different PDF libraries — what fails in one might work in another. For critical documents, I've extracted using multiple libraries and compared results, using the output that looks most correct. This is tedious but sometimes necessary for problematic PDFs.

If you control PDF creation, ensure you're using standard fonts and encodings. PDFs generated from HTML using IronPDF extract perfectly because the text encoding is straightforward. Avoid PDFs created with specialized tools that use custom encoding schemes.

How Do I Parse Structured Data from PDFs?

Extracting text is the first step. Parsing structured data requires analyzing the text for patterns. Invoices have key-value pairs ("Invoice Number: 12345"), tables have columns, forms have labeled fields. Regular expressions and string parsing extract this structure.

Parsing an invoice number:

var pdf = PdfDocument.FromFile("invoice.pdf");
string text = pdf.ExtractAllText();

var match = Regex.Match(text, @"Invoice Number:\s*(\d+)");
if (match.Success)
{
    string invoiceNumber = match.Groups[1].Value;
    Console.WriteLine($"Found invoice number: {invoiceNumber}");
}
Enter fullscreen mode Exit fullscreen mode

The regex looks for "Invoice Number:" followed by optional whitespace and one or more digits. This works if invoices follow a consistent format. Real-world documents have variations — "Invoice #:", "Inv No:", "Invoice ID:" — requiring more complex patterns or multiple regexes.

For tables, text extraction doesn't preserve table structure explicitly. You get lines of text. Reconstructing table rows and columns requires analyzing spacing and alignment. I use heuristics: lines with consistent tab positions likely form table columns, lines with similar content patterns (numbers, dates) likely form data rows.

Some PDF libraries offer table extraction APIs that parse table structure for you. These work well for simple tables with clear borders and consistent formatting. Complex tables with merged cells or nested tables still require manual parsing.

I've found that documents created from templates (like invoice generators) parse reliably with regex patterns. Hand-created PDFs or scanned documents are much harder because formatting isn't consistent. For those, I sometimes extract text, then manually review and correct the parsing output to create training data for ML models.

Can I Extract Text from Password-Protected PDFs?

Not without the password. PDFs can be encrypted with user passwords (prevents opening) and owner passwords (prevents editing/printing). Text extraction requires opening the PDF, which requires the user password if one is set.

To extract from password-protected PDFs, provide the password when loading:

var pdf = PdfDocument.FromFile("protected.pdf", "password123");
string text = pdf.ExtractAllText();
Enter fullscreen mode Exit fullscreen mode

If the password is wrong or missing, loading throws an exception. Wrap it in try-catch to handle gracefully:

try
{
    var pdf = PdfDocument.FromFile("protected.pdf", "password123");
    string text = pdf.ExtractAllText();
}
catch (Exception ex)
{
    Console.WriteLine($"Failed to open PDF: {ex.Message}");
}
Enter fullscreen mode Exit fullscreen mode

For owner-password-protected PDFs (no user password, but editing/printing restricted), text extraction usually works without the password. The owner password protects modification, not viewing. But some PDFs restrict text extraction with permissions — attempting extraction throws an exception or returns empty text.

There's no legal or technical way to bypass PDF encryption without the password. Password recovery tools exist but are slow and often ineffective against strong passwords. If you legitimately need access to encrypted PDFs, contact the document creator for the password.

How Do I Extract Text While Maintaining Formatting?

Plain text extraction strips all formatting — you lose bold, italics, font sizes, colors. For most use cases (search indexing, data extraction), formatting doesn't matter. But sometimes you need to preserve structure.

IronPDF's text extraction returns plain text with line breaks and spacing approximately preserved. Paragraphs have newlines between them. Tables have columns separated by tabs or spaces. This basic structure is often sufficient for parsing.

For richer formatting preservation, convert the PDF to HTML instead of plain text:

var pdf = PdfDocument.FromFile("document.pdf");
string html = pdf.ToHtml();
Enter fullscreen mode Exit fullscreen mode

The HTML output includes text, images, and approximate styling. You can process the HTML with an HTML parser to extract text while understanding structure — headings, lists, tables, links. This is more complex than plain text extraction but preserves semantic information.

I use HTML conversion when I need to display extracted content with formatting intact. For example, converting PDF reports to web pages preserves the visual layout. For pure data extraction, plain text is simpler and faster.

Quick Reference

Task Method
Extract all text pdf.ExtractAllText()
Extract from page pdf.ExtractTextFromPage(pageIndex)
Extract from pages 1-5 Loop with ExtractTextFromPage(i) for i=0 to 4
Load password-protected PdfDocument.FromFile("file.pdf", "password")
Parse specific data Use Regex.Match() on extracted text
Handle scanned PDFs Use OCR (IronOCR)
Preserve formatting Convert to HTML with pdf.ToHtml()

Key Principles:

  • PDF text extraction quality depends on how the PDF was created
  • Scanned PDFs require OCR, not text extraction
  • Use page-by-page extraction for large documents to manage memory
  • Parse extracted text with regex for structured data
  • Always handle exceptions when loading or extracting from PDFs

The complete PDF parsing guide covers advanced scenarios like custom text extraction strategies and handling malformed PDFs.


Written by Jacob Mellor, CTO at Iron Software. Jacob created IronPDF and leads a team of 50+ engineers building .NET document processing libraries.

Top comments (0)