DEV Community

IronSoftware
IronSoftware

Posted on

Extract Text from PDF in C# (.NET Guide)

Our accounts payable team was manually typing invoice data from 500+ PDFs per week into our ERP system. Hours of repetitive data entry, constant typos, and mounting frustration.

I built a C# tool that extracts text from invoices automatically, parses the data, and imports it directly into our database. No more manual entry. Here's how.

How Do I Extract All Text from a PDF?

Use ExtractAllText to get the entire document as a string:

using IronPdf;
// Install via NuGet: Install-Package IronPdf

var pdf = PdfDocument.FromFile("invoice.pdf");
string text = pdf.ExtractAllText();

Console.WriteLine(text);
Enter fullscreen mode Exit fullscreen mode

This returns all text from every page. Pages are separated by four newline characters (\n\n\n\n).

Can I Extract Text from Specific Pages?

Yes. Access individual pages:

using IronPdf;

var pdf = PdfDocument.FromFile("report.pdf");

// Extract from first page only
string page1 = pdf.Pages[0].ExtractText();

// Extract from pages 3-5
string pages3to5 = string.Join("\n",
    pdf.Pages.Skip(2).Take(3).Select(p => p.ExtractText()));

Console.WriteLine(page1);
Enter fullscreen mode Exit fullscreen mode

This is useful when data appears in consistent locations, like invoice totals always on page 1.

How Do I Parse Extracted Text?

Use string manipulation or regex:

using IronPdf;
using System.Text.RegularExpressions;

var pdf = PdfDocument.FromFile("invoice.pdf");
string text = pdf.ExtractAllText();

// Find invoice number (e.g., "Invoice #12345")
var invoiceMatch = Regex.Match(text, @"Invoice #(\d+)");
if (invoiceMatch.Success)
{
    string invoiceNumber = invoiceMatch.Groups[1].Value;
    Console.WriteLine($"Invoice: {invoiceNumber}");
}

// Find total amount (e.g., "Total: $1,234.56")
var totalMatch = Regex.Match(text, @"Total:\s*\$?([\d,]+\.\d{2})");
if (totalMatch.Success)
{
    string total = totalMatch.Value.Replace(",", "");
    decimal amount = decimal.Parse(total);
    Console.WriteLine($"Amount: {amount:C}");
}
Enter fullscreen mode Exit fullscreen mode

This extracts structured data from unstructured PDF text.

Can I Get Text Line by Line?

Yes. Use the Lines property:

using IronPdf;

var pdf = PdfDocument.FromFile("document.pdf");

foreach (var line in pdf.Pages[0].Lines)
{
    Console.WriteLine($"Line at Y={line.BoundingBox.Bottom}: {line.Contents}");
}
Enter fullscreen mode Exit fullscreen mode

Each line includes its text content and bounding box coordinates. This is useful for table extraction where data aligns vertically.

How Do I Extract Text by Coordinates?

Access the Characters property to get individual character positions:

using IronPdf;

var pdf = PdfDocument.FromFile("form.pdf");

foreach (var character in pdf.Pages[0].Characters)
{
    Console.WriteLine($"'{character.Contents}' at ({character.BoundingBox.Left}, {character.BoundingBox.Bottom})");
}
Enter fullscreen mode Exit fullscreen mode

I used this to extract data from forms where field values always appear at specific X/Y coordinates.

Can I Extract Tables from PDFs?

PDFs don't have table structures—just text positioned on a page. But you can reconstruct tables from coordinates:

using IronPdf;
using System.Linq;

var pdf = PdfDocument.FromFile("table.pdf");
var lines = pdf.Pages[0].Lines;

// Group lines by similar Y coordinates (same row)
var rows = lines.GroupBy(l => Math.Round(l.BoundingBox.Bottom / 10) * 10)
                .OrderByDescending(g => g.Key);

foreach (var row in rows)
{
    var cells = row.OrderBy(l => l.BoundingBox.Left).Select(l => l.Contents);
    Console.WriteLine(string.Join(" | ", cells));
}
Enter fullscreen mode Exit fullscreen mode

This works for simple tables. Complex tables with merged cells require more sophisticated parsing.

How Do I Handle Scanned PDFs?

Scanned PDFs are images, not searchable text. ExtractAllText returns nothing:

var pdf = PdfDocument.FromFile("scanned.pdf");
string text = pdf.ExtractAllText();

if (string.IsNullOrWhiteSpace(text))
{
    Console.WriteLine("This is a scanned PDF. OCR required.");
}
Enter fullscreen mode Exit fullscreen mode

You need OCR (Optical Character Recognition) to extract text from scanned documents. IronOCR integrates with IronPDF for this.

How Do I Extract Images from PDFs?

Use ExtractAllImages:

using IronPdf;
using System.IO;

var pdf = PdfDocument.FromFile("brochure.pdf");
var images = pdf.ExtractAllImages();

for (int i = 0; i < images.Length; i++)
{
    var image = images[i];
    File.WriteAllBytes($"image-{i}.png", image.BinaryData);
}
Enter fullscreen mode Exit fullscreen mode

This saves all embedded images as PNG files.

Can I Convert Images to Different Formats?

Use the SaveAs method with a format parameter:

using IronPdf;

var pdf = PdfDocument.FromFile("catalog.pdf");
var images = pdf.ExtractAllImages();

foreach (var image in images)
{
    image.SaveAs($"image-{Guid.NewGuid()}.jpg", IronSoftware.Drawing.AnyBitmap.ImageFormat.Jpeg);
}
Enter fullscreen mode Exit fullscreen mode

Supported formats: PNG, JPEG, BMP, GIF, TIFF.

How Do I Extract Raw Image Data?

Use ExtractAllRawImages to get images in their original format:

using IronPdf;
using System.IO;

var pdf = PdfDocument.FromFile("photos.pdf");
var rawImages = pdf.ExtractAllRawImages();

for (int i = 0; i < rawImages.Length; i++)
{
    File.WriteAllBytes($"raw-image-{i}.dat", rawImages[i].Data.BinaryData);
}
Enter fullscreen mode Exit fullscreen mode

This preserves the original encoding (JPEG, PNG, etc.) without re-encoding.

Can I Search for Specific Text?

Extract all text and use Contains or regex:

using IronPdf;

var pdf = PdfDocument.FromFile("contract.pdf");
string text = pdf.ExtractAllText();

if (text.Contains("Non-Disclosure Agreement", StringComparison.OrdinalIgnoreCase))
{
    Console.WriteLine("This is an NDA.");
}
Enter fullscreen mode Exit fullscreen mode

For page-specific searches:

for (int i = 0; i < pdf.PageCount; i++)
{
    string pageText = pdf.Pages[i].ExtractText();
    if (pageText.Contains("Confidential"))
    {
        Console.WriteLine($"'Confidential' found on page {i + 1}");
    }
}
Enter fullscreen mode Exit fullscreen mode

How Do I Extract Text from Password-Protected PDFs?

Open the PDF with the password first:

using IronPdf;

var pdf = PdfDocument.FromFile("secured.pdf", "password123");
string text = pdf.ExtractAllText();

Console.WriteLine(text);
Enter fullscreen mode Exit fullscreen mode

If the password is wrong, IronPDF throws an exception.

What About Extraction Performance?

Large PDFs (100+ pages) can take time. Process pages in parallel:

using IronPdf;
using System.Threading.Tasks;
using System.Collections.Concurrent;

var pdf = PdfDocument.FromFile("large-document.pdf");
var results = new ConcurrentBag<string>();

Parallel.ForEach(pdf.Pages, page =>
{
    string text = page.ExtractText();
    results.Add(text);
});

string allText = string.Join("\n", results);
Enter fullscreen mode Exit fullscreen mode

This speeds up extraction on multi-core systems.

Can I Extract Text and Preserve Layout?

PDFs don't have inherent layout—text is positioned at coordinates. ExtractAllText returns text in reading order, but layout isn't always perfect.

For better layout preservation, use line-by-line extraction and rebuild structure based on coordinates:

var page = pdf.Pages[0];
var lines = page.Lines.OrderByDescending(l => l.BoundingBox.Bottom)
                       .ThenBy(l => l.BoundingBox.Left);

foreach (var line in lines)
{
    Console.WriteLine(line.Contents);
}
Enter fullscreen mode Exit fullscreen mode

This sorts text top-to-bottom, left-to-right.

How Do I Batch Extract from Multiple PDFs?

Loop through a directory:

using IronPdf;
using System.IO;

var files = Directory.GetFiles(@"C:\invoices", "*.pdf");

foreach (var file in files)
{
    var pdf = PdfDocument.FromFile(file);
    string text = pdf.ExtractAllText();

    var outputPath = Path.ChangeExtension(file, ".txt");
    File.WriteAllText(outputPath, text);
}
Enter fullscreen mode Exit fullscreen mode

We process thousands of invoices nightly this way, extracting data for import into our accounting system.

What Are Extraction Limitations?

Complex layouts: Multi-column text or unusual positioning may extract in unexpected order.

Embedded fonts: Some PDFs use custom font encodings that don't map to standard characters. Text may appear garbled.

Form fields: ExtractAllText gets static text, not form field values. Use the form API to read fields.

Annotations: Comments and markup aren't included in extracted text.

How Do IronPDF's Alternatives Compare?

iTextSharp: Can extract text via PdfTextExtractor.GetTextFromPage(). API is more verbose than IronPDF.

PdfPig (open source): Excellent for text extraction with detailed positioning data. Good alternative if you need free/open source.

Aspose.PDF: Similar functionality to IronPDF but typically higher licensing costs.

I chose IronPDF because ExtractAllText is simple for basic needs, and Lines/Characters properties give me coordinate-level control when needed.

Can I Extract Metadata?

Yes, access the MetaData property:

using IronPdf;

var pdf = PdfDocument.FromFile("document.pdf");

Console.WriteLine($"Title: {pdf.MetaData.Title}");
Console.WriteLine($"Author: {pdf.MetaData.Author}");
Console.WriteLine($"Created: {pdf.MetaData.CreatedDate}");
Console.WriteLine($"Modified: {pdf.MetaData.ModifiedDate}");
Enter fullscreen mode Exit fullscreen mode

This retrieves document properties without extracting content.

How Do I Test Extraction Accuracy?

Create test PDFs with known content:

[Test]
public void ExtractsInvoiceNumber()
{
    var pdf = PdfDocument.FromFile("test-invoice.pdf");
    string text = pdf.ExtractAllText();

    Assert.IsTrue(text.Contains("INV-12345"));
}
Enter fullscreen mode Exit fullscreen mode

I maintain a suite of test PDFs covering various layouts, fonts, and structures to ensure extraction logic works correctly across different document types.


Written by Jacob Mellor, CTO at Iron Software. Jacob created IronPDF and leads a team of 50+ engineers building .NET document processing libraries.

Top comments (0)