Our accounts payable team was manually typing invoice data from 500+ PDFs per week into our ERP system. Hours of repetitive data entry, constant typos, and mounting frustration.
I built a C# tool that extracts text from invoices automatically, parses the data, and imports it directly into our database. No more manual entry. Here's how.
How Do I Extract All Text from a PDF?
Use ExtractAllText to get the entire document as a string:
using IronPdf;
// Install via NuGet: Install-Package IronPdf
var pdf = PdfDocument.FromFile("invoice.pdf");
string text = pdf.ExtractAllText();
Console.WriteLine(text);
This returns all text from every page. Pages are separated by four newline characters (\n\n\n\n).
Can I Extract Text from Specific Pages?
Yes. Access individual pages:
using IronPdf;
var pdf = PdfDocument.FromFile("report.pdf");
// Extract from first page only
string page1 = pdf.Pages[0].ExtractText();
// Extract from pages 3-5
string pages3to5 = string.Join("\n",
pdf.Pages.Skip(2).Take(3).Select(p => p.ExtractText()));
Console.WriteLine(page1);
This is useful when data appears in consistent locations, like invoice totals always on page 1.
How Do I Parse Extracted Text?
Use string manipulation or regex:
using IronPdf;
using System.Text.RegularExpressions;
var pdf = PdfDocument.FromFile("invoice.pdf");
string text = pdf.ExtractAllText();
// Find invoice number (e.g., "Invoice #12345")
var invoiceMatch = Regex.Match(text, @"Invoice #(\d+)");
if (invoiceMatch.Success)
{
string invoiceNumber = invoiceMatch.Groups[1].Value;
Console.WriteLine($"Invoice: {invoiceNumber}");
}
// Find total amount (e.g., "Total: $1,234.56")
var totalMatch = Regex.Match(text, @"Total:\s*\$?([\d,]+\.\d{2})");
if (totalMatch.Success)
{
string total = totalMatch.Value.Replace(",", "");
decimal amount = decimal.Parse(total);
Console.WriteLine($"Amount: {amount:C}");
}
This extracts structured data from unstructured PDF text.
Can I Get Text Line by Line?
Yes. Use the Lines property:
using IronPdf;
var pdf = PdfDocument.FromFile("document.pdf");
foreach (var line in pdf.Pages[0].Lines)
{
Console.WriteLine($"Line at Y={line.BoundingBox.Bottom}: {line.Contents}");
}
Each line includes its text content and bounding box coordinates. This is useful for table extraction where data aligns vertically.
How Do I Extract Text by Coordinates?
Access the Characters property to get individual character positions:
using IronPdf;
var pdf = PdfDocument.FromFile("form.pdf");
foreach (var character in pdf.Pages[0].Characters)
{
Console.WriteLine($"'{character.Contents}' at ({character.BoundingBox.Left}, {character.BoundingBox.Bottom})");
}
I used this to extract data from forms where field values always appear at specific X/Y coordinates.
Can I Extract Tables from PDFs?
PDFs don't have table structures—just text positioned on a page. But you can reconstruct tables from coordinates:
using IronPdf;
using System.Linq;
var pdf = PdfDocument.FromFile("table.pdf");
var lines = pdf.Pages[0].Lines;
// Group lines by similar Y coordinates (same row)
var rows = lines.GroupBy(l => Math.Round(l.BoundingBox.Bottom / 10) * 10)
.OrderByDescending(g => g.Key);
foreach (var row in rows)
{
var cells = row.OrderBy(l => l.BoundingBox.Left).Select(l => l.Contents);
Console.WriteLine(string.Join(" | ", cells));
}
This works for simple tables. Complex tables with merged cells require more sophisticated parsing.
How Do I Handle Scanned PDFs?
Scanned PDFs are images, not searchable text. ExtractAllText returns nothing:
var pdf = PdfDocument.FromFile("scanned.pdf");
string text = pdf.ExtractAllText();
if (string.IsNullOrWhiteSpace(text))
{
Console.WriteLine("This is a scanned PDF. OCR required.");
}
You need OCR (Optical Character Recognition) to extract text from scanned documents. IronOCR integrates with IronPDF for this.
How Do I Extract Images from PDFs?
Use ExtractAllImages:
using IronPdf;
using System.IO;
var pdf = PdfDocument.FromFile("brochure.pdf");
var images = pdf.ExtractAllImages();
for (int i = 0; i < images.Length; i++)
{
var image = images[i];
File.WriteAllBytes($"image-{i}.png", image.BinaryData);
}
This saves all embedded images as PNG files.
Can I Convert Images to Different Formats?
Use the SaveAs method with a format parameter:
using IronPdf;
var pdf = PdfDocument.FromFile("catalog.pdf");
var images = pdf.ExtractAllImages();
foreach (var image in images)
{
image.SaveAs($"image-{Guid.NewGuid()}.jpg", IronSoftware.Drawing.AnyBitmap.ImageFormat.Jpeg);
}
Supported formats: PNG, JPEG, BMP, GIF, TIFF.
How Do I Extract Raw Image Data?
Use ExtractAllRawImages to get images in their original format:
using IronPdf;
using System.IO;
var pdf = PdfDocument.FromFile("photos.pdf");
var rawImages = pdf.ExtractAllRawImages();
for (int i = 0; i < rawImages.Length; i++)
{
File.WriteAllBytes($"raw-image-{i}.dat", rawImages[i].Data.BinaryData);
}
This preserves the original encoding (JPEG, PNG, etc.) without re-encoding.
Can I Search for Specific Text?
Extract all text and use Contains or regex:
using IronPdf;
var pdf = PdfDocument.FromFile("contract.pdf");
string text = pdf.ExtractAllText();
if (text.Contains("Non-Disclosure Agreement", StringComparison.OrdinalIgnoreCase))
{
Console.WriteLine("This is an NDA.");
}
For page-specific searches:
for (int i = 0; i < pdf.PageCount; i++)
{
string pageText = pdf.Pages[i].ExtractText();
if (pageText.Contains("Confidential"))
{
Console.WriteLine($"'Confidential' found on page {i + 1}");
}
}
How Do I Extract Text from Password-Protected PDFs?
Open the PDF with the password first:
using IronPdf;
var pdf = PdfDocument.FromFile("secured.pdf", "password123");
string text = pdf.ExtractAllText();
Console.WriteLine(text);
If the password is wrong, IronPDF throws an exception.
What About Extraction Performance?
Large PDFs (100+ pages) can take time. Process pages in parallel:
using IronPdf;
using System.Threading.Tasks;
using System.Collections.Concurrent;
var pdf = PdfDocument.FromFile("large-document.pdf");
var results = new ConcurrentBag<string>();
Parallel.ForEach(pdf.Pages, page =>
{
string text = page.ExtractText();
results.Add(text);
});
string allText = string.Join("\n", results);
This speeds up extraction on multi-core systems.
Can I Extract Text and Preserve Layout?
PDFs don't have inherent layout—text is positioned at coordinates. ExtractAllText returns text in reading order, but layout isn't always perfect.
For better layout preservation, use line-by-line extraction and rebuild structure based on coordinates:
var page = pdf.Pages[0];
var lines = page.Lines.OrderByDescending(l => l.BoundingBox.Bottom)
.ThenBy(l => l.BoundingBox.Left);
foreach (var line in lines)
{
Console.WriteLine(line.Contents);
}
This sorts text top-to-bottom, left-to-right.
How Do I Batch Extract from Multiple PDFs?
Loop through a directory:
using IronPdf;
using System.IO;
var files = Directory.GetFiles(@"C:\invoices", "*.pdf");
foreach (var file in files)
{
var pdf = PdfDocument.FromFile(file);
string text = pdf.ExtractAllText();
var outputPath = Path.ChangeExtension(file, ".txt");
File.WriteAllText(outputPath, text);
}
We process thousands of invoices nightly this way, extracting data for import into our accounting system.
What Are Extraction Limitations?
Complex layouts: Multi-column text or unusual positioning may extract in unexpected order.
Embedded fonts: Some PDFs use custom font encodings that don't map to standard characters. Text may appear garbled.
Form fields: ExtractAllText gets static text, not form field values. Use the form API to read fields.
Annotations: Comments and markup aren't included in extracted text.
How Do IronPDF's Alternatives Compare?
iTextSharp: Can extract text via PdfTextExtractor.GetTextFromPage(). API is more verbose than IronPDF.
PdfPig (open source): Excellent for text extraction with detailed positioning data. Good alternative if you need free/open source.
Aspose.PDF: Similar functionality to IronPDF but typically higher licensing costs.
I chose IronPDF because ExtractAllText is simple for basic needs, and Lines/Characters properties give me coordinate-level control when needed.
Can I Extract Metadata?
Yes, access the MetaData property:
using IronPdf;
var pdf = PdfDocument.FromFile("document.pdf");
Console.WriteLine($"Title: {pdf.MetaData.Title}");
Console.WriteLine($"Author: {pdf.MetaData.Author}");
Console.WriteLine($"Created: {pdf.MetaData.CreatedDate}");
Console.WriteLine($"Modified: {pdf.MetaData.ModifiedDate}");
This retrieves document properties without extracting content.
How Do I Test Extraction Accuracy?
Create test PDFs with known content:
[Test]
public void ExtractsInvoiceNumber()
{
var pdf = PdfDocument.FromFile("test-invoice.pdf");
string text = pdf.ExtractAllText();
Assert.IsTrue(text.Contains("INV-12345"));
}
I maintain a suite of test PDFs covering various layouts, fonts, and structures to ensure extraction logic works correctly across different document types.
Written by Jacob Mellor, CTO at Iron Software. Jacob created IronPDF and leads a team of 50+ engineers building .NET document processing libraries.
Top comments (0)