Jeremy K.

Posted on Mar 2

Extract Text from PDF Files Using C#

#csharp #programming

Extracting text from PDF files is a common and critical requirement in daily office workflows and software development. Manual copy-pasting is not only time-intensive but also wildly inefficient when scaling to large document volumes. Traditional automation approaches often rely on external components like Adobe Reader: these solutions are cumbersome to deploy, and they fail entirely when handling encrypted PDFs.

This guide walks you through using Free Spire.PDF for .NET—a free, standalone library—to extract text from PDFs with high accuracy and reliability, no PDF reader installation required. We’ll start with a comparison of extraction methods, then cover environment setup, core implementation code, advanced techniques, and provide fully runnable examples to get you up and running fast.

PDF Text Extraction: Solution Comparison

Comparison Metric	Traditional Methods	Free Spire.PDF for .NET
Dependency Requirements	Requires third-party software (e.g., Adobe Reader)	Fully self-contained kernel with zero external dependencies
Encrypted PDF Support	Cannot process password-protected PDFs	Natively supports encrypted PDFs (user/owner password)
Development Complexity	Requires COM component integration (verbose code ★★★★☆)	Clean, intuitive pure .NET APIs (minimal code ★★☆)
Documentation & Support	Disjointed, incomplete official documentation	Comprehensive, well-structured API documentation

Step-by-Step Guide: Extract PDF Text in 3 Simple Steps

1. Environment Setup

First, create a .NET Console Application (compatible with .NET Framework 4.6.1+ or .NET Core 3.1+). Install the Free Spire.PDF library via NuGet—this is the only dependency you’ll need.

Open the Package Manager Console in Visual Studio and run:

Install-Package FreeSpire.PDF

Important Note: The free version has a page limit (max 10 pages per PDF), making it ideal for personal use or small-scale projects.

2. Core Code: Extract Text from a Single PDF Page

The code below loads a PDF, extracts text from a specific page, and saves the output to a TXT file.

using System.IO;
using Spire.Pdf;
using Spire.Pdf.Texts;

namespace PdfTextExtraction
{
    class Program
    {
        static void Main(string[] args)
        {
            // Initialize PDF document object
            PdfDocument pdfDoc = new PdfDocument();

            // Load target PDF file (replace with your file path)
            pdfDoc.LoadFromFile("Sample.pdf");

            // Get the 2nd page (0-indexed Pages collection)
            PdfPageBase targetPage = pdfDoc.Pages[1];

            // Create text extractor for the target page
            PdfTextExtractor textExtractor = new PdfTextExtractor(targetPage);

            // Configure extraction options (extract all text on the page)
            PdfTextExtractOptions extractOptions = new PdfTextExtractOptions
            {
                IsExtractAllText = true // Extract full page text (disable for partial extraction)
            };

            // Execute text extraction
            string extractedText = textExtractor.ExtractText(extractOptions);

            // Save extracted text to a TXT file
            File.WriteAllText("Extracted_Page_Text.txt", extractedText);

            // Clean up resources (critical for memory management)
            pdfDoc.Close();
        }
    }
}

Key Component Explanations

PdfTextExtractor: Binds to a specific PDF page and handles text extraction logic.
PdfTextExtractOptions: Defines extraction scope (e.g., full page, rectangular region).
ExtractText(): Executes extraction and returns the page’s text as a string (core method).

3. Advanced Extraction Techniques

1. Process Password-Protected PDFs

To load an encrypted PDF, pass the password (user or owner password) directly to the LoadFromFile method:

// Load encrypted PDF with password
pdfDoc.LoadFromFile("EncryptedDocument.pdf", "your_password_here");

2. Extract Text from All PDF Pages

Iterate through all pages in the document and merge extracted text into a single file:

using System.Text;

// Initialize StringBuilder to store text from all pages
StringBuilder fullDocumentText = new StringBuilder();

// Loop through every page in the PDF
foreach (PdfPageBase page in pdfDoc.Pages)
{
    PdfTextExtractor pageExtractor = new PdfTextExtractor(page);
    PdfTextExtractOptions options = new PdfTextExtractOptions { IsExtractAllText = true };

    // Append text from current page (add line break for readability)
    fullDocumentText.AppendLine(pageExtractor.ExtractText(options));
}

// Save merged text to file
File.WriteAllText("Full_Document_Text.txt", fullDocumentText.ToString());

3. Extract Text from a Specific Rectangular Region

Target text in a defined area (units: points; 1 point = 1/72 inch) using ExtractArea:

PdfTextExtractOptions areaOptions = new PdfTextExtractOptions();
// Define region: Left = 50pt, Top = 100pt, Width = 400pt, Height = 300pt
areaOptions.ExtractArea = new System.Drawing.RectangleF(50, 100, 400, 300);

// Extract text from the defined area
string areaText = textExtractor.ExtractText(areaOptions);

Recommended Workflow Extensions

PDF text extraction is rarely the final step—integrate these features to build end-to-end data processing pipelines:

Styled Text Extraction: Use PdfTextFinder to locate text by style (font, color, size) for extracting titles, keywords, or highlighted content.
Table Data Extraction: Extract structured table data (as DataTable or 2D array) with PdfTableExtractor (avoids manual parsing of tabular text).
OCR for Scanned PDFs: Pair Free Spire.PDF with Spire.OCR to extract text from image-based (scanned) PDFs via Optical Character Recognition (OCR).

Summary

Free Spire.PDF for .NET eliminates the limitations of traditional PDF text extraction methods in the .NET ecosystem:

It is lightweight and dependency-free (no Adobe Reader or external tools required).
Its intuitive .NET APIs reduce development complexity and speed up implementation.
It supports advanced use cases (encrypted PDFs, region extraction, table parsing) to meet diverse business needs.

Whether you’re building small-scale automation tools or enterprise-grade document processing systems, Free Spire.PDF for .NET provides a reliable, cost-effective solution for PDF text extraction.

DEV Community