DEV Community

Santiago Fernández
Santiago Fernández

Posted on

Implementing Parallel PDF Batch Processing in Rust

Processing large batches of PDFs is a common requirement in document management systems, data extraction pipelines, and archive digitization projects. The challenge is doing it efficiently while handling the inevitable edge cases: corrupted files, encryption, or malformed structures.
I've been working on oxidize-pdf, a Rust native library for PDF processing. Recently, I implemented a parallel batch processing feature to handle production-scale document processing.

Why Another PDF Library?

There are existing PDF libraries in the Rust ecosystem, notably lopdf. However, oxidize-pdf addresses a different use case:
lopdf is a general-purpose PDF manipulation library. It provides low-level access to PDF structures and is excellent for creating, modifying, and inspecting PDFs.
oxidize-pdf is optimized for document content extraction and processing. It focuses on:

  • High-performance text extraction
  • OCR integration with Tesseract for scanned documents
  • Structured data extraction (tables, forms, metadata)
  • Batch processing at scale
  • Production-ready error handling and recovery

The library is designed for systems that need to process thousands of documents daily: invoice processing, contract analysis, document classification, and RAG (Retrieval-Augmented Generation) pipelines.

Requirements

The batch processing implementation needed to satisfy several constraints:

  1. Performance: Sequential processing is impractical for batches of 500+ files
  2. Reliability: Individual file failures must not halt the entire batch
  3. Observability: Clear progress indication and detailed error reporting
  4. Integration: Machine-readable output for automation

Core Implementation

The batch processor uses Rayon for parallelism. Here's the main processing function:

rust
pub fn process_batch(files: &[PathBuf], config: &BatchConfig) -> BatchResult {
    let start = Instant::now();
    let progress = ProgressBar::new(files.len() as u64);
    let results = Arc::new(Mutex::new(Vec::new()));

    // Parallel processing with Rayon
    files.par_iter().for_each(|path| {
        let file_start = Instant::now();

        let result = match process_single_pdf(path) {
            Ok(data) => ProcessingResult {
                filename: path.file_name().unwrap().to_string_lossy().to_string(),
                success: true,
                pages: Some(data.page_count),
                text_chars: Some(data.text.len()),
                duration_ms: file_start.elapsed().as_millis() as u64,
                error: None,
            },
            Err(e) => ProcessingResult {
                filename: path.file_name().unwrap().to_string_lossy().to_string(),
                success: false,
                pages: None,
                text_chars: None,
                duration_ms: file_start.elapsed().as_millis() as u64,
                error: Some(e.to_string()),
            },
        };

        results.lock().unwrap().push(result);
        progress.inc(1);
    });

    progress.finish();

    let all_results = results.lock().unwrap();
    aggregate_results(&all_results, start.elapsed())
}

Enter fullscreen mode Exit fullscreen mode

The key aspects:

  • par_iter() enables Rayon's parallel iteration
  • Error isolation through individual match on each file
  • Thread-safe result collection using Arc<Mutex<Vec>>
  • Progress tracking with indicatif crate

Processing Individual PDFs

Each PDF is processed independently:

rust
fn process_single_pdf(path: &Path) -> Result<DocumentData, PdfError> {
    let document = Document::load(path)?;
    let text = document.extract_text()?;

    Ok(DocumentData {
        page_count: document.get_pages().len(),
        text,
    })
}
Enter fullscreen mode Exit fullscreen mode

The simplicity here is intentional. If load() or extract_text() fails, the error propagates up, gets caught in the main loop, and processing continues with the next file.

Usage

bash
# Process directory with default settings
cargo run --example batch_processing --features rayon -- --dir ./pdfs

# Control parallelism
cargo run --example batch_processing --features rayon -- --dir ./pdfs --workers 8

# JSON output for automation
cargo run --example batch_processing --features rayon -- --dir ./pdfs --json
Enter fullscreen mode Exit fullscreen mode

Performance Results

Testing with 772 PDFs on an Intel i9 MacBook Pro:

  • Sequential: ~10 minutes
  • Parallel: ~1 minute
  • Speedup: ~10x

Throughput varies with file complexity (15+ docs/sec for simple files, 1-2 docs/sec for complex ones), but the parallelization benefit is consistent.

Error Handling

The design decision was file independence. Each PDF is processed in isolation:

rust
// This continues even if some files fail
files.par_iter().for_each(|path| {
    match process_single_pdf(path) {
        Ok(data) => { /* record success */ },
        Err(e) => { /* record error, continue */ },
    }
});
Enter fullscreen mode Exit fullscreen mode

At completion, you receive a full report:

╔═══════════════════════════════════════╗
         BATCH SUMMARY REPORT
╚═══════════════════════════════════════╝

📊 Statistics:
   Total files:     772
   ✅ Successful:   749 (97.0%)
   ❌ Failed:       23 (3.0%)

⏱️  Performance:
   Total time:      62.4s
   Throughput:      12.4 docs/sec

❌ Failed files:
   • corrupted.pdf - Invalid PDF structure
   • locked.pdf - Permission denied
   • encrypted.pdf - Encryption not supported
Enter fullscreen mode Exit fullscreen mode

This approach is essential for production systems where restarting a 1-hour batch job because of file #437 is unacceptable.

JSON Output for Automation

For pipeline integration:

rust
#[derive(Serialize)]
struct BatchResult {
    total: usize,
    successful: usize,
    failed: usize,
    total_duration_ms: u128,
    throughput_docs_per_sec: f64,
    results: Vec<ProcessingResult>,
}

#[derive(Serialize)]
struct ProcessingResult {
    filename: String,
    success: bool,
    pages: Option<usize>,
    text_chars: Option<usize>,
    duration_ms: u64,
    error: Option<String>,
}
Enter fullscreen mode Exit fullscreen mode

Output example:

json
{
  "total": 772,
  "successful": 749,
  "failed": 23,
  "throughput_docs_per_sec": 12.4,
  "results": [
    {
      "filename": "document1.pdf",
      "success": true,
      "pages": 25,
      "text_chars": 15234,
      "duration_ms": 145,
      "error": null
    },
    {
      "filename": "corrupted.pdf",
      "success": false,
      "pages": null,
      "text_chars": null,
      "duration_ms": 23,
      "error": "Invalid PDF structure"
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

This integrates easily with jq, Python scripts, or monitoring systems:

bash
# Extract failed files
cat results.json | jq -r '.results[] | select(.success == false) | .filename'

# Calculate success rate
cat results.json | jq '.successful / .total * 100'
Enter fullscreen mode Exit fullscreen mode

Why Rust Over Python

The decision to implement this in Rust rather than Python was driven by practical considerations:
Memory efficiency: Python PDF libraries typically consume 2-3GB when processing large batches. Rust keeps memory usage predictable and significantly lower.
True parallelism: Python's GIL limits parallel processing. While multiprocessing works, the overhead is substantial. Rayon provides efficient work-stealing parallelism without process spawning costs.
Deployment simplicity: A single compiled binary eliminates dependency management issues in production environments.
Python remains excellent for exploratory work. But for daily production pipelines processing thousands of documents, the operational benefits of Rust are substantial.

Current Limitations

The current implementation:

  • Processes a single directory (non-recursive)
  • Loads complete PDFs into memory
  • Text extraction only (no images/metadata in this example)

For very large files (1GB+), a streaming approach would be more appropriate.

What's Next

The library is under active development. Planned features include:

  • Advanced structured data extraction with templates
  • Enhanced OCR quality detection and preprocessing
  • Memory-efficient streaming for large documents

Try It

The code is available at github.com/bzsanti/oxidizePdf. The batch processing example is in examples/batch_processing.rs.
If you're processing PDFs at scale, I'd be interested in hearing about your use cases and any edge cases this implementation doesn't handle.

Top comments (0)