Processing large batches of PDFs is a common requirement in document management systems, data extraction pipelines, and archive digitization projects. The challenge is doing it efficiently while handling the inevitable edge cases: corrupted files, encryption, or malformed structures.
I've been working on oxidize-pdf, a Rust native library for PDF processing. Recently, I implemented a parallel batch processing feature to handle production-scale document processing.
Why Another PDF Library?
There are existing PDF libraries in the Rust ecosystem, notably lopdf. However, oxidize-pdf addresses a different use case:
lopdf is a general-purpose PDF manipulation library. It provides low-level access to PDF structures and is excellent for creating, modifying, and inspecting PDFs.
oxidize-pdf is optimized for document content extraction and processing. It focuses on:
- High-performance text extraction
- OCR integration with Tesseract for scanned documents
- Structured data extraction (tables, forms, metadata)
- Batch processing at scale
- Production-ready error handling and recovery
The library is designed for systems that need to process thousands of documents daily: invoice processing, contract analysis, document classification, and RAG (Retrieval-Augmented Generation) pipelines.
Requirements
The batch processing implementation needed to satisfy several constraints:
- Performance: Sequential processing is impractical for batches of 500+ files
- Reliability: Individual file failures must not halt the entire batch
- Observability: Clear progress indication and detailed error reporting
- Integration: Machine-readable output for automation
Core Implementation
The batch processor uses Rayon for parallelism. Here's the main processing function:
rust
pub fn process_batch(files: &[PathBuf], config: &BatchConfig) -> BatchResult {
let start = Instant::now();
let progress = ProgressBar::new(files.len() as u64);
let results = Arc::new(Mutex::new(Vec::new()));
// Parallel processing with Rayon
files.par_iter().for_each(|path| {
let file_start = Instant::now();
let result = match process_single_pdf(path) {
Ok(data) => ProcessingResult {
filename: path.file_name().unwrap().to_string_lossy().to_string(),
success: true,
pages: Some(data.page_count),
text_chars: Some(data.text.len()),
duration_ms: file_start.elapsed().as_millis() as u64,
error: None,
},
Err(e) => ProcessingResult {
filename: path.file_name().unwrap().to_string_lossy().to_string(),
success: false,
pages: None,
text_chars: None,
duration_ms: file_start.elapsed().as_millis() as u64,
error: Some(e.to_string()),
},
};
results.lock().unwrap().push(result);
progress.inc(1);
});
progress.finish();
let all_results = results.lock().unwrap();
aggregate_results(&all_results, start.elapsed())
}
The key aspects:
-
par_iter()
enables Rayon's parallel iteration -
Error isolation through individual
match
on each file -
Thread-safe result collection using
Arc<Mutex<Vec>>
-
Progress tracking with
indicatif
crate
Processing Individual PDFs
Each PDF is processed independently:
rust
fn process_single_pdf(path: &Path) -> Result<DocumentData, PdfError> {
let document = Document::load(path)?;
let text = document.extract_text()?;
Ok(DocumentData {
page_count: document.get_pages().len(),
text,
})
}
The simplicity here is intentional. If load()
or extract_text()
fails, the error propagates up, gets caught in the main loop, and processing continues with the next file.
Usage
bash
# Process directory with default settings
cargo run --example batch_processing --features rayon -- --dir ./pdfs
# Control parallelism
cargo run --example batch_processing --features rayon -- --dir ./pdfs --workers 8
# JSON output for automation
cargo run --example batch_processing --features rayon -- --dir ./pdfs --json
Performance Results
Testing with 772 PDFs on an Intel i9 MacBook Pro:
- Sequential: ~10 minutes
- Parallel: ~1 minute
- Speedup: ~10x
Throughput varies with file complexity (15+ docs/sec for simple files, 1-2 docs/sec for complex ones), but the parallelization benefit is consistent.
Error Handling
The design decision was file independence. Each PDF is processed in isolation:
rust
// This continues even if some files fail
files.par_iter().for_each(|path| {
match process_single_pdf(path) {
Ok(data) => { /* record success */ },
Err(e) => { /* record error, continue */ },
}
});
At completion, you receive a full report:
╔═══════════════════════════════════════╗
BATCH SUMMARY REPORT
╚═══════════════════════════════════════╝
📊 Statistics:
Total files: 772
✅ Successful: 749 (97.0%)
❌ Failed: 23 (3.0%)
⏱️ Performance:
Total time: 62.4s
Throughput: 12.4 docs/sec
❌ Failed files:
• corrupted.pdf - Invalid PDF structure
• locked.pdf - Permission denied
• encrypted.pdf - Encryption not supported
This approach is essential for production systems where restarting a 1-hour batch job because of file #437 is unacceptable.
JSON Output for Automation
For pipeline integration:
rust
#[derive(Serialize)]
struct BatchResult {
total: usize,
successful: usize,
failed: usize,
total_duration_ms: u128,
throughput_docs_per_sec: f64,
results: Vec<ProcessingResult>,
}
#[derive(Serialize)]
struct ProcessingResult {
filename: String,
success: bool,
pages: Option<usize>,
text_chars: Option<usize>,
duration_ms: u64,
error: Option<String>,
}
Output example:
json
{
"total": 772,
"successful": 749,
"failed": 23,
"throughput_docs_per_sec": 12.4,
"results": [
{
"filename": "document1.pdf",
"success": true,
"pages": 25,
"text_chars": 15234,
"duration_ms": 145,
"error": null
},
{
"filename": "corrupted.pdf",
"success": false,
"pages": null,
"text_chars": null,
"duration_ms": 23,
"error": "Invalid PDF structure"
}
]
}
This integrates easily with jq
, Python scripts, or monitoring systems:
bash
# Extract failed files
cat results.json | jq -r '.results[] | select(.success == false) | .filename'
# Calculate success rate
cat results.json | jq '.successful / .total * 100'
Why Rust Over Python
The decision to implement this in Rust rather than Python was driven by practical considerations:
Memory efficiency: Python PDF libraries typically consume 2-3GB when processing large batches. Rust keeps memory usage predictable and significantly lower.
True parallelism: Python's GIL limits parallel processing. While multiprocessing
works, the overhead is substantial. Rayon provides efficient work-stealing parallelism without process spawning costs.
Deployment simplicity: A single compiled binary eliminates dependency management issues in production environments.
Python remains excellent for exploratory work. But for daily production pipelines processing thousands of documents, the operational benefits of Rust are substantial.
Current Limitations
The current implementation:
- Processes a single directory (non-recursive)
- Loads complete PDFs into memory
- Text extraction only (no images/metadata in this example)
For very large files (1GB+), a streaming approach would be more appropriate.
What's Next
The library is under active development. Planned features include:
- Advanced structured data extraction with templates
- Enhanced OCR quality detection and preprocessing
- Memory-efficient streaming for large documents
Try It
The code is available at github.com/bzsanti/oxidizePdf. The batch processing example is in examples/batch_processing.rs
.
If you're processing PDFs at scale, I'd be interested in hearing about your use cases and any edge cases this implementation doesn't handle.
Top comments (0)