DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

Document AI Toolkit

Document AI Toolkit

Turn unstructured documents into structured, queryable data. This toolkit provides complete pipelines for parsing PDFs, extracting tables, running OCR on scanned documents, summarizing long-form content, and pulling structured fields from invoices, contracts, and reports. Built as composable pipeline stages so you can mix, match, and extend for your specific document types.

Key Features

  • PDF Parsing Engine — Extract text, metadata, and layout information from PDFs with support for multi-column layouts and embedded images
  • OCR Integration — Process scanned documents and images with configurable OCR backends (Tesseract, cloud APIs) and pre-processing for skew correction
  • Table Extraction — Detect and extract tables from PDFs and images into pandas DataFrames or CSV, handling merged cells and spanning headers
  • Summarization Chains — Multi-stage summarization for long documents: chunk → summarize → merge, with configurable compression ratios
  • Structured Data Extraction — Define extraction schemas and pull typed fields (dates, amounts, names, addresses) from any document
  • Document Classification — Automatically categorize incoming documents by type (invoice, contract, report, letter) before routing to specialized extractors
  • Batch Processing — Process thousands of documents in parallel with progress tracking, retry logic, and partial-failure handling

Quick Start

from document_ai import DocumentPipeline, stages

# 1. Build a pipeline
pipeline = DocumentPipeline([
    stages.PDFParser(extract_images=True),
    stages.OCR(engine="tesseract", language="eng"),
    stages.TableExtractor(output_format="dataframe"),
    stages.Summarizer(model="gpt-4o-mini", max_summary_length=200),
    stages.StructuredExtractor(schema="schemas/invoice.yaml"),
])

# 2. Process a document
result = pipeline.process("documents/invoice_2025_Q3.pdf")

print(result.text[:500])           # Full extracted text
print(result.tables[0].to_csv())   # First table as CSV
print(result.summary)              # LLM-generated summary
print(result.structured_data)      # {"vendor": "Acme Corp", "total": 4250.00, ...}
Enter fullscreen mode Exit fullscreen mode

Architecture

Input Document (PDF / Image / Scan)
         │
         ▼
┌─────────────────┐
│   PDF Parser    │──── Extract text + layout + embedded images
└────────┬────────┘
         │
         ├── Has text ──────────────▶ Text output
         │
         └── Image/Scan ──▶ ┌───────────┐
                            │    OCR     │──── Text from images
                            └─────┬─────┘
                                  │
         ┌────────────────────────┘
         ▼
┌─────────────────┐
│ Table Extractor │──── Detect tables → DataFrames
└────────┬────────┘
         ▼
┌─────────────────┐
│  Summarizer     │──── Chunk → Summarize → Merge
└────────┬────────┘
         ▼
┌─────────────────┐
│ Schema Extractor│──── Extract typed fields per schema
└────────┬────────┘
         ▼
    DocumentResult (text, tables, summary, structured_data, metadata)
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Define Custom Extraction Schemas

from document_ai.extraction import ExtractionSchema, Field

invoice_schema = ExtractionSchema(
    name="invoice",
    fields=[
        Field("vendor_name", type="string", description="Company that issued the invoice"),
        Field("invoice_number", type="string", pattern=r"INV-\d+"),
        Field("date", type="date", formats=["%Y-%m-%d", "%m/%d/%Y"]),
        Field("line_items", type="list", item_schema={
            "description": "string",
            "quantity": "integer",
            "unit_price": "float",
        }),
        Field("total_amount", type="float", description="Total amount due"),
    ],
)

result = pipeline.process("invoice.pdf", schema=invoice_schema)
print(result.structured_data)
# {"vendor_name": "Acme Corp", "invoice_number": "INV-2025-0042",
#  "date": "2025-03-15", "total_amount": 12750.00, "line_items": [...]}
Enter fullscreen mode Exit fullscreen mode

Batch Processing with Progress Tracking

from document_ai import BatchProcessor
from pathlib import Path

processor = BatchProcessor(
    pipeline=pipeline,
    max_workers=4,
    retry_on_failure=True,
    max_retries=2,
)

results = processor.process_directory(
    Path("documents/inbox/"),
    glob_pattern="*.pdf",
    output_dir=Path("documents/processed/"),
)

print(f"Processed: {results.success_count}/{results.total_count}")
print(f"Failed: {[f.filename for f in results.failures]}")
Enter fullscreen mode Exit fullscreen mode

Multi-Stage Summarization for Long Documents

from document_ai.summarization import MapReduceSummarizer

summarizer = MapReduceSummarizer(
    chunk_size=2000,         # Tokens per chunk
    chunk_overlap=200,       # Overlap between chunks
    map_model="gpt-4o-mini", # Cheap model for individual chunks
    reduce_model="gpt-4o",   # Better model for final merge
    final_length=500,        # Target summary length in tokens
)

summary = summarizer.summarize(long_document_text)
Enter fullscreen mode Exit fullscreen mode

Configuration

# document_ai_config.yaml
pdf_parser:
  extract_images: true
  image_dpi: 300               # DPI for image extraction
  layout_analysis: true        # Detect columns, headers, footers
  password: null               # For encrypted PDFs

ocr:
  engine: "tesseract"          # tesseract | google_vision | aws_textract
  language: "eng"
  preprocessing:
    deskew: true               # Correct page rotation
    denoise: true              # Remove noise from scanned docs
    binarize: true             # Convert to black/white
  confidence_threshold: 0.6    # Below this, flag for manual review

table_extraction:
  detection_method: "lattice"  # lattice | stream | hybrid
  output_format: "dataframe"   # dataframe | csv | json
  merge_adjacent: true         # Merge tables split across pages

summarization:
  model: "gpt-4o-mini"
  strategy: "map_reduce"       # map_reduce | refine | stuff
  chunk_size: 2000
  chunk_overlap: 200
  max_summary_length: 300

extraction:
  model: "gpt-4o"
  confidence_threshold: 0.8   # Flag low-confidence extractions
  validate_types: true         # Enforce field type constraints
  schema_dir: "schemas/"

batch:
  max_workers: 4
  max_retries: 2
  output_format: "json"        # json | csv | parquet
  save_intermediate: false     # Save per-stage outputs for debugging
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Pre-process scanned documents — Deskew, denoise, and binarize before OCR to dramatically improve text quality.
  2. Use the cheapest model for summarization chunks — Only use GPT-4 for the final merge step; gpt-4o-mini handles individual chunks well.
  3. Define schemas per document type — Generic extraction is weak. Dedicated schemas for invoices, contracts, and reports yield much higher accuracy.
  4. Validate extraction results — Always check confidence_threshold on extracted fields and route low-confidence items for human review.
  5. Process in batches, not one-by-one — The BatchProcessor handles parallelism, retries, and partial failures automatically.
  6. Keep OCR language packs minimal — Only install language packs you actually need to keep the deployment lightweight.

Troubleshooting

Problem Cause Fix
OCR produces garbled text Poor scan quality or wrong language Enable preprocessing (deskew, denoise) and verify language setting
Table extraction misses tables Tables use borderless/minimal styling Switch detection_method to stream or hybrid
Summarization loses critical details Chunk size too small, important info split Increase chunk_overlap to 300+ and use refine strategy
Structured extraction returns null fields Schema field descriptions too vague Add specific descriptions and example values to each Field

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [Document AI Toolkit] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)