Alain Airom

Posted on Jan 21

Using SuryaOCR with Docling

#suriyaocr #docling #documentprocessing #gpu

Combination of Docling with SuryaOCR.

What is Surya OCR?

Surya is a high-performance, multilingual document OCR toolkit designed to provide “universal vision” through accurate text detection and recognition across more than 90 languages. Named after the Hindu sun god, the project — available via its GitHub repository and PyPI — offers a comprehensive suite of features including line-level text detection, layout analysis (identifying tables, images, and headers), reading order detection, and LaTeX OCR. Developed by Vik Paruchuri, Surya stands out for its specialized focus on complex document structures, benchmarking favorably against traditional tools like Tesseract by leveraging modern deep learning architectures to deliver superior accuracy and sophisticated visual understanding for a wide variety of document types.

Surya Project description
Surya is a document OCR toolkit that does:

Accurate OCR in 90+ languages

Line-level text detection in any language

Table and chart detection (coming soon) Surya is named for the Hindu sun god, who has universal vision.

From Surya PyPi Page

The Power of two: Surya and Docling

Combining the precision of Surya-OCR with the structural intelligence of Docling creates a powerful pipeline for transforming static documents into machine-readable data. While Surya provides “universal vision” through its state-of-the-art text detection, layout analysis, and LaTeX recognition across 90+ languages, Docling acts as the sophisticated orchestration layer that interprets these raw visual signals into a coherent document schema. By using Surya as the OCR engine within the Docling framework, users benefit from more than just raw text extraction; they gain the ability to accurately reconstruct complex elements like nested tables, hierarchical headers, and mathematical formulas while maintaining the correct reading order. This synergy is particularly advantageous for RAG (Retrieval-Augmented Generation) applications and LLM training, as it ensures that the structural context — the difference between a footnote and a main heading — is preserved with high fidelity, even in visually dense or multilingual PDFs.

Sample Implementation

From the curated Docling samples, we can immediately explore the synergy of Surya-OCR and Docling’s layout engine through the ready-to-use script provided below. This integration showcases how Surya’s high-performance multilingual detection pairs with Docling’s document-to-markdown conversion, transforming complex PDFs into structured data with minimal configuration.

# Requires `pip install docling-surya`
# See https://pypi.org/project/docling-surya/
from docling_surya import SuryaOcrOptions

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

def main():
    source = "https://19january2021snapshot.epa.gov/sites/static/files/2016-02/documents/epa_sample_letter_sent_to_commissioners_dated_february_29_2015.pdf"

    pipeline_options = PdfPipelineOptions(
        do_ocr=True,
        ocr_model="suryaocr",
        allow_external_plugins=True,
        ocr_options=SuryaOcrOptions(lang=["en"]),
    )

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
            InputFormat.IMAGE: PdfFormatOption(pipeline_options=pipeline_options),
        }
    )

    result = converter.convert(source)
    print(result.document.export_to_markdown())

if __name__ == "__main__":
    main()

In line with my habits for batch processing, I changed the script to recursively traverse the ./input directory, ensuring every document within the subfolder hierarchy is captured. The processed results are then systematically routed to a dedicated ./output folder—which the script automatically initializes if missing—maintaining a clean separation between raw source files and their converted Markdown counterparts.

The source document for the input is the one provided in the sample application.

So we prepare the environment as usual;

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
pip install docling-surya

The sample code 👇

import os
from pathlib import Path
from datetime import datetime
from docling_surya import SuryaOcrOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

def main():
    # Define and create folders
    input_dir = Path("./input")
    output_dir = Path("./output")
    output_dir.mkdir(parents=True, exist_ok=True)

    # Configure Surya OCR options
    pipeline_options = PdfPipelineOptions(
        do_ocr=True,
        ocr_model="suryaocr",
        allow_external_plugins=True,
        ocr_options=SuryaOcrOptions(lang=["en"]),
    )

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
            InputFormat.IMAGE: PdfFormatOption(pipeline_options=pipeline_options),
        }
    )

    # Define supported extensions
    supported_extensions = {".pdf", ".png", ".jpg", ".jpeg", ".tiff", ".bmp"}

    # Recursively find files
    print(f"Scanning {input_dir} for documents...")
    files = [
        f for f in input_dir.rglob("*") 
        if f.suffix.lower() in supported_extensions
    ]

    if not files:
        print("No supported documents found in the input folder.")
        return

    for file_path in files:
        try:
            print(f"Processing: {file_path.name}...")

            # Convert the document
            result = converter.convert(str(file_path))
            markdown_content = result.document.export_to_markdown()

            # Generate timestamped filename
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            output_filename = f"{file_path.stem}_{timestamp}.md"
            output_path = output_dir / output_filename

            # Write to file
            with open(output_path, "w", encoding="utf-8") as f:
                f.write(markdown_content)

            print(f"Successfully saved to: {output_path}")

        except Exception as e:
            print(f"Error processing {file_path.name}: {e}")

if __name__ == "__main__":
    main()

The execution runs flawless and the outputs are provided below.

Console output;

python app_V2.py
/Users/alainairom/Devs/Docling-SuryaOCR/app_V2.py:23: DeprecationWarning: Using DoclingParseV4DocumentBackend for InputFormat.IMAGE is deprecated. Images should use ImageDocumentBackend via ImageFormatOption. Automatically correcting the backend, please update your code to avoid this warning.
  converter = DocumentConverter(
Scanning input for documents...
Processing: epa_sample_letter_sent_to_commissioners_dated_february_29_2015.pdf...
Downloading manifest.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 262/262 [00:00<00:00, 326kB/s]
Downloading special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 278/278 [00:00<00:00, 815kB/s]
Downloading preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 419/419 [00:00<00:00, 1.66MB/s]
Downloading tokenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 694/694 [00:00<00:00, 584kB/s]
Downloading config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50.2k/50.2k [00:00<00:00, 3.33MB/s]
Downloading training_args.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7.45k/7.45k [00:00<00:00, 3.47MB/s]
Downloading README.md: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.05k/5.05k [00:00<00:00, 560kB/s]
Downloading specials_dict.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 43.5k/43.5k [00:00<00:00, 31.1MB/s]
Downloading .gitattributes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.48k/1.48k [00:00<00:00, 1.50MB/s]
Downloading vocab_math.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 20.1k/20.1k [00:00<00:00, 21.1MB/s]
Downloading specials.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19.6k/19.6k [00:00<00:00, 45.6MB/s]
Downloading processor_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 411/411 [00:00<00:00, 1.09MB/s]
Downloading model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.34G/1.34G [00:50<00:00, 28.4MB/s]
Downloading text_recognition model to /Users/alainairom/.cache/docling/models/SuryaOcr/text_recognition/2025_09_23: 100%|███████████████████████████████████████| 12/12 [00:52<00:00,  4.34s/it]
Downloading manifest.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 127/127 [00:00<00:00, 467kB/s]
Downloading README.md: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 393/393 [00:00<00:00, 1.02MB/s]
Downloading training_args.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5.49k/5.49k [00:00<00:00, 13.7MB/s]
Downloading .gitattributes: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.48k/1.48k [00:00<00:00, 4.83MB/s]
Downloading preprocessor_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 373/373 [00:00<00:00, 1.66MB/s]
Downloading config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 858/858 [00:00<00:00, 1.82MB/s]
Downloading model.safetensors: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 73.4M/73.4M [00:03<00:00, 25.3MB/s]
Downloading text_detection model to /Users/alainairom/.cache/docling/models/SuryaOcr/text_detection/2025_05_07: 100%|█████████████████████████████████████████████| 6/6 [00:03<00:00,  1.75it/s]
Detecting bboxes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:02<00:00,  2.35s/it]
Recognizing Text: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 40/40 [00:26<00:00,  1.52it/s]
Detecting bboxes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.33it/s]
Recognizing Text: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 44/44 [00:17<00:00,  2.53it/s]
Detecting bboxes: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.71it/s]
Recognizing Text: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:07<00:00,  1.08it/s]
Successfully saved to: output/epa_sample_letter_sent_to_commissioners_dated_february_29_2015_20260121_191039.md

Markdown file in “./output” folder;

NVIRONMENT STATES

<!-- image -->

## Dear Commissioner:

There is no higher priority for the U.S. Environmental Protection Agency than protecting public health and ensuring the safety of our nation's drinking water. Under the Safe Drinking Water Act (SDWA), «State» and other states have the primary responsibility for the implementation and enforcement of drinking water regulations, while the EPA is tasked with oversight of state efforts. Recent events in Flint, Michigan, and other U.S. cities, have led to important discussions about the safety of our nation's drinking water supplies. I am writing today to ask you to join in taking action to strengthen our safe drinking water programs, consistent with our shared recognition of the critical importance of safe drinking water for the health of all Americans.

First, with most states having primacy under SDWA, we need to work together to ensure that states are taking action to demonstrate that the Lead and Copper Rule (LCR) is being properly implemented. To this end, the EPA's Office of Water is increasing oversight of state programs to identify and address any deficiencies in current implementation of the Lead and Copper Rule. EPA staff are meeting with every state drinking water program across the country to ensure that states are taking appropriate actions to address lead action level exceedances, including optimizing corrosion control, providing effective public health communication and outreach to residents on steps to reduce exposures to lead, and removing lead service lines where required by the LCR. I ask you to join us in giving these efforts the highest priority.

Second, to assure the public of our shared commitment to addressing lead risks, I ask for your leadership in taking near-term actions to assure the public that we are doing everything we can to work together to address risks from lead in drinking water. Specifically, I urge you to take near-term action in the following areas:

- (1) Confirm that the state's protocols and procedures for implementing the LCR are fully consistent with the LCR and applicable EPA guidance;
- (2) Use relevant EPA guidance on LCR sampling protocols and procedures for optimizing corrosion control;
- (3) Post on your agency's public website all state LCR sampling protocols and guidance for identification of Tier 1 sites (at which LCR sampling is required to be conducted);
- (4) Work with public water systems - with a priority emphasis on large systems - to increase transparency in implementation of the LCR by posting on their public website and/or on your agency's website:

## UNITED STATES ENVIRONMENTAL PROTECTION AGENCY

WASHINGTON, D.C. 20460

## SAMPLE LETTER

FEB = 9 2016

OFFICE OF WATER

- the materials inventory that systems were required to complete under the LCR, including 0 the locations of lead service lines, together with any more updated inventory or map of lead service lines and lead plumbing in the system; and
- LCR compliance sampling results collected by the system, as well as justifications for 0 invalidation of LCR samples; and
- (5) Enhance efforts to ensure that residents promptly receive lead sampling results from their homes, together with clear information on lead risks and how to abate them, and that the general public receives prompt information on high lead levels in drinking water systems.

These actions are essential to restoring public confidence in our shared work to ensure safe drinking water for the American people. I ask you for your leadership and partnership in this effort and request that you respond in writing, within the next 30 days, to provide information on your activities in these areas.

To support state efforts to properly implement the LCR, the EPA will be providing information to assist states in understanding steps needed to ensure optimal corrosion control treatment and on appropriate sampling techniques. I am attaching to this letter a memorandum from the EPA's Office of Ground Water and Drinking Water summarizing EPA recommendations on sampling techniques. We will also be conducting training for state and public water systems staff to ensure that all water systems understand how to carry out the requirements of the LCR properly. Finally, we are working to revise and strengthen the LCR, but those revisions will take time to propose and finalize; our current expectation is that proposed revisions will be issued in 2017. The actions outlined above are not a substitute for needed revisions to the rule, but we can and should work together to take immediate steps to strengthen implementation of the existing rule.

While we have an immediate focus on lead in drinking water, we recognize that protection of the nation's drinking water involves both legacy and emerging contaminants, and a much broader set of scientific, technical and resource challenges as well as opportunities. This is a shared responsibility involving state, tribal, local and federal governments, system owners and operators, consumers and other stakeholders. Accordingly, in the coming weeks and months, we will be working with states and other stakeholders to identify strategies and actions to improve the safety and sustainability of our drinking water systems, including:

- ensuring adequate and sustained investment in, and attention to, regulatory oversight at all levels of government;
- using information technology to enhance transparency and accountability with regard to . reporting and public availability of drinking water compliance data;
- leveraging funding sources to finance maintenance, upgrading and replacement of aging infrastructure, especially for poor and overburdened communities; and
- identifying technology and infrastructure to address both existing and emerging contaminants. .

As always, the EPA appreciates your leadership and engagement as a partner in our efforts to protect public health and the environment. Please do not hesitate to contact me, or your staff may contact Peter Grevatt, Director of the Office of Ground Water and Drinking Water at grevatt.peter@epa.gov or (202) 564-8954.

Thank you in advance for your support to ensure that we are fulfilling our joint responsibility for the protection of public health and to restore public confidence in our shared work to ensure safe drinking water for the American people.

Sincerely,

Joel Beauvais Deputy Assistant Administrator

Enclosure

Bonus - GPU Processing (if you have one)

To enable GPU acceleration, we need to ensure the environment is configured to let PyTorch utilize the underlying hardware. Since both Surya and Docling rely on PyTorch for their deep learning models, moving the workload from the CPU to a GPU (like an NVIDIA card with CUDA or an Apple Silicon Mac with MPS) can result in a 10x to 20x increase in processing speed.

Hardware-Specific Requirements

Before running the code, ensure the correct version of PyTorch is installed on the system:

NVIDIA GPU: the CUDA toolkit must be installed. We can verify with nvidia-smi in the terminal.
Apple Silicon (M1/M2/M3): Uses Metal Performance Shaders (MPS).
Linux/Windows (CPU only): Default behavior, significantly slower for OCR tasks.

Updated Code for GPU Support

We can explicitly tell the PdfPipelineOptions to use a specific device. Below is the modification to the initialization logic:

import torch # <-------

def main():
    # Detect the best available hardware
    if torch.cuda.is_available():
        device = "cuda"
        print("🚀 Using NVIDIA GPU (CUDA)")
    elif torch.backends.mps.is_available():
        device = "mps"
        print("🚀 Using Apple Silicon GPU (MPS)")
    else:
        device = "cpu"
        print("🐢 Using CPU (Hardware acceleration not found)")

    # Pass the device to the pipeline options
    pipeline_options = PdfPipelineOptions(
        do_ocr=True,
        ocr_model="suryaocr",
        allow_external_plugins=True,
        accelerator=device,  # This tells Docling which hardware to use
        ocr_options=SuryaOcrOptions(lang=["en"]),
    )

    # ...

To enhance (visually) the application we can also add a progress and process logging to have a more user friendly interface, and also a more professional output during and after the execution.

import os
import torch
import gc
import logging
from pathlib import Path
from datetime import datetime
from tqdm import tqdm
from docling_surya import SuryaOcrOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

# Configure Logging
logging.basicConfig(
    filename='process_log.txt',
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)

def get_accelerator():
    if torch.cuda.is_available():
        return "cuda"
    elif torch.backends.mps.is_available():
        return "mps"
    return "cpu"

def main():
    input_dir = Path("./input")
    output_dir = Path("./output")
    output_dir.mkdir(parents=True, exist_ok=True)

    device = get_accelerator()
    logging.info(f"Starting pipeline on device: {device}")

    pipeline_options = PdfPipelineOptions(
        do_ocr=True,
        ocr_model="suryaocr",
        accelerator=device,
        ocr_options=SuryaOcrOptions(lang=["en"]),
    )

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
            InputFormat.IMAGE: PdfFormatOption(pipeline_options=pipeline_options),
        }
    )

    supported = {".pdf", ".png", ".jpg", ".jpeg", ".tiff"}
    files = [f for f in input_dir.rglob("*") if f.suffix.lower() in supported]

    # Progress Bar Initialization
    pbar = tqdm(files, desc="Converting Documents", unit="file")

    for file_path in pbar:
        try:
            pbar.set_postfix({"current": file_path.name[:20]})

            result = converter.convert(str(file_path))

            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            output_path = output_dir / f"{file_path.stem}_{timestamp}.md"

            with open(output_path, "w", encoding="utf-8") as f:
                f.write(result.document.export_to_markdown())

            logging.info(f"SUCCESS: {file_path.name} -> {output_path.name}")

            # Memory Cleanup
            del result
            if device == "cuda": torch.cuda.empty_cache()
            elif device == "mps": torch.mps.empty_cache()
            gc.collect()

        except Exception as e:
            logging.error(f"FAILED: {file_path.name} | Error: {str(e)}")
            continue

    print(f"\nProcessing complete. Check 'process_log.txt' for details.")

if __name__ == "__main__":
    main()

Performance Optimization: Batching

When using a GPU, we can further optimize performance by increasing the batch size. Surya processes multiple lines of text simultaneously; on a card with 8GB+ VRAM, it is possible to double the throughput by setting the environment variable: export RECOGNITION_BATCH_SIZE=64 (on Linux/Mac) or set RECOGNITION_BATCH_SIZE=64 (on Windows).

Depending on the underlying infrastrucre, the following “requirements.txt(s)” could be applied;

# requirements_cuda.txt
--extra-index-url https://download.pytorch.org/whl/cu121
torch>=2.2.0
torchvision
docling
docling-surya
pathlibConclusion

# requirements_apple.txt
torch>=2.2.0
torchvision
docling
docling-surya
pathlib

Verifying the Setup

Before running our full batch script, we can run this quick “Sanity Check” script to confirm that the code actually “sees” the underlying hardware.

import torch
from docling.datamodel.pipeline_options import Accelerator

def check_hardware():
    print(f"PyTorch Version: {torch.__version__}")

    if torch.cuda.is_available():
        print(f"✅ CUDA Detected: {torch.cuda.get_device_name(0)}")
        return "cuda"
    elif torch.backends.mps.is_available():
        print("✅ Apple MPS Detected")
        return "mps"
    else:
        print("⚠️ No GPU found. Falling back to CPU.")
        return "cpu"

device = check_hardware()

Performance Note: VRAM Management

OCR is memory-intensive. If we have an OutOfMemory (OOM) error while processing large PDFs on a GPU:

Reduce Batch Size: Set the environment variable RECOGNITION_BATCH_SIZE=8.
Clean Cache: Add torch.cuda.empty_cache() inside our file loop to release memory after each document is finished.

So our GPU application would be;

# batch_ocr.py
import os
import torch
import gc
from pathlib import Path
from datetime import datetime
from docling_surya import SuryaOcrOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

def get_accelerator():
    if torch.cuda.is_available():
        print(f"🚀 Hardware: NVIDIA GPU ({torch.cuda.get_device_name(0)})")
        return "cuda"
    elif torch.backends.mps.is_available():
        print("🚀 Hardware: Apple Silicon (MPS)")
        return "mps"
    else:
        print("🐢 Hardware: CPU (No acceleration found)")
        return "cpu"

def main():
    # Setup directories
    input_dir = Path("./input")
    output_dir = Path("./output")
    output_dir.mkdir(parents=True, exist_ok=True)

    device = get_accelerator()

    # Optimized Pipeline Configuration
    pipeline_options = PdfPipelineOptions(
        do_ocr=True,
        ocr_model="suryaocr",
        allow_external_plugins=True,
        accelerator=device, # Dynamically assigned
        ocr_options=SuryaOcrOptions(lang=["en"]),
    )

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
            InputFormat.IMAGE: PdfFormatOption(pipeline_options=pipeline_options),
        }
    )

    # File discovery
    supported = {".pdf", ".png", ".jpg", ".jpeg", ".tiff"}
    files = [f for f in input_dir.rglob("*") if f.suffix.lower() in supported]

    print(f"Starting batch process for {len(files)} files...\n")

    for file_path in files:
        try:
            start_time = datetime.now()
            print(f"Processing: {file_path.name}")

            result = converter.convert(str(file_path))

            # Save logic
            timestamp = start_time.strftime("%Y%m%d_%H%M%S")
            output_path = output_dir / f"{file_path.stem}_{timestamp}.md"

            with open(output_path, "w", encoding="utf-8") as f:
                f.write(result.document.export_to_markdown())

            # Memory Cleanup: Crucial for GPU batch processing
            del result
            if device == "cuda":
                torch.cuda.empty_cache()
            elif device == "mps":
                torch.mps.empty_cache()
            gc.collect()

            elapsed = (datetime.now() - start_time).total_seconds()
            print(f"✅ Success! Saved to {output_path.name} ({elapsed:.2f}s)")

        except Exception as e:
            print(f"❌ Error processing {file_path.name}: {e}")

if __name__ == "__main__":
    main()

Bonus 2 — Building an image for the application

We can also make a container image in order to deploy the application on a cluster (UI to be enhanced for sure) 🫠

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

ENV PYTHONUNBUFFERED=1 \
    PYTHONDONTWRITEBYTECODE=1 \
    DEBIAN_FRONTEND=noninteractive

RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    libgl1-mesa-glx \
    libglib2.0-0 \
    && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

COPY requirements_cuda.txt .
RUN pip3 install --upgrade pip && \
    pip3 install -r requirements_cuda.txt

COPY batch_ocr.py .
RUN mkdir input output

# Default command
CMD ["python3", "batch_ocr.py"]

docker build -t surya-docling-app .

docker run --gpus all \
    -v $(pwd)/input:/app/input \
    -v $(pwd)/output:/app/output \
    surya-docling-app

Bonus 3 — Summary of the Execution to verify ‘Data Integrity’ for Batch Processing

In high-stakes environments — like feeding data into a Vector Database for a LLM — knowing our Data Integrity is vital.

Quality Control: If the failure rate is high, we might need to adjust DETECTOR_TEXT_THRESHOLD.
Audit Trail: The process_log.txt serves as a permanent record of data ingestion history.
Efficiency: We can re-run the pipeline only for the specific files listed in the “Detailed Failure List” after fixing the source issues.

# summarize_results.py
import re
from collections import Counter
from pathlib import Path

def generate_report(log_file='process_log.txt'):
    if not Path(log_file).exists():
        print(f"Error: {log_file} not found.")
        return

    with open(log_file, 'r') as f:
        logs = f.readlines()

    stats = Counter()
    failures = []

    for line in logs:
        if "SUCCESS" in line:
            stats['Success'] += 1
        elif "FAILED" in line:
            stats['Failed'] += 1
            # Extract filename and error for the failure list
            match = re.search(r"FAILED: (.*?) \| Error: (.*)", line)
            if match:
                failures.append(f"File: {match.group(1)} | Reason: {match.group(2)}")

    # Print Summary Report
    print("="*30)
    print("Batch Processing Summary")
    print("="*30)
    print(f"Total Files Processed: {sum(stats.values())}")
    print(f"✅ Successes: {stats['Success']}")
    print(f"❌ Failures:  {stats['Failed']}")

    if failures:
        print("\nDetailed Failure List:")
        for fail in failures:
            print(f"  - {fail}")
    print("="*30)

if __name__ == "__main__":
    generate_report()

To run it alongside the container image above;

CMD python3 batch_ocr.py && python3 summarize_results.py

Output as JSON and Markdown (for Vector Databases)

Going farther than the sample provided, we can export the output in two formats as parts of a modern AI architecture:

Markdown: Ideal for LLM Context Windows and RAG chunking because it preserves headers and list structures cleanly in text.
JSON: Ideal for Metadata Filters in Vector Databases. It allows us to query documents by page count, language, or specific table data without having to parse the raw text.

import os
import torch
import gc
import json
import logging
from pathlib import Path
from datetime import datetime
from tqdm import tqdm
from docling_surya import SuryaOcrOptions
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption

# Configure Logging
logging.basicConfig(filename='process_log.txt', level=logging.INFO, 
                    format='%(asctime)s - %(levelname)s - %(message)s')

def get_accelerator():
    if torch.cuda.is_available(): return "cuda"
    if torch.backends.mps.is_available(): return "mps"
    return "cpu"

def main():
    input_dir, output_dir = Path("./input"), Path("./output")
    output_dir.mkdir(parents=True, exist_ok=True)
    device = get_accelerator()

    # ENABLE MULTILINGUAL & JSON EXPORT OPTIONS
    pipeline_options = PdfPipelineOptions(
        do_ocr=True,
        ocr_model="suryaocr",
        accelerator=device,
        # Leaving lang empty or using specific tags enables multilingual support
        ocr_options=SuryaOcrOptions(lang=["en", "hi", "ja", "zh", "fr"]), 
    )

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options),
            InputFormat.IMAGE: PdfFormatOption(pipeline_options=pipeline_options),
        }
    )

    files = [f for f in input_dir.rglob("*") if f.suffix.lower() in {".pdf", ".png", ".jpg"}]
    pbar = tqdm(files, desc="Processing Batch")

    for file_path in pbar:
        try:
            pbar.set_postfix({"file": file_path.name[:15]})
            result = converter.convert(str(file_path))
            timestamp = datetime.now().strftime("%H%M%S")
            base_name = f"{file_path.stem}_{timestamp}"

            # 1. Save Markdown (For LLMs/RAG)
            with open(output_dir / f"{base_name}.md", "w", encoding="utf-8") as f:
                f.write(result.document.export_to_markdown())

            # 2. Save JSON (For Databases/Analytics)
            # This exports the full document structure including tables and metadata
            with open(output_dir / f"{base_name}.json", "w", encoding="utf-8") as f:
                json.dump(result.document.export_to_dict(), f, indent=2, ensure_ascii=False)

            logging.info(f"PROCESSED: {file_path.name}")

            # Memory Cleanup
            del result
            if device == "cuda": torch.cuda.empty_cache()
            gc.collect()

        except Exception as e:
            logging.error(f"ERROR: {file_path.name} | {str(e)}")

if __name__ == "__main__":
    main()

Conclusion

The integration of Surya-OCR and Docling represents a significant leap forward in document intelligence, bridging the gap between raw visual data and structured digital knowledge. By anchoring Surya’s state-of-the-art multilingual detection and layout analysis within Docling’s robust conversion framework, users gain a professional-grade pipeline capable of untangling the most complex document formats. This combination doesn’t just “read” text — it understands the spatial and logical relationships of headers, tables, and formulas, delivering a high-fidelity Markdown output that is tailor-made for LLM ingestion and RAG architectures. Ultimately, this pairing empowers developers to transform massive, disorganized archives into a searchable, structured, and actionable data lake with unprecedented speed and accuracy.

Thanks for reading <<<

DEV Community