Alain Airom (Ayrom)

Posted on Jan 29

From Raw Scans to Structured Data: How Bob Built a Powerhouse OCR App in Just 60 Minutes

#bob #ocr #docling #gutenocr

Incredible! I just discovered GutenOCR, and the results are mind-blowing. To put it to the test, Bob (our resident automation wizard) managed to whip up a fully functional application in under an hour by combining the precision of Docling with the raw power of GutenOCR.

Introduction

While scrolling through LinkedIn this morning, I caught a “liked” post from a highly-regarded connection that introduced me to GutenOCR. My curiosity led me to their GitHub and live demo site, where I was — quite frankly — blown away by the software’s capabilities. Naturally, I couldn’t resist putting Bob to the test once again. I crafted a comprehensive prompt, feeding him the GutenOCR repositories and Hugging Face links, paired with the powerhouse Docling repo. The result? In literally less than an hour, Bob synthesized the two into a working application. The speed and precision were nothing short of remarkable, and after a successful test run, I’m ready to walk you through the inner workings of the code below.

TL;DR-What is GutenOCR?

Image from Hugging Face

GutenOCR, claims to be a state-of-the-art grounded Vision-Language Model (VLM) developed by Roots Automation designed to revolutionize document understanding. By fine-tuning the powerful Qwen2.5-VL architecture, GutenOCR offers a unified, prompt-based interface that goes far beyond simple text extraction. Whether you are using the agile 3B model for efficiency or the robust 7B model for high-complexity tasks, the system provides a comprehensive toolkit for full-text reading with layout preservation, precise word-level detection, and localized reading within specific bounding boxes. Specifically engineered for business workflows, GutenOCR serves as a high-fidelity “front-end” for documents, enabling seamless transition from raw pixels to structured, actionable data.
GutenOCR-3B is a grounded OCR front-end obtained by fine-tuning Qwen2.5-VL-3B. The resulting single-checkpoint vision-language model exposes reading, detection, and grounding through a unified, prompt-based interface.
GutenOCR-7B is a grounded OCR front-end obtained by fine-tuning Qwen2.5-VL-7B. The resulting single-checkpoint vision-language model exposes reading, detection, and grounding through a unified, prompt-based interface.

Image from GutenOCR GitHub Repository

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image

# 1. Load model and processor
model_id = "rootsautomation/GutenOCR-3B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)

# 2. Prepare inputs
image = Image.open("document.png")

# Example: Read all text
prompt = "Read all text in {image} and return a single TEXT string, linearized left-to-right/top-to-bottom."

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt},
        ],
    }
]

# 3. Process and Generate
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print(output_text[0])

Implementation

I’ll skip the long introductions for Docling and Bob — at this point, their reputations for high-performance parsing and AI wizardry precede them. 😅

Instead, let’s cut straight to the chase and dive into exactly what was achieved!

The Build: A Comprehensive GutenOCR & Docling Ecosystem

What started as a challenge turned into a full-scale document processing powerhouse. Bob didn’t just build a simple script; he engineered a multi-layered application that bridges the gap between raw vision and structured intelligence.

🌟 Key Features
This integration, now hosted at GutenOCR-Test, provides a robust toolkit for modern OCR workflows:

Intelligence at Scale: Support for both GutenOCR-3B (optimized for speed) and GutenOCR-7B (optimized for high-fidelity accuracy).
Hardware Agnostic: Seamlessly switches between CPU and GPU environments, making it accessible for local testing or high-performance production.
The Best of Both Worlds: * Standard GutenOCR UI: Focused, high-speed OCR tasks.
Docling + GutenOCR UI: A dedicated interface for advanced document processing that preserves layouts, tables, and complex structures.
Versatile Task Handling: From standard full-text reading and LaTeX conversion to localized reading within specific bounding boxes and conditional detection.
Enterprise-Ready Deployment: The project is fully “container-native,” featuring Docker support for both CPU/GPU and complete Kubernetes (K8s) deployment configurations.
End-to-End Automation: Includes recursive batch processing for entire directories and automated maintenance scripts (start, stop, and GitHub sync).

Code Samples

Hereafter, I share two major application code provided by Bob. Regarding Docling, there are two implementations; by bactch on console or throug the GUI.

# gutenocr_engine.py
"""
GutenOCR Engine - Core OCR processing with CPU/GPU support
"""
import os
import torch
from typing import Optional, Dict, Any, List
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class GutenOCREngine:
    """
    GutenOCR Engine for OCR processing with CPU/GPU support
    """

    def __init__(
        self,
        model_id: str = "rootsautomation/GutenOCR-3B",
        device: str = "auto",
        use_cpu: bool = False,
        torch_dtype: Optional[torch.dtype] = None
    ):
        """
        Initialize GutenOCR Engine

        Args:
            model_id: HuggingFace model ID (GutenOCR-3B or GutenOCR-7B)
            device: Device to use ('auto', 'cuda', 'cpu')
            use_cpu: Force CPU usage even if GPU is available
            torch_dtype: Torch data type (default: bfloat16 for GPU, float32 for CPU)
        """
        self.model_id = model_id
        self.use_cpu = use_cpu

        # Determine device and dtype
        if use_cpu or not torch.cuda.is_available():
            self.device = "cpu"
            self.torch_dtype = torch_dtype or torch.float32
            logger.info("Using CPU for inference")
        else:
            self.device = device
            self.torch_dtype = torch_dtype or torch.bfloat16
            logger.info(f"Using GPU for inference with dtype {self.torch_dtype}")

        # Load model and processor
        logger.info(f"Loading model: {model_id}")
        self.model = self._load_model()
        self.processor = AutoProcessor.from_pretrained(model_id)
        logger.info("Model loaded successfully")

    def _load_model(self) -> Qwen2_5_VLForConditionalGeneration:
        """Load the model with appropriate settings"""
        try:
            if self.use_cpu:
                # CPU-specific loading
                model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
                    self.model_id,
                    torch_dtype=self.torch_dtype,
                    device_map="cpu",
                    low_cpu_mem_usage=True
                )
            else:
                # GPU loading
                model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
                    self.model_id,
                    torch_dtype=self.torch_dtype,
                    device_map=self.device
                )
            return model
        except Exception as e:
            logger.error(f"Error loading model: {e}")
            raise

    def process_image(
        self,
        image_path: str,
        task_type: str = "reading",
        output_format: str = "TEXT",
        max_new_tokens: int = 4096,
        custom_prompt: Optional[str] = None
    ) -> Dict[str, Any]:
        """
        Process an image with OCR

        Args:
            image_path: Path to the image file
            task_type: Type of task ('reading', 'detection', 'localized_reading', 'conditional_detection')
            output_format: Output format ('TEXT', 'TEXT2D', 'LINES', 'WORDS', 'PARAGRAPHS', 'LATEX', 'BOX')
            max_new_tokens: Maximum number of tokens to generate
            custom_prompt: Custom prompt (overrides default)

        Returns:
            Dictionary with OCR results
        """
        try:
            # Load image
            image = Image.open(image_path).convert("RGB")

            # Generate prompt
            if custom_prompt:
                prompt = custom_prompt
            else:
                prompt = self._generate_prompt(task_type, output_format)

            # Prepare messages
            messages = [
                {
                    "role": "user",
                    "content": [
                        {"type": "image", "image": image},
                        {"type": "text", "text": prompt},
                    ],
                }
            ]

            # Process and generate
            text = self.processor.apply_chat_template(
                messages, tokenize=False, add_generation_prompt=True
            )
            image_inputs, video_inputs = process_vision_info(messages)
            inputs = self.processor(
                text=[text],
                images=image_inputs,
                videos=video_inputs,
                padding=True,
                return_tensors="pt",
            )

            # Move to device
            if self.use_cpu:
                inputs = {k: v.to("cpu") for k, v in inputs.items()}
            else:
                inputs = inputs.to(self.device)

            # Generate
            logger.info(f"Processing image: {image_path}")
            with torch.no_grad():
                generated_ids = self.model.generate(
                    **inputs,
                    max_new_tokens=max_new_tokens
                )

            # Decode output
            generated_ids_trimmed = [
                out_ids[len(in_ids):]
                for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
            ]
            output_text = self.processor.batch_decode(
                generated_ids_trimmed,
                skip_special_tokens=True,
                clean_up_tokenization_spaces=False
            )

            return {
                "success": True,
                "image_path": image_path,
                "task_type": task_type,
                "output_format": output_format,
                "text": output_text[0],
                "prompt": prompt
            }

        except Exception as e:
            logger.error(f"Error processing image {image_path}: {e}")
            return {
                "success": False,
                "image_path": image_path,
                "error": str(e)
            }

    def _generate_prompt(self, task_type: str, output_format: str) -> str:
        """Generate appropriate prompt based on task type and output format"""
        prompts = {
            "reading": {
                "TEXT": "Read all text in the image and return a single TEXT string, linearized left-to-right/top-to-bottom.",
                "TEXT2D": "Return a layout-sensitive TEXT2D representation of the image.",
                "LINES": "Return line-by-line OCR as LINES with bounding boxes.",
                "WORDS": "Return word-by-word OCR as WORDS with bounding boxes.",
                "PARAGRAPHS": "Return paragraph-wise OCR as PARAGRAPHS with bounding boxes.",
                "LATEX": "Extract all LaTeX expressions with bounding boxes."
            },
            "detection": {
                "BOX": "Highlight all text regions in the image by returning their bounding boxes as a JSON array."
            },
            "localized_reading": {
                "TEXT": "What does it say in the specified region of the image?"
            },
            "conditional_detection": {
                "BOX": "Find and return bounding boxes for the specified text query."
            }
        }

        return prompts.get(task_type, {}).get(
            output_format,
            "Read all text in the image."
        )

    def batch_process(
        self,
        image_paths: List[str],
        task_type: str = "reading",
        output_format: str = "TEXT",
        max_new_tokens: int = 4096
    ) -> List[Dict[str, Any]]:
        """
        Process multiple images in batch

        Args:
            image_paths: List of image paths
            task_type: Type of task
            output_format: Output format
            max_new_tokens: Maximum tokens to generate

        Returns:
            List of results for each image
        """
        results = []
        for image_path in image_paths:
            result = self.process_image(
                image_path,
                task_type=task_type,
                output_format=output_format,
                max_new_tokens=max_new_tokens
            )
            results.append(result)
        return results

    def get_device_info(self) -> Dict[str, Any]:
        """Get information about the device being used"""
        return {
            "device": self.device,
            "use_cpu": self.use_cpu,
            "torch_dtype": str(self.torch_dtype),
            "cuda_available": torch.cuda.is_available(),
            "cuda_device_count": torch.cuda.device_count() if torch.cuda.is_available() else 0,
            "model_id": self.model_id
        }

# Made with Bob

./scripts/start.sh --mode gradio
==========================================
GutenOCR Application Startup
==========================================
[INFO] Creating directories...
[INFO] Starting GutenOCR Gradio UI...
[INFO] Installing dependencies...
[INFO] Running on GPU (if available)
[INFO] Starting Gradio UI on http://localhost:7860
/Users/alainairom/Devs/GutenOCR-Test/src/gradio_ui.py:162: UserWarning: The parameters have been moved from the Blocks constructor to the launch() method in Gradio 6.0: theme. Please pass these parameters to launch() instead.
  with gr.Blocks(title="GutenOCR Application", theme=gr.themes.Soft()) as interface:
* Running on local URL:  http://0.0.0.0:7860
INFO:httpx:HTTP Request: GET http://localhost:7860/gradio_api/startup-events "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: HEAD http://localhost:7860/ "HTTP/1.1 200 OK"
* To create a public link, set `share=True` in `launch()`.
INFO:httpx:HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
INFO:__main__:Initializing engine with model: rootsautomation/GutenOCR-3B, CPU: False
INFO:gutenocr_engine:Using CPU for inference
INFO:gutenocr_engine:Loading model: rootsautomation/GutenOCR-3B
`torch_dtype` is deprecated! Use `dtype` instead!
config.json: 3.35kB [00:00, 6.93MB/s]
model.safetensors.index.json: 65.5kB [00:00, 329MB/s]
model-00002-of-00002.safetensors: 100%|██████████████████████████████████████████████████| 2.51G/2.51G [02:54<00:00, 14.4MB/s]
model-00001-of-00002.safetensors: 100%|██████████████████████████████████████████████████| 5.00G/5.00G [03:20<00:00, 25.0MB/s]
Fetching 2 files: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [03:20<00:00, 100.37s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00,  5.82s/it]
generation_config.json: 100%|████████████████████████████████████████████████████████████████| 188/188 [00:00<00:00, 2.85MB/s]
preprocessor_config.json: 100%|██████████████████████████████████████████████████████████████| 829/829 [00:00<00:00, 6.64MB/s]
tokenizer_config.json: 4.92kB [00:00, 10.4MB/s]
vocab.json: 2.78MB [00:00, 65.3MB/s]
merges.txt: 1.67MB [00:00, 38.4MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 6.92MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████| 605/605 [00:00<00:00, 2.54MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 2.92MB/s]
chat_template.jinja: 4.25kB [00:00, 7.97MB/s]
video_preprocessor_config.json: 100%|████████████████████████████████████████████████████████| 913/913 [00:00<00:00, 3.36MB/s]
INFO:gutenocr_engine:Model loaded successfully
INFO:file_processor:Discovered 2 image files
/Users/alainairom/Devs/GutenOCR-Test/venv/lib/python3.14/site-packages/transformers/tokenization_utils_base.py:2919: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
  warnings.warn(
INFO:gutenocr_engine:Processing image: input/1768221419048.jpeg

# docling_gutenocr_combined.py
"""
Combined Docling + GutenOCR Application
Integrates Docling's document processing with GutenOCR's OCR capabilities
"""
import os
import json
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Any, Optional
import logging

try:
    from docling.document_converter import DocumentConverter
    from docling.datamodel.base_models import InputFormat
    from docling.datamodel.pipeline_options import PdfPipelineOptions
    from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
    DOCLING_AVAILABLE = True
except ImportError:
    DOCLING_AVAILABLE = False
    logging.warning("Docling not available. Install with: pip install docling")

from gutenocr_engine import GutenOCREngine
from file_processor import FileProcessor

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)


class DoclingGutenOCRProcessor:
    """
    Combined processor using Docling for document structure and GutenOCR for OCR
    """

    def __init__(
        self,
        gutenocr_model: str = "rootsautomation/GutenOCR-3B",
        use_cpu: bool = False,
        use_docling: bool = True
    ):
        """
        Initialize combined processor

        Args:
            gutenocr_model: GutenOCR model to use
            use_cpu: Force CPU usage
            use_docling: Whether to use Docling (if available)
        """
        # Initialize GutenOCR
        self.gutenocr = GutenOCREngine(
            model_id=gutenocr_model,
            use_cpu=use_cpu
        )

        # Initialize Docling if available and requested
        self.use_docling = use_docling and DOCLING_AVAILABLE
        if self.use_docling:
            try:
                pipeline_options = PdfPipelineOptions()
                pipeline_options.do_ocr = False  # We'll use GutenOCR for OCR
                pipeline_options.do_table_structure = True

                self.docling_converter = DocumentConverter(
                    allowed_formats=[
                        InputFormat.PDF,
                        InputFormat.DOCX,
                        InputFormat.PPTX,
                        InputFormat.IMAGE,
                        InputFormat.HTML,
                        InputFormat.MD
                    ],
                    pdf_backend=PyPdfiumDocumentBackend,
                    pipeline_options=pipeline_options
                )
                logger.info("Docling initialized successfully")
            except Exception as e:
                logger.warning(f"Could not initialize Docling: {e}")
                self.use_docling = False
        else:
            self.docling_converter = None
            if not DOCLING_AVAILABLE:
                logger.warning("Docling not available - using GutenOCR only")

        self.file_processor = FileProcessor()

    def process_document(
        self,
        file_path: str,
        extract_structure: bool = True,
        extract_tables: bool = True,
        ocr_images: bool = True
    ) -> Dict[str, Any]:
        """
        Process a document with combined Docling + GutenOCR

        Args:
            file_path: Path to document
            extract_structure: Extract document structure with Docling
            extract_tables: Extract tables with Docling
            ocr_images: Perform OCR on images with GutenOCR

        Returns:
            Combined processing results
        """
        result = {
            "file_path": file_path,
            "timestamp": datetime.now().isoformat(),
            "docling_structure": None,
            "gutenocr_ocr": None,
            "combined_text": "",
            "metadata": {}
        }

        try:
            # Step 1: Process with Docling if available
            if self.use_docling and extract_structure:
                logger.info(f"Processing with Docling: {file_path}")
                docling_result = self._process_with_docling(
                    file_path,
                    extract_tables=extract_tables
                )
                result["docling_structure"] = docling_result
                result["metadata"]["docling_processed"] = True

            # Step 2: Process with GutenOCR
            if ocr_images:
                logger.info(f"Processing with GutenOCR: {file_path}")
                ocr_result = self.gutenocr.process_image(
                    image_path=file_path,
                    task_type="reading",
                    output_format="TEXT2D"
                )
                result["gutenocr_ocr"] = ocr_result
                result["metadata"]["gutenocr_processed"] = True

                if ocr_result.get("success"):
                    result["combined_text"] = ocr_result.get("text", "")

            # Step 3: Combine results
            if result["docling_structure"] and result["gutenocr_ocr"]:
                result["combined_text"] = self._merge_results(
                    result["docling_structure"],
                    result["gutenocr_ocr"]
                )
                result["metadata"]["processing_mode"] = "combined"
            elif result["docling_structure"]:
                result["combined_text"] = result["docling_structure"].get("text", "")
                result["metadata"]["processing_mode"] = "docling_only"
            elif result["gutenocr_ocr"]:
                result["combined_text"] = result["gutenocr_ocr"].get("text", "")
                result["metadata"]["processing_mode"] = "gutenocr_only"

            result["success"] = True

        except Exception as e:
            logger.error(f"Error processing document {file_path}: {e}")
            result["success"] = False
            result["error"] = str(e)

        return result

    def _process_with_docling(
        self,
        file_path: str,
        extract_tables: bool = True
    ) -> Dict[str, Any]:
        """Process document with Docling"""
        try:
            conv_result = self.docling_converter.convert(file_path)

            # Extract document structure
            doc_result = {
                "text": conv_result.document.export_to_markdown(),
                "structure": {
                    "pages": len(conv_result.document.pages) if hasattr(conv_result.document, 'pages') else 0,
                    "elements": []
                },
                "tables": [],
                "metadata": conv_result.document.metadata if hasattr(conv_result.document, 'metadata') else {}
            }

            # Extract tables if requested
            if extract_tables and hasattr(conv_result.document, 'tables'):
                for table in conv_result.document.tables:
                    doc_result["tables"].append({
                        "data": table.export_to_dataframe().to_dict() if hasattr(table, 'export_to_dataframe') else {},
                        "caption": getattr(table, 'caption', '')
                    })

            return doc_result

        except Exception as e:
            logger.error(f"Docling processing error: {e}")
            return {"error": str(e)}

    def _merge_results(
        self,
        docling_result: Dict[str, Any],
        gutenocr_result: Dict[str, Any]
    ) -> str:
        """
        Merge Docling structure with GutenOCR OCR results

        Args:
            docling_result: Docling processing result
            gutenocr_result: GutenOCR processing result

        Returns:
            Merged text content
        """
        merged_text = []

        # Add Docling structured content
        if docling_result.get("text"):
            merged_text.append("=== DOCUMENT STRUCTURE (Docling) ===\n")
            merged_text.append(docling_result["text"])
            merged_text.append("\n")

        # Add tables if present
        if docling_result.get("tables"):
            merged_text.append("\n=== EXTRACTED TABLES ===\n")
            for idx, table in enumerate(docling_result["tables"], 1):
                merged_text.append(f"\nTable {idx}:")
                if table.get("caption"):
                    merged_text.append(f"Caption: {table['caption']}")
                merged_text.append(str(table.get("data", {})))
                merged_text.append("\n")

        # Add GutenOCR OCR content
        if gutenocr_result.get("success") and gutenocr_result.get("text"):
            merged_text.append("\n=== OCR CONTENT (GutenOCR) ===\n")
            merged_text.append(gutenocr_result["text"])

        return "\n".join(merged_text)

    def batch_process(
        self,
        input_dir: str = "./input",
        output_dir: str = "./output",
        extract_structure: bool = True,
        extract_tables: bool = True,
        ocr_images: bool = True
    ) -> List[Dict[str, Any]]:
        """
        Batch process documents

        Args:
            input_dir: Input directory
            output_dir: Output directory
            extract_structure: Extract structure with Docling
            extract_tables: Extract tables
            ocr_images: Perform OCR

        Returns:
            List of processing results
        """
        # Update file processor directories
        self.file_processor.input_dir = Path(input_dir)
        self.file_processor.output_dir = Path(output_dir)

        # Discover files
        files = self.file_processor.discover_images(recursive=True)

        results = []
        for file_path in files:
            logger.info(f"Processing: {file_path}")
            result = self.process_document(
                file_path,
                extract_structure=extract_structure,
                extract_tables=extract_tables,
                ocr_images=ocr_images
            )
            results.append(result)

        # Save results
        timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_path = Path(output_dir) / f"combined_results_{timestamp_str}.json"

        with open(output_path, 'w', encoding='utf-8') as f:
            json.dump(results, f, indent=2, ensure_ascii=False)

        logger.info(f"Results saved to: {output_path}")

        return results

    def get_capabilities(self) -> Dict[str, Any]:
        """Get information about available capabilities"""
        return {
            "docling_available": self.use_docling,
            "gutenocr_model": self.gutenocr.model_id,
            "device_info": self.gutenocr.get_device_info(),
            "supported_formats": [
                "PDF", "DOCX", "PPTX", "PNG", "JPG", "JPEG",
                "TIFF", "BMP", "GIF", "WEBP", "HTML", "MD"
            ] if self.use_docling else [
                "PNG", "JPG", "JPEG", "TIFF", "BMP", "GIF", "WEBP", "PDF"
            ]
        }


def main():
    """Main entry point for combined processor"""
    import argparse

    parser = argparse.ArgumentParser(description="Combined Docling + GutenOCR Processor")
    parser.add_argument("--input", default="./input", help="Input directory")
    parser.add_argument("--output", default="./output", help="Output directory")
    parser.add_argument("--model", default="rootsautomation/GutenOCR-3B", help="GutenOCR model")
    parser.add_argument("--cpu", action="store_true", help="Force CPU usage")
    parser.add_argument("--no-docling", action="store_true", help="Disable Docling")
    parser.add_argument("--no-structure", action="store_true", help="Skip structure extraction")
    parser.add_argument("--no-tables", action="store_true", help="Skip table extraction")
    parser.add_argument("--no-ocr", action="store_true", help="Skip OCR")

    args = parser.parse_args()

    # Initialize processor
    processor = DoclingGutenOCRProcessor(
        gutenocr_model=args.model,
        use_cpu=args.cpu,
        use_docling=not args.no_docling
    )

    # Print capabilities
    capabilities = processor.get_capabilities()
    print("\n=== Processor Capabilities ===")
    for key, value in capabilities.items():
        print(f"{key}: {value}")
    print()

    # Process documents
    results = processor.batch_process(
        input_dir=args.input,
        output_dir=args.output,
        extract_structure=not args.no_structure,
        extract_tables=not args.no_tables,
        ocr_images=not args.no_ocr
    )

    # Print summary
    successful = sum(1 for r in results if r.get("success"))
    print(f"\n=== Processing Complete ===")
    print(f"Total files: {len(results)}")
    print(f"Successful: {successful}")
    print(f"Failed: {len(results) - successful}")


if __name__ == "__main__":
    main()

# Made with Bob

Tests

True to form, the application Bob built handles high-volume workloads with ease thanks to its batch processing engine. I put it to the test using the two images shown below. While running this on a CPU admittedly requires a bit of patience compared to a high-end GPU, the wait was well worth it. The results — which I’ve shared below — are remarkably accurate, capturing every detail with the kind of precision you’d expect from a much more complex setup.

The Test Run: From Pixels to Precision

Here is a look at the input images and the resulting structured data. Even under CPU constraints, the fidelity of the extraction is truly impressive.

Input ⬇️

Output(s); (combined or seperately) ⬆️

[
  {
    "file_path": "input/1768221419048.jpeg",
    "timestamp": "2026-01-29T10:36:20.309686",
    "docling_structure": null,
    "gutenocr_ocr": {
      "success": true,
      "image_path": "input/1768221419048.jpeg",
      "task_type": "reading",
      "output_format": "TEXT2D",
      "text": "HOW TO EXPLAIN AI TERMS\n          A PROFESSOR'S VISUAL GUIDE\n\n                Artificial Intelligence (AI)\n                     Machine Learning (ML)             Learning from\n                                                                 data\n                     Deep Learning (DL)                 Complex patterns\n\n                    Neural Networks (NN)              Mimics brain structure\n\n                  Attention                          Sequence\n                       Transformers                   processing\n                  Attention\n                      Generative AI (GenAI)         Create content\n                        Creates new content\n               Large Language Models (LLMs)\n                         Vast text data\n   Generative Pre-Trained Transformers (GPT)\n                        Specific application\n       pre-trained\n                                          ?     Answer ChatGPT\n                                ChatGPT\n                                                                 ?\n\n           Follow Luis Rodrigues for insights\n\n                            EXPLO EXPLO EXPLO EXPLO",
      "prompt": "Return a layout-sensitive TEXT2D representation of the image."
    },
    "combined_text": "HOW TO EXPLAIN AI TERMS\n          A PROFESSOR'S VISUAL GUIDE\n\n                Artificial Intelligence (AI)\n                     Machine Learning (ML)             Learning from\n                                                                 data\n                     Deep Learning (DL)                 Complex patterns\n\n                    Neural Networks (NN)              Mimics brain structure\n\n                  Attention                          Sequence\n                       Transformers                   processing\n                  Attention\n                      Generative AI (GenAI)         Create content\n                        Creates new content\n               Large Language Models (LLMs)\n                         Vast text data\n   Generative Pre-Trained Transformers (GPT)\n                        Specific application\n       pre-trained\n                                          ?     Answer ChatGPT\n                                ChatGPT\n                                                                 ?\n\n           Follow Luis Rodrigues for insights\n\n                            EXPLO EXPLO EXPLO EXPLO",
    "metadata": {
      "gutenocr_processed": true,
      "processing_mode": "gutenocr_only"
    },
    "success": true
  },
  {
    "file_path": "input/597919025_1354062503415836_5851124507328073158_n.jpg",
    "timestamp": "2026-01-29T10:37:06.931793",
    "docling_structure": null,
    "gutenocr_ocr": {
      "success": true,
      "image_path": "input/597919025_1354062503415836_5851124507328073158_n.jpg",
      "task_type": "reading",
      "output_format": "TEXT2D",
      "text": "LANG FOCUS                   Indo-European vocabulary\n\n   mādar     مادر       mother      barādar     برادر         brother\n   pedar     پدر       father      dokhtar     دختر         daughter\n\n   dar در     door          nām       نام           name\n   dandān دندان     tooth      gāw        كاو          cow\n related to \"dental\"; \"dent\" in French",
      "prompt": "Return a layout-sensitive TEXT2D representation of the image."
    },
    "combined_text": "LANG FOCUS                   Indo-European vocabulary\n\n   mādar     مادر       mother      barādar     برادر         brother\n   pedar     پدر       father      dokhtar     دختر         daughter\n\n   dar در     door          nām       نام           name\n   dandān دندان     tooth      gāw        كاو          cow\n related to \"dental\"; \"dent\" in French",
    "metadata": {
      "gutenocr_processed": true,
      "processing_mode": "gutenocr_only"
    },
    "success": true
  }
]

[
  {
    "success": true,
    "image_path": "input/1768221419048.jpeg",
    "task_type": "reading",
    "output_format": "TEXT",
    "text": "HOW TO EXPLAIN AI TERMS A PROFESSOR'S VISUAL GUIDE Artificial Intelligence (AI) Learning from Machine Learning (ML) data Complex Deep Learning (DL) patterns Mimics brain Neural Networks (NN) structure Attention Sequence Transformers processing Attention Create Generative AI (GenAI) content Creates new content Large Language Models (LLMs) Vast text data Generative Pre-Trained Transformers (GPT) Specific application pre-trained ? Answer ChatGPT ChatGPT ? Follow Luis Rodrigues for insights EXPLO EXPLO EXPLO EXPLO",
    "prompt": "Read all text in the image and return a single TEXT string, linearized left-to-right/top-to-bottom."
  },
  {
    "success": true,
    "image_path": "input/597919025_1354062503415836_5851124507328073158_n.jpg",
    "task_type": "reading",
    "output_format": "TEXT",
    "text": "Indo-European vocabulary LANG FOCUS mādar mother barādar brother مادر برادر pedar father dokhtar دختر daughter پدر daughter dar door nām name نام در cow dandān tooth gāw دندان related to \"dental\"; \"dent\" in French",
    "prompt": "Read all text in the image and return a single TEXT string, linearized left-to-right/top-to-bottom."
  }
]

Source: input/597919025_1354062503415836_5851124507328073158_n.jpg
Processed: 2026-01-29T10:32:18.971569
Task: reading
Format: TEXT

================================================================================

Indo-European vocabulary LANG FOCUS mādar mother barādar brother مادر برادر pedar father dokhtar دختر daughter پدر daughter dar door nām name نام در cow dandān tooth gāw دندان related to "dental"; "dent" in French

A bit of digression-CPU vs. GPU: Optimizing the Workflow

Running Vision Language Models (VLMs) like GutenOCR is a heavy lift for any system. While Bob’s application is designed to be flexible, your choice of hardware — and how you tune it — will define your experience.

⚡ Performance Comparison

| Feature         | **CPU (Standard)**            | **GPU (NVIDIA RTX)**                |
| --------------- | ----------------------------- | ----------------------------------- |
| **Speed**       | 🐢 Slow (Good for single docs) | 🚀 Fast (6x - 10x speedup)           |
| **Concurrency** | Sequential processing         | Parallel batching (up to 128 pages) |
| **Best For**    | Testing & Light Automation    | Large-scale Batch Processing        |
| **Fidelity**    | Identical to GPU              | Identical to GPU                    |

💡 Ideas for Maximum Efficiency

If you’re finding the process a bit slow on your machine, here is how you can “tune the engine” just like Bob did:

Choose the Right Model: Use GutenOCR-3B for daily tasks. It requires significantly less RAM (~8GB) and is much snappier than the 7B version without sacrificing too much accuracy for standard fonts.
Scale Your Images: In the configuration, keep image_scale at 1.0. Increasing it to 2.0 improves quality for tiny text but can double your processing time and memory usage.
Leverage Batch Processing: If you have an NVIDIA GPU, Bob has enabled vLLM support. This allows the app to “prefill” multiple pages at once, which is a total game-changer for 100+ page PDFs.
Toggle OCR Wisely: For digital-native PDFs (where you can already select text), you can disable the “Full OCR” mode in the UI to let Docling handle the structure extraction solo — it’s roughly 10x faster!

Final Thoughts: The Future of Document AI is Here

This experiment proved one thing: the barrier between “raw document” and “structured data” has officially collapsed. By combining the structural intelligence of Docling with the high-fidelity vision of GutenOCR, we’ve moved past simple text extraction into the realm of true document understanding.

What’s most impressive isn’t just the accuracy — it’s the accessibility. The fact that Bob could bridge these complex technologies into a functional, containerized app in under an hour shows how powerful the modern AI ecosystem has become. We are no longer waiting for “enterprise solutions” to catch up; the tools are in our hands right now.

🚀 Get Involved

I’ve made the repository public so you can take it for a spin yourself. Whether you’re processing a single invoice or a thousand-page archive, this stack is ready to work.

Check out the Repo: aairom/GutenOCR-Test
Try the Models: Appreciate Roots Automation team on Hugging Face. 🌞

What are you planning to automate next? If you have a specific document challenge or want to see Bob tackle a different integration, let me know in the comments below!