Incredible! I just discovered GutenOCR, and the results are mind-blowing. To put it to the test, Bob (our resident automation wizard) managed to whip up a fully functional application in under an hour by combining the precision of Docling with the raw power of GutenOCR.
Introduction
While scrolling through LinkedIn this morning, I caught a “liked” post from a highly-regarded connection that introduced me to GutenOCR. My curiosity led me to their GitHub and live demo site, where I was — quite frankly — blown away by the software’s capabilities. Naturally, I couldn’t resist putting Bob to the test once again. I crafted a comprehensive prompt, feeding him the GutenOCR repositories and Hugging Face links, paired with the powerhouse Docling repo. The result? In literally less than an hour, Bob synthesized the two into a working application. The speed and precision were nothing short of remarkable, and after a successful test run, I’m ready to walk you through the inner workings of the code below.
TL;DR-What is GutenOCR?
Image from Hugging Face
GutenOCR, claims to be a state-of-the-art grounded Vision-Language Model (VLM) developed by Roots Automation designed to revolutionize document understanding. By fine-tuning the powerful Qwen2.5-VL architecture, GutenOCR offers a unified, prompt-based interface that goes far beyond simple text extraction. Whether you are using the agile 3B model for efficiency or the robust 7B model for high-complexity tasks, the system provides a comprehensive toolkit for full-text reading with layout preservation, precise word-level detection, and localized reading within specific bounding boxes. Specifically engineered for business workflows, GutenOCR serves as a high-fidelity “front-end” for documents, enabling seamless transition from raw pixels to structured, actionable data.
GutenOCR-3B is a grounded OCR front-end obtained by fine-tuning Qwen2.5-VL-3B. The resulting single-checkpoint vision-language model exposes reading, detection, and grounding through a unified, prompt-based interface.
GutenOCR-7B is a grounded OCR front-end obtained by fine-tuning Qwen2.5-VL-7B. The resulting single-checkpoint vision-language model exposes reading, detection, and grounding through a unified, prompt-based interface.
Image from GutenOCR GitHub Repository
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
# 1. Load model and processor
model_id = "rootsautomation/GutenOCR-3B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_id)
# 2. Prepare inputs
image = Image.open("document.png")
# Example: Read all text
prompt = "Read all text in {image} and return a single TEXT string, linearized left-to-right/top-to-bottom."
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
}
]
# 3. Process and Generate
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=4096)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text[0])
Implementation
I’ll skip the long introductions for Docling and Bob — at this point, their reputations for high-performance parsing and AI wizardry precede them. 😅
Instead, let’s cut straight to the chase and dive into exactly what was achieved!
The Build: A Comprehensive GutenOCR & Docling Ecosystem
What started as a challenge turned into a full-scale document processing powerhouse. Bob didn’t just build a simple script; he engineered a multi-layered application that bridges the gap between raw vision and structured intelligence.
🌟 Key Features
This integration, now hosted at GutenOCR-Test, provides a robust toolkit for modern OCR workflows:
- Intelligence at Scale: Support for both GutenOCR-3B (optimized for speed) and GutenOCR-7B (optimized for high-fidelity accuracy).
- Hardware Agnostic: Seamlessly switches between CPU and GPU environments, making it accessible for local testing or high-performance production.
- The Best of Both Worlds: * Standard GutenOCR UI: Focused, high-speed OCR tasks.
- Docling + GutenOCR UI: A dedicated interface for advanced document processing that preserves layouts, tables, and complex structures.
- Versatile Task Handling: From standard full-text reading and LaTeX conversion to localized reading within specific bounding boxes and conditional detection.
- Enterprise-Ready Deployment: The project is fully “container-native,” featuring Docker support for both CPU/GPU and complete Kubernetes (K8s) deployment configurations.
- End-to-End Automation: Includes recursive batch processing for entire directories and automated maintenance scripts (start, stop, and GitHub sync).
Code Samples
Hereafter, I share two major application code provided by Bob. Regarding Docling, there are two implementations; by bactch on console or throug the GUI.
# gutenocr_engine.py
"""
GutenOCR Engine - Core OCR processing with CPU/GPU support
"""
import os
import torch
from typing import Optional, Dict, Any, List
from PIL import Image
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class GutenOCREngine:
"""
GutenOCR Engine for OCR processing with CPU/GPU support
"""
def __init__(
self,
model_id: str = "rootsautomation/GutenOCR-3B",
device: str = "auto",
use_cpu: bool = False,
torch_dtype: Optional[torch.dtype] = None
):
"""
Initialize GutenOCR Engine
Args:
model_id: HuggingFace model ID (GutenOCR-3B or GutenOCR-7B)
device: Device to use ('auto', 'cuda', 'cpu')
use_cpu: Force CPU usage even if GPU is available
torch_dtype: Torch data type (default: bfloat16 for GPU, float32 for CPU)
"""
self.model_id = model_id
self.use_cpu = use_cpu
# Determine device and dtype
if use_cpu or not torch.cuda.is_available():
self.device = "cpu"
self.torch_dtype = torch_dtype or torch.float32
logger.info("Using CPU for inference")
else:
self.device = device
self.torch_dtype = torch_dtype or torch.bfloat16
logger.info(f"Using GPU for inference with dtype {self.torch_dtype}")
# Load model and processor
logger.info(f"Loading model: {model_id}")
self.model = self._load_model()
self.processor = AutoProcessor.from_pretrained(model_id)
logger.info("Model loaded successfully")
def _load_model(self) -> Qwen2_5_VLForConditionalGeneration:
"""Load the model with appropriate settings"""
try:
if self.use_cpu:
# CPU-specific loading
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
self.model_id,
torch_dtype=self.torch_dtype,
device_map="cpu",
low_cpu_mem_usage=True
)
else:
# GPU loading
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
self.model_id,
torch_dtype=self.torch_dtype,
device_map=self.device
)
return model
except Exception as e:
logger.error(f"Error loading model: {e}")
raise
def process_image(
self,
image_path: str,
task_type: str = "reading",
output_format: str = "TEXT",
max_new_tokens: int = 4096,
custom_prompt: Optional[str] = None
) -> Dict[str, Any]:
"""
Process an image with OCR
Args:
image_path: Path to the image file
task_type: Type of task ('reading', 'detection', 'localized_reading', 'conditional_detection')
output_format: Output format ('TEXT', 'TEXT2D', 'LINES', 'WORDS', 'PARAGRAPHS', 'LATEX', 'BOX')
max_new_tokens: Maximum number of tokens to generate
custom_prompt: Custom prompt (overrides default)
Returns:
Dictionary with OCR results
"""
try:
# Load image
image = Image.open(image_path).convert("RGB")
# Generate prompt
if custom_prompt:
prompt = custom_prompt
else:
prompt = self._generate_prompt(task_type, output_format)
# Prepare messages
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
}
]
# Process and generate
text = self.processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = self.processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
# Move to device
if self.use_cpu:
inputs = {k: v.to("cpu") for k, v in inputs.items()}
else:
inputs = inputs.to(self.device)
# Generate
logger.info(f"Processing image: {image_path}")
with torch.no_grad():
generated_ids = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens
)
# Decode output
generated_ids_trimmed = [
out_ids[len(in_ids):]
for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = self.processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
return {
"success": True,
"image_path": image_path,
"task_type": task_type,
"output_format": output_format,
"text": output_text[0],
"prompt": prompt
}
except Exception as e:
logger.error(f"Error processing image {image_path}: {e}")
return {
"success": False,
"image_path": image_path,
"error": str(e)
}
def _generate_prompt(self, task_type: str, output_format: str) -> str:
"""Generate appropriate prompt based on task type and output format"""
prompts = {
"reading": {
"TEXT": "Read all text in the image and return a single TEXT string, linearized left-to-right/top-to-bottom.",
"TEXT2D": "Return a layout-sensitive TEXT2D representation of the image.",
"LINES": "Return line-by-line OCR as LINES with bounding boxes.",
"WORDS": "Return word-by-word OCR as WORDS with bounding boxes.",
"PARAGRAPHS": "Return paragraph-wise OCR as PARAGRAPHS with bounding boxes.",
"LATEX": "Extract all LaTeX expressions with bounding boxes."
},
"detection": {
"BOX": "Highlight all text regions in the image by returning their bounding boxes as a JSON array."
},
"localized_reading": {
"TEXT": "What does it say in the specified region of the image?"
},
"conditional_detection": {
"BOX": "Find and return bounding boxes for the specified text query."
}
}
return prompts.get(task_type, {}).get(
output_format,
"Read all text in the image."
)
def batch_process(
self,
image_paths: List[str],
task_type: str = "reading",
output_format: str = "TEXT",
max_new_tokens: int = 4096
) -> List[Dict[str, Any]]:
"""
Process multiple images in batch
Args:
image_paths: List of image paths
task_type: Type of task
output_format: Output format
max_new_tokens: Maximum tokens to generate
Returns:
List of results for each image
"""
results = []
for image_path in image_paths:
result = self.process_image(
image_path,
task_type=task_type,
output_format=output_format,
max_new_tokens=max_new_tokens
)
results.append(result)
return results
def get_device_info(self) -> Dict[str, Any]:
"""Get information about the device being used"""
return {
"device": self.device,
"use_cpu": self.use_cpu,
"torch_dtype": str(self.torch_dtype),
"cuda_available": torch.cuda.is_available(),
"cuda_device_count": torch.cuda.device_count() if torch.cuda.is_available() else 0,
"model_id": self.model_id
}
# Made with Bob
./scripts/start.sh --mode gradio
==========================================
GutenOCR Application Startup
==========================================
[INFO] Creating directories...
[INFO] Starting GutenOCR Gradio UI...
[INFO] Installing dependencies...
[INFO] Running on GPU (if available)
[INFO] Starting Gradio UI on http://localhost:7860
/Users/alainairom/Devs/GutenOCR-Test/src/gradio_ui.py:162: UserWarning: The parameters have been moved from the Blocks constructor to the launch() method in Gradio 6.0: theme. Please pass these parameters to launch() instead.
with gr.Blocks(title="GutenOCR Application", theme=gr.themes.Soft()) as interface:
* Running on local URL: http://0.0.0.0:7860
INFO:httpx:HTTP Request: GET http://localhost:7860/gradio_api/startup-events "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: HEAD http://localhost:7860/ "HTTP/1.1 200 OK"
* To create a public link, set `share=True` in `launch()`.
INFO:httpx:HTTP Request: GET https://api.gradio.app/pkg-version "HTTP/1.1 200 OK"
INFO:__main__:Initializing engine with model: rootsautomation/GutenOCR-3B, CPU: False
INFO:gutenocr_engine:Using CPU for inference
INFO:gutenocr_engine:Loading model: rootsautomation/GutenOCR-3B
`torch_dtype` is deprecated! Use `dtype` instead!
config.json: 3.35kB [00:00, 6.93MB/s]
model.safetensors.index.json: 65.5kB [00:00, 329MB/s]
model-00002-of-00002.safetensors: 100%|██████████████████████████████████████████████████| 2.51G/2.51G [02:54<00:00, 14.4MB/s]
model-00001-of-00002.safetensors: 100%|██████████████████████████████████████████████████| 5.00G/5.00G [03:20<00:00, 25.0MB/s]
Fetching 2 files: 100%|████████████████████████████████████████████████████████████████████████| 2/2 [03:20<00:00, 100.37s/it]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████| 2/2 [00:11<00:00, 5.82s/it]
generation_config.json: 100%|████████████████████████████████████████████████████████████████| 188/188 [00:00<00:00, 2.85MB/s]
preprocessor_config.json: 100%|██████████████████████████████████████████████████████████████| 829/829 [00:00<00:00, 6.64MB/s]
tokenizer_config.json: 4.92kB [00:00, 10.4MB/s]
vocab.json: 2.78MB [00:00, 65.3MB/s]
merges.txt: 1.67MB [00:00, 38.4MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████| 11.4M/11.4M [00:01<00:00, 6.92MB/s]
added_tokens.json: 100%|█████████████████████████████████████████████████████████████████████| 605/605 [00:00<00:00, 2.54MB/s]
special_tokens_map.json: 100%|███████████████████████████████████████████████████████████████| 613/613 [00:00<00:00, 2.92MB/s]
chat_template.jinja: 4.25kB [00:00, 7.97MB/s]
video_preprocessor_config.json: 100%|████████████████████████████████████████████████████████| 913/913 [00:00<00:00, 3.36MB/s]
INFO:gutenocr_engine:Model loaded successfully
INFO:file_processor:Discovered 2 image files
/Users/alainairom/Devs/GutenOCR-Test/venv/lib/python3.14/site-packages/transformers/tokenization_utils_base.py:2919: UserWarning: `max_length` is ignored when `padding`=`True` and there is no truncation strategy. To pad to max length, use `padding='max_length'`.
warnings.warn(
INFO:gutenocr_engine:Processing image: input/1768221419048.jpeg
# docling_gutenocr_combined.py
"""
Combined Docling + GutenOCR Application
Integrates Docling's document processing with GutenOCR's OCR capabilities
"""
import os
import json
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Any, Optional
import logging
try:
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
DOCLING_AVAILABLE = True
except ImportError:
DOCLING_AVAILABLE = False
logging.warning("Docling not available. Install with: pip install docling")
from gutenocr_engine import GutenOCREngine
from file_processor import FileProcessor
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class DoclingGutenOCRProcessor:
"""
Combined processor using Docling for document structure and GutenOCR for OCR
"""
def __init__(
self,
gutenocr_model: str = "rootsautomation/GutenOCR-3B",
use_cpu: bool = False,
use_docling: bool = True
):
"""
Initialize combined processor
Args:
gutenocr_model: GutenOCR model to use
use_cpu: Force CPU usage
use_docling: Whether to use Docling (if available)
"""
# Initialize GutenOCR
self.gutenocr = GutenOCREngine(
model_id=gutenocr_model,
use_cpu=use_cpu
)
# Initialize Docling if available and requested
self.use_docling = use_docling and DOCLING_AVAILABLE
if self.use_docling:
try:
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False # We'll use GutenOCR for OCR
pipeline_options.do_table_structure = True
self.docling_converter = DocumentConverter(
allowed_formats=[
InputFormat.PDF,
InputFormat.DOCX,
InputFormat.PPTX,
InputFormat.IMAGE,
InputFormat.HTML,
InputFormat.MD
],
pdf_backend=PyPdfiumDocumentBackend,
pipeline_options=pipeline_options
)
logger.info("Docling initialized successfully")
except Exception as e:
logger.warning(f"Could not initialize Docling: {e}")
self.use_docling = False
else:
self.docling_converter = None
if not DOCLING_AVAILABLE:
logger.warning("Docling not available - using GutenOCR only")
self.file_processor = FileProcessor()
def process_document(
self,
file_path: str,
extract_structure: bool = True,
extract_tables: bool = True,
ocr_images: bool = True
) -> Dict[str, Any]:
"""
Process a document with combined Docling + GutenOCR
Args:
file_path: Path to document
extract_structure: Extract document structure with Docling
extract_tables: Extract tables with Docling
ocr_images: Perform OCR on images with GutenOCR
Returns:
Combined processing results
"""
result = {
"file_path": file_path,
"timestamp": datetime.now().isoformat(),
"docling_structure": None,
"gutenocr_ocr": None,
"combined_text": "",
"metadata": {}
}
try:
# Step 1: Process with Docling if available
if self.use_docling and extract_structure:
logger.info(f"Processing with Docling: {file_path}")
docling_result = self._process_with_docling(
file_path,
extract_tables=extract_tables
)
result["docling_structure"] = docling_result
result["metadata"]["docling_processed"] = True
# Step 2: Process with GutenOCR
if ocr_images:
logger.info(f"Processing with GutenOCR: {file_path}")
ocr_result = self.gutenocr.process_image(
image_path=file_path,
task_type="reading",
output_format="TEXT2D"
)
result["gutenocr_ocr"] = ocr_result
result["metadata"]["gutenocr_processed"] = True
if ocr_result.get("success"):
result["combined_text"] = ocr_result.get("text", "")
# Step 3: Combine results
if result["docling_structure"] and result["gutenocr_ocr"]:
result["combined_text"] = self._merge_results(
result["docling_structure"],
result["gutenocr_ocr"]
)
result["metadata"]["processing_mode"] = "combined"
elif result["docling_structure"]:
result["combined_text"] = result["docling_structure"].get("text", "")
result["metadata"]["processing_mode"] = "docling_only"
elif result["gutenocr_ocr"]:
result["combined_text"] = result["gutenocr_ocr"].get("text", "")
result["metadata"]["processing_mode"] = "gutenocr_only"
result["success"] = True
except Exception as e:
logger.error(f"Error processing document {file_path}: {e}")
result["success"] = False
result["error"] = str(e)
return result
def _process_with_docling(
self,
file_path: str,
extract_tables: bool = True
) -> Dict[str, Any]:
"""Process document with Docling"""
try:
conv_result = self.docling_converter.convert(file_path)
# Extract document structure
doc_result = {
"text": conv_result.document.export_to_markdown(),
"structure": {
"pages": len(conv_result.document.pages) if hasattr(conv_result.document, 'pages') else 0,
"elements": []
},
"tables": [],
"metadata": conv_result.document.metadata if hasattr(conv_result.document, 'metadata') else {}
}
# Extract tables if requested
if extract_tables and hasattr(conv_result.document, 'tables'):
for table in conv_result.document.tables:
doc_result["tables"].append({
"data": table.export_to_dataframe().to_dict() if hasattr(table, 'export_to_dataframe') else {},
"caption": getattr(table, 'caption', '')
})
return doc_result
except Exception as e:
logger.error(f"Docling processing error: {e}")
return {"error": str(e)}
def _merge_results(
self,
docling_result: Dict[str, Any],
gutenocr_result: Dict[str, Any]
) -> str:
"""
Merge Docling structure with GutenOCR OCR results
Args:
docling_result: Docling processing result
gutenocr_result: GutenOCR processing result
Returns:
Merged text content
"""
merged_text = []
# Add Docling structured content
if docling_result.get("text"):
merged_text.append("=== DOCUMENT STRUCTURE (Docling) ===\n")
merged_text.append(docling_result["text"])
merged_text.append("\n")
# Add tables if present
if docling_result.get("tables"):
merged_text.append("\n=== EXTRACTED TABLES ===\n")
for idx, table in enumerate(docling_result["tables"], 1):
merged_text.append(f"\nTable {idx}:")
if table.get("caption"):
merged_text.append(f"Caption: {table['caption']}")
merged_text.append(str(table.get("data", {})))
merged_text.append("\n")
# Add GutenOCR OCR content
if gutenocr_result.get("success") and gutenocr_result.get("text"):
merged_text.append("\n=== OCR CONTENT (GutenOCR) ===\n")
merged_text.append(gutenocr_result["text"])
return "\n".join(merged_text)
def batch_process(
self,
input_dir: str = "./input",
output_dir: str = "./output",
extract_structure: bool = True,
extract_tables: bool = True,
ocr_images: bool = True
) -> List[Dict[str, Any]]:
"""
Batch process documents
Args:
input_dir: Input directory
output_dir: Output directory
extract_structure: Extract structure with Docling
extract_tables: Extract tables
ocr_images: Perform OCR
Returns:
List of processing results
"""
# Update file processor directories
self.file_processor.input_dir = Path(input_dir)
self.file_processor.output_dir = Path(output_dir)
# Discover files
files = self.file_processor.discover_images(recursive=True)
results = []
for file_path in files:
logger.info(f"Processing: {file_path}")
result = self.process_document(
file_path,
extract_structure=extract_structure,
extract_tables=extract_tables,
ocr_images=ocr_images
)
results.append(result)
# Save results
timestamp_str = datetime.now().strftime("%Y%m%d_%H%M%S")
output_path = Path(output_dir) / f"combined_results_{timestamp_str}.json"
with open(output_path, 'w', encoding='utf-8') as f:
json.dump(results, f, indent=2, ensure_ascii=False)
logger.info(f"Results saved to: {output_path}")
return results
def get_capabilities(self) -> Dict[str, Any]:
"""Get information about available capabilities"""
return {
"docling_available": self.use_docling,
"gutenocr_model": self.gutenocr.model_id,
"device_info": self.gutenocr.get_device_info(),
"supported_formats": [
"PDF", "DOCX", "PPTX", "PNG", "JPG", "JPEG",
"TIFF", "BMP", "GIF", "WEBP", "HTML", "MD"
] if self.use_docling else [
"PNG", "JPG", "JPEG", "TIFF", "BMP", "GIF", "WEBP", "PDF"
]
}
def main():
"""Main entry point for combined processor"""
import argparse
parser = argparse.ArgumentParser(description="Combined Docling + GutenOCR Processor")
parser.add_argument("--input", default="./input", help="Input directory")
parser.add_argument("--output", default="./output", help="Output directory")
parser.add_argument("--model", default="rootsautomation/GutenOCR-3B", help="GutenOCR model")
parser.add_argument("--cpu", action="store_true", help="Force CPU usage")
parser.add_argument("--no-docling", action="store_true", help="Disable Docling")
parser.add_argument("--no-structure", action="store_true", help="Skip structure extraction")
parser.add_argument("--no-tables", action="store_true", help="Skip table extraction")
parser.add_argument("--no-ocr", action="store_true", help="Skip OCR")
args = parser.parse_args()
# Initialize processor
processor = DoclingGutenOCRProcessor(
gutenocr_model=args.model,
use_cpu=args.cpu,
use_docling=not args.no_docling
)
# Print capabilities
capabilities = processor.get_capabilities()
print("\n=== Processor Capabilities ===")
for key, value in capabilities.items():
print(f"{key}: {value}")
print()
# Process documents
results = processor.batch_process(
input_dir=args.input,
output_dir=args.output,
extract_structure=not args.no_structure,
extract_tables=not args.no_tables,
ocr_images=not args.no_ocr
)
# Print summary
successful = sum(1 for r in results if r.get("success"))
print(f"\n=== Processing Complete ===")
print(f"Total files: {len(results)}")
print(f"Successful: {successful}")
print(f"Failed: {len(results) - successful}")
if __name__ == "__main__":
main()
# Made with Bob
Tests
True to form, the application Bob built handles high-volume workloads with ease thanks to its batch processing engine. I put it to the test using the two images shown below. While running this on a CPU admittedly requires a bit of patience compared to a high-end GPU, the wait was well worth it. The results — which I’ve shared below — are remarkably accurate, capturing every detail with the kind of precision you’d expect from a much more complex setup.
The Test Run: From Pixels to Precision
Here is a look at the input images and the resulting structured data. Even under CPU constraints, the fidelity of the extraction is truly impressive.
- Input ⬇️
- Output(s); (combined or seperately) ⬆️
[
{
"file_path": "input/1768221419048.jpeg",
"timestamp": "2026-01-29T10:36:20.309686",
"docling_structure": null,
"gutenocr_ocr": {
"success": true,
"image_path": "input/1768221419048.jpeg",
"task_type": "reading",
"output_format": "TEXT2D",
"text": "HOW TO EXPLAIN AI TERMS\n A PROFESSOR'S VISUAL GUIDE\n\n Artificial Intelligence (AI)\n Machine Learning (ML) Learning from\n data\n Deep Learning (DL) Complex patterns\n\n Neural Networks (NN) Mimics brain structure\n\n Attention Sequence\n Transformers processing\n Attention\n Generative AI (GenAI) Create content\n Creates new content\n Large Language Models (LLMs)\n Vast text data\n Generative Pre-Trained Transformers (GPT)\n Specific application\n pre-trained\n ? Answer ChatGPT\n ChatGPT\n ?\n\n Follow Luis Rodrigues for insights\n\n EXPLO EXPLO EXPLO EXPLO",
"prompt": "Return a layout-sensitive TEXT2D representation of the image."
},
"combined_text": "HOW TO EXPLAIN AI TERMS\n A PROFESSOR'S VISUAL GUIDE\n\n Artificial Intelligence (AI)\n Machine Learning (ML) Learning from\n data\n Deep Learning (DL) Complex patterns\n\n Neural Networks (NN) Mimics brain structure\n\n Attention Sequence\n Transformers processing\n Attention\n Generative AI (GenAI) Create content\n Creates new content\n Large Language Models (LLMs)\n Vast text data\n Generative Pre-Trained Transformers (GPT)\n Specific application\n pre-trained\n ? Answer ChatGPT\n ChatGPT\n ?\n\n Follow Luis Rodrigues for insights\n\n EXPLO EXPLO EXPLO EXPLO",
"metadata": {
"gutenocr_processed": true,
"processing_mode": "gutenocr_only"
},
"success": true
},
{
"file_path": "input/597919025_1354062503415836_5851124507328073158_n.jpg",
"timestamp": "2026-01-29T10:37:06.931793",
"docling_structure": null,
"gutenocr_ocr": {
"success": true,
"image_path": "input/597919025_1354062503415836_5851124507328073158_n.jpg",
"task_type": "reading",
"output_format": "TEXT2D",
"text": "LANG FOCUS Indo-European vocabulary\n\n mādar مادر mother barādar برادر brother\n pedar پدر father dokhtar دختر daughter\n\n dar در door nām نام name\n dandān دندان tooth gāw كاو cow\n related to \"dental\"; \"dent\" in French",
"prompt": "Return a layout-sensitive TEXT2D representation of the image."
},
"combined_text": "LANG FOCUS Indo-European vocabulary\n\n mādar مادر mother barādar برادر brother\n pedar پدر father dokhtar دختر daughter\n\n dar در door nām نام name\n dandān دندان tooth gāw كاو cow\n related to \"dental\"; \"dent\" in French",
"metadata": {
"gutenocr_processed": true,
"processing_mode": "gutenocr_only"
},
"success": true
}
]
[
{
"success": true,
"image_path": "input/1768221419048.jpeg",
"task_type": "reading",
"output_format": "TEXT",
"text": "HOW TO EXPLAIN AI TERMS A PROFESSOR'S VISUAL GUIDE Artificial Intelligence (AI) Learning from Machine Learning (ML) data Complex Deep Learning (DL) patterns Mimics brain Neural Networks (NN) structure Attention Sequence Transformers processing Attention Create Generative AI (GenAI) content Creates new content Large Language Models (LLMs) Vast text data Generative Pre-Trained Transformers (GPT) Specific application pre-trained ? Answer ChatGPT ChatGPT ? Follow Luis Rodrigues for insights EXPLO EXPLO EXPLO EXPLO",
"prompt": "Read all text in the image and return a single TEXT string, linearized left-to-right/top-to-bottom."
},
{
"success": true,
"image_path": "input/597919025_1354062503415836_5851124507328073158_n.jpg",
"task_type": "reading",
"output_format": "TEXT",
"text": "Indo-European vocabulary LANG FOCUS mādar mother barādar brother مادر برادر pedar father dokhtar دختر daughter پدر daughter dar door nām name نام در cow dandān tooth gāw دندان related to \"dental\"; \"dent\" in French",
"prompt": "Read all text in the image and return a single TEXT string, linearized left-to-right/top-to-bottom."
}
]
Source: input/597919025_1354062503415836_5851124507328073158_n.jpg
Processed: 2026-01-29T10:32:18.971569
Task: reading
Format: TEXT
================================================================================
Indo-European vocabulary LANG FOCUS mādar mother barādar brother مادر برادر pedar father dokhtar دختر daughter پدر daughter dar door nām name نام در cow dandān tooth gāw دندان related to "dental"; "dent" in French
A bit of digression-CPU vs. GPU: Optimizing the Workflow
Running Vision Language Models (VLMs) like GutenOCR is a heavy lift for any system. While Bob’s application is designed to be flexible, your choice of hardware — and how you tune it — will define your experience.
⚡ Performance Comparison
| Feature | **CPU (Standard)** | **GPU (NVIDIA RTX)** |
| --------------- | ----------------------------- | ----------------------------------- |
| **Speed** | 🐢 Slow (Good for single docs) | 🚀 Fast (6x - 10x speedup) |
| **Concurrency** | Sequential processing | Parallel batching (up to 128 pages) |
| **Best For** | Testing & Light Automation | Large-scale Batch Processing |
| **Fidelity** | Identical to GPU | Identical to GPU |
💡 Ideas for Maximum Efficiency
If you’re finding the process a bit slow on your machine, here is how you can “tune the engine” just like Bob did:
- Choose the Right Model: Use GutenOCR-3B for daily tasks. It requires significantly less RAM (~8GB) and is much snappier than the 7B version without sacrificing too much accuracy for standard fonts.
- Scale Your Images: In the configuration, keep image_scale at 1.0. Increasing it to 2.0 improves quality for tiny text but can double your processing time and memory usage.
- Leverage Batch Processing: If you have an NVIDIA GPU, Bob has enabled vLLM support. This allows the app to “prefill” multiple pages at once, which is a total game-changer for 100+ page PDFs.
- Toggle OCR Wisely: For digital-native PDFs (where you can already select text), you can disable the “Full OCR” mode in the UI to let Docling handle the structure extraction solo — it’s roughly 10x faster!
Final Thoughts: The Future of Document AI is Here
This experiment proved one thing: the barrier between “raw document” and “structured data” has officially collapsed. By combining the structural intelligence of Docling with the high-fidelity vision of GutenOCR, we’ve moved past simple text extraction into the realm of true document understanding.
What’s most impressive isn’t just the accuracy — it’s the accessibility. The fact that Bob could bridge these complex technologies into a functional, containerized app in under an hour shows how powerful the modern AI ecosystem has become. We are no longer waiting for “enterprise solutions” to catch up; the tools are in our hands right now.
🚀 Get Involved
I’ve made the repository public so you can take it for a spin yourself. Whether you’re processing a single invoice or a thousand-page archive, this stack is ready to work.
- Check out the Repo: aairom/GutenOCR-Test
- Try the Models: Appreciate Roots Automation team on Hugging Face. 🌞
What are you planning to automate next? If you have a specific document challenge or want to see Bob tackle a different integration, let me know in the comments below!
Links
GutenOCR Online Demonstration: https://ocr.roots.ai/
GutenOCR Repository: https://github.com/Roots-Automation/GutenOCR
Roots Automation on Hugging Face: https://huggingface.co/rootsautomation
GutenOCR 3B on Hugging Face: https://huggingface.co/rootsautomation/GutenOCR-3B
GutenOCR 7B on Hugging Face: https://huggingface.co/rootsautomation/GutenOCR-7B
GutenOCR Paper: https://arxiv.org/abs/2601.14490
Code Repository of this post: https://github.com/aairom/GutenOCR-Test
Docling Repository: https://github.com/docling-project/docling
My buddy Bob: https://www.ibm.com/products/bob 😉









Top comments (0)