Alain Airom (Ayrom)

Posted on May 2

Inside Docling Factory: Building a Multimodal RAG Powerhouse

#rag #opensearch #docling #bob

The vO.1a release of a project on which I’m working on for a while…📍

Intro-TL;DR

We’ve all been there: staring at a pile of messy PDFs, complex XBRL financial reports, and scattered CSVs, wondering how to turn that “dark data” into something an LLM can actually understand. Today, I’m pulling back the curtain on Docling Factory, a project I’ve been refining to solve exactly that.

The purpose and final expectation of Docling Factory is to design itas a multimodal document processing pipeline. It transforms unstructured documents (PDFs, Docx, etc.) into structured formats and provides an interactive chat interface to query those documents using a local or remote LLM.

The aim is to have a set of tools which are more than just a parser; it’s a full-stack document intelligence pipeline that bridges the gap between raw files and meaningful conversation.

🔩 Key Components

Web Interface: Built with Gradio, providing a multi-tabbed UI for uploading, parsing, and chatting.
Parsing Engine (DoclingParser): Leverages the docling library for layout-aware document conversion.
RAG Engine (RAGEngine): Orchestrates semantic search using OpenSearch as a vector database and Ollama or LiteLLM for embeddings and text generation.
Observability: Integrated with OpenLLMetry and a custom MetricsCollector to track latency, token usage, and system health.
LLM Engine: The LLM orchestration layer follows a hybrid approach: local inference is handled by Ollama, while a LiteLLM gateway provides a unified API interface for external model integration. This setup provides the best of both worlds — local execution for standard tasks and an easy ‘escape hatch’ to commercial LLMs via a standardized gateway when higher reasoning capabilities are required.

The Core Engine: Docling & Layout Awareness

At the heart of the system is the DoclingParser. Unlike traditional parsers that treat a page as a flat bag of words, this engine is layout-aware. It identifies headers, tables, and even those tricky images. Using the docling library, the user can convert a complex PDF into clean Markdown while keeping the structure intact.

I’ve baked in support for RapidOCR, EasyOCR, and even macOS Vision. Whether you have a digital-native DOCX or a grainy scan of a 1990s invoice, the parser adapts. It even handles XBRL and CSV files natively, transforming structured data into LLM-friendly formats.

# docling_parser.py
"""
Docling Parser Module - Enhanced Version
Handles document parsing using the Docling library with support for:
- Batch and individual processing
- Multiple output formats (Markdown, HTML, JSON, Multimodal)
- Figure extraction
- Full page OCR with multiple OCR engines
- XBRL document conversion
- CSV file conversion
"""

import os
import logging
from pathlib import Path
from datetime import datetime
from typing import List, Dict, Optional, Union, Callable
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    EasyOcrOptions,
    RapidOcrOptions,
    TesseractOcrOptions,
    OcrMacOptions
)
from docling_core.types.doc.base import ImageRefMode
from docling_core.types.doc.document import PictureItem, TableItem
import json

#####
#####
#####

def main():
    """Example usage of the enhanced DoclingParser class."""
    parser = DoclingParser(use_gpu=False, output_dir="output")

    # Example: Parse with OCR and figure extraction
    def progress_update(message):
        print(f"Progress: {message}")

    # Batch processing with all features
    def batch_progress(message, current, total):
        percentage = (current / total * 100) if total > 0 else 0
        print(f"[{percentage:.1f}%] {message}")

    results = parser.parse_batch(
        "input",
        output_formats=['markdown', 'html', 'json'],
        export_figures=True,
        export_multimodal=False,
        ocr_engine='easyocr',
        force_ocr=False,
        progress_callback=batch_progress
    )

    # Print summary
    print("\n" + "="*50)
    print("BATCH PROCESSING SUMMARY")
    print("="*50)
    for result in results:
        status_icon = "✓" if result["status"] == "success" else "✗"
        file_name = Path(result['input_file']).name
        if result["status"] == "success":
            formats = ", ".join(result.get('formats', []))
            figures = result.get('figure_count', 0)
            print(f"{status_icon} {file_name}: {formats} ({figures} figures)")
        else:
            print(f"{status_icon} {file_name}: {result.get('error', 'Unknown error')}")
    print("="*50)


if __name__ == "__main__":
    main()

# Made with Bob

Why Multimodal Matters: Documents aren’t just text. They have charts, figures, and diagrams. Docling Factory detects PictureItem elements and extracts them as separate PNGs, or embeds them directly into the Markdown. This means your RAG system doesn't just "read"—it "sees."

RAG & Vector Storage (`rag_engine.py`)

How do the user(s) chat with these documents? I built the RAGEngine to be as flexible as possible. It uses OpenSearch with k-NN (k-Nearest Neighbors) search for lightning-fast vector retrieval. The RAG system enables “Chat with Documents” by indexing parsed text into a vector space.

Vector Database: Uses OpenSearch with k-NN (k-Nearest Neighbors) search enabled. It uses the lucene engine and cosinesimil space type for vector comparisons.

Hybrid Backend

The real magic is the Dual Backend Support. You can run everything locally using Ollama (perfect for privacy-sensitive data) or flip a switch to LiteLLM. LiteLLM acts as a gateway to over 100 providers, letting you use GPT-4, Claude, or Gemini without changing a single line of business logic.

Ollama: For local execution of models like llama3.2 and granite-embedding.

LiteLLM: Acts as an AI Gateway to connect to remote providers (OpenAI, Anthropic) or proxied local models.

# LiteLLM Configuration File
# This file configures the LiteLLM AI Gateway for unified access to multiple LLM providers
# Documentation: https://docs.litellm.ai/docs/proxy/configs

model_list:
  # OpenAI Models (requires OPENAI_API_KEY environment variable)
  - model_name: gpt-4
    litellm_params:
      model: gpt-4
      api_key: os.environ/OPENAI_API_KEY  # Never hardcode API keys

  - model_name: gpt-3.5-turbo
    litellm_params:
      model: gpt-3.5-turbo
      api_key: os.environ/OPENAI_API_KEY  # Never hardcode API keys

  # OpenAI Embeddings
  - model_name: text-embedding-ada-002
    litellm_params:
      model: text-embedding-ada-002
      api_key: os.environ/OPENAI_API_KEY  # Never hardcode API keys

  # Anthropic Claude (requires ANTHROPIC_API_KEY environment variable)
  - model_name: claude-3-sonnet
    litellm_params:
      model: claude-3-sonnet-20240229
      api_key: os.environ/ANTHROPIC_API_KEY  # Never hardcode API keys

  - model_name: claude-3-opus
    litellm_params:
      model: claude-3-opus-20240229
      api_key: os.environ/ANTHROPIC_API_KEY  # Never hardcode API keys

  # Local Ollama models (connect to host Ollama)
  - model_name: ollama/llama3.2
    litellm_params:
      model: ollama/llama3.2:latest
      api_base: http://host.docker.internal:11434

  - model_name: ollama/granite-embedding
    litellm_params:
      model: ollama/granite-embedding:30m
      api_base: http://host.docker.internal:11434

  # Azure OpenAI (requires AZURE_API_KEY, AZURE_API_BASE, AZURE_API_VERSION)
  # - model_name: azure-gpt-4
  #   litellm_params:
  #     model: azure/gpt-4
  #     api_key: os.environ/AZURE_API_KEY  # Never hardcode API keys
  #     api_base: os.environ/AZURE_API_BASE
  #     api_version: os.environ/AZURE_API_VERSION

  # Google Vertex AI (requires GOOGLE_APPLICATION_CREDENTIALS)
  # - model_name: gemini-pro
  #   litellm_params:
  #     model: vertex_ai/gemini-pro
  #     vertex_project: os.environ/VERTEX_PROJECT
  #     vertex_location: os.environ/VERTEX_LOCATION

  # AWS Bedrock (requires AWS credentials)
  # - model_name: bedrock-claude
  #   litellm_params:
  #     model: bedrock/anthropic.claude-v2
  #     aws_region_name: us-east-1

# General settings
general_settings:
  master_key: os.environ/LITELLM_MASTER_KEY  # REQUIRED: Set via environment variable, never hardcode
  database_url: os.environ/DATABASE_URL  # PostgreSQL connection string

  # Logging
  success_callback: ["langfuse"]  # Optional: integrate with Langfuse for observability

  # Rate limiting (optional)
  # max_parallel_requests: 100
  # max_budget: 100  # USD

  # Caching (optional)
  # cache: true
  # cache_params:
  #   type: "redis"
  #   host: "redis"
  #   port: 6379

# Router settings for load balancing and fallbacks
router_settings:
  routing_strategy: simple-shuffle  # Options: simple-shuffle, least-busy, usage-based-routing
  model_group_alias:
    gpt-4-group:
      - gpt-4
      - azure-gpt-4  # Fallback to Azure if OpenAI fails

  # Retry settings
  num_retries: 2
  timeout: 600  # seconds

  # Fallback models
  fallbacks:
    - gpt-4: ["gpt-3.5-turbo"]
    - claude-3-opus: ["claude-3-sonnet"]

# Environment variables required:
# - LITELLM_MASTER_KEY: Master key for API authentication
# - DATABASE_URL: PostgreSQL connection string
# - OPENAI_API_KEY: For OpenAI models (optional)
# - ANTHROPIC_API_KEY: For Claude models (optional)
# - AZURE_API_KEY, AZURE_API_BASE, AZURE_API_VERSION: For Azure OpenAI (optional)
# - GOOGLE_APPLICATION_CREDENTIALS: For Google Vertex AI (optional)
# - AWS credentials: For AWS Bedrock (optional)

# Made with Bob

Chunking: Employs RecursiveCharacterTextSplitter with a chunk size of 500 characters and 100-character overlap to ensure context is preserved without exceeding model limits.

# rag_engine.py
"""
RAG Engine Module
Implements Retrieval-Augmented Generation using OpenSearch and Ollama.
Supports document indexing, semantic search, and chat with documents.
"""

import os
import logging
from pathlib import Path
from typing import List, Dict, Optional, Tuple
import json
from datetime import datetime

# OpenSearch
from opensearchpy import OpenSearch, helpers
from opensearchpy.exceptions import NotFoundError

# LLM Clients
import ollama
from litellm import completion, embedding
import litellm

# LangChain for RAG orchestration
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# OpenLLMetry for observability
from traceloop.sdk import Traceloop
from traceloop.sdk.decorators import workflow, task

# OpenTelemetry for metrics
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Metrics collector
from metrics_collector import MetricsCollector, initialize_metrics_collector, get_metrics_collector

#####
#####
#####

# Initialize LLM components based on configuration
        if use_litellm:
            logger.info("Using LiteLLM for embeddings and LLM")
            self.embeddings = LiteLLMEmbeddings(
                model=embedding_model,
                api_base=litellm_api_base,
                api_key=litellm_api_key
            )
            self.llm = LiteLLMLLM(
                model=llm_model,
                api_base=litellm_api_base,
                api_key=litellm_api_key
            )
        else:
            logger.info("Using Ollama for embeddings and LLM")
            self.embeddings = OllamaEmbeddings(model=embedding_model, base_url=ollama_base_url)
            self.llm = OllamaLLM(model=llm_model, base_url=ollama_base_url)

        # Text splitter for chunking
        # Reduced chunk size to avoid exceeding embedding model context length
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=500,  # Reduced from 1000 to fit within model context
            chunk_overlap=100,  # Reduced overlap proportionally
            length_function=len,
        )

A Word on The RAG Database: OpenSeach RAG

The OpenSearch implementation serves as the high-performance backbone of the Docling Factory’s RAG engine, specifically configured as a vector database to handle complex semantic retrieval. Within the RAGEngine class, the system initializes an OpenSearch client that utilizes k-NN (k-Nearest Neighbors) search, leveraging the lucene engine and cosinesimil space type to measure the similarity between document chunks and user queries. During the indexing phase, the engine converts Markdown chunks into vectors—typically using a 384-dimensional model—and performs bulk uploads into a dedicated index where index.knn is set to true. For retrieval, the search method transforms the user’s prompt into an embedding and queries OpenSearch to return the top k most relevant context fragments based on their vector score, which are then used to augment the LLM's final response.

Observability: Seeing the Unseen

“Black box” AI is a no-go for production. That’s why I integrated OpenLLMetry. Through our MetricsCollector, the application tracks everything: latency, token usage, and even cost estimation. If a model starts acting up or a prompt is getting too expensive, you’ll see it on the Plotly-powered dashboard before it becomes a problem.

The MetricsCollector uses OpenTelemetry to intercept spans from LLM calls. The standalone_dashboard.py script can then generate a visual representation of this data using Plotly, showing error rates, token costs, and P95/P99 latency percentiles.

# metrics_collector.py
"""
OpenLLMetry Metrics Collector
Collects and aggregates metrics from OpenTelemetry traces for dashboard display.
"""

import logging
from typing import Dict, List, Optional
from datetime import datetime, timedelta
from collections import defaultdict
import threading
import time

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider, ReadableSpan
from opentelemetry.sdk.trace.export import SpanExporter, SpanExportResult
from opentelemetry.sdk.resources import Resource

logger = logging.getLogger(__name__)


class MetricsCollector(SpanExporter):
    """
    Custom span exporter that collects metrics from OpenTelemetry spans.
    Stores metrics in memory for dashboard display.
    """

    def __init__(self, max_history: int = 1000):
        """
        Initialize metrics collector.

        Args:
            max_history: Maximum number of spans to keep in history
        """
        self.max_history = max_history
        self.spans: List[Dict] = []
        self.metrics: Dict = {
            "total_requests": 0,
            "total_tokens": 0,
            "total_latency_ms": 0,
            "error_count": 0,
            "operations": defaultdict(int),
            "models_used": defaultdict(int),
            "hourly_requests": defaultdict(int),
            "latency_by_operation": defaultdict(list),
        }
        self.lock = threading.Lock()
        logger.info("MetricsCollector initialized")

######
######
######

Ensuring Reliability: The Docling Factory Test Suite

Building a complex document processing pipeline requires more than just functional code; it requires a rigorous testing framework to ensure that every component — from OCR to vector search — performs predictably under pressure. I’ve implemented a comprehensive suite of unit tests to validate the core logic of the factory.
Document Parsing Validation
The reliability of our ingestion starts with test_docling_parser.py. This test suite ensures that the DoclingParser correctly identifies and processes various file formats and OCR requirements:

Initialization & Formats: Verifies that the parser correctly initializes with or without GPU support and correctly identifies supported input extensions like .pdf, .docx, and .txt.
OCR Logic: Tests the validation of different OCR engines (RapidOCR, EasyOCR, etc.) and ensures that the system fails over to a default engine if an invalid one is specified.
Functional Success: Uses mocking of the DocumentConverter to simulate successful document exports to Markdown and JSON without requiring a full local setup.
Error Handling: Validates that the system gracefully handles “file not found” scenarios and unsupported file types.

RAG & Embedding Accuracy

To ensure our “chat with documents” feature doesn’t hallucinate or fail, test_rag_engine.py provides targeted testing for the retrieval pipeline:

Connectivity Checks: Includes health checks to verify that both the OpenSearch cluster and the Ollama model server are reachable.
Indexing Pipeline: Tests the workflow of splitting text into chunks and generating 384-dimensional embeddings (standard for models like all-minilm).
Semantic Search: Simulates search queries to verify that OpenSearch returns the expected hits and scores.
Embedding & LLM Classes: Provides isolated unit tests for OllamaEmbeddings and OllamaLLM to confirm they correctly format prompts and return responses with the right temperature settings.

# test_rag_engine.py
"""
Unit tests for rag_engine.py module
"""

import unittest
import os
import tempfile
from unittest.mock import Mock, patch, MagicMock
import sys

# Add parent directory to path
sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__))))


class TestRAGEngine(unittest.TestCase):
    """Test cases for RAGEngine class"""

    def setUp(self):
        """Set up test fixtures"""
        self.test_dir = tempfile.mkdtemp()

    def tearDown(self):
        """Clean up test fixtures"""
        import shutil
        if os.path.exists(self.test_dir):
            shutil.rmtree(self.test_dir)

    @patch('rag_engine.OpenSearch')
    @patch('rag_engine.ollama')
    def test_initialization(self, mock_ollama, mock_opensearch):
        """Test RAG engine initialization"""
        from rag_engine import RAGEngine

        rag = RAGEngine(
            opensearch_host="localhost",
            opensearch_port=9200,
            embedding_model="test-embedding",
            llm_model="test-llm",
            enable_tracing=False
        )

        self.assertIsNotNone(rag)
        self.assertEqual(rag.embedding_model, "test-embedding")
        self.assertEqual(rag.llm_model, "test-llm")

    @patch('rag_engine.OpenSearch')
    @patch('rag_engine.ollama')
    def test_health_check(self, mock_ollama, mock_opensearch):
        """Test health check functionality"""
        from rag_engine import RAGEngine

        # Mock OpenSearch client
        mock_client = MagicMock()
        mock_client.ping.return_value = True
        mock_opensearch.return_value = mock_client

        # Mock Ollama client
        mock_ollama_client = MagicMock()
        mock_response = MagicMock()
        mock_response.models = []
        mock_ollama_client.list.return_value = mock_response
        mock_ollama.Client.return_value = mock_ollama_client

        rag = RAGEngine(enable_tracing=False)
        health = rag.health_check()

        self.assertIsInstance(health, dict)
        self.assertIn('opensearch', health)

    @patch('rag_engine.OpenSearch')
    @patch('rag_engine.ollama')
    def test_index_document(self, mock_ollama, mock_opensearch):
        """Test document indexing"""
        from rag_engine import RAGEngine

        # Mock OpenSearch
        mock_client = MagicMock()
        mock_client.indices.exists.return_value = True
        mock_opensearch.return_value = mock_client

        # Mock Ollama embeddings
        mock_ollama.embeddings.return_value = {
            'embedding': [0.1] * 384
        }

        rag = RAGEngine(enable_tracing=False)

        result = rag.index_document(
            file_path="test.pdf",
            content="Test content for indexing",
            metadata={"source": "test"}
        )

        self.assertIsInstance(result, dict)
        self.assertIn('chunks_indexed', result)

    @patch('rag_engine.OpenSearch')
    @patch('rag_engine.ollama')
    def test_search(self, mock_ollama, mock_opensearch):
        """Test semantic search"""
        from rag_engine import RAGEngine

        # Mock OpenSearch search response
        mock_client = MagicMock()
        mock_client.search.return_value = {
            'hits': {
                'hits': [
                    {
                        '_source': {
                            'content': 'Test result',
                            'file_path': 'test.pdf'
                        },
                        '_score': 0.9
                    }
                ]
            }
        }
        mock_opensearch.return_value = mock_client

        # Mock Ollama embeddings
        mock_ollama.embeddings.return_value = {
            'embedding': [0.1] * 384
        }

        rag = RAGEngine(enable_tracing=False)
        results = rag.search("test query", top_k=5)

        self.assertIsInstance(results, list)

    @patch('rag_engine.OpenSearch')
    @patch('rag_engine.ollama')
    def test_get_stats(self, mock_ollama, mock_opensearch):
        """Test getting index statistics"""
        from rag_engine import RAGEngine

        # Mock OpenSearch stats
        mock_client = MagicMock()
        mock_client.count.return_value = {'count': 100}
        mock_client.indices.stats.return_value = {
            'indices': {
                'documents': {
                    'primaries': {
                        'store': {'size_in_bytes': 1024000}
                    }
                }
            }
        }
        mock_opensearch.return_value = mock_client

        rag = RAGEngine(enable_tracing=False)
        stats = rag.get_stats()

        self.assertIsInstance(stats, dict)
        self.assertIn('total_chunks', stats)


class TestOllamaEmbeddings(unittest.TestCase):
    """Test cases for OllamaEmbeddings class"""

    @patch('rag_engine.ollama')
    def test_embed_documents(self, mock_ollama):
        """Test embedding multiple documents"""
        from rag_engine import OllamaEmbeddings

        mock_ollama.embeddings.return_value = {
            'embedding': [0.1] * 384
        }

        embedder = OllamaEmbeddings(model="test-model")
        texts = ["text1", "text2", "text3"]
        embeddings = embedder.embed_documents(texts)

        self.assertIsInstance(embeddings, list)
        self.assertEqual(len(embeddings), 3)

    @patch('rag_engine.ollama')
    def test_embed_query(self, mock_ollama):
        """Test embedding a single query"""
        from rag_engine import OllamaEmbeddings

        mock_ollama.embeddings.return_value = {
            'embedding': [0.1] * 384
        }

        embedder = OllamaEmbeddings(model="test-model")
        embedding = embedder.embed_query("test query")

        self.assertIsInstance(embedding, list)
        self.assertEqual(len(embedding), 384)


class TestOllamaLLM(unittest.TestCase):
    """Test cases for OllamaLLM class"""

    @patch('rag_engine.ollama')
    def test_generate(self, mock_ollama):
        """Test LLM text generation"""
        from rag_engine import OllamaLLM

        mock_ollama.generate.return_value = {
            'response': 'Generated response'
        }

        llm = OllamaLLM(model="test-model")
        response = llm.generate("test prompt")

        self.assertIsInstance(response, str)
        self.assertEqual(response, 'Generated response')

    @patch('rag_engine.ollama')
    def test_generate_with_temperature(self, mock_ollama):
        """Test LLM generation with temperature parameter"""
        from rag_engine import OllamaLLM

        mock_ollama.generate.return_value = {
            'response': 'Generated response'
        }

        llm = OllamaLLM(model="test-model", temperature=0.7)
        response = llm.generate("test prompt")

        self.assertIsInstance(response, str)


if __name__ == '__main__':
    unittest.main()

# Made with Bob

Metrics and Observability

Finally, the test_metrics_collector.py suite ensures that our observability dashboard isn't just showing pretty numbers, but accurate data:

Span Management: Validates the export of OpenTelemetry spans, ensuring that operation names, start times, and end times are captured correctly.
Aggregation Logic: Tests the collector’s ability to sum up token usage (prompt vs. completion) and calculate error counts across different operations.
Data Structures: Verifies that time-series data for hourly requests and latency percentiles are generated in the correct format for the Plotly dashboard.

By maintaining these tests, the application ensure that by adding new features — like new OCR backends or remote LLM providers via LiteLLM — the core foundation of the Docling Factory remains rock solid.

Flexible Processing: From Single Files to Batch Operations

The Docling Factory is engineered for maximum versatility, allowing users to choose between processing individual documents or executing high-volume batch operations. For targeted tasks, the parse_single_file function handles specific uploads via the Gradio interface, while the parse_batch method in the DoclingParser class enables the system to traverse entire directories and convert multiple files in a single pass. Whether you are indexing a single technical manual or a massive archive of financial reports, the architecture scales to meet the demand, ensuring consistent output across all supported formats like Markdown, JSON, and HTML.

CPU or GPU?

To accommodate varying hardware environments, the application is designed to operate seamlessly with or without GPU acceleration. The DoclingParser class includes a use_gpu initialization parameter that toggles the processing backend, and the deployment infrastructure provides dedicated Docker images for both CPU-only and GPU-enabled configurations. This flexibility ensures that the factory can be deployed on standard local machines using CPU-based OCR engines like RapidOCR, or on high-performance servers leveraging NVIDIA GPUs to significantly accelerate the layout analysis and document conversion pipeline.

Cloud Deployment

The Docling Factory features a production-ready cloud deployment architecture built on Kubernetes, supporting highly available and scalable environments. The infrastructure is designed with a multi-container strategy, utilizing Docker for both CPU-optimized and GPU-accelerated versions of the application. To ensure operational stability, the deployment includes:

Orchestration and Scaling: Kubernetes deployments manage multiple replicas of the application, with specific resource limits (up to 8Gi memory and 4000m CPU for GPU nodes) and health checks for both liveness and readiness.
Infrastructure Components: Integrated services include OpenSearch for vector storage, a dedicated LiteLLM gateway **with a PostgreSQL backend for managed LLM access, and an **Ingress controller for secure external traffic management via TLS.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: opensearch
  namespace: docling-factory
  labels:
    app: opensearch
spec:
  replicas: 1
  selector:
    matchLabels:
      app: opensearch
  template:
    metadata:
      labels:
        app: opensearch
    spec:
      containers:
      - name: opensearch
        image: opensearchproject/opensearch:2.11.0
        ports:
        - containerPort: 9200
          name: http
        - containerPort: 9600
          name: performance
        env:
        - name: discovery.type
          value: "single-node"
        - name: OPENSEARCH_JAVA_OPTS
          value: "-Xms512m -Xmx512m"
        - name: DISABLE_SECURITY_PLUGIN
          value: "true"
        - name: DISABLE_INSTALL_DEMO_CONFIG
          value: "true"
        resources:
          requests:
            memory: "1Gi"
            cpu: "500m"
          limits:
            memory: "2Gi"
            cpu: "1000m"
        volumeMounts:
        - name: opensearch-data
          mountPath: /usr/share/opensearch/data
        livenessProbe:
          httpGet:
            path: /_cluster/health
            port: 9200
          initialDelaySeconds: 60
          periodSeconds: 30
          timeoutSeconds: 10
        readinessProbe:
          httpGet:
            path: /_cluster/health
            port: 9200
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
      volumes:
      - name: opensearch-data
        persistentVolumeClaim:
          claimName: opensearch-data-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: opensearch-service
  namespace: docling-factory
  labels:
    app: opensearch
spec:
  type: ClusterIP
  ports:
  - port: 9200
    targetPort: 9200
    protocol: TCP
    name: http
  - port: 9600
    targetPort: 9600
    protocol: TCP
    name: performance
  selector:
    app: opensearch
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: opensearch-data-pvc
  namespace: docling-factory
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard

# Made with Bob

Persistent Storage: Robust data management is handled through several Persistent Volume Claims (PVCs), providing up to 50Gi for document output and separate storage for input and logs.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: docling-factory-input-pvc
  namespace: docling-factory
  labels:
    app: docling-factory
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi
  storageClassName: standard
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: docling-factory-output-pvc
  namespace: docling-factory
  labels:
    app: docling-factory
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 50Gi
  storageClassName: standard
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: docling-factory-logs-pvc
  namespace: docling-factory
  labels:
    app: docling-factory
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
  storageClassName: standard

# Made with Bob

GPU Integration: Specialized Kubernetes manifests utilize nodeSelector and https://nvidia.com/gpu resource requests to target nodes with NVIDIA hardware, ensuring high-performance document parsing and layout analysis in cloud environments.
CPU Deployment

# Docling Factory - CPU Version (Optimized Build)
# Multi-stage build with better caching and faster builds

FROM python:3.11-slim as builder

# Set working directory
WORKDIR /app

# Install system dependencies in one layer
RUN apt-get update && apt-get install -y --no-install-recommends \
    build-essential \
    git \
    curl \
    && rm -rf /var/lib/apt/lists/*

# Copy only requirements first for better caching
COPY requirements.txt .

# Install dependencies in stages for better caching
# Stage 1: Install lightweight dependencies first
RUN pip install --no-cache-dir --user \
    numpy>=1.24.0,<2.0.0 \
    pandas>=2.0.0 \
    pillow>=10.0.0 \
    python-docx>=1.1.0 \
    PyPDF2>=3.0.0 \
    python-dateutil>=2.8.2 \
    pathlib>=1.0.1 \
    lxml>=4.9.0 \
    openpyxl>=3.1.0

# Stage 2: Install medium-weight dependencies
RUN pip install --no-cache-dir --user \
    gradio>=4.0.0 \
    streamlit>=1.31.0 \
    plotly>=5.18.0 \
    httpx>=0.24.0 \
    tiktoken>=0.5.0

# Stage 3: Install OpenSearch and LangChain
RUN pip install --no-cache-dir --user \
    opensearch-py>=2.4.0 \
    langchain>=0.1.0 \
    langchain-community>=0.0.20

# Stage 4: Install LLM clients (lightweight)
RUN pip install --no-cache-dir --user \
    ollama>=0.1.0 \
    litellm>=1.0.0

# Stage 5: Install OpenTelemetry (lightweight)
RUN pip install --no-cache-dir --user \
    opentelemetry-api>=1.20.0 \
    opentelemetry-sdk>=1.20.0 \
    opentelemetry-instrumentation>=0.41b0 \
    traceloop-sdk>=0.30.0

# Stage 6: Install Docling (can be slow)
RUN pip install --no-cache-dir --user \
    docling>=2.0.0 \
    docling-core>=2.0.0 \
    docling-parse>=2.0.0 \
    python-xbrl>=1.1.1

# Stage 7: Install CV dependencies (slowest - do last)
RUN pip install --no-cache-dir --user \
    opencv-python-headless>=4.8.0,<5.0.0 \
    scikit-image>=0.21.0 \
    pytesseract>=0.3.10

# Stage 8: Install PyTorch (very slow - separate for better error handling)
RUN pip install --no-cache-dir --user \
    torch>=2.0.0 \
    torchvision>=0.15.0 \
    || echo "PyTorch installation failed, continuing..."

# Stage 9: Install ML dependencies (slow)
RUN pip install --no-cache-dir --user \
    sentence-transformers>=2.2.0 \
    faiss-cpu>=1.7.4 \
    chromadb>=0.4.0 \
    || echo "ML dependencies installation failed, continuing..."

# Stage 10: Install EasyOCR last (slowest)
RUN pip install --no-cache-dir --user \
    easyocr>=1.7.0 \
    || echo "EasyOCR installation failed, continuing..."

# Final stage
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install runtime dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    libgomp1 \
    libglib2.0-0 \
    libsm6 \
    libxext6 \
    libxrender-dev \
    libgl1-mesa-glx \
    && rm -rf /var/lib/apt/lists/*

# Copy Python packages from builder
COPY --from=builder /root/.local /root/.local

# Make sure scripts in .local are usable
ENV PATH=/root/.local/bin:$PATH

# Copy application files
COPY docling_parser.py .
COPY app_enhanced.py .
COPY rag_engine.py .
COPY metrics_collector.py .
COPY standalone_dashboard.py .
COPY metrics_dashboard.py .

# Create necessary directories
RUN mkdir -p input output output/figures logs

# Expose Gradio port
EXPOSE 7860

# Set environment variables
ENV GRADIO_SERVER_NAME=0.0.0.0
ENV GRADIO_SERVER_PORT=7860
ENV PYTHONUNBUFFERED=1
ENV OPENSEARCH_HOST=opensearch
ENV OPENSEARCH_PORT=9200
ENV OLLAMA_BASE_URL=http://host.docker.internal:11434
ENV LITELLM_API_BASE=http://litellm:4000
ENV LITELLM_API_KEY=sk-1234

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:7860')" || exit 1

# Run the application
CMD ["python", "app_enhanced.py"]

GPU Deployment

# Docling Factory - GPU Version with LiteLLM Support
# Multi-stage build for optimized image size with CUDA support

FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04 as builder

# Set working directory
WORKDIR /app

# Install Python and system dependencies
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 \
    python3.11-dev \
    python3-pip \
    build-essential \
    git \
    && rm -rf /var/lib/apt/lists/*

# Create symbolic links for python
RUN ln -sf /usr/bin/python3.11 /usr/bin/python && \
    ln -sf /usr/bin/python3.11 /usr/bin/python3

# Copy requirements file
COPY requirements-gpu.txt .

# Install Python dependencies
RUN pip install --no-cache-dir --user -r requirements-gpu.txt

# Final stage
FROM nvidia/cuda:12.1.0-runtime-ubuntu22.04

# Set working directory
WORKDIR /app

# Install Python runtime
RUN apt-get update && apt-get install -y --no-install-recommends \
    python3.11 \
    python3-pip \
    libgomp1 \
    && rm -rf /var/lib/apt/lists/*

# Create symbolic links for python
RUN ln -sf /usr/bin/python3.11 /usr/bin/python && \
    ln -sf /usr/bin/python3.11 /usr/bin/python3

# Copy Python packages from builder
COPY --from=builder /root/.local /root/.local

# Make sure scripts in .local are usable
ENV PATH=/root/.local/bin:$PATH

# Copy application files
COPY docling_parser.py .
COPY app_enhanced.py .
COPY rag_engine.py .
COPY metrics_collector.py .
COPY standalone_dashboard.py .
COPY metrics_dashboard.py .

# Create necessary directories
RUN mkdir -p input output output/figures logs

# Expose Gradio port
EXPOSE 7860

# Set environment variables
ENV GRADIO_SERVER_NAME=0.0.0.0
ENV GRADIO_SERVER_PORT=7860
ENV PYTHONUNBUFFERED=1
ENV CUDA_VISIBLE_DEVICES=0
ENV OPENSEARCH_HOST=opensearch
ENV OPENSEARCH_PORT=9200
ENV OLLAMA_BASE_URL=http://host.docker.internal:11434
ENV LITELLM_API_BASE=http://litellm:4000
ENV LITELLM_API_KEY=sk-1234

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:7860')" || exit 1

# Run the application
CMD ["python", "app_enhanced.py"]

apiVersion: apps/v1
kind: Deployment
metadata:
  name: docling-factory-gpu
  namespace: docling-factory
  labels:
    app: docling-factory
    version: gpu
spec:
  replicas: 1
  selector:
    matchLabels:
      app: docling-factory
      version: gpu
  template:
    metadata:
      labels:
        app: docling-factory
        version: gpu
    spec:
      containers:
      - name: docling-factory
        image: docling-factory:gpu-latest
        imagePullPolicy: IfNotPresent
        ports:
        - containerPort: 7860
          name: http
          protocol: TCP
        envFrom:
        - configMapRef:
            name: docling-factory-config
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        resources:
          requests:
            memory: "4Gi"
            cpu: "2000m"
            nvidia.com/gpu: 1
          limits:
            memory: "8Gi"
            cpu: "4000m"
            nvidia.com/gpu: 1
        volumeMounts:
        - name: input-storage
          mountPath: /app/input
        - name: output-storage
          mountPath: /app/output
        - name: logs-storage
          mountPath: /app/logs
        livenessProbe:
          httpGet:
            path: /
            port: 7860
          initialDelaySeconds: 60
          periodSeconds: 30
          timeoutSeconds: 10
          failureThreshold: 3
        readinessProbe:
          httpGet:
            path: /
            port: 7860
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
      volumes:
      - name: input-storage
        persistentVolumeClaim:
          claimName: docling-factory-input-pvc
      - name: output-storage
        persistentVolumeClaim:
          claimName: docling-factory-output-pvc
      - name: logs-storage
        persistentVolumeClaim:
          claimName: docling-factory-logs-pvc
      nodeSelector:
        accelerator: nvidia-gpu
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule

# Made with Bob

The “End of Level 1” Log

So, here I am — about two months into this rabbit hole, and I’m still standing, mostly thanks to the digital wizardry of IBM Bob. Is the Docling Factory “finished”? In the words of every dev ever: Hell no. I’m currently in that glorious “it works on my machine” phase, but the quest log is still overflowing. I’ve got to stress-test the GPU until the fans scream, feed it files so large they have their own gravity, and ensure the unit tests cover more than just the “happy path”. I’m also dreaming of a universal Cloud Object Storage (COS) implementation to make file repo handling truly elite. This is just a fragment of the “to-do” list currently living rent-free in my brain. The code is out there on GitHub, and frankly, I’d love some co-op players for this raid — carrying this entire multimodal stack on my own two shoulders is getting a bit “Atlas-tier” heavy. Pull requests are open, the coffee is brewing, and the evolution has only just begun!

Thanks for reading 🤠

DEV Community

Inside Docling Factory: Building a Multimodal RAG Powerhouse

Intro-TL;DR

🔩 Key Components

The Core Engine: Docling & Layout Awareness

RAG & Vector Storage (`rag_engine.py`)

Hybrid Backend

A Word on The RAG Database: OpenSeach RAG

Observability: Seeing the Unseen

Ensuring Reliability: The Docling Factory Test Suite

RAG & Embedding Accuracy

Metrics and Observability

Flexible Processing: From Single Files to Batch Operations

CPU or GPU?

Cloud Deployment

The “End of Level 1” Log

Links

Top comments (0)

Intro-TL;DR

🔩 Key Components

The Core Engine: Docling & Layout Awareness

RAG & Vector Storage (rag_engine.py)

Hybrid Backend

A Word on The RAG Database: OpenSeach RAG

Observability: Seeing the Unseen

Ensuring Reliability: The Docling Factory Test Suite

RAG & Embedding Accuracy

Metrics and Observability

Flexible Processing: From Single Files to Batch Operations

CPU or GPU?

Cloud Deployment

The “End of Level 1” Log

Links

RAG & Vector Storage (`rag_engine.py`)