Beyond the Text: Wiring Up My First Docling-Graph Application with Bob
Introducing Docling-Graph

Docling-Graph turns documents into validated Pydantic objects, then builds a directed knowledge graph with explicit semantic relationships.
This transformation enables high-precision use cases in chemistry, finance, and legal domains, where AI must capture exact entity connections (compounds and reactions, instruments and dependencies, properties and measurements) rather than rely on approximate text embeddings.
This toolkit supports two extraction paths: local VLM extraction via Docling, and LLM-based extraction routed through LiteLLM for local runtimes (vLLM, Ollama) and API providers (Mistral, OpenAI, Gemini, IBM watsonx), all orchestrated through a flexible, config-driven pipeline.
Key Capabilities
- βπ» Input formats: Doclingβs supported inputs: PDF, images, markdown, Office, HTML, and more.
- π§ Extraction: LLM or VLM backends, with chunking and processing modes.
- π Graphs: Pydantic β NetworkX directed graphs with stable IDs and edge metadata.
- π¦ Export: CSV, Cypher, and other KG-friendly formats.
- π Visualization: Interactive HTML and Markdown reports.
- πͺ Multi-pass extraction: Delta and staged contracts (experimental).
- π Structured extraction: LLM output is schema-enforced by default; see CLI and API to disable.
- β¨ LiteLLM: Single interface for vLLM, OpenAI, Mistral, WatsonX, and more.
- π Trace capture: Debug exports for extraction and fallback diagnostics.
- And Coming Soonβ¦
- π§© Interactive Template Builder: Guided workflows for building Pydantic templates.
- π§² Ontology-Based Templates: Match content to the best Pydantic template using semantic similarity.
- πΎ Graph Database Integration: Export data straight into Neo4j, ArangoDB, and similar databases.
My Implementation of docling-graph β First Step
To build a comprehensive test-of-concept, I used Bob to synthesize the Docling-Graph documentation and sample code into a working implementation. This application isnβt a production-ready solution yet; rather, itβs a foundational prototype designed to explore the libraryβs technical capacities. My goal was to see how the graph-based document parsing holds up in a real-world environment, laying the groundwork for future business applications.
Out of the various examples provided in the official repository, I selected the following implementation as the foundation for my application. It serves as the perfect blueprint for demonstrating how Docling-Graph maps document structures into a navigable, programmatic format.
"""
Example 02: Quickstart - LLM Extraction from PDF
Description:
Basic LLM extraction from a multi-page rheology research PDF using a remote API.
Demonstrates the standard workflow for text-heavy documents with automatic chunking.
Use Cases:
- Rheology researchs and academic documents
- Technical reports and whitepapers
- Multi-page business documents
- Any text-heavy PDF content
Prerequisites:
- Installation: uv sync
- Environment: export MISTRAL_API_KEY="your-api-key"
- Data: Sample rheology research included in repository
Key Concepts:
- LLM Backend: Processes text extracted from PDFs
- Many-to-One Mode: All pages merged into single output
- Chunking: Automatically splits large documents for LLM context limits
- Remote Inference: Uses Mistral API for extraction
- Programmatic Merge: Combines chunk results without additional LLM call
Expected Output:
- nodes.csv: Extracted research data (authors, experiments, results)
- edges.csv: Relationships between research entities
- graph.html: Interactive knowledge graph visualization
- document.md: Markdown version of the PDF
- report.md: Extraction statistics and summary
Related Examples:
- Example 01: VLM extraction from images
- Example 07: Local LLM inference
- Example 08: Advanced chunking strategies
- Documentation: https://ibm.github.io/docling-graph/usage/examples/research-paper/
"""
import sys
from pathlib import Path
from rich import print as rich_print
from rich.console import Console
from rich.panel import Panel
# Setup project path
project_root = Path(__file__).parent.parent.parent
sys.path.append(str(project_root))
try:
from examples.templates.rheology_research import ScholarlyRheologyPaper
from docling_graph import PipelineConfig, run_pipeline
except ImportError:
rich_print("[red]Error:[/red] Could not import required modules.")
rich_print("Please run this script from the project root directory.")
sys.exit(1)
# Configuration
SOURCE_FILE = "docs/examples/data/research_paper/rheology.pdf"
TEMPLATE_CLASS = ScholarlyRheologyPaper
console = Console()
def main() -> None:
"""Execute LLM extraction from rheology research PDF."""
console.print(
Panel.fit(
"[bold blue]Example 02: Quickstart - LLM from PDF[/bold blue]\n"
"[dim]Extract structured data from a rheology research using Large Language Model[/dim]",
border_style="blue",
)
)
console.print("\n[yellow]π Configuration:[/yellow]")
console.print(f" β’ Source: [cyan]{SOURCE_FILE}[/cyan]")
console.print(f" β’ Template: [cyan]{TEMPLATE_CLASS.__name__}[/cyan]")
console.print(" β’ Backend: [cyan]LLM (Large Language Model)[/cyan]")
console.print(" β’ Provider: [cyan]Mistral AI[/cyan]")
console.print(" β’ Mode: [cyan]many-to-one[/cyan]")
console.print("\n[yellow]β οΈ Prerequisites:[/yellow]")
console.print(" β’ Mistral API key must be set: [cyan]export MISTRAL_API_KEY='...'[/cyan]")
console.print(" β’ Install dependencies: [cyan]uv sync[/cyan]")
try:
# Configure the pipeline
config = PipelineConfig(
source=SOURCE_FILE,
template=TEMPLATE_CLASS,
# LLM backend for text-based extraction
backend="llm",
# Remote inference using API
inference="remote",
# Use Mistral AI provider
provider_override="mistral",
# Use a capable model for complex extraction
model_override="mistral-large-latest",
# Many-to-one: merge all pages into single result
processing_mode="many-to-one",
# extraction_contract="direct" (default); use "staged" for complex nested templates (see Example 11)
use_chunking=True,
)
# Execute the pipeline
console.print("\n[yellow]βοΈ Processing (this may take 1-2 minutes)...[/yellow]")
console.print(" β’ Converting PDF to markdown")
console.print(" β’ Chunking document for LLM context")
console.print(" β’ Extracting data from each chunk")
console.print(" β’ Merging results programmatically")
console.print(" β’ Building knowledge graph")
context = run_pipeline(config)
# Success message
console.print("\n[green]β Success![/green]")
graph = context.knowledge_graph
console.print(
f"\n[bold]Extracted:[/bold] [cyan]{graph.number_of_nodes()} nodes[/cyan] "
f"and [cyan]{graph.number_of_edges()} edges[/cyan]"
)
console.print("\n[bold]π‘ What Happened:[/bold]")
console.print(" β’ PDF converted to markdown using Docling")
console.print(" β’ Document split into chunks respecting context limits")
console.print(" β’ Each chunk processed by Mistral LLM")
console.print(" β’ Results merged programmatically (no LLM consolidation)")
console.print(" β’ Knowledge graph built from extracted entities")
console.print("\n[bold]π― Key Differences from Example 01:[/bold]")
console.print(" β’ LLM vs VLM: Text-based vs vision-based extraction")
console.print(" β’ Remote vs Local: API call vs local model")
console.print(" β’ Many-to-one vs One-to-one: Merged vs separate outputs")
console.print(" β’ Chunking: Enabled for large documents")
except FileNotFoundError:
console.print(f"\n[red]Error:[/red] Source file not found: {SOURCE_FILE}")
console.print("\n[yellow]Troubleshooting:[/yellow]")
console.print(" β’ Ensure you're running from the project root directory")
console.print(" β’ Check that the sample data exists in docs/examples/data/")
sys.exit(1)
except Exception as e:
error_msg = str(e).lower()
console.print(f"\n[red]Error:[/red] {e}")
console.print("\n[yellow]Troubleshooting:[/yellow]")
if "api" in error_msg or "key" in error_msg or "auth" in error_msg:
console.print(
" β’ Set your Mistral API key: [cyan]export MISTRAL_API_KEY='your-key'[/cyan]"
)
console.print(" β’ Get a key at: https://console.mistral.ai/")
console.print(" β’ Or use local inference: see Example 07")
else:
console.print(" β’ Ensure dependencies installed: [cyan]uv sync[/cyan]")
console.print(" β’ Check your internet connection")
console.print(" β’ Verify the template class is correctly defined")
console.print(" β’ Try with a smaller document first")
sys.exit(1)
if __name__ == "__main__":
main()
Rheology Research Extraction
Overview
Extract complex research data from scientific papers including experiments, measurements, materials, and results.
**Document Type:** Rheology Research (PDF)
**Time:** 30 minutes
**Backend:** LLM with chunking
---
Prerequisites
bash
Install with remote API support
pip install docling-graph
Set API key
export MISTRAL_API_KEY="your-key"
---
Template Overview
The rheology research template (`rheology_research.py`) includes:
- **Measurements** - Flexible value/unit pairs
- **Materials** - Granular material properties
- **Geometry** - Experimental setup
- **Vibration** - Vibration parameters
- **Simulation** - DEM simulation details
- **Results** - Rheological measurements
- **Experiments** - Complete experiment instances
- **Research** - Root document model
Key Components
python
1. Measurement Model
class Measurement(BaseModel):
Flexible measurement with value and unit."""
name: str
numeric_value: float | None = None
text_value: str | None = None
unit: str | None = None
2. Enum Types
class GeometryType(str, Enum):
VANE_RHEOMETER = "Vane Rheometer"
DOUBLE_PLATE = "Double Plate"
CYLINDRICAL_CONTAINER = "Cylindrical Container"
3. Experiment Entity
class Experiment(BaseModel):
experiment_id: str
objective: str
granular_material: GranularMaterial = edge("USES_MATERIAL")
vibration_conditions: VibrationConditions = edge("HAS_VIBRATION")
rheological_results: List[RheologicalResult] = edge("HAS_RESULT")
4. Root Model
class Research(BaseModel):
title: str
authors: List[str]
experiments: List[Experiment] = edge("HAS_EXPERIMENT")
Processing
Using CLI
bash
Process rheology research with chunking
uv run docling-graph convert research.pdf \
--template "docs.examples.templates.rheology_research.ScholarlyRheologyPaper" \
--backend llm \
--inference remote \
--provider mistral \
--model mistral-large-latest \
--processing-mode many-to-one \
--use-chunking \
--docling-pipeline vision \
--output-dir "outputs/research"
Using Python API
python
Process rheology research.
import os
from docling_graph import run_pipeline, PipelineConfig
os.environ["MISTRAL_API_KEY"] = "your-key"
config = PipelineConfig(
source="research.pdf",
template="docs.examples.templates.rheology_research.ScholarlyRheologyPaper",
backend="llm",
inference="remote",
provider_override="mistral",
model_override="mistral-large-latest",
processing_mode="many-to-one",
use_chunking=True,
docling_config="vision" # Better for complex layouts
)
print("Processing rheology research (may take several minutes)...")
run_pipeline(config)
print("β
Complete!")
Expected Results
Graph Structure
Research (Title)
βββ HAS_EXPERIMENT β Experiment 1
β βββ USES_MATERIAL β GranularMaterial
β β βββ properties: [Measurement, Measurement]
β βββ HAS_GEOMETRY β SystemGeometry
β β βββ dimensions: [Measurement, Measurement]
β βββ HAS_VIBRATION β VibrationConditions
β β βββ amplitude: Measurement
β β βββ frequency: Measurement
β β βββ confining_pressure: Measurement
β βββ HAS_SIMULATION β SimulationSetup
β β βββ parameters: [Measurement, Measurement]
β βββ HAS_RESULT β RheologicalResult
β βββ measurement: Measurement
βββ HAS_EXPERIMENT β Experiment 2
βββ ...
Statistics
json
{
"node_count": 45,
"edge_count": 38,
"density": 0.019,
"node_types": {
"Research": 1,
"Experiment": 3,
"GranularMaterial": 3,
"SystemGeometry": 3,
"VibrationConditions": 3,
"RheologicalResult": 12,
"Measurement": 20
}
}
Key Features
1. Enum Normalization
python
class GeometryType(str, Enum):
VANE_RHEOMETER = "Vane Rheometer"
CYLINDRICAL_CONTAINER = "Cylindrical Container"
Validator accepts multiple formats
@field_validator("geometry_type", mode="before")
@classmethod
def normalize_enum(cls, v):
# Accepts: "Vane Rheometer", "vane_rheometer", "VANE_RHEOMETER"
return _normalize_enum(GeometryType, v)
2. Measurement Parsing
python
# Parses strings like "1.6 mPa.s", "2 mm", "80-90 Β°C"
def _parse_measurement_string(s: str):
# Single value: "1.6 mPa.s" β {numeric_value: 1.6, unit: "mPa.s"}
# Range: "80-90 Β°C" β {numeric_value_min: 80, numeric_value_max: 90, unit: "Β°C"}
...
3. Flexible Measurements
python
class Measurement(BaseModel):
name: str
numeric_value: float | None = None # Single value
numeric_value_min: float | None = None # Range min
numeric_value_max: float | None = None # Range max
text_value: str | None = None # Qualitative
unit: str | None = None
4. Nested Relationships
python
class Experiment(BaseModel):
Direct edges
granular_material: GranularMaterial = edge("USES_MATERIAL")
Nested properties (not separate nodes)
key_findings: List[str] = Field(default_factory=list)
Configuration Tips
For Long Documents
bash
# Enable chunking and consolidation
uv run docling-graph convert research.pdf \
--template "templates.ScholarlyRheologyPaper" \
--use-chunking \
--processing-mode many-to-one
For Complex Layouts
bash
Use vision pipeline for better table/figure handling
uv run docling-graph convert research.pdf \
--template "templates.ScholarlyRheologyPaper" \
--docling-pipeline vision
For Cost Optimization
bash
Use smaller model without consolidation
uv run docling-graph convert research.pdf \
--template "templates.ScholarlyRheologyPaper" \
--model mistral-small-latest \
Customization
Simplify for Your Domain
python
"""Simplified research template."""
from pydantic import BaseModel, Field
from typing import List
def edge(label: str, **kwargs):
return Field(..., json_schema_extra={"edge_label": label}, **kwargs)
class Measurement(BaseModel):
"""Simple measurement."""
name: str
value: str # Keep as string for simplicity
unit: str | None = None
class Experiment(BaseModel):
"""Simplified experiment."""
title: str
objective: str
methods: str
results: str
measurements: List[Measurement] = Field(default_factory=list)
class Research(BaseModel):
"""Simplified rheology research (for demonstration).
Note: For production use, see the full ScholarlyRheologyPaper template at:
docs/examples/templates/rheology_research.py
The full template includes:
- Comprehensive scholarly metadata (authors, affiliations, identifiers)
- Detailed formulation specifications (materials, components, amounts)
- Batch preparation history (mixing steps, equipment, conditions)
- Complete rheometry setup (instruments, geometries, protocols)
- Test runs and datasets (curves, measurements, model fits)
"""
title: str
authors: List[str]
abstract: str
experiments: List[Experiment] = edge("HAS_EXPERIMENT")
Troubleshooting
π Extraction Takes Too Long
**Solution:**
`bash
Disable consolidation for faster processing
uv run docling-graph convert research.pdf \
--template "templates.ScholarlyRheologyPaper" \
Or use smaller model
--model mistral-small-latest
π Missing Measurements
**Solution:**
python
# Make measurements optional
measurements: List[Measurement] = Field(
default_factory=list,
description="List of measurements (optional)"
)
π Enum Validation Errors
**Solution:**
python
# Add OTHER option to enums
class GeometryType(str, Enum):
VANE_RHEOMETER = "Vane Rheometer"
OTHER = "Other" # Fallback
Or make enum optional
geometry_type: GeometryType | None = Field(default=None)
Best Practices
π Start Simple, Add Complexity
python
Phase 1: Basic structure
class Research(BaseModel):
title: str
authors: List[str]
abstract: str
Phase 2: Add experiments
class Research(BaseModel):
title: str
authors: List[str]
abstract: str
experiments: List[Experiment]
Phase 3: Add measurements, validations, etc.
π Use Appropriate Chunking
python
For papers > 10 pages
config = PipelineConfig(
source="long_paper.pdf",
template="templates.ScholarlyRheologyPaper",
use_chunking=True, # Essential
)
π Provide Clear Examples
python
β
Good - Domain-specific examples
viscosity: Measurement = Field(
description="Effective viscosity measurement",
examples=[
{"name": "Effective Viscosity", "numeric_value": 1.6, "unit": "mPa.s"}
]
)
Next Steps
1. **[ID Card β](id-card.md)** - Vision-based extraction
2. **[Advanced Patterns β](../../fundamentals/schema-definition/advanced-patterns.md)** - Complex templates
3. **[Performance Tuning β](../advanced/performance-tuning.md)** - Optimization
Core Project Implementation Synthesis

The D*ocling-Graph Showcase Application* is a production-ready, local-first solution designed to transform unstructured documents into validated, structured knowledge graphs. At its core, the implementation bridges the gap between raw document parsing (via the Docling-Graph library) and accessible user interaction (via a Gradio web interface).
The system architecture follows a modular pipeline: it ingests various file formats (PDFs, Office docs, images), processes them through a Document Converter, and utilizes local LLM inference β specifically Ollama with the Granite 3.1 model β to perform entity and relationship extraction. This extraction is governed by a Template Engine that ensures the output conforms to strict Pydantic schemas. The final result is a timestamped knowledge graph stored in an organized output directory. The project is fully βcontainer-readyβ with Docker and Kubernetes manifests, supported by automation scripts for launching and lifecycle management, ensuring it can scale from a simple local test to a deployed environment.
The Template Guide: Defining the βBrainβ of Extraction
Templates serve as the foundational blueprints for the applicationβs intelligence, defining exactly what should be extracted and how it should be structured. Built using Pydantic models, these templates act as a bridge between unstructured text and formal data; they specify βnodesβ (entities like parties, dates, or products) and βedgesβ (relationships like βbuyer ofβ or βtax applied toβ).
What makes these templates unique is their use of v*alidation rules and natural language descriptions* to guide the LLM. For instance, a template might include a validator to normalize currency formats or specific βhintsβ to tell the LLM where to look for data. By switching between different templates β such as those for billing documents, scientific research, or identity cards β the application can pivot its entire extraction logic to suit different industries without changing the underlying code. Essentially, the Template Guide provides the schema that transforms a generic LLM into a specialized document expert.
LLM Implementation
While the official repository showcases various hosted models, Iβve tailored my implementation to run on a local Ollama setup for maximum privacy and control. That said, the applicationβs architecture is intentionally provider-agnostic; users can easily pivot to watsonx, OpenAI, Mistral, or Gemini by simply adjusting the environment configuration. A dedicated LLM configuration layer handles the specific nuances of each provider, ensuring the extraction logic remains consistent regardless of the backend.
If you have several local models using Ollama, the application letβs you chose the one you prefer (or to benchmark them!).
The output of precossed documents
As is my habit, Iβve configured the system to store everything in timestamped files within the output folder for easy version control. Just a heads-up: because I was putting the CPU through its paces with these complex document-to-graph transformations, the 'heavy lifting' can take a little while. Grab a coffee while Bobβs docling-graph works through the more data-dense files! π
The codeβ¦

Now that you know the βwhyβ and the βhow,β here is the code Bob and I put together to bring the Gradio interface to life. This project is a starting point, and Iβve made it fully available on GitHub for anyone to fork, test, and improve. If you have ideas for better configurations or want to use the UI as a springboard for your own business case, Iβd love to see what you build!
"""
Docling-Graph Showcase Application
A Gradio-based UI for document processing using docling-graph with Ollama/Granite4
"""
import os
import sys
from pathlib import Path
from datetime import datetime
from typing import List, Tuple, Optional, Dict, Any, Type
import json
import traceback
import requests
import importlib.util
from dotenv import load_dotenv
import gradio as gr
from rich.console import Console
from rich.panel import Panel
from pydantic import BaseModel
# Load environment variables from .env file
load_dotenv()
# Add project root to path
project_root = Path(__file__).parent
sys.path.append(str(project_root))
try:
from docling_graph import PipelineConfig, run_pipeline
except ImportError:
print("Error: docling-graph not installed. Run: pip install docling-graph")
sys.exit(1)
console = Console()
# Configuration
INPUT_DIR = project_root / "input"
OUTPUT_DIR = project_root / "output"
SAMPLES_DIR = project_root / "_samples"
TEMPLATES_DIR = project_root / "templates"
# Ensure directories exist
INPUT_DIR.mkdir(exist_ok=True)
OUTPUT_DIR.mkdir(exist_ok=True)
TEMPLATES_DIR.mkdir(exist_ok=True)
# Load configuration from environment variables
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "granite4")
# watsonx Orchestrate configuration
WO_DEVELOPER_EDITION_SOURCE = os.getenv("WO_DEVELOPER_EDITION_SOURCE", "orchestrate")
WO_INSTANCE = os.getenv("WO_INSTANCE", "")
WO_API_KEY = os.getenv("WO_API_KEY", "")
# Remote API keys
MISTRAL_API_KEY = os.getenv("MISTRAL_API_KEY", "")
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY", "")
# Application settings
GRADIO_SERVER_PORT = int(os.getenv("GRADIO_SERVER_PORT", "7860"))
GRADIO_SERVER_NAME = os.getenv("GRADIO_SERVER_NAME", "0.0.0.0")
def get_ollama_models() -> List[str]:
"""Fetch available Ollama models from the local Ollama instance."""
try:
response = requests.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5)
if response.status_code == 200:
data = response.json()
models = [model["name"] for model in data.get("models", [])]
return sorted(models) if models else [OLLAMA_MODEL]
else:
console.print(f"[yellow]Warning: Could not fetch Ollama models (status {response.status_code})[/yellow]")
return [OLLAMA_MODEL]
except requests.exceptions.RequestException as e:
console.print(f"[yellow]Warning: Ollama not available - {str(e)}[/yellow]")
return [OLLAMA_MODEL]
def check_ollama_status() -> Tuple[bool, str]:
"""Check if Ollama is running and return status."""
try:
response = requests.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5)
if response.status_code == 200:
models = get_ollama_models()
return True, f"π’ Ollama Running ({len(models)} models available)"
return False, "π‘ Ollama responding but no models found"
except requests.exceptions.RequestException:
return False, "π΄ Ollama Not Running"
def get_timestamp() -> str:
"""Generate timestamp for output files."""
return datetime.now().strftime("%Y%m%d_%H%M%S")
def load_template_from_file(template_path: Path) -> Optional[Type[BaseModel]]:
"""
Dynamically load a Pydantic template class from a Python file.
Args:
template_path: Path to the template Python file
Returns:
The root template class (BaseModel subclass) or None if not found
"""
try:
# Load the module
spec = importlib.util.spec_from_file_location(template_path.stem, template_path)
if spec is None or spec.loader is None:
console.print(f"[yellow]Warning: Could not load spec for {template_path.name}[/yellow]")
return None
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
# Find the root template class (usually the last BaseModel class defined)
# Look for classes with graph_id_fields in model_config
template_classes = []
for name in dir(module):
obj = getattr(module, name)
if (isinstance(obj, type) and
issubclass(obj, BaseModel) and
obj is not BaseModel):
# Check if it has graph_id_fields (indicates it's a root entity)
if hasattr(obj, 'model_config'):
config = obj.model_config
if isinstance(config, dict) and 'graph_id_fields' in config:
template_classes.append(obj)
# Return the last one found (usually the root document class)
if template_classes:
return template_classes[-1]
console.print(f"[yellow]Warning: No root template class found in {template_path.name}[/yellow]")
return None
except Exception as e:
console.print(f"[red]Error loading template {template_path.name}: {str(e)}[/red]")
return None
def get_available_templates() -> Dict[str, Dict[str, Any]]:
"""
Get all available templates from templates directory and _samples.
Returns:
Dictionary mapping template names to their info (path, class, description)
"""
templates = {}
# Load from templates directory
if TEMPLATES_DIR.exists():
for template_file in TEMPLATES_DIR.glob("*.py"):
if template_file.name.startswith("_"):
continue
template_class = load_template_from_file(template_file)
if template_class:
# Extract description from docstring
description = template_class.__doc__ or "No description available"
description = description.strip().split('\n')[0] # First line only
templates[template_file.stem] = {
"name": template_file.stem.replace("_", " ").title(),
"path": template_file,
"class": template_class,
"description": description,
"source": "templates"
}
# Load from _samples directory
if SAMPLES_DIR.exists():
for template_file in SAMPLES_DIR.glob("*_template.py"):
template_class = load_template_from_file(template_file)
if template_class:
description = template_class.__doc__ or "No description available"
description = description.strip().split('\n')[0]
templates[f"sample_{template_file.stem}"] = {
"name": f"Sample: {template_file.stem.replace('_', ' ').title()}",
"path": template_file,
"class": template_class,
"description": description,
"source": "_samples"
}
return templates
def list_input_files() -> List[str]:
"""List all files in the input directory."""
if not INPUT_DIR.exists():
return []
files = []
for file_path in INPUT_DIR.iterdir():
if file_path.is_file():
files.append(file_path.name)
return sorted(files)
def process_document(
file_path: str,
backend: str,
processing_mode: str,
use_chunking: bool,
provider: str,
model: str,
template_key: str,
progress=gr.Progress()
) -> Tuple[str, Optional[str], Optional[str], Optional[str]]:
"""
Process a single document using docling-graph.
Args:
file_path: Path to the document
backend: Extraction backend (llm or vlm)
processing_mode: Processing mode (one-to-one or many-to-one)
use_chunking: Whether to use chunking
provider: LLM provider (ollama, mistral, openai, etc.)
model: Model name
template_key: Key of the template to use
progress: Gradio progress tracker
Returns:
Tuple of (status_message, graph_html_path, nodes_csv_path, edges_csv_path)
File paths may be None if files weren't generated
"""
try:
progress(0.0, desc="π§ Initializing...")
# Create timestamped output directory
timestamp = get_timestamp()
output_subdir = OUTPUT_DIR / f"run_{timestamp}"
output_subdir.mkdir(exist_ok=True)
# Prepare source path
source_path = INPUT_DIR / file_path
if not source_path.exists():
return f"Error: File not found: {file_path}", None, None, None
progress(0.1, desc="π Loading template...")
# Load the selected template
available_templates = get_available_templates()
if template_key not in available_templates:
return f"Error: Template '{template_key}' not found", None, None, None
template_info = available_templates[template_key]
template_class = template_info["class"]
template_name = template_info["name"]
progress(0.15, desc="π Configuring pipeline...")
# Configure pipeline
config_dict = {
"source": str(source_path),
"template": template_class,
"backend": backend,
"inference": "remote" if provider != "ollama" else "remote",
"processing_mode": processing_mode,
"use_chunking": use_chunking,
"output_dir": str(output_subdir),
}
# Add provider-specific configuration
if provider == "ollama":
config_dict["provider_override"] = "ollama"
config_dict["model_override"] = f"ollama/{model}"
config_dict["api_base"] = OLLAMA_BASE_URL
else:
config_dict["provider_override"] = provider
config_dict["model_override"] = model
config = PipelineConfig(**config_dict)
progress(0.2, desc="βοΈ Processing document...")
console.print(" β’ Converting document to markdown")
progress(0.3, desc="π Converting to markdown...")
progress(0.5, desc="π§ Extracting data with LLM...")
console.print(" β’ Extracting structured data")
progress(0.7, desc="π Building knowledge graph...")
# Run pipeline
context = run_pipeline(config)
progress(0.85, desc="πΎ Exporting results...")
console.print(" β’ Exporting to CSV and HTML")
# Get results
graph = context.knowledge_graph
models = context.extracted_models
# Export results manually
from docling_graph.core import CSVExporter, JSONExporter, InteractiveVisualizer
from pathlib import Path
# Export nodes and edges as CSV
csv_exporter = CSVExporter()
csv_output_path = output_subdir / f"graph_{timestamp}"
csv_exporter.export(graph=graph, output_path=csv_output_path)
# Export as JSON
json_exporter = JSONExporter()
json_output_path = output_subdir / f"graph_{timestamp}.json"
json_exporter.export(graph=graph, output_path=json_output_path)
# Generate HTML visualization
visualizer = InteractiveVisualizer()
html_output_path = output_subdir / f"graph_{timestamp}.html"
visualizer.save_cytoscape_graph(graph=graph, output_path=html_output_path)
progress(0.95, desc="πΎ Generating outputs...")
console.print(" β’ Saving results")
# Save results with timestamp
timestamp_str = timestamp
# Save report in the style of 02_quickstart_llm_pdf.py
report_path = output_subdir / f"report_{timestamp_str}.md"
with open(report_path, "w") as f:
f.write(f"# Document Processing Report\n\n")
f.write(f"## Configuration\n\n")
f.write(f"- **Source:** {file_path}\n")
f.write(f"- **Template:** {template_name}\n")
f.write(f"- **Backend:** {backend.upper()} ({'Large Language Model' if backend == 'llm' else 'Vision Language Model'})\n")
f.write(f"- **Provider:** {provider}\n")
f.write(f"- **Model:** {model}\n")
f.write(f"- **Mode:** {processing_mode}\n")
f.write(f"- **Chunking:** {'Enabled' if use_chunking else 'Disabled'}\n\n")
f.write(f"## Results\n\n")
f.write(f"**Extracted:** {graph.number_of_nodes()} nodes and {graph.number_of_edges()} edges\n\n")
f.write(f"## What Happened\n\n")
f.write(f"- Document converted to markdown using Docling\n")
if use_chunking:
f.write(f"- Document split into chunks respecting context limits\n")
f.write(f"- Each chunk processed by {provider} {backend.upper()}\n")
f.write(f"- Results merged programmatically\n")
else:
f.write(f"- Document processed by {provider} {backend.upper()}\n")
f.write(f"- Knowledge graph built from extracted entities\n\n")
f.write(f"## Output Files\n\n")
f.write(f"- **nodes.csv:** Extracted entities\n")
f.write(f"- **edges.csv:** Relationships between entities\n")
f.write(f"- **graph.html:** Interactive knowledge graph visualization\n")
f.write(f"- **document.md:** Markdown version of the document\n")
f.write(f"- **report.md:** This extraction report\n")
# Find generated files (CSV files are in subdirectory)
graph_html = list(output_subdir.glob("*.html"))
nodes_csv = list(output_subdir.glob("**/nodes.csv"))
edges_csv = list(output_subdir.glob("**/edges.csv"))
# Return None instead of empty string if files don't exist (Gradio handles None properly)
graph_html_path = str(graph_html[0]) if graph_html else None
nodes_csv_path = str(nodes_csv[0]) if nodes_csv else None
edges_csv_path = str(edges_csv[0]) if edges_csv else None
progress(1.0, desc="β
Complete!")
status = f"""## β
Success!
**Extracted:** {graph.number_of_nodes()} nodes and {graph.number_of_edges()} edges
### π Configuration
- **Source:** {file_path}
- **Template:** {template_name}
- **Backend:** {backend.upper()} ({'Large Language Model' if backend == 'llm' else 'Vision Language Model'})
- **Provider:** {provider}
- **Model:** {model}
- **Mode:** {processing_mode}
### π‘ What Happened
- Document converted to markdown using Docling
{f'- Document split into chunks respecting context limits' if use_chunking else ''}
{f'- Each chunk processed by {provider} {backend.upper()}' if use_chunking else f'- Document processed by {provider} {backend.upper()}'}
{f'- Results merged programmatically' if use_chunking else ''}
- Knowledge graph built from extracted entities
### π Output Files
**Directory:** `{output_subdir.name}`
- **report_{timestamp_str}.md** - Extraction report and statistics
- **{Path(graph_html_path).name if graph_html_path else 'graph.html'}** - Interactive visualization
- **{Path(nodes_csv_path).name if nodes_csv_path else 'nodes.csv'}** - Extracted entities
- **{Path(edges_csv_path).name if edges_csv_path else 'edges.csv'}** - Relationships
"""
return status, graph_html_path, nodes_csv_path, edges_csv_path
except Exception as e:
error_msg = f"""β Error Processing Document
**Error:** {str(e)}
**Traceback:**
{traceback.format_exc()}
**Troubleshooting:**
- Ensure Ollama is running: `ollama serve`
- Check if model is available: `ollama list`
- Verify input file exists in ./input directory
- Check API keys if using remote providers
- For large documents, processing may take 30-60 minutes
- Check logs: `tail -f logs/docling-graph-app.log`
"""
return error_msg, None, None, None
def batch_process_documents(
backend: str,
processing_mode: str,
use_chunking: bool,
provider: str,
model: str,
template_key: str,
progress=gr.Progress()
) -> str:
"""
Process all documents in the input directory.
Returns:
Status message with results
"""
try:
files = list_input_files()
if not files:
return "β No files found in input directory"
results = []
total_files = len(files)
for idx, file_name in enumerate(files):
progress((idx + 1) / total_files, desc=f"Processing {file_name}...")
status, _, _, _ = process_document(
file_name,
backend,
processing_mode,
use_chunking,
provider,
model,
template_key,
progress=gr.Progress()
)
results.append(f"### {file_name}\n{status}\n")
summary = f"""# Batch Processing Complete
**Total Files:** {total_files}
**Output Directory:** {OUTPUT_DIR}
---
{"".join(results)}
"""
return summary
except Exception as e:
return f"β Batch Processing Error: {str(e)}\n\n{traceback.format_exc()}"
# Create Gradio Interface
with gr.Blocks(title="Docling-Graph Showcase") as app:
# Check Ollama status at startup
ollama_running, ollama_status = check_ollama_status()
available_models = get_ollama_models() if ollama_running else [OLLAMA_MODEL]
# Load available templates
available_templates = get_available_templates()
template_choices = {info["name"]: key for key, info in available_templates.items()}
template_descriptions = {info["name"]: info["description"] for key, info in available_templates.items()}
gr.Markdown(f"""
# π Docling-Graph Showcase
Transform documents into validated knowledge graphs using docling-graph with local or remote LLMs.
**Status:** {ollama_status}
**Features:**
- π Individual or batch document processing
- π§ Local LLM inference with Ollama or remote providers
- π Interactive graph visualization
- πΎ CSV export for nodes and edges
- π Multiple domain-specific templates
---
""")
with gr.Tabs():
# Individual Processing Tab
with gr.Tab("π Individual Processing"):
gr.Markdown("### Process a single document")
with gr.Row():
with gr.Column(scale=1):
file_dropdown = gr.Dropdown(
choices=list_input_files(),
label="Select Document",
info="Files from ./input directory"
)
refresh_btn = gr.Button("π Refresh File List", size="sm")
# Template selection
template_dropdown = gr.Dropdown(
choices=list(template_choices.keys()),
value=list(template_choices.keys())[0] if template_choices else None,
label="π Extraction Template",
info="Choose a domain-specific template for structured extraction"
)
template_info = gr.Markdown(
value=f"**Description:** {list(template_descriptions.values())[0] if template_descriptions else 'No templates available'}",
visible=True
)
backend_radio = gr.Radio(
choices=["llm", "vlm"],
value="llm",
label="Extraction Backend",
info="LLM for text, VLM for images"
)
mode_radio = gr.Radio(
choices=["one-to-one", "many-to-one"],
value="many-to-one",
label="Processing Mode",
info="one-to-one: separate outputs per page, many-to-one: merged output"
)
chunking_check = gr.Checkbox(
value=True,
label="Use Chunking",
info="Split large documents for LLM context limits"
)
provider_dropdown = gr.Dropdown(
choices=["ollama", "watsonx", "mistral", "openai", "gemini"],
value="ollama",
label="Provider",
info="LLM provider (ollama for local, watsonx for IBM watsonx)"
)
# Dynamic model selection based on provider
model_dropdown = gr.Dropdown(
choices=available_models,
value=available_models[0] if available_models else OLLAMA_MODEL,
label="Ollama Model",
info="Select from available Ollama models",
visible=True,
allow_custom_value=True
)
model_text = gr.Textbox(
value="",
label="Model Name (for non-Ollama providers)",
info="e.g., gpt-4, mistral-large, gemini-pro",
visible=False
)
refresh_models_btn = gr.Button("π Refresh Ollama Models", size="sm")
# API Key fields for remote providers
with gr.Accordion("π API Configuration (for remote providers)", open=False):
api_key_text = gr.Textbox(
value="",
label="API Key",
type="password",
info="Required for watsonx, OpenAI, Mistral, or Gemini. Leave empty to use .env values.",
placeholder="Enter API key or leave empty to use .env"
)
api_base_text = gr.Textbox(
value="",
label="API Base URL (optional)",
info="Custom API endpoint if needed. Leave empty to use defaults.",
placeholder="Optional: Custom API endpoint"
)
process_btn = gr.Button("π Process Document", variant="primary")
with gr.Column(scale=2):
status_output = gr.Markdown(label="Status")
with gr.Accordion("π Outputs", open=False):
graph_file = gr.File(label="Graph HTML")
nodes_file = gr.File(label="Nodes CSV")
edges_file = gr.File(label="Edges CSV")
# Function to handle provider change
def update_model_inputs(provider):
"""Update model input fields based on selected provider."""
if provider == "ollama":
models = get_ollama_models()
return (
gr.Dropdown(visible=True, choices=models, value=models[0] if models else OLLAMA_MODEL),
gr.Textbox(visible=False)
)
else:
# For remote providers, show text input for model name
default_models = {
"watsonx": "ibm/granite-13b-chat-v2",
"openai": "gpt-4",
"mistral": "mistral-large-latest",
"gemini": "gemini-pro"
}
return (
gr.Dropdown(visible=False),
gr.Textbox(visible=True, value=default_models.get(provider, ""))
)
def refresh_ollama_models():
"""Refresh the list of available Ollama models."""
models = get_ollama_models()
return gr.Dropdown(choices=models, value=models[0] if models else OLLAMA_MODEL)
def get_model_value(provider, model_dropdown_value, model_text_value):
"""Get the appropriate model value based on provider."""
return model_dropdown_value if provider == "ollama" else model_text_value
# Wire up individual processing
refresh_btn.click(
fn=lambda: gr.Dropdown(choices=list_input_files()),
outputs=file_dropdown
)
provider_dropdown.change(
fn=update_model_inputs,
inputs=[provider_dropdown],
outputs=[model_dropdown, model_text]
)
refresh_models_btn.click(
fn=refresh_ollama_models,
outputs=model_dropdown
)
# Function to update template description
def update_template_info(template_name):
"""Update template description when selection changes."""
if template_name and template_name in template_descriptions:
return f"**Description:** {template_descriptions[template_name]}"
return "**Description:** No description available"
template_dropdown.change(
fn=update_template_info,
inputs=[template_dropdown],
outputs=[template_info]
)
# Modified process function to handle both model inputs and template
def process_with_model_selection(file_path, template_name, backend, mode, chunking, provider,
model_dropdown_val, model_text_val, progress=gr.Progress()):
model = model_dropdown_val if provider == "ollama" else model_text_val
template_key = template_choices.get(template_name, list(template_choices.values())[0])
return process_document(file_path, backend, mode, chunking, provider, model, template_key, progress)
process_btn.click(
fn=process_with_model_selection,
inputs=[
file_dropdown,
template_dropdown,
backend_radio,
mode_radio,
chunking_check,
provider_dropdown,
model_dropdown,
model_text
],
outputs=[status_output, graph_file, nodes_file, edges_file]
)
# Batch Processing Tab
with gr.Tab("π Batch Processing"):
gr.Markdown("### Process all documents in the input directory")
with gr.Row():
with gr.Column(scale=1):
# Template selection for batch
batch_template_dropdown = gr.Dropdown(
choices=list(template_choices.keys()),
value=list(template_choices.keys())[0] if template_choices else None,
label="π Extraction Template",
info="Choose a domain-specific template for structured extraction"
)
batch_template_info = gr.Markdown(
value=f"**Description:** {list(template_descriptions.values())[0] if template_descriptions else 'No templates available'}",
visible=True
)
batch_backend = gr.Radio(
choices=["llm", "vlm"],
value="llm",
label="Extraction Backend"
)
batch_mode = gr.Radio(
choices=["one-to-one", "many-to-one"],
value="many-to-one",
label="Processing Mode"
)
batch_chunking = gr.Checkbox(
value=True,
label="Use Chunking"
)
batch_provider = gr.Dropdown(
choices=["ollama", "watsonx", "mistral", "openai", "gemini"],
value="ollama",
label="Provider"
)
# Dynamic model selection for batch processing
batch_model_dropdown = gr.Dropdown(
choices=available_models,
value=available_models[0] if available_models else OLLAMA_MODEL,
label="Ollama Model",
info="Select from available Ollama models",
visible=True,
allow_custom_value=True
)
batch_model_text = gr.Textbox(
value="",
label="Model Name (for non-Ollama providers)",
info="e.g., gpt-4, mistral-large, gemini-pro",
visible=False
)
batch_refresh_models_btn = gr.Button("π Refresh Ollama Models", size="sm")
with gr.Accordion("π API Configuration (for remote providers)", open=False):
batch_api_key_text = gr.Textbox(
value="",
label="API Key",
type="password",
info="Required for watsonx, OpenAI, Mistral, or Gemini. Leave empty to use .env values.",
placeholder="Enter API key or leave empty to use .env"
)
batch_api_base_text = gr.Textbox(
value="",
label="API Base URL (optional)",
info="Custom API endpoint if needed. Leave empty to use defaults.",
placeholder="Optional: Custom API endpoint"
)
batch_btn = gr.Button("π Process All Documents", variant="primary")
with gr.Column(scale=2):
batch_status = gr.Markdown(label="Batch Status")
# Wire up batch processing provider change
batch_provider.change(
fn=update_model_inputs,
inputs=[batch_provider],
outputs=[batch_model_dropdown, batch_model_text]
)
batch_refresh_models_btn.click(
fn=refresh_ollama_models,
outputs=batch_model_dropdown
)
# Function to update batch template description
batch_template_dropdown.change(
fn=update_template_info,
inputs=[batch_template_dropdown],
outputs=[batch_template_info]
)
# Modified batch process function
def batch_process_with_model_selection(template_name, backend, mode, chunking, provider,
model_dropdown_val, model_text_val, progress=gr.Progress()):
model = model_dropdown_val if provider == "ollama" else model_text_val
template_key = template_choices.get(template_name, list(template_choices.values())[0])
return batch_process_documents(backend, mode, chunking, provider, model, template_key, progress)
# Wire up batch processing
batch_btn.click(
fn=batch_process_with_model_selection,
inputs=[
batch_template_dropdown,
batch_backend,
batch_mode,
batch_chunking,
batch_provider,
batch_model_dropdown,
batch_model_text
],
outputs=batch_status
)
# Help Tab
with gr.Tab("βΉοΈ Help"):
gr.Markdown("""
## Getting Started
### 1. Setup Ollama (for local inference)
```
bash
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Start Ollama service
ollama serve
# Pull a model (examples)
ollama pull granite4
ollama pull llama3
ollama pull mistral
```
### 2. Add Documents
Place your documents (PDF, images, markdown, etc.) in the `./input` directory.
### 3. Select Provider & Model
- **Ollama (Local):** Select from your installed models using the dropdown
- **Remote Providers:** Choose OpenAI, Mistral, or Gemini and enter your API key
### 4. Process Documents
- **Individual:** Select a file and click "Process Document"
- **Batch:** Click "Process All Documents" to process everything
### 5. View Results
Results are saved in `./output` with timestamps:
- `report_TIMESTAMP.md` - Processing summary
- `graph_TIMESTAMP.html` - Interactive visualization
- `nodes.csv` - Extracted entities
- `edges.csv` - Relationships
## Configuration
### Templates
Templates define the structure of data to extract from documents. Each template is a Pydantic model that:
- Defines entities (nodes) and relationships (edges)
- Provides field descriptions to guide the LLM
- Validates extracted data
- Generates a knowledge graph
**Available Templates:**
- Located in `./templates/` directory
- Can be customized or extended
- Support complex nested structures
- Include validation and normalization
**Creating Custom Templates:**
1. Create a new `.py` file in `./templates/`
2. Define Pydantic models with `graph_id_fields`
3. Use the `edge()` helper for relationships
4. Add field descriptions to guide extraction
5. Restart the app to load new templates
### Backends
- **LLM:** Text-based extraction (best for PDFs, documents)
- **VLM:** Vision-based extraction (best for images, forms)
### Processing Modes
- **one-to-one:** Each page becomes a separate output
- **many-to-one:** All pages merged into single output
### Providers
#### Ollama (Local - Recommended)
- **Advantages:** Privacy, no API costs, works offline
- **Models:** Any model you've pulled (granite4, llama3, mistral, etc.)
- **Setup:** Just install Ollama and pull models
- **Refresh:** Click "π Refresh Ollama Models" to update the list
#### Remote Providers
- **watsonx:** IBM watsonx models (requires WO_INSTANCE and WO_API_KEY in .env)
- **OpenAI:** GPT-4, GPT-3.5-turbo (requires API key)
- **Mistral:** mistral-large-latest, mistral-medium (requires API key)
- **Gemini:** gemini-pro (requires API key)
## Troubleshooting
### Ollama Connection Error
```
bash
# Check if Ollama is running
curl http://localhost:11434/api/tags
# Restart Ollama
ollama serve
```
### No Models Available
```
bash
# List available models
ollama list
# Pull a model
ollama pull granite4
ollama pull llama3
# Refresh the model list in the UI
# Click "π Refresh Ollama Models" button
```
### Remote Provider Errors
- **watsonx:** Verify WO_INSTANCE and WO_API_KEY in .env file
- **Other providers:** Verify your API key is correct
- Check your API quota/credits
- Ensure you have network connectivity
### Environment Configuration
```
bash
# Copy the template and configure
cp .env.template .env
# Edit .env with your settings
# For watsonx: Set WO_INSTANCE and WO_API_KEY
# For other providers: Set respective API keys
```
### Out of Memory
- Enable chunking (recommended for large documents)
- Use a smaller model
- Process fewer documents at once
## Documentation
For detailed documentation, see the `./Docs` directory or visit:
https://docling-project.github.io/docling-graph/
""")
if __name__ == "__main__":
console.print(
Panel.fit(
"[bold blue]Docling-Graph Showcase[/bold blue]\n"
"[dim]Starting Gradio application...[/dim]",
border_style="blue",
)
)
# Try to find an available port starting from 7861
import socket
def find_free_port(start_port=7861, max_attempts=10):
"""Find a free port starting from start_port."""
for port in range(start_port, start_port + max_attempts):
try:
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(("", port))
return port
except OSError:
continue
return start_port # Fallback to original port
port = find_free_port()
console.print(f"[green]Starting on port {port}[/green]")
app.launch(
server_name="0.0.0.0",
server_port=port,
share=False,
show_error=True
)
# Made with Bob
True to my standard βproduction-firstβ approach, the application comes fully containerized with a dedicated Dockerfile. To simplify the transition from local testing to cloud-scale environments, Iβve also included a comprehensive set of Kubernetes YAML manifests. This ensures that whether you are deploying to a private cluster or a public cloud provider, the infrastructure is defined as code and ready to scale.
Conclusion
In conclusion, the true power of Docling-Graph lies in its ability to move beyond the limitations of βfuzzyβ text searching and approximate embeddings. By transforming unstructured documents into validated Pydantic objects, it enforces a strict data contract that ensures every extracted entity β be it a chemical compound in a lab report, a tax clause in a financial statement, or a dependency in a legal contract β is captured with clinical precision. This isnβt just data extraction; it is the automated creation of a semantic knowledge graph where the relationships between entities are as explicit and reliable as the data itself.
Through the collaborative efforts with Bob, this implementation provides more than just a code snippet; it delivers a complete, production-ready ecosystem. What has been built is a robust bridge between high-level document intelligence and practical deployment:
- Universal Flexibility: A modular configuration that effortlessly toggles between local-first privacy (via Ollama) and high-performance cloud intelligence (IBM watsonx, OpenAI, Mistral, Gemini).
- Architectural Integrity: A dual-path extraction pipeline that leverages both local VLM capabilities and LiteLLM routing, ensuring the system adapts to the complexity of the document at hand.
- Operational Readiness: Beyond the logic, Bob has provided the βlast mileβ of software engineering β Gradio UIs for user interaction, Dockerfile and Kubernetes manifests for scaling, and timestamped automation for data lifecycle management.
Ultimately, what Bob has delivered is a starter kit that doesnβt just βtestβ Docling-Graph β it builds a foundation for mission-critical AI applications where accuracy is non-negotiable and the relationship is the message.
>>> Thanks for reading <<<
Links
- Docling Project: https://docling-project.github.io/docling/
- Docling-graph: https://github.com/docling-project/docling-graph
- The Postβs Code Repo: https://github.com/aairom/docling-graph-test/
- IBM Project Bob: https://www.ibm.com/products/bob






Top comments (0)