Alain Airom (Ayrom)

Posted on Mar 7

Docling-Agent is Coming: Bob’s Blueprint for Next-Gen Automation

#bob #docling #agents #doclingagent

Bob + Docling-Agent: Automation, Simplified

🏗️ Docling-Agent is Coming: Bob’s Blueprint for Next-Gen Automation

IBM Bob has a new set of power tools in his belt! Docling-agent is the latest addition to the Docling ecosystem, designed to simplify how you handle complex document tasks through “agentic” operations. Whether you are starting from scratch or refining existing work, Bob is here to show you how these new agents can automate your heavy lifting.

🧰 What’s in the New Toolbox?

Docling-agent isn’t just a parser; it’s a collaborator. It allows you to interact with documents using natural language to perform high-level tasks. Here is what Bob is building with it:
Document Writing: Generate structured reports (JSON, Markdown, or HTML) simply by giving the agent a prompt.
Targeted Editing: Load an existing Docling JSON and tell the agent exactly what to change — like adding a column to a table or updating a summary.
Schema-Guided Extraction: Hand the agent a pile of PDFs or images along with a specific schema (like “invoice number” or “total”), and it will extract the data into a clean report.
Model Agnostic: You can plug in different backends, from OpenAI’s GPT models to IBM Granite, via the Mellea framework.

> Excerpt from GitHub repo: This package is still immature and work-in-progress. We are happy to get comments, suggestions, code contributions, etc!

Implementation and Test

The project GitHub’s site provides some sample code and source data in order to test docling-agent ⬇️

Write a new document (see example):

from mellea.backends import model_ids
from docling_agent.agents import DoclingWritingAgent
agent = DoclingWritingAgent(model_id=model_ids.OPENAI_GPT_OSS_20B)
doc = agent.run("Write a brief report on polymers in food packaging with a small comparison table.")
doc.save_as_html("./scratch/report.html")

Edit an existing document (see example):

Use natural-language tasks to update a Docling JSON. You can run multiple tasks to iteratively refine content, structure, or formatting.

from pathlib import Path
from mellea.backends import model_ids
from docling_core.types.doc.document import DoclingDocument
from docling_agent.agents import DoclingEditingAgent

ipath = Path("./examples/example_02_edit_resources/20250815_125216.json")
doc = DoclingDocument.load_from_json(ipath)

agent = DoclingEditingAgent(model_id=model_ids.OPENAI_GPT_OSS_20B)
updated = agent.run(task="Put polymer abbreviations in a separate column in the first table.", document=doc)
updated.save_as_html("./scratch/updated_table.html")

Extract structured data with a schema (see example):

Define a simple schema and provide a list of files (PDFs/images). The agent produces an HTML report with extracted fields.

from pathlib import Path
from mellea.backends import model_ids
from docling_agent.agents import DoclingExtractingAgent

schema = {"invoice-number": "string", "total": "float", "currency": "string"}
sources = sorted([p for p in Path("./examples/example_03_extract/invoices").rglob("*.*") if p.suffix.lower() in {".pdf", ".png", ".jpg", ".jpeg"}])

agent = DoclingExtractingAgent(model_id=model_ids.OPENAI_GPT_OSS_20B)
report = agent.run(task=str(schema), sources=sources)
report.save_as_html("./scratch/invoices_extraction_report.html")

🛑 Although on the project’s GitHub site we see “Installation/Coming Soon…” using Bob we can autmate the installation and much more… 💪😁

So How did Bob help?

🏗️ A Complete Automation Scaffold

Bob didn’t just wait for a pip install; he realized that because docling-agent is open-source on GitHub, he could automate the entire setup. Bob has engineered a workflow that bridges the gap between raw research and production-ready automation. By synthesizing the latest tools, he’s created a “one-click” environment for high-fidelity data extraction.

Automatic Dependency Management: Bob’s start.sh script automatically creates a virtual environment and pulls the latest docling-agent, mellea, and docling-core directly from GitHub.
GitHub-Direct Installation: Since docling-agent is in its early stages and not yet on PyPI, Bob’s requirements.txt is configured to pull the latest source code directly from GitHub.

# Core dependencies
flask>=3.0.0
python-dotenv>=1.0.0

# Image processing dependencies (required for docling to process images)
qwen-vl-utils

# Docling dependencies - install from GitHub
# Install in specific order to avoid conflicts
# docling-agent will pull in its own dependencies including docling-core and mellea
git+https://github.com/docling-project/docling-agent.git

Zero-Config Start: He realized that users just want to “place and process,” so he built an input/ and output/ directory system where documents (PDFs, PNGs, JPGs) are automatically picked up.

🖥️ From CLI to a Polished Web UI

Bob knew that not everyone wants to live in the terminal, so he wrapped the agent in a Flask-based Web interface.

# app.py
#!/usr/bin/env python3
"""
Docling Extract Schema Agent Application
Batch processing application for extracting structured data from documents
"""

import os
import logging
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Optional
from flask import Flask, render_template, request, jsonify, send_file
from mellea.backends import model_ids
from docling_agent.agents import DoclingExtractingAgent

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

# Initialize Flask app
app = Flask(__name__)
app.config['SECRET_KEY'] = os.environ.get('SECRET_KEY', 'dev-secret-key-change-in-production')

# Configuration
INPUT_DIR = Path("./input")
OUTPUT_DIR = Path("./output")
SUPPORTED_EXTENSIONS = {".pdf", ".png", ".jpg", ".jpeg"}

# Default schema for extraction
DEFAULT_SCHEMA = {
    "document-type": "string",
    "date": "string",
    "title": "string",
    "key-information": "string"
}

# Ensure directories exist
INPUT_DIR.mkdir(exist_ok=True)
OUTPUT_DIR.mkdir(exist_ok=True)

class DocumentProcessor:
    """Handles document processing with docling-agent"""

    def __init__(self, model_id = model_ids.OPENAI_GPT_OSS_20B):
        self.model_id = model_id
        self.agent = None

    def initialize_agent(self):
        """Initialize the DoclingExtractingAgent"""
        try:
            # Initialize agent with empty tools list (can be extended later)
            self.agent = DoclingExtractingAgent(model_id=self.model_id, tools=[])
            logger.info(f"Agent initialized with model: {self.model_id}")
        except Exception as e:
            logger.error(f"Failed to initialize agent: {e}")
            raise

    def get_source_files(self, directory: Path) -> List[Path]:
        """Get all supported document files from directory"""
        sources = sorted([
            p for p in directory.rglob("*.*") 
            if p.suffix.lower() in SUPPORTED_EXTENSIONS
        ])
        logger.info(f"Found {len(sources)} source files in {directory}")
        return sources

    def process_documents(self, schema: Dict, sources: List[Path]) -> tuple:
        """Process documents and return report path and timestamp"""
        if not self.agent:
            self.initialize_agent()

        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_filename = f"extraction_report_{timestamp}.html"
        output_path = OUTPUT_DIR / output_filename

        try:
            logger.info(f"Processing {len(sources)} documents with schema: {schema}")
            report = self.agent.run(task=str(schema), sources=sources)
            report.save_as_html(str(output_path))
            logger.info(f"Report saved to: {output_path}")
            return output_path, timestamp
        except Exception as e:
            logger.error(f"Error processing documents: {e}")
            raise

# Global processor instance
processor = DocumentProcessor()

@app.route('/')
def index():
    """Main page"""
    return render_template('index.html')

@app.route('/api/status')
def status():
    """Get application status"""
    input_files = processor.get_source_files(INPUT_DIR)
    output_files = sorted(OUTPUT_DIR.glob("*.html"), reverse=True)

    return jsonify({
        'status': 'running',
        'input_dir': str(INPUT_DIR),
        'output_dir': str(OUTPUT_DIR),
        'input_files_count': len(input_files),
        'input_files': [f.name for f in input_files],
        'output_files_count': len(output_files),
        'output_files': [f.name for f in output_files[:10]]  # Last 10 reports
    })

@app.route('/api/process', methods=['POST'])
def process():
    """Process documents with custom schema"""
    try:
        data = request.get_json()
        schema = data.get('schema', DEFAULT_SCHEMA)

        # Validate schema
        if not isinstance(schema, dict):
            return jsonify({'error': 'Schema must be a dictionary'}), 400

        # Get source files
        sources = processor.get_source_files(INPUT_DIR)

        if not sources:
            return jsonify({'error': 'No documents found in input directory'}), 400

        # Process documents
        output_path, timestamp = processor.process_documents(schema, sources)

        return jsonify({
            'success': True,
            'message': f'Processed {len(sources)} documents',
            'output_file': output_path.name,
            'timestamp': timestamp,
            'sources_processed': len(sources)
        })

    except Exception as e:
        logger.error(f"Processing error: {e}")
        return jsonify({'error': str(e)}), 500

@app.route('/api/reports')
def list_reports():
    """List all generated reports"""
    output_files = sorted(OUTPUT_DIR.glob("*.html"), reverse=True)

    reports = []
    for f in output_files:
        stat = f.stat()
        reports.append({
            'name': f.name,
            'size': stat.st_size,
            'created': datetime.fromtimestamp(stat.st_ctime).isoformat(),
            'modified': datetime.fromtimestamp(stat.st_mtime).isoformat()
        })

    return jsonify({'reports': reports})

@app.route('/api/reports/<filename>')
def download_report(filename):
    """Download a specific report"""
    file_path = OUTPUT_DIR / filename

    if not file_path.exists() or not file_path.is_file():
        return jsonify({'error': 'Report not found'}), 404

    return send_file(file_path, as_attachment=True)

@app.route('/api/schema/default')
def get_default_schema():
    """Get the default extraction schema"""
    return jsonify({'schema': DEFAULT_SCHEMA})

if __name__ == '__main__':
    # Initialize agent on startup
    try:
        processor.initialize_agent()
    except Exception as e:
        logger.error(f"Failed to initialize agent on startup: {e}")
        logger.warning("Agent will be initialized on first request")

    # Run Flask app
    # Default to port 8080 to avoid conflict with macOS AirPlay on port 5000
    port = int(os.environ.get('PORT', 8080))
    debug = os.environ.get('DEBUG', 'False').lower() == 'true'

    logger.info(f"Starting Docling Extract Schema Agent Application on port {port}")
    logger.info(f"Input directory: {INPUT_DIR.absolute()}")
    logger.info(f"Output directory: {OUTPUT_DIR.absolute()}")

    app.run(host='0.0.0.0', port=port, debug=debug)

# Made with Bob

Real-Time Monitoring: Bob realized he could provide a live status dashboard so you can watch the agent work.
Schema Editor: He built a tool where you can define exactly what data you want (like invoice numbers or totals) in a simple JSON format, and the agent obeys.
Model Agnostic: He realized the system could use various AI models (like OpenAI GPT or OSS 20B models) via the Mellea backend.
Structured Output: Instead of just raw text, Bob’s setup forces the agent to produce timestamped HTML reports and structured data that is actually useful for business.
Self-Healing Dependencies: Essential building blocks like Mellea (the model backend) and docling-core are automatically installed through the same requirement file, ensuring version compatibility.
Ollama Integration: To power the “brain” of the agent, Bob utilizes Ollama to pull and manage the heavy-duty LLM models locally, keeping your data secure and your processing fast.

📂 The “Ready-to-Build” Workflow

Stage Your Materials: Place your PDFs or images into the input/ directory.
Ignite the Engine: Run ./scripts/start.sh to let Bob handle the virtual environment and dependency installation automatically.
Extract Value: Navigate to http://localhost:8080, define your schema, and watch the agent transform documents into structured reports.

Et voilà 🏅

Conclusion

By synthesizing these insights, Bob has realized that the “Coming Soon” status of a project is no barrier to a builder with the right tools. He has successfully transformed the raw potential of the docling-agent GitHub repository into a production-ready, web-based automation engine. By automating the installation of non-PyPI dependencies and integrating Ollama for local model execution, Bob has created a sturdy bridge between complex agentic code and an intuitive user experience. Ultimately, this project proves that with a clear schema and a bit of automation, transforming messy documents into structured, timestamped reports is no longer a future promise — it’s a task Bob can help you finish today.

>>> Thanks for reading <<<

DEV Community