DEV Community

Alain Airom (Ayrom)
Alain Airom (Ayrom)

Posted on

Rapid Documents Conversion Using Docling (locally and Remotely) with an Apify Actor

Recent work on implementing and using Docling and Apify

Introduction

If you have been following my posts, you already know I am a massive advocate for Docling — it is my absolute go-to recommendation whenever business partners bring up document processing challenges. This post reflects a recent workshop I ran with a partner where we tackled scaling document ingestion. I designed two distinct applications for the session. The first is a batch processing application that converts MS Office (.docx) documents into the standard structured outputs provided by Docling. While document conversion is a fairly standard pattern, I’ve included the full implementation anyway. What makes this implementation noteworthy is that by leveraging IBM Bob alongside Docling’s samples, I was able to industrialize the solution—complete with full documentation and automated scripts—in under 15 minutes.


Part 1-Docling MS Office “Docx” Converter Implementation

Hereafter follows the implementation of Docling converter application.

docling-docx/
├── input/                  # Input documents (*.docx, *.pdf, etc.)
├── output/                 # Converted documents with timestamps
├── scripts/                # Utility scripts
│   └── setup_venv.sh      # Virtual environment setup
├── Docs/                   # Documentation
│   └── Architecture.md    # This file
├── convert_documents.py   # Main application
├── requirements.txt       # Python dependencies
├── .gitignore            # Git ignore rules
└── README.md             # Project documentation
Enter fullscreen mode Exit fullscreen mode

  • Prepraring the environment
# Docling - Document conversion library
docling>=2.0.0

# Additional dependencies for document processing
python-dotenv>=1.0.0
Enter fullscreen mode Exit fullscreen mode
  • The script to prepare the environment for the Python application;
#!/bin/bash

# Setup Virtual Environment for Docling Document Converter
# This script creates a Python virtual environment and installs dependencies

set -e  # Exit on error

echo "======================================"
echo "Setting up Virtual Environment"
echo "======================================"
echo ""

# Check if Python 3 is installed
if ! command -v python3 &> /dev/null; then
    echo "Error: Python 3 is not installed. Please install Python 3.8 or higher."
    exit 1
fi

# Display Python version
PYTHON_VERSION=$(python3 --version)
echo "Found: $PYTHON_VERSION"
echo ""

# Create virtual environment
echo "Creating virtual environment..."
python3 -m venv venv

# Activate virtual environment
echo "Activating virtual environment..."
source venv/bin/activate

# Upgrade pip
echo "Upgrading pip..."
pip install --upgrade pip

# Install requirements
echo "Installing dependencies from requirements.txt..."
pip install -r requirements.txt

echo ""
echo "======================================"
echo "Setup Complete!"
echo "======================================"
echo ""
echo "To activate the virtual environment, run:"
echo "  source venv/bin/activate"
echo ""
echo "To run the document converter:"
echo "  python convert_documents.py"
echo ""
echo "To deactivate the virtual environment:"
echo "  deactivate"
echo ""

# Made with Bob
Enter fullscreen mode Exit fullscreen mode
  • And the code 👨‍💻
#!/usr/bin/env python3
"""
Document Converter using Docling
Converts documents from the input folder to various formats and saves them in the output folder.
"""

import os
import sys
from pathlib import Path
from datetime import datetime
from docling.document_converter import DocumentConverter


def create_output_directory():
    """Create output directory if it doesn't exist."""
    output_dir = Path("output")
    output_dir.mkdir(exist_ok=True)
    return output_dir


def get_timestamp():
    """Generate timestamp for output files."""
    return datetime.now().strftime("%Y%m%d_%H%M%S")


def process_documents(input_dir="input", output_dir="output"):
    """
    Process all documents in the input directory recursively.

    Args:
        input_dir: Directory containing input documents
        output_dir: Directory to save converted documents
    """
    input_path = Path(input_dir)
    output_path = Path(output_dir)

    if not input_path.exists():
        print(f"Error: Input directory '{input_dir}' does not exist.")
        sys.exit(1)

    # Create output directory
    output_path.mkdir(exist_ok=True)

    # Initialize Docling converter
    print("Initializing Docling DocumentConverter...")
    converter = DocumentConverter()

    # Supported document extensions
    supported_extensions = ['.docx', '.doc', '.pdf', '.pptx', '.xlsx', '.html']

    # Find all documents recursively
    documents = []
    for ext in supported_extensions:
        documents.extend(input_path.rglob(f"*{ext}"))

    if not documents:
        print(f"No supported documents found in '{input_dir}'")
        print(f"Supported formats: {', '.join(supported_extensions)}")
        return

    print(f"\nFound {len(documents)} document(s) to process:")
    for doc in documents:
        print(f"  - {doc.relative_to(input_path)}")

    # Process each document
    timestamp = get_timestamp()
    successful = 0
    failed = 0

    print("\n" + "="*60)
    print("Starting document conversion...")
    print("="*60 + "\n")

    for doc_path in documents:
        try:
            print(f"Processing: {doc_path.name}")

            # Convert document
            result = converter.convert(str(doc_path))

            # Create output filename with timestamp
            base_name = doc_path.stem
            relative_path = doc_path.relative_to(input_path).parent

            # Create subdirectory structure in output if needed
            output_subdir = output_path / relative_path
            output_subdir.mkdir(parents=True, exist_ok=True)

            # Export to Markdown
            md_output = output_subdir / f"{base_name}_{timestamp}.md"
            try:
                with open(md_output, "w", encoding="utf-8") as f:
                    f.write(result.document.export_to_markdown())
                print(f"  ✓ Markdown: {md_output.relative_to(output_path)}")
            except Exception as e:
                print(f"  ⚠ Markdown export failed: {str(e)}")

            # Export to JSON
            json_output = output_subdir / f"{base_name}_{timestamp}.json"
            try:
                import json
                with open(json_output, "w", encoding="utf-8") as f:
                    json.dump(result.document.export_to_dict(), f, indent=2, ensure_ascii=False)
                print(f"  ✓ JSON: {json_output.relative_to(output_path)}")
            except Exception as e:
                print(f"  ⚠ JSON export failed: {str(e)}")

            # Export to plain text
            txt_output = output_subdir / f"{base_name}_{timestamp}.txt"
            try:
                with open(txt_output, "w", encoding="utf-8") as f:
                    f.write(result.document.export_to_text())
                print(f"  ✓ Text: {txt_output.relative_to(output_path)}")
            except Exception as e:
                print(f"  ⚠ Text export failed: {str(e)}")

            successful += 1
            print(f"  ✓ Successfully converted: {doc_path.name}\n")

        except Exception as e:
            failed += 1
            print(f"  ✗ Error converting {doc_path.name}: {str(e)}\n")

    # Summary
    print("="*60)
    print("Conversion Summary")
    print("="*60)
    print(f"Total documents: {len(documents)}")
    print(f"Successfully converted: {successful}")
    print(f"Failed: {failed}")
    print(f"\nOutput directory: {output_path.absolute()}")
    print("="*60)


def main():
    """Main entry point."""
    print("="*60)
    print("Docling Document Converter")
    print("="*60)
    print()

    try:
        process_documents()
    except KeyboardInterrupt:
        print("\n\nConversion interrupted by user.")
        sys.exit(1)
    except Exception as e:
        print(f"\n\nFatal error: {str(e)}")
        sys.exit(1)


if __name__ == "__main__":
    main()

# Made with Bob
Enter fullscreen mode Exit fullscreen mode

Part 2-Docling MS Office “Docx” Converter Implementation with “apify” 👀

Things got a bit interesting — and by interesting, I mean “the architectural scope expanded while I nodded along knowingly” — when we started discussing integration with Apify. Full disclosure? I had absolutely no idea what Apify was. (Pro tip: it’s actually right there in the Docling documentation; don’t look at me like that 🤣).

Naturally, I did what any seasoned professional does: I quietly opened a dozen tabs, chased down links to Apify’s website and GitHub, level-upped my project skills in record time, and then asked my pal IBM Bob to do the heavy lifting and build the app.

But before we dig into the codebase Bob spun up, let’s answer the burning question: what on earth is Apify anyway?


Apify

Image from Apify site

Image from Apify site
So, what exactly is Apify? Think of it as a cloud-hosted playground and execution engine designed to turn automation scripts, web scrapers, and data parsers into production-grade, scalable cloud services called Actors. Instead of worrying about setting up servers, managing proxies, containerizing your code, or scheduling cron jobs, Apify handles the entire infrastructure lifecycle. Whether you are running a Python script, an asynchronous SDK, or a specialized parser like Docling, you can package it into a headless Docker container that runs on Apify’s infrastructure. It provides built-in storage features — like specialized Key-Value stores for document artifacts and structured Datasets for metadata — and exposes everything through a clean CLI and REST API, making it incredibly easy to industrialize your workflows or feed cleaned data directly into generative AI pipelines.


Docling And Apify Together Application

Once I wrapped my head around Apify’s architecture — and realized just how powerful it is for cloudifying automation workflows — I handed the blueprint over to my digital co-pilot, IBM Bob. The goal? Take our initial batch processing tool and make it fully “Apify-Ready.”

Naturally, the road to production-grade glory isn’t paved without a few spilled coffees. It took us exactly three rounds of intense back-and-forth debugging to pin down the strict repository folder structure Apify demands (which you can check out in the GitHub screenshot below).


docling-docx-apify/
├── .actor/                      # Apify Actor metadata
│   └── actor.json              # Actor configuration & input schema
├── .bob/                        # Bob AI rules
│   └── rules/
│       └── project-development.md
├── Docs/                        # Documentation
│   ├── Apify-Console.md        # Apify deployment guide
│   ├── Architecture.md         # System architecture
│   └── PROJECT_SUMMARY.md      # This file
├── input/                       # Input documents (local mode)
│   ├── test_document.txt       # Test file
│   ├── docx_checkboxes.docx    # Sample DOCX
│   └── docx_external_image.docx # Sample DOCX
├── output/                      # Output directory (auto-created)
├── scripts/                     # Utility scripts
│   ├── github-push.sh          # Git push automation
│   ├── setup_venv.sh           # Setup virtual environment
│   └── test_setup.sh           # Validate setup
├── .gitignore                   # Git ignore rules
├── convert_documents.py         # Local conversion script
├── main.py                      # Apify Actor entry point
├── Dockerfile                   # Container configuration for Apify
├── requirements.txt             # Python dependencies
└── README.md                    # Main documentation
Enter fullscreen mode Exit fullscreen mode

After two dramatic build failures that sent me digging through the logs, the third time was the absolute charm. The configuration clicked, the input schema validated, and we got that beautiful, glowing green checkmark: Build Successful.


How to set-up Apify?

Step 1: Connect your Repository to Apify Console

Once your Docling application code (including your main.py, requirements.txt, and Dockerfile or apify.json) is safely pushed to GitHub, you need to point Apify to it so it knows where to pull the source code.

  1. Log into Apify Console: Go to console.apify.com and navigate to the Actors tab in the left-hand menu.
  2. Create a New Actor: Click the Create new button in the top right corner.
  3. S*elect Git Provider: Instead of starting from a blank template, look at the **Source code* options and select GitHub (or Git repository).
  4. Link your GitHub Account: If you haven’t done so already, authorize Apify to access your GitHub repositories. You can choose to grant access to all repositories or only the specific repository containing your Docling code.
  5. Configure the Repository Details:
  • Repository URL: Select or paste the URL of your GitHub repository (e.g., https://github.com/your-username/your-docling-actor).
  • Branch/Tag/Commit: Specify the branch you want Apify to track (usually main or master).
  • Source Directory (Optional): If your Actor code lives in a subdirectory of your repository rather than the root, specify the folder path here.
  1. Save/Create: Click Create Actor. Apify will now establish a link to your repository. By default, it will also set up a webhook so that whenever you push new changes to your designated GitHub branch, Apify automatically knows it needs to rebuild.

Actors overview
Actors are serverless programs to automate workflows and extract data. Each Actor takes a structured JSON input, performs a task (like web scraping, browser automation, or data processing), and can optionally produce a structured output. Actors are easy to run manually, via API, or on a schedule, and you can combine them into larger automations.

Step 2: Build the Actor

Building the Actor is the process where Apify takes your source code, pulls the base environment (like Python), installs your dependencies (such as docling and apify-client), and compiles it into an executable, containerized image hosted on their platform.

  1. Navigate to the Builds Tab: Inside your newly created Actor’s dashboard in the Apify Console, click on the Builds tab.
  2. Trigger a Manual Build: Click the Build button (often labeled Start build or accompanied by a play icon).
  3. Monitor the Build Log: A live console log will appear. You will see Apify:
  • Cloning your GitHub repository.
  • Parsing your environment setup (detecting your Dockerfile or automatically generating one based on your requirements.txt).
  • Running pip install for Docling and your other dependencies. (Note: The first build might take a couple of minutes as it downloads and caches the heavier machine learning packages used by Docling).
  1. Successful Build Status: Once the log finishes scrolling, the build status will change to a green Ready or Succeeded checkmark.

Your Actor is now fully built and containerized! You can immediately test it right from the Console UI by providing an input schema (like an array of .docx file paths), or trigger it programmatically using the Apify API, Python SDK, or CLI.


As you’ll discover in the GitHub repository, the code implementing Docling to run on Apify is almost the same!

#!/usr/bin/env python3
"""
Docling Document Converter - Apify Actor
Converts documents using Docling and stores results in Apify Dataset.
"""

import os
from pathlib import Path
from datetime import datetime
from apify import Actor
from docling.document_converter import DocumentConverter


async def main():
    """
    Main entry point for the Apify Actor.
    Processes documents from input and stores results in Apify Dataset.
    """
    async with Actor:
        # Get input from Apify
        actor_input = await Actor.get_input() or {}

        # Configuration
        input_urls = actor_input.get('urls', [])
        output_formats = actor_input.get('outputFormats', ['markdown', 'json', 'text'])
        max_documents = actor_input.get('maxDocuments', 100)

        Actor.log.info(f'Starting Docling Document Converter Actor')
        Actor.log.info(f'Input URLs: {len(input_urls)}')
        Actor.log.info(f'Output formats: {output_formats}')
        Actor.log.info(f'Max documents: {max_documents}')

        # Initialize Docling converter
        Actor.log.info('Initializing Docling DocumentConverter...')
        converter = DocumentConverter()

        # Process documents
        processed_count = 0
        failed_count = 0

        for idx, url in enumerate(input_urls[:max_documents]):
            try:
                Actor.log.info(f'Processing document {idx + 1}/{len(input_urls[:max_documents])}: {url}')

                # Download document to temporary location
                temp_path = await Actor.download_file(url)

                if not temp_path:
                    Actor.log.warning(f'Failed to download: {url}')
                    failed_count += 1
                    continue

                # Convert document
                result = converter.convert(str(temp_path))

                # Prepare output data
                output_data = {
                    'url': url,
                    'timestamp': datetime.now().isoformat(),
                    'filename': Path(url).name,
                    'formats': {}
                }

                # Export to requested formats
                if 'markdown' in output_formats:
                    try:
                        output_data['formats']['markdown'] = result.document.export_to_markdown()
                        Actor.log.info(f'  ✓ Markdown export successful')
                    except Exception as e:
                        Actor.log.warning(f'  ⚠ Markdown export failed: {str(e)}')

                if 'json' in output_formats:
                    try:
                        output_data['formats']['json'] = result.document.export_to_dict()
                        Actor.log.info(f'  ✓ JSON export successful')
                    except Exception as e:
                        Actor.log.warning(f'  ⚠ JSON export failed: {str(e)}')

                if 'text' in output_formats:
                    try:
                        output_data['formats']['text'] = result.document.export_to_text()
                        Actor.log.info(f'  ✓ Text export successful')
                    except Exception as e:
                        Actor.log.warning(f'  ⚠ Text export failed: {str(e)}')

                # Push data to Apify Dataset
                await Actor.push_data(output_data)

                processed_count += 1
                Actor.log.info(f'  ✓ Successfully processed: {url}')

                # Clean up temporary file
                if os.path.exists(temp_path):
                    os.remove(temp_path)

            except Exception as e:
                failed_count += 1
                Actor.log.error(f'  ✗ Error processing {url}: {str(e)}')

        # Summary
        Actor.log.info('='*60)
        Actor.log.info('Conversion Summary')
        Actor.log.info('='*60)
        Actor.log.info(f'Total documents: {len(input_urls[:max_documents])}')
        Actor.log.info(f'Successfully processed: {processed_count}')
        Actor.log.info(f'Failed: {failed_count}')
        Actor.log.info('='*60)

        # Set output
        await Actor.set_value('OUTPUT', {
            'total': len(input_urls[:max_documents]),
            'processed': processed_count,
            'failed': failed_count,
            'timestamp': datetime.now().isoformat()
        })


if __name__ == '__main__':
    import asyncio
    asyncio.run(main())

# Made with Bob
Enter fullscreen mode Exit fullscreen mode

The requirements change to run with Apify API.

# Docling - Document conversion library
docling>=2.0.0

# Apify SDK for Python
apify>=2.0.0

# Additional dependencies for document processing
python-dotenv>=1.0.0

# Made with Bob
Enter fullscreen mode Exit fullscreen mode

Structural Architecture: Local vs. Apify

The primary architectural shift between the two implementations lies in how files and results are routed. The local script (convert_documents.py) relies on direct file system access, iteratively scanning a local input/ folder recursively via Path.rglob() and outputting files to structured subdirectories on your hard drive using standard Python context managers (with open(...)). Conversely, the cloud version (main.py) operates in a stateless, serverless environment, fetching remote files dynamically from web endpoints into a transient workspace using await Actor.download_file(url). Rather than compiling local directories, it streams structured text extraction payloads directly into Apify’s cloud storage using await Actor.push_data(), wrapping the entire pipeline inside an asynchronous runtime (asyncio.run(main())) to handle network latency elastically.

  • Core Code Comparaison
# --- LOCAL FILE SYSTEM EXTRACTION (convert_documents.py) ---
# Direct I/O operations checking physical directory structures
input_path = Path("input")
for doc_path in input_path.rglob(f"*{ext}"):
    result = converter.convert(str(doc_path))
    md_output = output_subdir / f"{base_name}_{timestamp}.md"
    with open(md_output, "w", encoding="utf-8") as f:
        f.write(result.document.export_to_markdown())

# --- APIFY CLOUD INGESTION (main.py) ---
# Asynchronous execution downloading remote blobs and pushing to structured datasets
async with Actor:
    actor_input = await Actor.get_input() or {}
    for url in actor_input.get('urls', []):
        temp_path = await Actor.download_file(url)
        result = converter.convert(str(temp_path))
        output_data['formats']['markdown'] = result.document.export_to_markdown()
        await Actor.push_data(output_data)  # Persists seamlessly to Apify Dataset
Enter fullscreen mode Exit fullscreen mode

Industrialization Synthesis

This evolution represents a classic industrialization pattern for enterprise AI pipelines. While the local version is perfect for immediate developer testing, offline script automation, and handling on-premise documents, it lacks execution scaling. By offloading the baseline engine designed by IBM Bob to an Apify Actor, the implementation gains an instant API wrapper, built-in rate-limiting, and an operational dashboard. Instead of worrying about host hardware requirements, runtime dependencies, or background thread pools, you can scale document ingestion horizontally across multiple cloud threads, providing a robust frontend for Generative AI applications to parse unstructured data on demand.

  • Local architecture

  • Local deployment

  • Apify architecture

  • Apify (cloud) deployment


Conclusion: The Trifecta of Intelligent Document Automation

The successful transition of this project from a local prototype to a production-grade cloud solution demonstrates how modern AI parsing tools, elastic cloud infrastructure, and context-aware generative development assistants can converge to streamline software industrialization. Below is a comprehensive synthesis of the key take-aways from this implementation:

1. The Architectural Flexibility of Docling

Docling has proven itself to be a remarkably flexible, ecosystem-agnostic parsing framework. Whether it is scanning a nested, on-premise physical directory recursively using standard synchronous Python pathing (Path.rglob) or executing inside a stateless, ephemeral container fetching remote assets asynchronously (await Actor.download_file), the underlying parsing engine remains entirely stable. Its ability to seamlessly isolate structural text, tables, and document metadata allows developers to treat document ingestion as a highly portable layer. Docling can be embedded just as easily into a lightweight local command-line script as it can into a heavily orchestrated enterprise workflow, making it an ideal choice for unified document parsing across varied platform architectures.

2. Apify as a Paradigm of Modern Serverless Infrastructure

Apify serves as an exceptional execution playground that mirrors the core values of serverless container platforms like IBM Cloud Code Engine. It bridges the gap between raw script logic and scalable microservices by shielding the developer from infrastructure overhead — such as managing proxies, scaling runtimes, provisioning compute resources, or manually building multi-tenant queuing mechanisms. By exposing an intake layout through a strict JSON input schema and providing standardized storage blocks like Datasets and Key-Value stores, it establishes an elastic ecosystem where open-source runtimes can immediately connect with third-party web scrapers, downstream LLM orchestration frameworks, and external SaaS webhooks. It exemplifies how decoupling your processing code from static compute hosts allows workflows to scale horizontally on demand.

3. Amplifying Industrialization via IBM Bob

This journey underscores that a generative development assistant like IBM Bob — when provided with the precise architectural boundaries, target platform constraints, and package definitions — can dramatically accelerate software delivery. Writing custom boilerplate for specific platform SDKs, debugging runtime container boundaries, and fine-tuning API input validations often consume hours of trial-and-error. IBM Bob eliminates this friction by rapidly translating functional specifications into modular, enterprise-ready code, automating configuration overhead, and generating operational scripts. Armed with the correct engineering context, Bob acts as a massive force multiplier, transforming raw prototypes into production-grade solutions optimized for any deployment target in minutes rather than days.

Thanks for reading 👍

Links and resources

Top comments (0)