DEV Community

Alain Airom
Alain Airom

Posted on

⚡️ Supercharge Your Document Workflows: Docling Now Unleashes the Power of NVIDIA RTX!

Combing Docling document processing capacities with the power of (Nvidia) GPUs!

Are you tired of waiting for massive PDFs to process? Whether you are an AI researcher, a developer building the next generation of RAG (Retrieval-Augmented Generation) apps, or an enthusiast handling heavy document loads, the wait is over.

IBM Research is thrilled to announce that Docling now fully supports NVIDIA RTX GPU acceleration, bringing unprecedented speed and efficiency to your document processing pipelines.


TL;DR-What is Nvidia RTX?

NVIDIA RTX (Ray Tracing Texel eXtreme) is a professional visual computing platform and high-end graphics brand that revolutionized digital rendering by introducing specialized hardware for real-time ray tracing and artificial intelligence. Built on modern architectures like Blackwell, Ada Lovelace, and Ampere, RTX GPUs feature dedicated RT Cores that simulate the physical behavior of light — calculating how rays bounce, reflect, and cast shadows — alongside Tensor Cores that accelerate AI tasks such as DLSS (Deep Learning Super Sampling) for frame-rate boosting. Beyond cinematic gaming, the platform provides a massive performance leap for creators and researchers, enabling “neural rendering” and high-throughput data processing that is up to six times faster than traditional CPU-based workflows.


🚀 Why This is a Game-Changer

By shifting the heavy lifting from your CPU to your NVIDIA RTX GPU, you can experience up to a 6x speedup in processing times. This isn’t just a minor tweak; it’s a performance leap that transforms how you handle:

  • Large Batches: Process thousands of pages in a fraction of the time.
  • High-Throughput Workflows: Keep your production pipelines moving at lightning speed.
  • Advanced Models: Experiment with complex document understanding models without the lag.

🛠 Getting Started in Minutes

Docling is designed to be “plug and play.” Once you have your NVIDIA drivers, CUDA Toolkit, and cuDNN installed, Docling will automatically detect and use your RTX GPU.

The 3-Step Quick Start:

  • Verify your Hardware: Run nvidia-smi to ensure your drivers are ready.
  • Install PyTorch with CUDA Support: Use the dedicated index URL to get the GPU-enabled version:
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
Enter fullscreen mode Exit fullscreen mode
  • Run Docling
from docling.document_converter import DocumentConverter
converter = DocumentConverter() # Automatically detects GPU!
result = converter.convert("document.pdf")
Enter fullscreen mode Exit fullscreen mode

🏎 Advanced Performance Tuning

For those looking to squeeze every drop of power out of their hardware, Docling offers deep customization. You can explicitly set your accelerator and scale your batch sizes based on your specific RTX card:

  • RTX 5090 (32GB): High-octane processing with batch sizes of 64–128.
  • RTX 4090 (24GB): Smooth performance with batch sizes of 32–64.
  • RTX 5070 (12GB): Efficient handling with batch sizes of 16–32.

Using the vLLM pipeline on Linux can deliver approximately 4x better performance for Vision Language Models (VLM) compared to standard server setups!


🛠 OS-Specific Setup Comparison

While the core Docling code remains the same, the underlying environment setup differs slightly depending on your operating system.


| Feature             | Windows 10/11                                    | Linux (Ubuntu/Debian/etc.)                   |
| ------------------- | ------------------------------------------------ | -------------------------------------------- |
| **Driver Install**  | Manual download from NVIDIA website.             | Package manager (apt/dnf) or NVIDIA website. |
| **Verification**    | `nvidia-smi` in PowerShell/CMD.                  | `nvidia-smi` in Terminal.                    |
| **VLM Inference**   | `llama-server` (llama.cpp) recommended.          | `vLLM` (High performance) recommended.       |
| **Max Performance** | Possible via WSL2 (Windows Subsystem for Linux). | Native performance.                          |

Enter fullscreen mode Exit fullscreen mode

💻 Installation Commands

PyTorch with CUDA Support

The command for PyTorch is generally the same across platforms, but ensures you are matching the correct CUDA toolkit version installed on your machine.

For CUDA 12.8:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu128
Enter fullscreen mode Exit fullscreen mode

For CUDA 13.0:

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu130
Enter fullscreen mode Exit fullscreen mode

GPU-Accelerated VLM Serving

This is where the platforms diverge most in terms of performance and tooling.

  • Linux (vLLM): Offers ~4x better performance than llama-server.
vllm serve ibm-granite/granite-docling-258M --host 127.0.0.1 --port 8000 --gpu-memory-utilization 0.9
Enter fullscreen mode Exit fullscreen mode
  • Windows (llama-server):
.\llama-server.exe --hf-repo ibm-granite/granite-docling-258M-GGUF -ngl -1 --port 8000
Enter fullscreen mode Exit fullscreen mode

💡 Quick Troubleshooting Tip
If you’ve followed these steps and aren’t seeing a speedup, run this snippet in your Python environment to verify Docling can “see” your hardware:

import torch
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
Enter fullscreen mode Exit fullscreen mode

🚀 Optimized GPU Tuning Template

This script detects your available VRAM and applies the recommended batch sizes for maximum throughput.

import torch
from docling.document_converter import DocumentConverter
from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.pipeline_options import ThreadedPdfPipelineOptions

def get_optimal_settings():
    if not torch.cuda.is_available():
        print("CUDA not found. Falling back to CPU.")
        return None, None

    # Determine VRAM to pick the best batch size
    vram_gb = torch.cuda.get_device_properties(0).total_memory / 1024**3
    print(f"Detected GPU: {torch.cuda.get_device_name(0)} ({vram_gb:.2f} GB VRAM)")

    # Tuning logic based on hardware tiers
    if vram_gb > 24:    # e.g., RTX 5090 (32GB)
        b_size = 128
    elif vram_gb >= 20: # e.g., RTX 4090 (24GB)
        b_size = 64
    else:               # e.g., RTX 5070 (12GB) or lower
        b_size = 16

    acc_options = AcceleratorOptions(device=AcceleratorDevice.CUDA)

    pipe_options = ThreadedPdfPipelineOptions(
        ocr_batch_size=b_size,
        layout_batch_size=b_size,
        table_batch_size=4 # Tables are memory intensive
    )

    return acc_options, pipe_options

# Initialize with optimized settings
acc_opts, pipe_opts = get_optimal_settings()

converter = DocumentConverter(
    accelerator_options=acc_opts,
    pipeline_options=pipe_opts
)

# Convert your document
result = converter.convert("large_document.pdf")
print("Conversion complete!")
Enter fullscreen mode Exit fullscreen mode

Or this simpler example (no detections);

import datetime
import logging
import time
from pathlib import Path

import numpy as np
from pydantic import TypeAdapter

from docling.datamodel.accelerator_options import AcceleratorDevice, AcceleratorOptions
from docling.datamodel.base_models import ConversionStatus, InputFormat
from docling.datamodel.pipeline_options import (
    ThreadedPdfPipelineOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.pipeline.threaded_standard_pdf_pipeline import ThreadedStandardPdfPipeline
from docling.utils.profiling import ProfilingItem

_log = logging.getLogger(__name__)


def main():
    logging.getLogger("docling").setLevel(logging.WARNING)
    _log.setLevel(logging.INFO)

    data_folder = Path(__file__).parent / "../../tests/data"
    # input_doc_path = data_folder / "pdf" / "2305.03393v1.pdf"  # 14 pages
    input_doc_path = data_folder / "pdf" / "redp5110_sampled.pdf"  # 18 pages

    pipeline_options = ThreadedPdfPipelineOptions(
        accelerator_options=AcceleratorOptions(
            device=AcceleratorDevice.CUDA,
        ),
        ocr_batch_size=4,
        layout_batch_size=64,
        table_batch_size=4,
    )
    pipeline_options.do_ocr = False

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_cls=ThreadedStandardPdfPipeline,
                pipeline_options=pipeline_options,
            )
        }
    )

    start_time = time.time()
    doc_converter.initialize_pipeline(InputFormat.PDF)
    init_runtime = time.time() - start_time
    _log.info(f"Pipeline initialized in {init_runtime:.2f} seconds.")

    start_time = time.time()
    conv_result = doc_converter.convert(input_doc_path)
    pipeline_runtime = time.time() - start_time
    assert conv_result.status == ConversionStatus.SUCCESS

    num_pages = len(conv_result.pages)
    _log.info(f"Document converted in {pipeline_runtime:.2f} seconds.")
    _log.info(f"  {num_pages / pipeline_runtime:.2f} pages/second.")


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

💡 Performance Checklist

  • Memory Monitoring: Use nvidia-smi -l 1 in your terminal while running this script to see if you can push the batch size even higher.
  • vLLM for Linux: If you are on Linux, remember that the vLLM pipeline offers roughly 4x better performance for Vision Language Models than the Windows equivalent.
  • Clear Cache: If you process massive folders, call torch.cuda.empty_cache() between files to prevent "Out of Memory" errors.

To maximize the potential of the RTX 5090 (with its massive 32GB of GDDR7 VRAM), you should look beyond basic local scripts and use a dedicated vLLM server on Linux (or WSL2). This setup can provide up to 4x better performance for Vision Language Models (VLM) like granite-docling-258M.

  • Launch the vLLM Server (Optimized for RTX 5090): run this command in your terminal. We have adjusted the memory utilization and token limits to take advantage of the 5090’s capacity.
# Optimized for 32GB VRAM
vllm serve ibm-granite/granite-docling-258M \
  --revision untied \
  --host 127.0.0.1 \
  --port 8000 \
  --gpu-memory-utilization 0.9 \
  --max-num-seqs 1024 \
  --max-num-batched-tokens 16384 \
  --enable-chunked-prefill
Enter fullscreen mode Exit fullscreen mode
  • --revision untied: Required for compatibility with current vLLM versions and the granite-docling architecture.
  • --gpu-memory-utilization 0.9: Allots 90% of your 32GB VRAM to the model and KV cache.
  • --max-num-seqs 1024: Leverages the 5090’s massive core count to process more sequences in parallel.
  • Connect Docling to the Server: once the server is running, use this Python script to route your document conversions through it. This moves the heavy “vision” tasks to the highly optimized vLLM engine.
from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions, 
    VlmOptions
)

# 1. Configure the VLM to point to your local vLLM server
vlm_options = VlmOptions(
    server_url="http://127.0.0.1:8000/v1",
    model_id="ibm-granite/granite-docling-258M"
)

# 2. Set pipeline to use the server-based VLM
pipeline_options = PdfPipelineOptions()
pipeline_options.vlm_options = vlm_options

# 3. Initialize converter
converter = DocumentConverter(pipeline_options=pipeline_options)

# 4. Run high-speed conversion
result = converter.convert("massive_report.pdf")
print(result.document.export_to_markdown())
Enter fullscreen mode Exit fullscreen mode

🚀 Why this setup wins on the RTX 5090:

  • Massive Batching: Unlike standard inference, vLLM uses “PagedAttention,” which allows the RTX 5090 to handle much larger batches of pages simultaneously without crashing.
  • GDDR7 Speed: The higher memory bandwidth of the 5090 means the “prefill” stage (where the GPU reads the document page) is significantly faster.
  • Blackwell Architecture: This setup utilizes the latest CUDA 12.8 optimizations specific to the 50-series cards, ensuring you aren’t running in “legacy” mode.

If you see “Out of Memory” errors when processing extremely complex documents with many tables, simply lower the --gpu-memory-utilization to 0.8 to give the system more breathing room for workspace activations.


🛑 Stop Waiting, Start Converting

The combination of NVIDIA’s parallel computing power and Docling’s sophisticated document parsing means your data is ready for AI faster than ever before.

Ready to level up? Check out the Docling GPU Support Guide for more examples and troubleshooting tips.

Links

Top comments (0)