DEV Community

Cover image for How To Use DeepSeek-OCR And Docling For PDF Parsing
Peng Qian
Peng Qian

Posted on • Originally published at dataleadsfuture.com

How To Use DeepSeek-OCR And Docling For PDF Parsing

This is a hands-on guide that explores how to use DeepSeek-OCR with docling for PDF parsing inside an agent application.

By reading this, you will learn how to use the DeepSeek-OCR model in real code. I will also compare DeepSeek-OCR's results with a traditional OCR model so you can see how good it actually is.

You will find the full source code for this project at the end.


Introduction

DeepSeek-OCR has been making waves lately — articles and videos are everywhere. The idea of using Contexts Optical Compression instead of text tokens sounds amazing.

But once you read those posts carefully, you find they're basically reposts of DeepSeek's official blog charts with a lot of praise like “optical compression will change how LLMs understand context” and other blah blah.

They rarely explain how to use this model in a real project.

So, is this one of those “nice in the lab but useless in the real world” things? Not for us.

Today I’ll walk you through how to use DeepSeek-OCR inside docling’s VlmPipeline to parse PDFs.

For our test file, I picked NVIDIA FY2026 Q2 Financial Report. Here’s a screenshot of the parsed result:

Compare the original PDF text (top) with the parsed Markdown content (bottom). Image by Author

I’ll also add cognee to run Agentic RAG on the parsed text.

Even though DeepSeek-OCR focuses on the optical compression concept, I believe accurate OCR for PDFs is the basic requirement for any VLM.

The DeepSeek team said in their official blog that they used PaddleOCR for data labeling. So later in this post, I’ll compare PDF parsing results from DeepSeek-OCR and PaddleOCR so you can see for yourself if DeepSeek-OCR’s performance lives up to the hype.

Buckle up, let’s go.


Environment Setup

Install dependencies

Before coding, let’s install the dependencies we need for docling and DeepSeek-OCR.

I’m using vllm server for deploying the DeepSeek-OCR model, so I don’t need vllm-related packages locally.

You only need to install docling, docling[vlm], and hf-xet.

For Agentic RAG with cognee, install cognee>=0.3.6 and starlette>=0.48. Watch the version number — starlette versions below 0.48 will make cognee throw errors.

We also need PaddleOCR for comparison. To make it easier, I’m using the onnxruntime version of RapidOCR. So add those two dependencies too.

Don’t worry — my project’s pyproject.toml already lists them all. Just run:

pip install --upgrade -e .
Enter fullscreen mode Exit fullscreen mode

and you’re set.

Configure environment variables

I’m using DeepSeek-OCR deployed in OpenAI API compatible mode (though you can get a provider that hosts the model too). So in the .env file, I set up:

OCR_MODEL="deepseek-ai/DeepSeek-OCR"
OCR_API_KEY=<your api key>
OCR_BASE_URL=<your api base url>
Enter fullscreen mode Exit fullscreen mode

I use OCR_API_KEY and OCR_BASE_URL to distinguish from regular LLM clients, but you can change those in code.

To use cognee, you also need LLM and embedding model configs. This isn’t the focus here, so check their docs for details.

# Cognee LLM Provider setup
LLM_PROVIDER="openai"
LLM_MODEL="openai/Qwen/Qwen3-30B-A3B-Instruct-2507"
LLM_API_KEY=<your llm's api key>
LLM_ENDPOINT=<your llm's base url>
LLM_RATE_LIMIT_ENABLED="true"
LLM_RATE_LIMIT_REQUESTS="600"
LLM_RATE_LIMIT_INTERVAL="60"

# Cognee Embedding model setup
EMBEDDING_PROVIDER="custom"
EMBEDDING_MODEL="openai/BAAI/bge-m3"
EMBEDDING_DIMENSIONS="1024"
EMBEDDING_API_KEY=<your embedding model's api key>
EMBEDDING_ENDPOINT=<your embedding model's base url>
EMBEDDING_RATE_LIMIT_ENABLED="true"
EMBEDDING_RATE_LIMIT_REQUESTS="2000"
EMBEDDING_RATE_LIMIT_INTERVAL="60"
Enter fullscreen mode Exit fullscreen mode

Integrating DeepSeek-OCR And Docling

Module design

Sure, you can find some PDF parsing examples with DeepSeek-OCR online, but most are just notebook experiments.

I want our parser to be proper and ready for enterprise use. So I’ll write it with a clean, modular design.

For enterprise, rewriting everything from scratch — splitting PDFs, converting them to images, passing to a vision model — is a waste. We should use what we already have. In this case, that’s docling.

Here’s the flow chart for our module:

How the OCR module works. Image by Author

Cognee is optional — swap in any RAG system you like.

Start coding

We’ll organize code with OOP, inside ocr_agentic_rag.py.

First, define an OCRAgenticRAG class. All parsing happens here.

import os
import asyncio
from pathlib import Path
from tempfile import gettempdir

from dotenv import load_dotenv
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import  (
    VlmPipelineOptions
)
from docling.datamodel.pipeline_options_vlm_model import (
    ApiVlmOptions, ResponseFormat
)
from docling.document_converter import (
    DocumentConverter, PdfFormatOption
)
from docling.pipeline.vlm_pipeline import VlmPipeline
import cognee
from cognee.infrastructure.databases.vector.embeddings.config import EmbeddingConfig

from common.utils.project_path import get_project_root, get_current_directory

load_dotenv(get_project_root() / ".env")


class OCRAgenticRAG:
    ...
Enter fullscreen mode Exit fullscreen mode

The two key methods are _openai_compatible_vlm_options and _get_docling_converter.

Since we let docling’s pipeline talk to DeepSeek instead of using requests, _openai_compatible_vlm_options sets up the VLM client:

class OCRAgenticRAG:
    ...

    @staticmethod
    def _openai_compatible_vlm_options(
            model: str = "",
            prompt: str = "Convert these pdf pages to markdown.",
            response_format: ResponseFormat = ResponseFormat.MARKDOWN,
            base_url: str = "",
            temperature: float = 0.7,
            max_tokens: int = 4096,
            api_key: str = "",
            skip_special_token = False,
    ):
        ocr_model = model or os.getenv("OCR_MODEL")
        headers = {}
        if api_key:
            headers["Authorization"] = f"Bearer {api_key}"
            headers["Content-Type"] = "application/json"

        options = ApiVlmOptions(
            url=f"{base_url}/chat/completions",
            params=dict(
                model=ocr_model,
                max_tokens=max_tokens,
                skip_special_token=skip_special_token,
            ),
            headers=headers,
            prompt=prompt,
            timeout=90,
            scale=1.0,
            temperature=temperature,
            response_format=response_format,
        )
        return options
Enter fullscreen mode Exit fullscreen mode

Next, _get_docling_converter configures pipeline_options so docling uses that VLM setup for PDF processing.

class OCRAgenticRAG:
    ...

    def _get_docling_converter(
        self,
        api_key: str = "",
        base_url: str = "",
    ) -> DocumentConverter:
        pipeline_options = VlmPipelineOptions(
            enable_remote_services=True
        )
        pipeline_options.vlm_options = self._openai_compatible_vlm_options(
            api_key=api_key,
            base_url=base_url
        )
        doc_converter = DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(
                    pipeline_options=pipeline_options,
                    pipeline_cls=VlmPipeline,
                )
            }
        )
        return doc_converter
Enter fullscreen mode Exit fullscreen mode

Once we set up the docling PDF converter, _ocr_pdf parses PDFs into Markdown md files. It supports multiple files at once. To track results, I save those Markdown files in a temp folder.

class OCRAgenticRAG:
    ...

    def _ocr_pdf(self, source_data: str | list[str]) -> list[str]:
    if not isinstance(source_data, list):
        source_data = [source_data]

    output_files = []
    for source_file in source_data:
        result = self.converter.convert(source_file)
        markdown_str = result.document.export_to_markdown()
        source_filename = result.input.file.stem
        out_file = self._write_to_file(self.temp_dir, source_filename, markdown_str)
        output_files.append(out_file)

    return output_files
Enter fullscreen mode Exit fullscreen mode

Finally, add the entry method for handling files.

class OCRAgenticRAG:
    ...

    async def add(self, files: str | list[str]):
        temp_files = self._ocr_pdf(files)
        print("All the PDF files have been successfully parsed.")
Enter fullscreen mode Exit fullscreen mode

Now docling and DeepSeek-OCR can work together. If you also want Agentic RAG with cognee, add this code:

class OCRAgenticRAG:
    ...

    @staticmethod
    async def clear():
        await cognee.prune.prune_data()
        await cognee.prune.prune_system(metadata=True)

    async def add(self, files: str | list[str]):
        temp_files = self._ocr_pdf(files)
        print("All the PDF files have been successfully parsed.")

        await self.clear()
        await cognee.add(temp_files)
        await cognee.cognify()

    @staticmethod
    async def search(query: str) -> str:
        results = await cognee.search(
            query_text=query
        )
        return "\n".join([str(result) for result in results])
Enter fullscreen mode Exit fullscreen mode

Test the code

Next, let’s write a simple main method to check how this parsing module works.

if __name__ == "__main__":
    ocr_rag = OCRAgenticRAG(temp_dir=get_current_directory()/"temp")
    async def main():
        source_dir = get_current_directory() / "temp"
        pdf_files = [
            source_dir / "NVIDIAAn.pdf"
        ]
        await ocr_rag.add(pdf_files)
        result = await ocr_rag.search(query="How much was Nvidia’s revenue in Q2 Fiscal 2025?")
        print(result)

    asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

We’ll parse the NVIDIA FY2026 Q2 Financial Report since reading financial reports is a classic AI agent use case.

We’ll search with a question to see if cognee indexed the right text:

I asked for the data for the 2025 fiscal year, but it gave me the results for the 2026 fiscal year. Image by Author

Wait — the PDF definitely has NVIDIA’s FY2025 Q2 revenue info. Why can’t cognee find it?


Evaluating DeepSeek-OCR’s PDF parsing

Since RAG missed something, I wanted to check the quality of DeepSeek-OCR’s Markdown output.

Comparing with the original PDF, I found many mistakes in tables. Multi-header tables had missing rows and columns.

DeepSeek-OCR made a lot of mistakes and lost data when handling multi-headers and across-page parts in file parsing. Image by Author

But hold on — the docs say DeepSeek-OCR reaches 97% accuracy. It shouldn’t be this bad.

Maybe this PDF is just hard to parse? I decided to make a control group.

Since the DeepSeek team used PaddleOCR for labeling, I’ll parse the file with PaddleOCR to see if the document itself is tricky.

Docling doesn’t support PaddleOCR directly, but RapidOCR in docling is basically PaddleOCR with onnxruntime. Accuracy is a bit lower, but it’s fine for testing.

The official docs already show how. My code is in paddle_ocr_docling.py.

from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import (
    PdfPipelineOptions,
    RapidOcrOptions,
)
from docling.document_converter import DocumentConverter, PdfFormatOption

from common.utils.project_path import get_current_directory


def main():
    pdf_file = get_current_directory() / "temp/NVIDIAAn.pdf"

    pipeline_options = PdfPipelineOptions()
    pipeline_options.do_ocr = True
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True

    ocr_options = RapidOcrOptions(
        force_full_page_ocr=False,
    )
    pipeline_options.ocr_options = ocr_options

    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        }
    )

    doc = converter.convert(pdf_file).document
    md = doc.export_to_markdown()

    with open(get_current_directory() / "temp/NVIDIAAn_rapid.md", "w", encoding="utf8") as f:
        f.write(md)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

Let’s check the Markdown from RapidOCR, especially the tables.

RapidOCR’s table data parsing is super accurate. Image by Author

Looks like RapidOCR’s tables are much more accurate, with no major missing or misplaced data.


Conclusion

DeepSeek’s popularity means that whenever the team drops something new, the internet goes wild.

The Contexts Optical Compression idea in DeepSeek-OCR, if proven viable, would spark another wave of innovation for LLMs.

But most online content is just theory talk. Few actually show you how to use DeepSeek-OCR in a real project.

This tutorial aimed to show how to integrate DeepSeek-OCR with docling and other open-source tools to parse PDFs.

Still, as an OCR model, DeepSeek-OCR’s table handling needs work for both text-based and scanned PDFs. This limits its use in data science.

I hope to see more practical experiments on where DeepSeek-OCR shines. If you have thoughts, drop me a comment.

Top comments (0)