Gao Dalie (Ilyass)

Posted on Dec 15, 2025

DeepSeek-V3.2 + DocLing + Agentic RAG: Parse Any Document with Ease

#ai #datascience #machinelearning

If you’ve been following open-source logical modelling, you know it's become a highly competitive field. Every few months, a new model comes out and says it breaks old limits, and some of them truly do

Just two days ago, after I quietly finished my exam and locked in, I was scrolling online late at night. DeepSeek, as always, sent a shockwave through the AI community.

DeepSeek launched its latest model, built for agents, “DeepSeek-V3.2,” and its high-performance version

These models have significantly improved their reasoning capabilities. combining technological innovations such as efficient sparse attention and large-scale reinforcement learning.

DeepSeek-V3.2 can go head-to-head with GPT-5, while Speciale, combining long-term thinking and theorem-proving capabilities, performs comparable to Gemini-3.0-Pro. One reader commented, “This model shouldn’t be called V3.2; it should be called V4.

In particular, the Speciale version achieved gold medal-level results at the 2025 IMO, IOI, and ICPC World Championships, placing in the top two at the ICPC World Championships and in the top ten at the IOI, achieving “Gold-medal performance.

As part of my research and development, I needed to extract text data from PDFs as accurately as possible. In the past, I have extracted text from PDFs using PyMuPDF or the OCR engine Tesseract.

These are powerful tools that have been used in many projects for many years. However, I encountered the following issue, possibly due to the PDF I was working with.

Docling, an open source library developed by IBM Research, is an effective solution to these challenges. Docling is a powerful tool that can structure and convert documents, such as PDFs and Word files, into Markdown.

So, let me give you a quick demo of a live chatbot to show you what I mean.

Link

I’ll upload an Ocean AI PDF and ask the chatbot a question: “What is Ocean AI, and why is Ocean AI different from OpenAI?”

If you look at how the chatbot generates the output, you’ll see that the agent first runs a relevance check to determine whether the question is actually related to your uploaded documents. If it’s not relevant, the agent immediately rejects the question instead of generating a hallucinated answer.

For relevant questions, the agent parses the documents into structured formats such as Markdown or JSON. Then perform hybrid retrieval using both BM25 keyword search and vector embeddings to find the most relevant sections, even across multiple documents.

The Research Agent uses this retrieved content to generate an answer, and then the Verification Agent cross-checks the response against the original documents to confirm factual accuracy and catch unsupported claims or contradictions.

If verification fails, a self-correction loop automatically re-runs retrieval and research with adjusted parameters until the answer passes all checks. Once the answer is fully verified, the agent returns it. If at any point the question is found to be unrelated to the uploaded content, the agent clearly tells you instead of hallucinating.

What makes DeepSeek-V3.2 Unique?

Most powerful AI models face a common problem: as file length increases, model execution speed decreases significantly, and costs rise dramatically. This is because traditional models attempt to compare each word with all other words to understand the context.

DeepSeek-V3.2 addresses this problem by introducing a new method called DeepSeek Sparse Attention (DSA). You can think of it as a researcher conducting research in a library:

Traditional method (intensive attention): Researchers read every book on the shelf, page by page, just to answer one question. While comprehensive, this method is extremely slow and requires immense effort.
The new method (DeepSeek-V3.2): Researchers use a digital index (Lightning Indexer) to find key pages and read only those pages quickly. This method is just as accurate, but much faster.

What makes Docling Unique?

The biggest reason why Docling stands out from existing tools is that its design concept is based on collaboration with generative AI, particularly RAG (Retrieval Augmented Generation).

Modern AI applications require more than just extracting text. For AI to deeply understand the content of a document and generate accurate answers, it needs to know its meaning, including:

Is this sentence the “abstract” or the “conclusion” of the paper?
This string of numbers is not just text but a “table,” so what does each cell mean?
What “caption” accompanies this image?

While PyMuPDF and Tesseract extract text as “strings,” Docking uses the power of the Visual Language Model (VLM) to analyse these structures and relationships and output them as a “DoclingDocument” object with rich information.

This structured data is the key to dramatically improving RAG’s retrieval and answer generation quality.

Let’s Start Coding :

Let us now explore step by step and unravel the answer to how to use the DeepSeek-V3.2 + DocLing + Agentic RAG. We will install the libraries that support the model. For this, we will do a pip install requirements

pip install requirements

The next step is the usual one: We will import the relevant libraries, the significance of which will become evident as we proceed.

DocumentConverter: A high-level Python class designed for converting documents into a structured DoclingDocument format.

EnsembleRetriever: Ensemble retriever that aggregates and orders the results of multiple retrievers by using weighted Reciprocal Rank Fusion.
**
DocLing:**

I created a VerificationAgent class that fact-checks AI-generated answers against source documents. In initI instantiate a deepseek-v3.2 model with zero temperature for deterministic outputs and build a prompt template that asks the LLM to verify answers in 4 specific ways: whether claims are directly supported, what's unsupported, what contradicts, and if it's relevant, forcing a structured response format for consistent parsing.

In check()I take the answer string and a list of Document objects, extract and concatenate all the document text into one context string, then create a LangChain pipeline (prompt → LLM → string parser) that I invoke with the answer and context to get back a verification report.

I log both the report and context for debugging, re-raise any errors that occur, and return a dict containing the verification report text and the context string. The whole point is to catch hallucinations by checking if the RAG system's generated answer is actually supported by the source documents.

import os
import hashlib
import pickle
from datetime import datetime, timedelta
from pathlib import Path
from typing import List
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownHeaderTextSplitter
from config import constants
from config.settings import settings
from utils.logging import logger

class DocumentProcessor:
    def __init__(self):
        self.headers = [("#", "Header 1"), ("##", "Header 2")]
        self.cache_dir = Path(settings.CACHE_DIR)
        self.cache_dir.mkdir(parents=True, exist_ok=True)

    def validate_files(self, files: List) -> None:
        """Validate the total size of the uploaded files."""
        total_size = sum(os.path.getsize(f.name) for f in files)
        if total_size > constants.MAX_TOTAL_SIZE:
            raise ValueError(f"Total size exceeds {constants.MAX_TOTAL_SIZE//1024//1024}MB limit")

    def process(self, files: List) -> List:
        """Process files with caching for subsequent queries"""
        self.validate_files(files)
        all_chunks = []
        seen_hashes = set()

        for file in files:
            try:
                # Generate content-based hash for caching
                with open(file.name, "rb") as f:
                    file_hash = self._generate_hash(f.read())

                cache_path = self.cache_dir / f"{file_hash}.pkl"

                if self._is_cache_valid(cache_path):
                    logger.info(f"Loading from cache: {file.name}")
                    chunks = self._load_from_cache(cache_path)
                else:
                    logger.info(f"Processing and caching: {file.name}")
                    chunks = self._process_file(file)
                    self._save_to_cache(chunks, cache_path)

                # Deduplicate chunks across files
                for chunk in chunks:
                    chunk_hash = self._generate_hash(chunk.page_content.encode())
                    if chunk_hash not in seen_hashes:
                        all_chunks.append(chunk)
                        seen_hashes.add(chunk_hash)

            except Exception as e:
                logger.error(f"Failed to process {file.name}: {str(e)}")
                continue

        logger.info(f"Total unique chunks: {len(all_chunks)}")
        return all_chunks

    def _process_file(self, file) -> List:
        """Original processing logic with Docling"""
        if not file.name.endswith(('.pdf', '.docx', '.txt', '.md')):
            logger.warning(f"Skipping unsupported file type: {file.name}")
            return []

        converter = DocumentConverter()
        markdown = converter.convert(file.name).document.export_to_markdown()
        splitter = MarkdownHeaderTextSplitter(self.headers)
        return splitter.split_text(markdown)

    def _generate_hash(self, content: bytes) -> str:
        return hashlib.sha256(content).hexdigest()

    def _save_to_cache(self, chunks: List, cache_path: Path):
        with open(cache_path, "wb") as f:
            pickle.dump({
                "timestamp": datetime.now().timestamp(),
                "chunks": chunks
            }, f)

    def _load_from_cache(self, cache_path: Path) -> List:
        with open(cache_path, "rb") as f:
            data = pickle.load(f)
        return data["chunks"]

    def _is_cache_valid(self, cache_path: Path) -> bool:
        if not cache_path.exists():
            return False

        cache_age = datetime.now() - datetime.fromtimestamp(cache_path.stat().st_mtime)
        return cache_age < timedelta(days=settings.CACHE_EXPIRE_DAYS)

RelevanceChecker

I created a RelevanceChecker class that determines whether retrieved documents can answer a user's question by classifying them into three categories.

In init, I initialize a deepseek-v3.2 model with the API key and create a prompt template that instructs the LLM to classify passages as "CAN_ANSWER" (fully answers), "PARTIAL" (mentions topic but incomplete), or "NO_MATCH" (doesn't discuss topic at all), with emphasis that any topic mention should be "PARTIAL" not "NO_MATCH". I built a LangChain chain by piping prompt → LLM → string parser.

In the check() method, I take a question, a retriever object, and a k parameter (default 3) for how many top documents to analyse. I invoke the retriever with the question to get relevant chunks, returning "NO_MATCH" immediately if nothing comes back.

I print debug info showing document count and 200-character previews of the top k chunks for visibility. I combine the top k document texts into one string with double newlines, invoke the LLM chain with the question and combined content, and get back a classification string.

I validate the response is one of the three valid labels by converting to uppercase and checking against valid options, forcing "NO_MATCH" if the LLM returns something unexpected.
Finally, I return the validated classification, giving me a clear signal about whether my retriever found usable documents or if I need to fall back to alternative methods like web search.

# agents/relevance_checker.py
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_deepseek import ChatDeepSeek
from config.settings import settings

class RelevanceChecker:
    def __init__(self):
        # self.llm = ChatOpenAI(api_key=settings.OPENAI_API_KEY, model="gpt-4o")
        self.llm = ChatDeepSeek(api_key=settings.DEEPSEEK_API_KEY, model="deepseek-chat")

        self.prompt = ChatPromptTemplate.from_template(
            """
            You are given a user question and some passages from uploaded documents.

            Classify how well these passages address the user's question. 
            Choose exactly one of the following responses (respond ONLY with that label):

            1) "CAN_ANSWER": The passages contain enough explicit info to fully answer the question.
            2) "PARTIAL": The passages mention or discuss the question's topic (e.g., relevant years, facility names)
            but do not provide all the data or details needed for a complete answer.
            3) "NO_MATCH": The passages do not discuss or mention the question's topic at all.

            Important: If the passages mention or reference the topic or timeframe of the question in ANY way,
            even if incomplete, you should respond "PARTIAL", not "NO_MATCH".

            Question: {question}
            Passages: {document_content}

            Respond ONLY with "CAN_ANSWER", "PARTIAL", or "NO_MATCH".
            """
        )

        self.chain = self.prompt | self.llm | StrOutputParser()

    def check(self, question: str, retriever, k=3) -> str:
        """
        1. Retrieve the top-k document chunks from the global retriever.
        2. Combine them into a single text string.
        3. Pass that text + question to the LLM chain for classification.

        Returns: "CAN_ANSWER" or "PARTIAL" or "NO_MATCH".
        """

        print(f"[DEBUG] RelevanceChecker.check called with question='{question}' and k={k}")

        # Retrieve doc chunks from the retriever
        top_docs = retriever.invoke(question)[:k]  # Only use top k docs
        if not top_docs:
            print("[DEBUG] No documents returned from retriever.invoke(). Classifying as NO_MATCH.")
            return "NO_MATCH"

        print(f"[DEBUG] Retriever returned {len(top_docs)} docs.")

        # Show a quick snippet of each chunk for debugging
        for i, doc in enumerate(top_docs):
            snippet = doc.page_content[:200].replace("\n", "\\n")
            print(f"[DEBUG] Chunk #{i+1} preview (first 200 chars): {snippet}...")

        # Combine the top k chunk texts into one string
        document_content = "\n\n".join(doc.page_content for doc in top_docs)
        print(f"[DEBUG] Combined text length for top {k} chunks: {len(document_content)} chars.")

        # Call the LLM
        response = self.chain.invoke({
            "question": question, 
            "document_content": document_content
        }).strip()

        print(f"[DEBUG] LLM raw classification response: '{response}'")

        # Convert to uppercase, check if it's one of our valid labels
        classification = response.upper()
        valid_labels = {"CAN_ANSWER", "PARTIAL", "NO_MATCH"}
        if classification not in valid_labels:
            print("[DEBUG] LLM did not respond with a valid label. Forcing 'NO_MATCH'.")
            classification = "NO_MATCH"
        else:
            print(f"[DEBUG] Classification recognized as '{classification}'.")

        return classification

ResearchAgent

I created a ResearchAgent class that generates answers to questions using retrieved documents as context.

I create a prompt template that asks the LLM to answer questions based on provided context, being precise and factual, with an instruction to explicitly say "I cannot answer this question based on the provided documents" if the context is insufficient.

In the generate() method, I take a question string and a list of Document objects, then extract and concatenate all document text into one context string using double newlines as separators.

I invoke the chain with the question and context, which substitutes them into the template, sends the request to DeepSeek, and returns the generated answer as a string. I wrap this in try-except to log both the answer and full context for debugging, and re-raise any exceptions that occur.

Finally, I return a dictionary containing the draft answer and the context used, giving me both the generated response and traceability of what source material was used to create it.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from typing import Dict, List
from langchain_core.documents import Document
from langchain_deepseek import ChatDeepSeek
from config.settings import settings
import logging

logger = logging.getLogger(__name__)

class ResearchAgent:
    def __init__(self):
        """Initialize the research agent with the OpenAI model."""
        # self.llm = ChatOpenAI(
        #     model="gpt-4-turbo",
        #     temperature=0.3,
        #     api_key=settings.OPENAI_API_KEY  # Pass the API key here
        # )
        self.llm = ChatDeepSeek(
            model="deepseek-chat",
            temperature=0.3,
            api_key=settings.DEEPSEEK_API_KEY  # Pass the API key here
        )
        self.prompt = ChatPromptTemplate.from_template(
            """Answer the following question based on the provided context. Be precise and factual.

            Question: {question}

            Context:
            {context}

            If the context is insufficient, respond with: "I cannot answer this question based on the provided documents."
            """
        )

    def generate(self, question: str, documents: List[Document]) -> Dict:
        """Generate an initial answer using the provided documents."""
        context = "\n\n".join([doc.page_content for doc in documents])

        chain = self.prompt | self.llm | StrOutputParser()
        try:
            answer = chain.invoke({
                "question": question,
                "context": context
            })
            logger.info(f"Generated answer: {answer}")
            logger.info(f"Context used: {context}")
        except Exception as e:
            logger.error(f"Error generating answer: {e}")
            raise

        return {
            "draft_answer": answer,
            "context_used": context
        }

Verification Agent

I created a VerificationAgent class that fact-checks AI-generated answers against source documents to catch hallucinations. In initI initialise a deepseek-v3.2 model with temperature 0 (fully deterministic), create a prompt template that instructs the LLM to verify four aspects (direct factual support, unsupported claims, contradictions, and relevance) with a structured response format, then build a LangChain chain.

In check()I take an answer string and a list of Document objects, concatenate all document text into one context string with double newlines, invoke the chain with the answer and context to get a verification report, log both the report and context for debugging in a try-except block, and return a dictionary with the verification report and context used for traceability.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from typing import Dict, List
from langchain_core.documents import Document
from langchain_deepseek import ChatDeepSeek
from config.settings import settings
import logging

logger = logging.getLogger(__name__)

class VerificationAgent:
    def __init__(self):
        # self.llm = ChatOpenAI(
        #     model="gpt-4-turbo",
        #     temperature=0,
        #     api_key=settings.OPENAI_API_KEY  # Pass the API key here
        # )
        self.llm = ChatDeepSeek(
            model="deepseek-chat",
            temperature=0,
            api_key=settings.DEEPSEEK_API_KEY  # Pass the API key here
        )
        self.prompt = ChatPromptTemplate.from_template(
            """Verify the following answer against the provided context. Check for:
            1. Direct factual support (YES/NO)
            2. Unsupported claims (list)
            3. Contradictions (list)
            4. Relevance to the question (YES/NO)

            Respond in this format:
            Supported: YES/NO
            Unsupported Claims: [items]
            Contradictions: [items]
            Relevant: YES/NO

            Answer: {answer}
            Context: {context}
            """
        )

    def check(self, answer: str, documents: List[Document]) -> Dict:
        """Verify the answer against the provided documents."""
        context = "\n\n".join([doc.page_content for doc in documents])

        chain = self.prompt | self.llm | StrOutputParser()
        try:
            verification = chain.invoke({
                "answer": answer,
                "context": context
            })
            logger.info(f"Verification report: {verification}")
            logger.info(f"Context used: {context}")
        except Exception as e:
            logger.error(f"Error verifying answer: {e}")
            raise

        return {
            "verification_report": verification,
            "context_used": context
        }

Conclusion :

DeepSeek V3.2 doesn't win by scale, but by smarter thinking. With its sparse attention mechanism, lower cost, stronger long-context awareness, and superior tool-use inference capabilities, it demonstrates how open-source models can remain competitive without massive hardware budgets.

While it may not top every benchmark, it significantly improves how users interact with AI today. And that's precisely why it stands out in a highly competitive market.

I would highly appreciate it if you