NAEEM HADIQ

Posted on Jan 28 • Originally published at Medium on Jan 28, 2025

Building an On premise RFP Response creation stack Utilising Docling, Ollama, Deepseek R1|…

#ollama #rfp #deepseek

Building an On premise RFP Response creation stack Utilising Docling, Ollama, Deepseek R1| ExtractThinker -Part 1

In today’s fast-paced business landscape, responding to Requests for Proposals (RFPs) efficiently can make or break an organization’s success. RFPs often demand the rapid gathering of information from large, diverse datasets — spanning scanned contracts, lengthy financial statements, regulatory documents, and more.

This is where a well-constructed, on-premise Document Intelligence stack becomes invaluable. By deploying such a system, you gain the ability to quickly extract, classify, and summarize critical information directly on your own infrastructure. This approach not only ensures data privacy but also dramatically reduces the time needed to prepare thorough, accurate proposals.

At the core of this workflow is the ability to classify and extract the exact information needed from each document. For example, if a proposal requires precise cost figures, project timelines, or legal compliance clauses, you can configure the system to target these specific fields.

This article showcases how you can build a fully on-premise Document Intelligence solution by combining:

ExtractThinker — an open-source framework orchestrating OCR, classification, and data extraction pipelines for LLMs
Ollama — a local deployment solution for language models like DeepSeek R1
Docling or MarkItDown — flexible libraries to handle document loading, OCR, and layout parsing

Whether you’re operating under strict confidentiality rules, dealing with scanned PDFs, or simply want advanced vision-based extraction, this end-to-end stack provides a secure, high-performance pipeline fully within your own infrastructure.

These Knowledge Sources are created utilizing extracted information from documents utilizing tools like Docling or mark it down

For scenarios Like RFP Multimodal capabilities are highly advised, but considering the complexity text model can be utilised.

In some scenarios, you can pair different models for different stages. For instance, a smaller moondream model (0.5B parameters) might handle splitting, while theDeepseek R1 model manages classification and extraction. Many large institutions prefer deploying a single, more powerful model (e.g., Llama 3.3 or Qwen 2.5 in the 70B range) to cover all use cases. If you only need English-centric IDP, you could simply use R1 for most tasks and keep a lightweight moondream model on standby for edge-case splitting. It all depends on your specific requirements and available infrastructure.

2. Processing Documents: MarkItDown vs. Docling

For document parsing, two popular libraries stand out:

MarkItDown

Simpler, straightforward, widely supported by Microsoft
Perfect for direct text-based tasks where you don’t require multiple OCR engines
Easy to install and integrate

Docling

More advanced, with multi-OCR support (Tesseract, AWS Textract, Google Document AI, etc.)
Excellent for scanning workflows or robust extraction from image PDFs
Detailed documentation, flexible for complex layouts

ExtractThinker lets you swap in either DocumentLoaderMarkItDown or DocumentLoaderDocling depending on your needs—simple digital PDFs or multi-engine OCR.

3. Running Local Models

Although Ollama is a popular tool for hosting LLMs locally, there are now several solutions for on-prem deployments that can integrate seamlessly with ExtractThinker:

LocalAI — An open-source platform that mimics OpenAI’s API locally. It can run LLMs on consumer-grade hardware (even CPU-only), such as Llama 2 or Mistral, and provides a simple endpoint to connect with.
OpenLLM — A project by BentoML that exposes LLMs via an OpenAI-compatible API. It’s optimized for throughput and low latency, suitable for both on-prem and cloud, and supports a wide range of open-source LLMs.
Llama.cpp — A lower-level approach for running Llama models with advanced custom configurations. Great for granular control or HPC setups, albeit more complexity to manage.

Ollama is often a first choice and my preferred choice for this tutorial thanks to its ease of setup and simple CLI.

Quick Link to Running Ollama on Mac : https://blog.nhadiq.me/running-deep-seek-models-on-mac-quickly-e6ca9a9bc439

4. Tackling small context windows

When working with local models that have limited context windows (e.g., ~8K tokens or less), it becomes critical to manage both:

Splitting Documents

To avoid exceeding the model’s input capacity, Lazy Splitting is ideal. Rather than ingesting the entire document at once:

You incrementally compare pages (e.g., pages 1–2, then 2–3), deciding if they belong to the same sub-document.
If they do, you keep them together for the next step. If not, you start a new segment.
This approach is memory-friendly and scales to very large PDFs by only loading and analyzing a couple of pages at a time.

Note: Concatenate is ideal when you have a higher token allowance; Paginate is preferred for limited windows.

Handling Partial Responses

For smaller local models, each response also risks truncation if the prompt is large. PaginationHandler elegantly addresses this by:

Splitting the document’s pages for separate requests (one page per request).
Merging page-level results at the end, with optional conflict resolution if pages disagree on certain fields.

Note: Concatenate is ideal when you have a higher token allowance; Paginate is preferred for limited windows.

Quick Example Flow

Lazy Split the PDF so each chunk/page remains below your model’s limit.
Paginate across pages: each chunk’s result is returned separately.
Merge the partial page results into the final structured data.

This minimal approach ensures you never exceed the local model’s context window both in how you feed the PDF and in how you handle multi-page responses.

5. ExtractThinker: Building the stack

Below is a minimal code snippet showing how to integrate these components. First, install ExtractThinker :

pip install extract-thinker

Document Loader

As discussed above, we can use MarkitDown or Docling.

from extract_thinker import DocumentLoaderMarkItDown, DocumentLoaderDocling

# DocumentLoaderDocling or DocumentLoaderMarkItDown
document_loader = DocumentLoaderDocling()

Defining Contracts

We use Pydantic-based Contracts to specify the structure of data we want to extract. For example, Previous RFPs and Capability Statements :

from extract_thinker.models.contract import Contract
from pydantic import Field

class PreviousRFP(Contract):
    rfp_id: str = Field(description="Unique RFP ID")
    rfp_name: str = Field(description="RFP Name including identifiers")
    rfp_summary: float = Field(description="Summary of the submitted RFP")

class CapabilityStatements(Contract):
    name: str = Field(description="Capability Name")
    Experience: str = Field(description="Details on prior experiences")
    no_of_Years: int = Field(description="Years of Experience")

Classification

If you have multiple document types, define Classification objects. You can specify:

The name of each classification (e.g., “PreviousRFP”).
A description.
The contract it maps to.

from extract_thinker import Classification

TEST_CLASSIFICATIONS = [
    Classification(
        name="PreviousRFP",
        description="This is an RFP document",
        contract=PreviousRFP
    ),
    Classification(
        name="Capabilities",
        description="This is a capability document",
        contract=CapabilityStatements
    )
]

Putting It All Together: Local Extraction Process

Below, we create an Extractor that uses our chosen document_loader and a local model (Ollama, LocalAI, etc.). Then we build a Process to load, classify, split, and extract in a single pipeline.

import os
from dotenv import load_dotenv

from extract_thinker import (
    Extractor,
    Process,
    Classification,
    SplittingStrategy,
    ImageSplitter,
    TextSplitter
)

# Load environment variables (if you store LLM endpoints/API_BASE, etc. in .env)
load_dotenv()

# Example path to a multi-page document
MULTI_PAGE_DOC_PATH = "path/to/your/doc.pdf"

def setup_local_process():
    """
    Helper function to set up an ExtractThinker process
    using local LLM endpoints (e.g., Ollama, LocalAI, OnPrem.LLM, etc.)
    """

    # 1) Create an Extractor
    extractor = Extractor()

    # 2) Attach our chosen DocumentLoader (Docling or MarkItDown)
    extractor.load_document_loader(document_loader)

    # 3) Configure your local LLM
    # For Ollama, you might do:
    os.environ["API_BASE"] = "http://localhost:11434" # Replace with your local endpoint
    extractor.load_llm("ollama/deepseek-r1") # or "ollama/llama3.3" or your local model

    # 4) Attach extractor to each classification
    TEST_CLASSIFICATIONS[0].extractor = extractor
    TEST_CLASSIFICATIONS[1].extractor = extractor

    # 5) Build the Process
    process = Process()
    process.load_document_loader(document_loader)
    return process

def run_local_idp_workflow():
    """
    Demonstrates loading, classifying, splitting, and extracting
    a multi-page document with a local LLM.
    """
    # Initialize the process
    process = setup_local_process()

    # (Optional) You can use ImageSplitter(model="ollama/moondream:v2") for the split
    process.load_splitter(TextSplitter(model="ollama/deepseek-r1"))

    # 1) Load the file
    # 2) Split into sub-documents with EAGER strategy
    # 3) Classify each sub-document with our TEST_CLASSIFICATIONS
    # 4) Extract fields based on the matched contract (Invoice or DriverLicense)
    result = (
        process
        .load_file(MULTI_PAGE_DOC_PATH)
        .split(TEST_CLASSIFICATIONS, strategy=SplittingStrategy.LAZY)
        .extract(vision=False, completion_strategy=CompletionStrategy.PAGINATE)
    )

    # 'result' is a list of extracted objects (PreviousRFP or Capability)
    for item in result:
        # Print or store each extracted data model
        if isinstance(item, PreviousRFP):
            print("[Extracted RFP]")
            print(f"ID: {item.rfp_id}")
            print(f"Name: {item.rfp_name}")
            print(f"Summary: {item.rfp_summary}")
        elif isinstance(item, Capability):
            print("[Extracted Capabilities]")
            print(f"Capability: {item.name}")
            print(f"Prios-Experience #: {item.Experience}")
            print(f"Years of Experience #: {item.no_of_years}")

# For a quick test, just call run_local_idp_workflow()
if __name__ == " __main__":

6. Privacy and PII: LLMs in the Cloud

Not every organization can — or wants to — run local hardware. Some prefer advanced cloud-based LLMs. If so, keep in mind:

Data Privacy Risks : Sending sensitive data to the cloud raises potential compliance issues.
GDPR/HIPAA : Regulations may restrict data from leaving your premises at all.
VPC + Firewalls : You can isolate cloud resources in private networks, but this adds complexity.

Note : Many LLM APIs (e.g., OpenAI) do provide GDPR compliance. But if you’re heavily regulated or want the freedom to switch providers easily, consider local or masked-cloud approaches.

PII Masking

A robust approach is building a PII masking pipeline. Tools like Presidio can automatically detect and redact personal identifiers before sending to the LLM. This way, you remain model-agnostic while maintaining compliance. Alternatively you could setup filters over specific models.

7. Conclusion

By combining ExtractThinker with a local LLM (such as Ollama, LocalAI, or OnPrem.LLM) and a flexible DocumentLoader (Docling or MarkItDown), you can build a secure, on-premise Document Intelligence workflow from the ground up. If regulatory requirements demand total privacy or minimal external calls, this stack keeps your data in-house without sacrificing the power of modern LLMs. As next steps this can be extended further to create the storage and retrieval systems.

DEV Community