My hands-on experience with Qdrant and Docling (and Ollama)

#qdrant #rag #docling #granite

Using Qdrant for the first time!

What is Qdrant?

Qdrant is an open-source, high-performance vector database and similarity search engine. It is specifically designed to store, index, and search high-dimensional vectors, which are crucial for modern AI applications like Retrieval-Augmented Generation (RAG) systems, recommendation engines, and image recognition.

Qdrant is built in Rust for speed and memory efficiency, offering solutions for both cloud-based and self-hosted on-premises deployments.

Motivation for this test

Having conducted different evaluations across various Retrieval-Augmented Generation (RAG) solutions, I’ve found that each vector database offers a unique set of architectural and performance trade-offs. To systematically assess Qdrant’s capabilities — particularly its renowned advanced metadata filtering and hybrid search features, which are critical for production RAG systems — I decided on a focused approach. I’m now leveraging the practical RAG pipeline example provided in the Docling documentation to deploy a local instance, allowing for a deep-dive, hands-on test of Qdrant’s performance.

TL;DR — What is Docling?

For those who might not ne familiar with Docling (seriously? Does that kind of people even exist?😂)… Docling is an open-source package. It is a powerful library which simplifies document processing, parsing diverse formats — including advanced PDF understanding — and providing seamless integrations with the gen AI ecosystem. Below are the existing Docling features;

🗂️ Parsing of multiple document formats incl. PDF, DOCX, PPTX, XLSX, HTML, WAV, MP3, VTT, images (PNG, TIFF, JPEG, …), and more
📑 Advanced PDF understanding incl. page layout, reading order, table structure, code, formulas, image classification, and more
🧬 Unified, expressive DoclingDocument representation format
↪️ Various export formats and options, including Markdown, HTML, DocTags and lossless JSON
🔒 Local execution capabilities for sensitive data and air-gapped environments
🤖 Plug-and-play integrations including LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
🔍 Extensive OCR support for scanned PDFs and images
👓 Support of several Visual Language Models (GraniteDocling)
🎙️ Audio support with Automatic Speech Recognition (ASR) models
🔌 Connect to any agent using the MCP server
💻 Simple and convenient CLI

Tests and Implementation

With the motivation now established, and maintaining my personal preference to build and test everything on my local laptop environment, the deployment phase was a pleasant surprise. Setting up Qdrant, typically a key friction point with dedicated vector databases, was remarkably smooth and straightforward (often achieved in minutes using a simple Docker command). This rapid deployment meant I could bypass infrastructure headaches and move straight to integrating the actual Docling RAG pipeline logic.

The starting point is the following page: https://qdrant.tech/documentation/quickstart/
All you have to do is use the following ‘docker’ command (which works perfectly with my local ‘Podman’ setup).

docker pull qdrant/qdrant

###
mkdir -p qdrant_storage

###
docker run -p 6333:6333 -p 6334:6334 \
    -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
    qdrant/qdrant

Qdrant becomes accessible through these URLs;

docker run -p 6333:6333 -p 6334:6334 \
    -v "$(pwd)/qdrant_storage:/qdrant/storage:z" \
    qdrant/qdrant
           _                 _    
  __ _  __| |_ __ __ _ _ __ | |_  
 / _` |/ _` | '__/ _` | '_ \| __| 
| (_| | (_| | | | (_| | | | | |_  
 \__, |\__,_|_|  \__,_|_| |_|\__| 
    |_|                           

Version: 1.16.2, build: d2834de0
Access web UI at http://localhost:6333/dashboard

REST API: localhost:6333
Web UI: localhost:6333/dashboard
GRPC API: localhost:6334

Following the above steps, I copy/pasted (and adapted a bit) the sample Python code given on the same page to build a basic simple application which demonstrates that all is working!

# qdrant_app.py
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct, Filter, FieldCondition, MatchValue, UpdateStatus
import time

# --- 1. Client Initialization ---
print("1. Initializing Qdrant client...")
client = QdrantClient(url="http://localhost:6333")

# --- 2. Collection Setup ---
COLLECTION_NAME = "city_vectors"
VECTOR_SIZE = 4
DISTANCE_METRIC = Distance.DOT

# Delete collection if it exists to ensure a clean run
try:
    client.delete_collection(collection_name=COLLECTION_NAME)
    print(f"   Collection '{COLLECTION_NAME}' deleted (if it existed).")
except Exception:
    pass # Ignore error if collection did not exist

# Create a new collection
print(f"   Creating new collection '{COLLECTION_NAME}'...")
client.create_collection(
    collection_name=COLLECTION_NAME,
    vectors_config=VectorParams(size=VECTOR_SIZE, distance=DISTANCE_METRIC),
)
print("   Collection created successfully.")

# --- 3. Upsert (Insert) Points ---
print("\n2. Upserting data points...")
points_to_insert = [
    PointStruct(id=1, vector=[0.05, 0.61, 0.76, 0.74], payload={"city": "Berlin", "population": 3.7}),
    PointStruct(id=2, vector=[0.19, 0.81, 0.75, 0.11], payload={"city": "London", "population": 8.9}),
    PointStruct(id=3, vector=[0.36, 0.55, 0.47, 0.94], payload={"city": "Moscow", "population": 12.6}),
    PointStruct(id=4, vector=[0.18, 0.01, 0.85, 0.80], payload={"city": "New York", "population": 8.4}),
    PointStruct(id=5, vector=[0.24, 0.18, 0.22, 0.44], payload={"city": "Beijing", "population": 21.5}),
    PointStruct(id=6, vector=[0.35, 0.08, 0.11, 0.44], payload={"city": "Mumbai", "population": 20.4}),
]

operation_info = client.upsert(
    collection_name=COLLECTION_NAME,
    wait=True, # Wait for the operation to complete
    points=points_to_insert,
)

print(f"   Upsert operation status: {operation_info.status.name}")
if operation_info.status == UpdateStatus.COMPLETED:
    # A small delay ensures the index is fully ready, though `wait=True` should handle it
    time.sleep(1)
    print(f"   Successfully inserted {len(points_to_insert)} points.")

# --- 4. Query 1: Nearest Neighbors Search ---
QUERY_VECTOR_1 = [0.2, 0.1, 0.9, 0.7]
print(f"\n3. Performing nearest neighbors search for query vector: {QUERY_VECTOR_1}")

search_result_1 = client.query_points(
    collection_name=COLLECTION_NAME,
    query=QUERY_VECTOR_1,
    with_payload=True, # Return the payload (city/population)
    limit=3
).points

print("   Top 3 closest points:")
for result in search_result_1:
    print(f"   - Score: {result.score:.4f}, City: {result.payload.get('city')}, Vector ID: {result.id}")

# --- 5. Query 2: Nearest Neighbors Search with Filter ---
# Filter to find the closest point *only* from London
QUERY_VECTOR_2 = [0.2, 0.1, 0.9, 0.7]
print(f"\n4. Performing filtered search (only city='London') for query vector: {QUERY_VECTOR_2}")

search_result_2 = client.query_points(
    collection_name=COLLECTION_NAME,
    query=QUERY_VECTOR_2,
    query_filter=Filter(
        must=[FieldCondition(key="city", match=MatchValue(value="London"))]
    ),
    with_payload=True,
    limit=3,
).points

print("   Closest point filtered by city='London':")
for result in search_result_2:
    print(f"   - Score: {result.score:.4f}, City: {result.payload.get('city')}, Vector ID: {result.id}")

The output is the following;

> python qdrant_app.py

1. Initializing Qdrant client...
   Collection 'city_vectors' deleted (if it existed).
   Creating new collection 'city_vectors'...
   Collection created successfully.

2. Upserting data points...
   Upsert operation status: COMPLETED
   Successfully inserted 6 points.

3. Performing nearest neighbors search for query vector: [0.2, 0.1, 0.9, 0.7]
   Top 3 closest points:
   - Score: 1.3620, City: New York, Vector ID: 4
   - Score: 1.2730, City: Berlin, Vector ID: 1
   - Score: 1.2080, City: Moscow, Vector ID: 3

4. Performing filtered search (only city='London') for query vector: [0.2, 0.1, 0.9, 0.7]
   Closest point filtered by city='London':
   - Score: 0.8710, City: London, Vector ID: 2

The next step is to make the sample original code provided as a Python notebook from Docling repository to work ⬇️ Press enter or click to view image in full size

Implementation and test with Docling

As I mentioned, I used the sample provided on Docling’s repository with some slight changes;

I use local Ollama and granite4.
The source document mentioned in the original sample code does not exist anymore, so I changed and replaced it with an existing reference to a document.

# from

result = doc_converter.convert(
    "https://www.sagacify.com/news/a-guide-to-chunking-strategies-for-retrieval-augmented-generation-rag"
)

# to 
 SAMPLE_PDF_URL = "https://arxiv.org/pdf/2408.09869"

I make (as my usual habit) a specific “./output” folder in which I create the output result in markdown format 😎. But first of all we need to prepare the environment.

# I figured out that my Python 3.14 was not fit for this test, 
# so I made a 3.11 venv
python3.11 -m venv qdrant_env

source qdrant_env/bin/activate

Install the required packages 📥

# requirements.txt
qdrant-client[fastembed]
docling
requests

pip install -r requirements.tx

And hereafter the actaul code 👨‍💻

# docling_qdrant_app_v3.py 
import os
import json
import logging
from typing import List, Dict, Any
from datetime import datetime
import requests

from qdrant_client import QdrantClient, models
from docling.chunking import HybridChunker
from docling.datamodel.base_models import InputFormat
from docling_core.types.doc import TextItem 
from docling.document_converter import DocumentConverter

logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
_log = logging.getLogger(__name__)

# --- 1. Configuration ---
class Config:
    """Central configuration for the RAG pipeline."""
    COLLECTION_NAME = "docling_rag_collection"
    # FastEmbed default model used by Qdrant
    EMBEDDING_MODEL = "all-MiniLM-L6-v2" 

    # --- Ollama Configuration ---
    OLLAMA_MODEL = "granite4:latest"
    OLLAMA_API_URL = "http://localhost:11434/api/generate"

    MAX_TOKENS = 512
    RETRIEVAL_LIMIT = 3

    # A sample PDF URL for demonstration
    SAMPLE_PDF_URL = "https://arxiv.org/pdf/2408.09869" 

# --- 2. Document Processor ---
class DocumentProcessor:
    def __init__(self):
        self.chunker = HybridChunker()

        self.converter = DocumentConverter(
            allowed_formats=[
                InputFormat.PDF,
                InputFormat.DOCX, 
                InputFormat.HTML
            ]
        )
        _log.info(f"DocumentConverter initialized. Allowed formats: {[f.name for f in self.converter.allowed_formats]}")

    def process_document(self, file_path: str) -> tuple[List[str], List[Dict[str, Any]]]:
        """Converts document to chunks and extracts metadata."""
        _log.info(f"Converting and processing document: {file_path}")

        try:
            result = self.converter.convert(source=file_path)
            document = result.document
        except Exception as e:
            _log.error(f"Conversion failed. Check if the file is accessible and the format is allowed. Error: {e}")
            raise

        text_chunks, metadatas = [], []

        for chunk in self.chunker.chunk(dl_doc=document, max_tokens=Config.MAX_TOKENS):
            text_chunks.append(chunk.text)
            metadatas.append(chunk.meta.export_json_dict())

        _log.info(f"Document processed into {len(text_chunks)} chunks.")
        return text_chunks, metadatas

# --- 3. Vector Store (Qdrant) ---
class VectorStore:
    def __init__(self):
        self.client = QdrantClient(location=":memory:") 
        self.collection_name = Config.COLLECTION_NAME

        self.client.set_model(f"sentence-transformers/{Config.EMBEDDING_MODEL}")
        self.client.set_sparse_model("Qdrant/bm25") 

        self._ensure_collection()

    def _ensure_collection(self):
        """Creates the collection if it doesn't exist."""
        if not self.client.collection_exists(collection_name=self.collection_name):

            sparse_vectors_config = self.client.get_fastembed_sparse_vector_params()

            self.client.recreate_collection(
                collection_name=self.collection_name,
                vectors_config=self.client.get_fastembed_vector_params( 
                    on_disk=False
                ),
                sparse_vectors_config=sparse_vectors_config 
            )
            _log.info(f"Created new collection: {self.collection_name}")

    def add_documents(self, text_chunks: List[str], metadatas: List[Dict[str, Any]]):
        """Adds text chunks and their metadata to Qdrant."""
        _ = self.client.add(
            collection_name=self.collection_name,
            documents=text_chunks,
            metadata=metadatas,
            batch_size=64,
        )
        _log.info(f"Added {len(text_chunks)} documents to Qdrant collection.")

    def search(self, query: str, limit: int = None) -> List[Dict[str, Any]]:
        """Searches the vector store for relevant documents using hybrid search."""
        limit = limit or Config.RETRIEVAL_LIMIT

        search_results = self.client.query(
            collection_name=self.collection_name,
            query_text=query,
            limit=limit
        )

        relevant_chunks = []
        for result in search_results:
            relevant_chunks.append({
                "text": result.document,
                "score": result.score,
                "metadata": result.metadata
            })

        return relevant_chunks

# --- 4. RAG Chatbot ---
class RAGChatbot:
    def __init__(self):
        self.document_processor = DocumentProcessor()
        self.vector_store = VectorStore()

        self.llm_initialized = False
        try:
            response = requests.get(Config.OLLAMA_API_URL.replace("/api/generate", "/"))
            response.raise_for_status()
            self.llm_initialized = True
            _log.info(f"Successfully connected to Ollama at {Config.OLLAMA_API_URL.replace('/api/generate', '')}")
        except requests.exceptions.RequestException as e:
            _log.error(f"FATAL: Could not connect to Ollama server at http://localhost:11434. Please ensure Ollama is running and the model {Config.OLLAMA_MODEL} is pulled. Error: {e}")
            raise RuntimeError("Ollama connection failed. Terminating to avoid using mock responses.")


    def load_documents(self, file_path: str):
        """Processes a document and loads it into the vector store."""
        text_chunks, metadatas = self.document_processor.process_document(file_path)

        if text_chunks:
            self.vector_store.add_documents(text_chunks, metadatas)
        else:
            _log.warning("No chunks were generated to load into Qdrant.")

    def generate_response(self, context: str, query: str) -> str:
        """Generates a response using the retrieved context from Ollama."""

        if not self.llm_initialized:
            raise RuntimeError(
                f"Cannot generate response. Ollama connection was not initialized successfully "
                f"for model {Config.OLLAMA_MODEL}."
            )

        prompt = f"""
        You are a helpful assistant. Answer the user's question based ONLY on the provided context.
        CONTEXT:
        {context}
        QUESTION: {query}
        If the answer is not in the context, state that you cannot find the answer.
        """

        try:
            payload = {
                "model": Config.OLLAMA_MODEL,
                "prompt": prompt,
                "stream": False,
                "options": {
                    "temperature": 0.0,
                    "num_ctx": 4096 
                }
            }

            headers = {'Content-Type': 'application/json'}

            response = requests.post(
                Config.OLLAMA_API_URL, 
                data=json.dumps(payload), 
                headers=headers,
                timeout=120
            )
            response.raise_for_status()

            result = response.json()
            # The 'response' field contains the generated text for /api/generate
            return result.get('response', 'Ollama model returned an empty response.')

        except requests.exceptions.RequestException as e:
            _log.error(f"Ollama API call failed. Error: {e}")
            # Corrected: Ensured the string is terminated on the same line to fix SyntaxError.
            return f"Error: Failed to get response from Ollama. Check server status and model ({Config.OLLAMA_MODEL}) availability."

    def chat_pipeline(self, query: str) -> str:
        """Performs the RAG search and generation cycle."""

        # 1. Retrieval
        relevant_chunks = self.vector_store.search(query)

        if not relevant_chunks:
            return "I could not find any relevant information in the documents loaded."

        # 2. Context Aggregation
        context_parts = []
        for i, chunk in enumerate(relevant_chunks):
            context_parts.append(f"--- CONTEXT CHUNK {i+1} ---\n{chunk['text']}")

        context = "\n\n".join(context_parts)

        # 3. Generation
        return self.generate_response(context=context, query=query)


# --- 5. Main Execution ---
if __name__ == "__main__":
    _log.info("Starting Docling + Qdrant RAG Chatbot...")

    try:
        chatbot = RAGChatbot()

        # Step 1: Load a PDF document
        _log.info(f"\n--- Loading Document: {Config.SAMPLE_PDF_URL} ---")
        chatbot.load_documents(Config.SAMPLE_PDF_URL)
        _log.info("Document successfully processed and loaded into Qdrant.")

        # Step 2: Ask a question and get a RAG response
        test_query = "How does Processing pipeline work?"

        # 1. Run the RAG pipeline
        response = chatbot.chat_pipeline(test_query)

        # 2. Setup file output directory and filename (Saves output to file)
        output_dir = "./output"
        os.makedirs(output_dir, exist_ok=True)

        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        output_filename = os.path.join(output_dir, f"rag_output_{timestamp}.md")

        # 3. Format the content to be written
        file_content = (
            f"=============================================\n"
            f"QUERY: {test_query}\n"
            f"=============================================\n"
            f"{response}\n"
            f"=============================================\n"
        )

        # 4. Write to the file
        with open(output_filename, 'w') as f:
            f.write(file_content)

        # 5. Print the output and a confirmation message
        print(f"\n=============================================")
        print(f"QUERY: {test_query}")
        print("=============================================")
        print(response)
        print("=============================================")
        _log.info(f"RAG pipeline finished. Output saved to: {output_filename}")


    except Exception as e:
        _log.error(f"Execution failed. Check your dependencies, Ollama server, and file path/URL. Error: {e}")

Your local Qdrant volume/storage

And throgh the UI

{"status":"green","optimizer_status":"ok","indexed_vectors_count":0,"points_count":0,"segments_count":3,"config":{"params":{"vectors":{},"shard_number":1,"replication_factor":1,"write_consistency_factor":1,"on_disk_payload":true},"hnsw_config":{"m":16,"ef_construct":100,"full_scan_threshold":10000,"max_indexing_threads":0,"on_disk":false},"optimizer_config":{"deleted_threshold":0.2,"vacuum_min_vector_number":1000,"default_segment_number":0,"max_segment_size":null,"memmap_threshold":null,"indexing_threshold":10000,"flush_interval_sec":5,"max_optimization_threads":null},"wal_config":{"wal_capacity_mb":32,"wal_segments_ahead":0,"wal_retain_closed":1},"quantization_config":null},"payload_schema":{}}

And the output of the test is below;

=============================================
QUERY: How does Processing pipeline work?
=============================================
Based on the provided context, here's how the processing pipeline works:

1. Initial document parsing: The document is first parsed to extract its content.

2. Model pipeline: Following initial parsing, the model pipeline processes the document. This central part of the processing involves a chain of models that can be fully customized by sub-classing from an abstract baseclass (BaseModelPipeline) or cloning the default model pipeline. Key points:
   - The custom pipeline class can be provided as an argument to the main document conversion methods.
   - Implementations of model classes must satisfy the python Callable interface and accept an iterator over page objects, producing another iterator with additional features predicted by the models.
   - Models can be added or replaced in the pipeline, and additional configuration parameters can be introduced.

3. Final output assembly: After the model pipeline processing, all prediction results are assembled into a well-defined datatype that encapsulates the converted document (defined in the docling-core auxiliary package).

4. Post-processing: The generated document object is passed through a post-processing model which:
   - Detects the document language
   - Corrects the reading order
   - Matches figures with captions
   - Labels metadata such as title, authors, and references

5. Serialization/Transformation: The final output can be serialized to JSON or transformed into a Markdown representation at the user's request.

In summary, the processing pipeline involves parsing the document, running it through a customizable model pipeline that adds features predicted by various models, assembling the results into a structured datatype, applying post-processing algorithms, and finally serializing or transforming the output as needed.
=============================================

Conclusion

In conclusion, this hands-on test successfully validated the robustness of the RAG pipeline built with Qdrant (and Ollama for testing locally). More critically, it reaffirmed the undeniable power of Docling as the foundational layer. While the selection of a vector database — be it for high-volume, scalable, high-availability, or light, ad-hoc use cases — will always remain specific to the project’s heavy-lifting requirements, data quality and meticulous document preparation are universally recognized as the essential step stone for the entire RAG stack. By efficiently handling conversion, advanced layout understanding, and intelligent hybrid chunking across a vast variety of document types, Docling establishes itself as the best-in-class tool for ensuring the semantic quality needed to drive accurate and reliable AI responses.

Thanks for reading.