Libardo Ramirez

Posted on Apr 2

Building a Fully Local RAG System with Qdrant and Ollama

#rag #ai #llm #database

Some months ago I was working on a custom solution and I needed to add RAG to it. The requirements were simple but not flexible: everything had to run local, and it had to be deployable in Docker alongside the rest of the services. After looking at some options, I choose Qdrant, and after doing some experiments with it I can say it was a good decision.

I know there are more complete solutions to add RAG to a local LLM setup. Frameworks like LangChain or LlamaIndex already abstract most of what I will describe here. But my requirements were not complex, and I did not want to add more dependencies and abstractions on top of a stack I already understand. Keeping things explicit made more sense for this project.

This article explains what I learned. It is not a deep technical guide, it is more a conceptual explanation for developers who want to understand how Qdrant and Ollama work together before they start coding.

Why Run Everything Local?

My client did not want documents leaving their network, so I did not have much to think about. But even before I started the project, I was already curious about local LLMs. I wanted to understand how far you can go without depending on external services.

The answer is: pretty far. The models available through Ollama are good enough for most practical use cases, and tools like Qdrant make the infrastructure side simple. The cost of "running local" is much lower than I expected, both in setup time and in hardware requirements.

The tradeoff is real though. A local 7B model is not going to perform like GPT-4. For this project that was fine, because the task is retrieval and summarization, not complex reasoning. The model just needs to read some context and write a coherent answer, and for that, smaller models work well.

What is RAG?

RAG is not a new idea. It has been used for a while and is now a well known pattern. I am not saying this is something new. But it is very useful for this type of use case, and it is worth understanding how it works before you start connecting the tools.

A standard LLM only knows what it learned during training, and it can only answer questions from that knowledge. If you ask it about your internal documents, your company wiki, or a PDF you have, it has no idea about that content.

RAG solves this by adding a retrieval step before the model generates an answer: it searches your documents, finds the relevant parts, and gives them to the model as context. The model then uses that context to write the answer, so the response is based on your real data and not just what the model learned before, which reduces hallucinations a lot.

The steps are:

Index your documents - split them into small pieces, convert each piece into a vector (a numerical representation of its meaning), and store those vectors in a vector database.
Receive a question - convert the question into a vector using the same embedding model.
Search - find the stored pieces whose vectors are most similar to the question vector.
Build the prompt - put the found pieces as context before the question and pass everything to the LLM.
Generate the answer - the model reads the context and responds based on it.

User Question
     │
     ▼
[Embedding Model] ──► Question Vector
                              │
                              ▼
                      [Qdrant] ── similarity search ──► Top-k Chunks
                                                               │
                              ┌────────────────────────────────┘
                              ▼
                     Prompt = Context + Question
                              │
                              ▼
                         [Ollama LLM]
                              │
                              ▼
                           Answer

The Stack

Qdrant - The Vector Database

Qdrant is an open source vector database built for storing and searching vectors efficiently. In a RAG pipeline it works as the memory of the system: you push the document pieces into it during indexing, and when a question comes it finds the most relevant ones in milliseconds.

What I liked about it is how little friction there is to start. It runs as a single Docker container with no extra configuration, and its REST API is clean enough to use directly without a framework on top. Each stored item can also carry metadata alongside the vector, so you can filter results by things like document type, date, or source, which is useful when you have documents from different contexts in the same collection.

docker run -d -p 6333:6333 -p 6334:6334 qdrant/qdrant

It also comes with a web dashboard at http://localhost:6333/dashboard where you can browse your collections and inspect stored points, which is very useful when you are debugging why a particular chunk is or is not being retrieved.

Ollama - Local LLM Runtime

Ollama is the runtime that makes running language models locally feel simple. It handles model downloads, quantization, and serving, and you interact with it through a CLI or a local HTTP API that has a similar format to the OpenAI API, so most existing tools work with minimal changes.

For this RAG setup, Ollama does two things: it runs the embedding model that converts text into vectors, and the generation model that synthesizes the final answer. Having both in the same runtime keeps the stack simple: one service, one API, and no separate embedding server to manage.

Install it from ollama.com and pull the models:

ollama pull llama3.2          # generation model
ollama pull nomic-embed-text  # embedding model

How They Work Together

The indexing phase happens once, or when your documents change. You start by loading your documents. In my case this was PDFs, text files, and also some MP4 files whose audio I transcribed to text before indexing. Once you have plain text, Qdrant does not care about the original format. You then split the text into overlapping chunks, typically around 512 tokens with some overlap so context is not lost at the boundaries. For each chunk, you call Ollama's embedding API to get a vector (for example 768 dimensions with nomic-embed-text) and save that vector together with the original text and any metadata into a Qdrant collection.

The query phase runs for every user question. You convert the question to a vector using the same Ollama embedding model, pass that vector to Qdrant's search API, and get back the most similar chunks. You then build a prompt by putting those chunks as context before the question, send it to the Ollama generation model, and return the answer to the user.

One important thing to understand: you must use the same embedding model for indexing and for queries, because the vector space it creates only makes sense if both document chunks and questions are embedded in the same space. If you change the model, you need to re-index everything.

Key Things to Know

Chunking

How you split the documents affects the quality of the results more than most people expect. Chunks that are too big bring too much irrelevant text and reduce retrieval precision. Chunks that are too small lose the context needed to answer the question well.

A good starting point is chunks of 512 tokens with 64 tokens of overlap. The overlap makes sure that a sentence split across a boundary is not lost entirely. For structured documents like FAQs or product specs, splitting by logical section usually works better than splitting by character count.

Embedding Model

For a local setup with Ollama, these are the common options:

Model	Dimensions	Notes
`nomic-embed-text`	768	Fast, good for general English
`mxbai-embed-large`	1024	Better quality, needs more resources
`nomic-embed-text-v1.5`	768	Supports flexible dimension reduction

I used nomic-embed-text, not because I did a detailed comparison, but because I already used it some months earlier when I was learning RAG from a tutorial, it worked well then, and there was no reason to change. Sometimes the familiar option is good enough.

Collections in Qdrant

A collection in Qdrant is similar to a table in a relational database. When you create one you declare the vector size and the distance metric (cosine similarity is the standard for text embeddings):

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams

client = QdrantClient("http://localhost:6333")

client.create_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

Filtering by Metadata

One of the most useful features of Qdrant for RAG is the ability to filter search results by the metadata you attach to each vector. If you are building a system where different users have their own documents, you can tag each vector with a user_id and filter the search so users only retrieve their own content, without needing a separate collection for each user:

from qdrant_client.models import Filter, FieldCondition, MatchValue

results = client.search(
    collection_name="docs",
    query_vector=question_embedding,
    query_filter=Filter(
        must=[FieldCondition(key="user_id", match=MatchValue(value="alice"))]
    ),
    limit=5,
)

A Simple Example

Here is the basic flow in Python, no framework, just the minimum to make it work end to end:

import requests
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

OLLAMA_BASE = "http://localhost:11434"
EMBED_MODEL = "nomic-embed-text"
CHAT_MODEL  = "llama3.2"
qdrant = QdrantClient("http://localhost:6333")

# 1. Create collection
qdrant.recreate_collection(
    collection_name="docs",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)

# 2. Helper to embed text
def embed(text: str) -> list[float]:
    resp = requests.post(
        f"{OLLAMA_BASE}/api/embeddings",
        json={"model": EMBED_MODEL, "prompt": text},
    )
    return resp.json()["embedding"]

# 3. Index documents
documents = [
    "Qdrant is a vector database written in Rust, designed for fast nearest-neighbor search.",
    "Ollama lets you run large language models locally with a simple CLI and REST API.",
    "RAG combines information retrieval with text generation to ground LLM answers in real data.",
]

points = [
    PointStruct(id=i, vector=embed(doc), payload={"text": doc})
    for i, doc in enumerate(documents)
]

qdrant.upsert(collection_name="docs", points=points)

# 4. Search
question = "What database should I use for semantic search?"

hits = qdrant.search(
    collection_name="docs",
    query_vector=embed(question),
    limit=2,
)

context = "\n\n".join(hit.payload["text"] for hit in hits)

prompt = f"""Answer the question using only the context below.

Context:
{context}

Question: {question}
"""

# 5. Generate answer
response = requests.post(
    f"{OLLAMA_BASE}/api/generate",
    json={"model": CHAT_MODEL, "prompt": prompt, "stream": False},
)

print(response.json()["response"])

In a real project you would add proper document loading (PyMuPDF for PDFs, python-docx for Word files), better chunking logic, error handling, and a web API layer, but the core logic is exactly this.

Things to Be Careful About

The most important thing is to not change the embedding model after you already indexed your documents. The vectors from different models are not compatible, so if you switch models everything in Qdrant becomes useless and you need to re-index from the beginning. It is a good habit to keep the model name in your configuration and treat it like part of your data schema.

If the answers are not good, the problem is usually in the chunking. Chunks that are too big bring too much irrelevant text and the model gets confused. Chunks that are too small lose context and the answer is incomplete. Try smaller chunks with more overlap, or split by paragraph instead of by character count. This depends a lot on the type of documents you have.

The context window is also something to watch. You are passing retrieved chunks plus the question into the LLM, and if you include too many large chunks you can go over the limit. A safe approach is to retrieve 3 to 5 chunks and keep each one under 400 tokens. llama3.2 has an 8k token context window by default, which is enough if you are careful with the chunk size.

On the hardware side, a 7B model in 4-bit quantization needs around 5 to 6 GB of RAM. Adding Qdrant, which is very lightweight, and the application, the total is around 8 to 10 GB. On a 16 GB machine this is comfortable. If you have less RAM, a smaller model like phi3.5 at 3.8B parameters is a good alternative that still gives useful results.

What I Found in My Experiments

Qdrant was very simple to start with. Just run the Docker image and it works with no configuration needed. For persistent storage you only need to add a volume mount, and in a docker-compose.yml alongside the rest of the services it integrates cleanly without any special networking configuration:

docker run -d -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage \
  qdrant/qdrant

The embeddings from Ollama worked well from the first test. I did not need to tune anything. nomic-embed-text already gave useful retrieval results for domain-specific documents without any changes.

Chunk size made a real difference in quality. I tested with 256, 512, and 1024 tokens. With 1024 the results had too much irrelevant surrounding text that diluted the retrieval signal, and with 256 some answers were missing important context. 512 was the best balance for the type of documents I was working with.

The Qdrant dashboard at http://localhost:6333/dashboard was also more useful than I expected. When a retrieval is not working as expected, you can open it and see exactly what is stored and what is being returned for a query. It saves a lot of time compared to adding print statements to the code.

When to Use This Stack

This setup works well for internal knowledge bases, documentation search, or any project where documents cannot leave the company network. It is also good for simple Q&A over a set of documents, or for prototyping when you do not want to pay for API calls while you are still experimenting.

It is not the best choice when you need complex reasoning. Smaller local models are not as capable as GPT-4-class models for that. If your document collection is very large, with millions of vectors, Qdrant supports distributed mode for that but it is a different and more complex setup. And if your project needs support for many languages, it is worth checking the embedding model benchmarks carefully before choosing one, because quality varies a lot between models.

Conclusion

When I needed to add RAG to my project, I wanted something that runs local, works in Docker, and is not too complex to set up. Qdrant was the right choice for that. Together with Ollama, the stack is straightforward: Ollama handles the models for both embedding and generation, and Qdrant handles the storage and search.

It is not the most powerful setup you can build, and I know there are more complete frameworks available. But for requirements like mine, it works very well, the setup time is short, and the result is a RAG system with no external dependencies, no token costs, and no data leaving the infrastructure.

If you are thinking about adding local RAG to a project, this is a good place to start.

References

Qdrant Documentation
Qdrant GitHub - 27,000+ stars as of 2025
Ollama Official Site
Qdrant Python Client
nomic-embed-text on Ollama
mxbai-embed-large on Ollama
Qdrant 2025 Recap: Powering the Agentic Era

DEV Community