How I Built a Lightning-Fast Local RAG with Cosdata and Ollama

#programming #ai #python #vectordatabase

Retrieval-Augmented Generation (RAG) is the backbone of modern AI applications. But most tutorials send your data to the cloud (Pinecone, OpenAI) or require heavy setups.

I wanted something Private, Free, and Fast.

So I built LocalVaultQA - a 100% local RAG system using Cosdata (a new high-performance vector DB) and Ollama. Here is how I did it, and why this stack is a game-changer.

The Stack

Ollama: The standard for running local LLMs. I used gemma3:12b for generation and nomic-embed-text for embeddings.
Cosdata: An open-source, Rust-based vector database. It's incredibly lightweight and supports Hybrid Search out of the box.
Streamlit: For a clean, interactive UI.

Key Challenge: Authentication & Search

Cosdata is secure by default, which can be a pain for local dev. It requires an Admin Key for every request.

The Fix: I wrote a custom start-cosdata.sh script that runs inside the Docker container to auto-configure the environment.

# docker-compose.yml snippet
command: /bin/sh -c "cd /opt/cosdata && /root/cosdata-v0.1.2-beta/bin/cosdata --admin-key my_secure_password"

This bypasses the interactive prompt, making the database spin up effectively "headless" and ready for connection immediately.

The Secret Sauce: Client-Side Hybrid Search

Semantic search (Dense vectors) is great for "concepts", but fails at specific keywords (like part numbers or acronyms). Keyword search (BM25) is the opposite.

Hybrid Search combines them.

Since the Cosdata Python SDK is still in beta, I implemented a robust Reciprocal Rank Fusion (RRF) algorithm directly in the client:

def query(self, question, top_k=5, mode="hybrid"):
    # 1. Get Dense Results
    embedding = ollama.embeddings(model="nomic-embed-text", prompt=question)["embedding"]
    dense_res = self.collection.search.dense(query_vector=embedding, top_k=top_k*2)

    # 2. Get Text Results
    text_res = self.collection.search.text(query_text=question, top_k=top_k*2)

    # 3. Fuse with RRF
    scores = {}
    for rank, hit in enumerate(dense_res):
        scores[hit['text']] = 1.0 / (60 + rank)
    for rank, hit in enumerate(text_res):
        scores[hit['text']] = scores.get(hit['text'], 0) + (1.0 / (60 + rank))

    # Sort and return top_k
    return sorted_results[:top_k]

This ensures we get the best of both worlds, capturing "meaning" AND "precision".

Why Use This?

Privacy: Your financial docs, legal papers, or personal journals never leave your laptop.
Speed: Cosdata is written in Rust. Indexing is near-instant.
Cost: $0. No tokens, no monthly fees.