DEV Community

Cover image for How I Built a Lightning-Fast Local RAG with Cosdata and Ollama
Harish Kotra (he/him)
Harish Kotra (he/him)

Posted on

How I Built a Lightning-Fast Local RAG with Cosdata and Ollama

Retrieval-Augmented Generation (RAG) is the backbone of modern AI applications. But most tutorials send your data to the cloud (Pinecone, OpenAI) or require heavy setups.

I wanted something Private, Free, and Fast.

So I built LocalVaultQA - a 100% local RAG system using Cosdata (a new high-performance vector DB) and Ollama. Here is how I did it, and why this stack is a game-changer.

LocalVaultQA in action

The Stack

  • Ollama: The standard for running local LLMs. I used gemma3:12b for generation and nomic-embed-text for embeddings.
  • Cosdata: An open-source, Rust-based vector database. It's incredibly lightweight and supports Hybrid Search out of the box.
  • Streamlit: For a clean, interactive UI.

Key Challenge: Authentication & Search

Cosdata is secure by default, which can be a pain for local dev. It requires an Admin Key for every request.

The Fix: I wrote a custom start-cosdata.sh script that runs inside the Docker container to auto-configure the environment.

# docker-compose.yml snippet
command: /bin/sh -c "cd /opt/cosdata && /root/cosdata-v0.1.2-beta/bin/cosdata --admin-key my_secure_password"
Enter fullscreen mode Exit fullscreen mode

This bypasses the interactive prompt, making the database spin up effectively "headless" and ready for connection immediately.

The Secret Sauce: Client-Side Hybrid Search

Semantic search (Dense vectors) is great for "concepts", but fails at specific keywords (like part numbers or acronyms). Keyword search (BM25) is the opposite.

Hybrid Search combines them.

Since the Cosdata Python SDK is still in beta, I implemented a robust Reciprocal Rank Fusion (RRF) algorithm directly in the client:

def query(self, question, top_k=5, mode="hybrid"):
    # 1. Get Dense Results
    embedding = ollama.embeddings(model="nomic-embed-text", prompt=question)["embedding"]
    dense_res = self.collection.search.dense(query_vector=embedding, top_k=top_k*2)

    # 2. Get Text Results
    text_res = self.collection.search.text(query_text=question, top_k=top_k*2)

    # 3. Fuse with RRF
    scores = {}
    for rank, hit in enumerate(dense_res):
        scores[hit['text']] = 1.0 / (60 + rank)
    for rank, hit in enumerate(text_res):
        scores[hit['text']] = scores.get(hit['text'], 0) + (1.0 / (60 + rank))

    # Sort and return top_k
    return sorted_results[:top_k]
Enter fullscreen mode Exit fullscreen mode

This ensures we get the best of both worlds, capturing "meaning" AND "precision".

Why Use This?

  1. Privacy: Your financial docs, legal papers, or personal journals never leave your laptop.
  2. Speed: Cosdata is written in Rust. Indexing is near-instant.
  3. Cost: $0. No tokens, no monthly fees.

Try It Yourself

Clone the repo, run docker-compose up -d, and you have a production-grade RAG pipeline running on your Mac.

Check out the code on GitHub! 👇

https://github.com/harishkotra/LocalVaultQA

Tools Used:

  1. Ollama
  2. Cosdata
  3. Streamlit

Top comments (0)