DEV Community

Julien L for WiScale

Posted on

Build a local RAG pipeline in 30 lines of Python (no Docker, no API keys)

Most RAG tutorials start with "spin up Docker" and "get your API key." This one starts with pip install.

The problem

Retrieval-Augmented Generation (RAG) is the standard way to ground LLM answers in your own data. But the typical setup looks like this:

  1. Spin up a Docker container for your vector database
  2. Sign up for an API and grab your keys
  3. Configure connection strings, authentication, ports
  4. Write 100+ lines of glue code

That is a lot of infrastructure for what is conceptually simple: embed text, store vectors, search by similarity.

What if you could skip all of that?

The 30-line local RAG pipeline

Here is a complete RAG pipeline. No Docker. No API keys. No cloud. Just Python.

pip install velesdb sentence-transformers
Enter fullscreen mode Exit fullscreen mode
from sentence_transformers import SentenceTransformer
from velesdb import Database

# Load embedding model (runs locally, no API key)
model = SentenceTransformer("all-MiniLM-L6-v2")

# Open a local database (just a folder on disk)
db = Database("./rag_data")
collection = db.get_or_create_collection("documents", dimension=384)

# Your documents (from a file, a PDF, a web scrape, anything)
documents = [
    "VelesDB combines vector, graph, and columnar storage in one binary.",
    "HNSW indexing enables fast approximate nearest neighbor search.",
    "The Python SDK supports both vector search and graph traversal.",
    "RAG pipelines retrieve relevant context before generating answers.",
    "Local-first means no network dependency and full data privacy.",
    "Embeddings capture semantic meaning as high-dimensional vectors.",
]

# Embed and store each chunk
for i, doc in enumerate(documents):
    embedding = model.encode(doc).tolist()
    collection.upsert(i, vector=embedding, payload={"text": doc})

# Search with natural language
query = "How does vector search work?"
query_vec = model.encode(query).tolist()
results = collection.search(vector=query_vec, top_k=3)

for r in results:
    print(f"[{r['score']:.4f}] {r['payload']['text']}")
Enter fullscreen mode Exit fullscreen mode

That is it. 30 lines, and you have a working RAG retrieval engine running entirely on your machine.

What the results look like

When you run the query "How does vector search work?", you get ranked results with cosine similarity scores:

[0.6821] HNSW indexing enables fast approximate nearest neighbor search.
[0.5765] Embeddings capture semantic meaning as high-dimensional vectors.
[0.4938] VelesDB combines vector, graph, and columnar storage in one binary.
Enter fullscreen mode Exit fullscreen mode

The model understands that "vector search" is semantically close to "nearest neighbor search" and "embeddings," even though those exact words do not appear in the query. That is the power of semantic search over keyword matching.

From retrieval to generation

The retrieval step above gives you ranked chunks. To turn this into a full RAG pipeline, you need one more function that formats the results into an LLM prompt:

def build_rag_prompt(collection, model, question, top_k=3):
    query_vec = model.encode(question).tolist()
    results = collection.search(vector=query_vec, top_k=top_k)
    context = "\n".join(r["payload"]["text"] for r in results)
    return f"""Answer the question using only the context below.

Context:
{context}

Question: {question}
Answer:"""

prompt = build_rag_prompt(collection, model, "How does vector search work?")
# Pass 'prompt' to any LLM: OpenAI, Anthropic, Ollama, llama.cpp...
Enter fullscreen mode Exit fullscreen mode

You can feed this prompt to any LLM you want. OpenAI, Anthropic, a local Ollama instance, it does not matter. The retrieval layer is completely decoupled from the generation layer.

The comparison

Typical RAG stack This approach
Vector store Chroma (Docker) or Pinecone (cloud) VelesDB (pip install)
Setup Docker compose + API keys + env vars pip install velesdb
Lines of code 80-150 30
Network required Yes (cloud) or Docker daemon No
Data privacy Data leaves your machine Everything stays local
Binary size 500MB+ Docker image ~6MB

Scaling it up

The example above uses a small list of strings, but the pattern scales to real documents. Here is how you would chunk a text file:

def chunk_file(filepath, max_length=500):
    with open(filepath) as f:
        text = f.read()
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks = []
    for p in paragraphs:
        if len(p) <= max_length:
            chunks.append(p)
        else:
            sentences = p.split(". ")
            current = ""
            for s in sentences:
                if len(current) + len(s) > max_length:
                    chunks.append(current.strip())
                    current = s
                else:
                    current += ". " + s if current else s
            if current:
                chunks.append(current.strip())
    return chunks
Enter fullscreen mode Exit fullscreen mode

Load, chunk, embed, store. The same pattern every time.

Limitations (being honest)

This approach is not for everyone:

  • Single machine only. VelesDB is an embedded database. If you need distributed vector search across a cluster, look at Milvus or Qdrant.
  • Bring your own embeddings. VelesDB stores and searches vectors but does not generate them. You need a model like sentence-transformers or an API.
  • No built-in reranking. For production RAG, you might want a cross-encoder reranking step. That is outside VelesDB's scope.

For local development, prototyping, edge deployments, privacy-sensitive applications, and anywhere you want RAG without infrastructure, this works.

Get started

pip install velesdb sentence-transformers
Enter fullscreen mode Exit fullscreen mode

Previous articles in this series:

  1. I replaced my 500MB vector database Docker stack with a 3MB embedded engine
  2. Give your AI agent a real memory in 50 lines of Python
  3. Build an MCP server that gives any LLM long-term memory

What is your current RAG stack, and how many moving parts does it have? I am curious whether "just pip install" would work for your use case.

Top comments (0)