Julien L for WiScale

Posted on Mar 29 • Edited on Apr 5

Build a local RAG pipeline in 30 lines of Python (no Docker, no API keys)

#rag #python #ai #tutorial

Most RAG tutorials start with "spin up Docker" and "get your API key." This one starts with pip install.

The problem

Retrieval-Augmented Generation (RAG) is the standard way to ground LLM answers in your own data. But the typical setup looks like this:

Spin up a Docker container for your vector database
Sign up for an API and grab your keys
Configure connection strings, authentication, ports
Write 100+ lines of glue code

That is a lot of infrastructure for what is conceptually simple: embed text, store vectors, search by similarity.

What if you could skip all of that?

The 30-line local RAG pipeline

Here is a complete RAG pipeline. No Docker. No API keys. No cloud. Just Python.

pip install velesdb sentence-transformers

from sentence_transformers import SentenceTransformer
from velesdb import Database

# Load embedding model (runs locally, no API key)
model = SentenceTransformer("all-MiniLM-L6-v2")

# Open a local database (just a folder on disk)
db = Database("./rag_data")
collection = db.get_or_create_collection("documents", dimension=384)

# Your documents (from a file, a PDF, a web scrape, anything)
documents = [
    "VelesDB® combines vector, graph, and columnar storage in one binary.",
    "HNSW indexing enables fast approximate nearest neighbor search.",
    "The Python SDK supports both vector search and graph traversal.",
    "RAG pipelines retrieve relevant context before generating answers.",
    "Local-first means no network dependency and full data privacy.",
    "Embeddings capture semantic meaning as high-dimensional vectors.",
]

# Embed and store each chunk
for i, doc in enumerate(documents):
    embedding = model.encode(doc).tolist()
    collection.upsert(i, vector=embedding, payload={"text": doc})

# Search with natural language
query = "How does vector search work?"
query_vec = model.encode(query).tolist()
results = collection.search(vector=query_vec, top_k=3)

for r in results:
    print(f"[{r['score']:.4f}] {r['payload']['text']}")

That is it. 30 lines, and you have a working RAG retrieval engine running entirely on your machine.

What the results look like

When you run the query "How does vector search work?", you get ranked results with cosine similarity scores:

[0.6821] HNSW indexing enables fast approximate nearest neighbor search.
[0.5765] Embeddings capture semantic meaning as high-dimensional vectors.
[0.4938] VelesDB combines vector, graph, and columnar storage in one binary.

The model understands that "vector search" is semantically close to "nearest neighbor search" and "embeddings," even though those exact words do not appear in the query. That is the power of semantic search over keyword matching.

From retrieval to generation

The retrieval step above gives you ranked chunks. To turn this into a full RAG pipeline, you need one more function that formats the results into an LLM prompt:

def build_rag_prompt(collection, model, question, top_k=3):
    query_vec = model.encode(question).tolist()
    results = collection.search(vector=query_vec, top_k=top_k)
    context = "\n".join(r["payload"]["text"] for r in results)
    return f"""Answer the question using only the context below.

Context:
{context}

Question: {question}
Answer:"""

prompt = build_rag_prompt(collection, model, "How does vector search work?")
# Pass 'prompt' to any LLM: OpenAI, Anthropic, Ollama, llama.cpp...

You can feed this prompt to any LLM you want. OpenAI, Anthropic, a local Ollama instance, it does not matter. The retrieval layer is completely decoupled from the generation layer.

The comparison

	Typical RAG stack	This approach
Vector store	Chroma (Docker) or Pinecone (cloud)	VelesDB (`pip install`)
Setup	Docker compose + API keys + env vars	`pip install velesdb`
Lines of code	80-150	30
Network required	Yes (cloud) or Docker daemon	No
Data privacy	Data leaves your machine	Everything stays local
Binary size	500MB+ Docker image	~6MB

Scaling it up

The example above uses a small list of strings, but the pattern scales to real documents. Here is how you would chunk a text file:

def chunk_file(filepath, max_length=500):
    with open(filepath) as f:
        text = f.read()
    paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
    chunks = []
    for p in paragraphs:
        if len(p) <= max_length:
            chunks.append(p)
        else:
            sentences = p.split(". ")
            current = ""
            for s in sentences:
                if len(current) + len(s) > max_length:
                    chunks.append(current.strip())
                    current = s
                else:
                    current += ". " + s if current else s
            if current:
                chunks.append(current.strip())
    return chunks

Load, chunk, embed, store. The same pattern every time.

Limitations

This approach is not for everyone:

Single machine only. VelesDB is an embedded database. If you need distributed vector search across a cluster, look at Milvus or Qdrant.
Bring your own embeddings. VelesDB stores and searches vectors but does not generate them. You need a model like sentence-transformers or an API.
No built-in reranking. For production RAG, you might want a cross-encoder reranking step. That is outside VelesDB's scope.

For local development, prototyping, edge deployments, privacy-sensitive applications, and anywhere you want RAG without infrastructure, this works.

Get started

pip install velesdb sentence-transformers

VelesDB® on GitHub (source-available, Elastic License 2.0)
VelesDB on PyPI
sentence-transformers docs

Previous articles in this series:

What is your current RAG stack, and how many moving parts does it have? I am curious whether "just pip install" would work for your use case.

DEV Community