Most RAG tutorials start with "spin up Docker" and "get your API key." This one starts with pip install.
The problem
Retrieval-Augmented Generation (RAG) is the standard way to ground LLM answers in your own data. But the typical setup looks like this:
- Spin up a Docker container for your vector database
- Sign up for an API and grab your keys
- Configure connection strings, authentication, ports
- Write 100+ lines of glue code
That is a lot of infrastructure for what is conceptually simple: embed text, store vectors, search by similarity.
What if you could skip all of that?
The 30-line local RAG pipeline
Here is a complete RAG pipeline. No Docker. No API keys. No cloud. Just Python.
pip install velesdb sentence-transformers
from sentence_transformers import SentenceTransformer
from velesdb import Database
# Load embedding model (runs locally, no API key)
model = SentenceTransformer("all-MiniLM-L6-v2")
# Open a local database (just a folder on disk)
db = Database("./rag_data")
collection = db.get_or_create_collection("documents", dimension=384)
# Your documents (from a file, a PDF, a web scrape, anything)
documents = [
"VelesDB combines vector, graph, and columnar storage in one binary.",
"HNSW indexing enables fast approximate nearest neighbor search.",
"The Python SDK supports both vector search and graph traversal.",
"RAG pipelines retrieve relevant context before generating answers.",
"Local-first means no network dependency and full data privacy.",
"Embeddings capture semantic meaning as high-dimensional vectors.",
]
# Embed and store each chunk
for i, doc in enumerate(documents):
embedding = model.encode(doc).tolist()
collection.upsert(i, vector=embedding, payload={"text": doc})
# Search with natural language
query = "How does vector search work?"
query_vec = model.encode(query).tolist()
results = collection.search(vector=query_vec, top_k=3)
for r in results:
print(f"[{r['score']:.4f}] {r['payload']['text']}")
That is it. 30 lines, and you have a working RAG retrieval engine running entirely on your machine.
What the results look like
When you run the query "How does vector search work?", you get ranked results with cosine similarity scores:
[0.6821] HNSW indexing enables fast approximate nearest neighbor search.
[0.5765] Embeddings capture semantic meaning as high-dimensional vectors.
[0.4938] VelesDB combines vector, graph, and columnar storage in one binary.
The model understands that "vector search" is semantically close to "nearest neighbor search" and "embeddings," even though those exact words do not appear in the query. That is the power of semantic search over keyword matching.
From retrieval to generation
The retrieval step above gives you ranked chunks. To turn this into a full RAG pipeline, you need one more function that formats the results into an LLM prompt:
def build_rag_prompt(collection, model, question, top_k=3):
query_vec = model.encode(question).tolist()
results = collection.search(vector=query_vec, top_k=top_k)
context = "\n".join(r["payload"]["text"] for r in results)
return f"""Answer the question using only the context below.
Context:
{context}
Question: {question}
Answer:"""
prompt = build_rag_prompt(collection, model, "How does vector search work?")
# Pass 'prompt' to any LLM: OpenAI, Anthropic, Ollama, llama.cpp...
You can feed this prompt to any LLM you want. OpenAI, Anthropic, a local Ollama instance, it does not matter. The retrieval layer is completely decoupled from the generation layer.
The comparison
| Typical RAG stack | This approach | |
|---|---|---|
| Vector store | Chroma (Docker) or Pinecone (cloud) | VelesDB (pip install) |
| Setup | Docker compose + API keys + env vars | pip install velesdb |
| Lines of code | 80-150 | 30 |
| Network required | Yes (cloud) or Docker daemon | No |
| Data privacy | Data leaves your machine | Everything stays local |
| Binary size | 500MB+ Docker image | ~6MB |
Scaling it up
The example above uses a small list of strings, but the pattern scales to real documents. Here is how you would chunk a text file:
def chunk_file(filepath, max_length=500):
with open(filepath) as f:
text = f.read()
paragraphs = [p.strip() for p in text.split("\n\n") if p.strip()]
chunks = []
for p in paragraphs:
if len(p) <= max_length:
chunks.append(p)
else:
sentences = p.split(". ")
current = ""
for s in sentences:
if len(current) + len(s) > max_length:
chunks.append(current.strip())
current = s
else:
current += ". " + s if current else s
if current:
chunks.append(current.strip())
return chunks
Load, chunk, embed, store. The same pattern every time.
Limitations (being honest)
This approach is not for everyone:
- Single machine only. VelesDB is an embedded database. If you need distributed vector search across a cluster, look at Milvus or Qdrant.
-
Bring your own embeddings. VelesDB stores and searches vectors but does not generate them. You need a model like
sentence-transformersor an API. - No built-in reranking. For production RAG, you might want a cross-encoder reranking step. That is outside VelesDB's scope.
For local development, prototyping, edge deployments, privacy-sensitive applications, and anywhere you want RAG without infrastructure, this works.
Get started
pip install velesdb sentence-transformers
- VelesDB on GitHub (source-available, Elastic License 2.0)
- VelesDB on PyPI
- sentence-transformers docs
Previous articles in this series:
- I replaced my 500MB vector database Docker stack with a 3MB embedded engine
- Give your AI agent a real memory in 50 lines of Python
- Build an MCP server that gives any LLM long-term memory
What is your current RAG stack, and how many moving parts does it have? I am curious whether "just pip install" would work for your use case.
Top comments (0)