We Added Full RAG to Our Open-Source AI Template: 4 Vector Stores, Hybrid Search, and Reranking
One template, every RAG decision already made — from vector store to reranking strategy.
You know the drill. You want to add RAG to your AI app. So you start: pick a vector database, write an embedding pipeline, figure out chunking, wire up retrieval, add it to your agent as a tool, build a frontend to manage documents...
Three weeks later you have a working prototype. Then someone asks "can we try Qdrant instead of Milvus?" and you realize your vector store is hardcoded in 14 places.
We just shipped v0.2.2 of our open-source full-stack AI template, and RAG was the biggest addition. Not a toy demo — a production pipeline with 4 vector stores, 4 embedding providers, hybrid search, reranking, document versioning, and a management dashboard. All configurable. All swappable.
Here's what we built and why.
I'm Kacper, AI Engineer at Vstorm — an Applied Agentic AI Engineering Consultancy. We've shipped 30+ production AI agent implementations and open-source our tooling at github.com/vstorm-co. Connect with me on LinkedIn.
The Architecture: 5 Steps, Every One Configurable
Every RAG system does the same thing: parse → chunk → embed → store → search. The difference is how many decisions you have to make at each step.
In our template, each step is a pluggable abstraction:
Document Upload
│
├── Parse: PyMuPDF (default) | LlamaParse (130+ formats) | python-docx
│
├── Chunk: recursive (default) | markdown | fixed
│ └── chunk_size=512, overlap=50 (configurable via env vars)
│
├── Embed: OpenAI | Voyage | Gemini (multimodal) | SentenceTransformers (local)
│ └── dimensions auto-derived from model name
│
├── Store: Milvus | Qdrant | ChromaDB | pgvector
│
└── Search: vector | hybrid (BM25 + vector + RRF) | + reranking (Cohere | CrossEncoder)
You pick your stack during project generation. The template wires everything up. No glue code.
4 Vector Stores, 1 Interface
The biggest design decision was making vector stores swappable. We implemented BaseVectorStore with four backends:
class BaseVectorStore(ABC):
async def insert_document(self, collection_name: str, document: Document) -> None
async def search(self, collection_name: str, query: str, limit: int = 4) -> list[SearchResult]
async def delete_document(self, collection_name: str, document_id: str) -> None
async def get_collection_info(self, collection_name: str) -> CollectionInfo
Milvus — production-grade, runs as 3 Docker services (etcd + MinIO + Milvus). Best for large-scale deployments. Cosine similarity with IVF_FLAT indexing.
Qdrant — single Docker service, great balance of performance and simplicity. Our default recommendation for most teams.
ChromaDB — embedded mode, zero Docker required. Perfect for prototyping and local development. Just pip install chromadb.
pgvector — uses your existing PostgreSQL. No new infrastructure. HNSW indexing. If you already have Postgres, this is the lowest-friction option.
Switching between them? One environment variable:
# In your .env:
VECTOR_STORE=qdrant # or: milvus, chromadb, pgvector
The template handles connection strings, Docker services, schema creation, and index configuration automatically.
Hybrid Search: Why Vector-Only Isn't Enough
Pure vector search works well for semantic queries ("documents about building safety"). It fails on exact matches ("find contract #2024-0847") because embeddings don't preserve exact strings.
Our hybrid search combines both:
async def retrieve(self, query: str, collection_name: str, limit: int = 5):
# Step 1: Vector search (semantic)
raw_results = await self.store.search(collection_name, query, limit=limit * fetch_multiplier)
# Step 2: BM25 keyword search
if self._hybrid_enabled:
bm25_results = await self._bm25_search(query, collection_name, limit * fetch_multiplier)
if bm25_results:
raw_results = self._rrf_fuse(raw_results, bm25_results)
# Step 3: Rerank (optional)
if should_rerank and self.rerank_service:
results = await self.rerank_service.rerank(query=query, results=raw_results, top_k=limit * 2)
return results[:limit]
The fusion uses Reciprocal Rank Fusion (RRF) — a simple but effective algorithm that combines rankings from multiple sources:
@staticmethod
def _rrf_fuse(vector_results, bm25_results, k=60):
scores = {}
for rank, r in enumerate(vector_results):
key = r.content[:100]
scores[key] = scores.get(key, 0) + 1.0 / (k + rank + 1)
for rank, r in enumerate(bm25_results):
key = r.content[:100]
scores[key] = scores.get(key, 0) + 1.0 / (k + rank + 1)
return sorted_by_score(scores)
Enable it with one env var: RAG_HYBRID_SEARCH=true.
Reranking: The Quality Multiplier
Initial retrieval casts a wide net. Reranking narrows it down. We support two options:
Cohere Reranker (API) — the fastest way to improve retrieval quality. Send your results + query, get them re-scored by a model trained specifically for relevance ranking:
response = await self.client.rerank(
query=query,
documents=[result.content for result in results],
model="rerank-v3.5",
top_n=top_k,
)
CrossEncoder (local) — runs a SentenceTransformers cross-encoder model locally. No API calls, no data leaves your infrastructure:
pairs = [[query, result.content] for result in results]
scores = self.model.predict(pairs) # Runs locally on CPU/GPU
The pipeline is: retrieve 3× more results than needed → rerank → return top-k. This consistently improves precision without touching your embeddings or vector store.
Document Versioning: SHA256 Dedup
Re-ingesting a document shouldn't create duplicates. Our pipeline uses content hashing:
async def ingest_file(self, filepath, collection_name, replace=True):
document = await self.processor.process_file(filepath)
# Check for existing version by source path or content hash
existing_id = await self._find_existing_by_source(collection_name, source_path)
if not existing_id:
existing_id = await self._find_existing_by_hash(collection_name, document.metadata.content_hash)
# Replace old chunks with new ones
if existing_id:
await self.store.delete_document(collection_name, existing_id)
await self.store.insert_document(collection_name, document)
Google Drive sync? Same logic — changed files get re-embedded, unchanged files skip.
4 Embedding Providers
| Provider | Model | Dimensions | API Key? |
|---|---|---|---|
| OpenAI | text-embedding-3-small | 1536 | Yes |
| Voyage | voyage-3 | 1024 | Yes |
| Gemini | gemini-embedding-exp-03-07 | 3072 | Yes |
| SentenceTransformers | all-MiniLM-L6-v2 | 384 | No (local) |
Dimensions are auto-derived from the model name — no manual configuration:
EMBEDDING_DIMENSIONS = {
"text-embedding-3-small": 1536,
"voyage-3": 1024,
"gemini-embedding-exp-03-07": 3072,
"all-MiniLM-L6-v2": 384,
}
Gemini is the interesting one — it supports multimodal embeddings. Text and images in the same vector space. We use it for image description extraction from PDFs.
The Agent Integration
RAG becomes an agent tool — search_knowledge_base — available to all 5 AI frameworks (Pydantic AI, LangChain, LangGraph, CrewAI, DeepAgents):
async def search_knowledge_base(
query: str,
collection: str = "documents",
collections: list[str] | None = None, # Multi-collection search
top_k: int = 5,
) -> str:
"""Search with automatic reranking & hybrid search if enabled."""
Results include source attribution: filename, page number, chunk number, and similarity score. The agent's system prompt instructs it to cite sources with [1], [2] references.
Key Takeaways
- RAG is a pipeline of 5 decisions (parse, chunk, embed, store, search) — our template makes each one configurable without code changes
- Vector-only search misses exact matches — hybrid (BM25 + vector + RRF) catches both semantic and keyword queries
- Reranking is the cheapest quality improvement — 3× over-retrieve + rerank consistently beats tuning embeddings
- Document versioning prevents duplicate chunks — SHA256 content hash + source path tracking
-
One env var switches everything —
VECTOR_STORE=pgvector,RAG_HYBRID_SEARCH=true,EMBEDDING_MODEL=voyage-3
Try it yourself
full-stack-ai-agent-template — generates production-ready FastAPI + Next.js AI apps with full RAG pipeline
pip install fastapi-fullstack
Related:
- AI Agent Configurator — configure 75+ options visually, download as ZIP
- Step-by-step guides — 50 tutorials across 5 frameworks
More from Vstorm's open-source ecosystem:
- All our open-source projects — 13 packages for the Pydantic AI ecosystem
- awesome-pydantic-ai — curated list of Pydantic AI resources and tools
- vstorm.co — our consultancy (30+ AI agent implementations)
If this was useful, follow me on LinkedIn for daily AI agent insights.
Top comments (0)