DEV Community

Cover image for Full RAG Pipeline: 4 Vector Stores, Hybrid Search, and Reranking in One Template
Kacper Włodarczyk
Kacper Włodarczyk

Posted on • Originally published at oss.vstorm.co

Full RAG Pipeline: 4 Vector Stores, Hybrid Search, and Reranking in One Template

We Added Full RAG to Our Open-Source AI Template: 4 Vector Stores, Hybrid Search, and Reranking

One template, every RAG decision already made — from vector store to reranking strategy.


You know the drill. You want to add RAG to your AI app. So you start: pick a vector database, write an embedding pipeline, figure out chunking, wire up retrieval, add it to your agent as a tool, build a frontend to manage documents...

Three weeks later you have a working prototype. Then someone asks "can we try Qdrant instead of Milvus?" and you realize your vector store is hardcoded in 14 places.

We just shipped v0.2.2 of our open-source full-stack AI template, and RAG was the biggest addition. Not a toy demo — a production pipeline with 4 vector stores, 4 embedding providers, hybrid search, reranking, document versioning, and a management dashboard. All configurable. All swappable.

Here's what we built and why.


I'm Kacper, AI Engineer at Vstorm — an Applied Agentic AI Engineering Consultancy. We've shipped 30+ production AI agent implementations and open-source our tooling at github.com/vstorm-co. Connect with me on LinkedIn.


The Architecture: 5 Steps, Every One Configurable

Every RAG system does the same thing: parse → chunk → embed → store → search. The difference is how many decisions you have to make at each step.

In our template, each step is a pluggable abstraction:

Document Upload
  │
  ├── Parse: PyMuPDF (default) | LlamaParse (130+ formats) | python-docx
  │
  ├── Chunk: recursive (default) | markdown | fixed
  │     └── chunk_size=512, overlap=50 (configurable via env vars)
  │
  ├── Embed: OpenAI | Voyage | Gemini (multimodal) | SentenceTransformers (local)
  │     └── dimensions auto-derived from model name
  │
  ├── Store: Milvus | Qdrant | ChromaDB | pgvector
  │
  └── Search: vector | hybrid (BM25 + vector + RRF) | + reranking (Cohere | CrossEncoder)
Enter fullscreen mode Exit fullscreen mode

You pick your stack during project generation. The template wires everything up. No glue code.

4 Vector Stores, 1 Interface

The biggest design decision was making vector stores swappable. We implemented BaseVectorStore with four backends:

class BaseVectorStore(ABC):
    async def insert_document(self, collection_name: str, document: Document) -> None
    async def search(self, collection_name: str, query: str, limit: int = 4) -> list[SearchResult]
    async def delete_document(self, collection_name: str, document_id: str) -> None
    async def get_collection_info(self, collection_name: str) -> CollectionInfo
Enter fullscreen mode Exit fullscreen mode

Milvus — production-grade, runs as 3 Docker services (etcd + MinIO + Milvus). Best for large-scale deployments. Cosine similarity with IVF_FLAT indexing.

Qdrant — single Docker service, great balance of performance and simplicity. Our default recommendation for most teams.

ChromaDB — embedded mode, zero Docker required. Perfect for prototyping and local development. Just pip install chromadb.

pgvector — uses your existing PostgreSQL. No new infrastructure. HNSW indexing. If you already have Postgres, this is the lowest-friction option.

Switching between them? One environment variable:

# In your .env:
VECTOR_STORE=qdrant    # or: milvus, chromadb, pgvector
Enter fullscreen mode Exit fullscreen mode

The template handles connection strings, Docker services, schema creation, and index configuration automatically.

Hybrid Search: Why Vector-Only Isn't Enough

Pure vector search works well for semantic queries ("documents about building safety"). It fails on exact matches ("find contract #2024-0847") because embeddings don't preserve exact strings.

Our hybrid search combines both:

async def retrieve(self, query: str, collection_name: str, limit: int = 5):
    # Step 1: Vector search (semantic)
    raw_results = await self.store.search(collection_name, query, limit=limit * fetch_multiplier)

    # Step 2: BM25 keyword search
    if self._hybrid_enabled:
        bm25_results = await self._bm25_search(query, collection_name, limit * fetch_multiplier)
        if bm25_results:
            raw_results = self._rrf_fuse(raw_results, bm25_results)

    # Step 3: Rerank (optional)
    if should_rerank and self.rerank_service:
        results = await self.rerank_service.rerank(query=query, results=raw_results, top_k=limit * 2)

    return results[:limit]
Enter fullscreen mode Exit fullscreen mode

The fusion uses Reciprocal Rank Fusion (RRF) — a simple but effective algorithm that combines rankings from multiple sources:

@staticmethod
def _rrf_fuse(vector_results, bm25_results, k=60):
    scores = {}
    for rank, r in enumerate(vector_results):
        key = r.content[:100]
        scores[key] = scores.get(key, 0) + 1.0 / (k + rank + 1)
    for rank, r in enumerate(bm25_results):
        key = r.content[:100]
        scores[key] = scores.get(key, 0) + 1.0 / (k + rank + 1)
    return sorted_by_score(scores)
Enter fullscreen mode Exit fullscreen mode

Enable it with one env var: RAG_HYBRID_SEARCH=true.

Reranking: The Quality Multiplier

Initial retrieval casts a wide net. Reranking narrows it down. We support two options:

Cohere Reranker (API) — the fastest way to improve retrieval quality. Send your results + query, get them re-scored by a model trained specifically for relevance ranking:

response = await self.client.rerank(
    query=query,
    documents=[result.content for result in results],
    model="rerank-v3.5",
    top_n=top_k,
)
Enter fullscreen mode Exit fullscreen mode

CrossEncoder (local) — runs a SentenceTransformers cross-encoder model locally. No API calls, no data leaves your infrastructure:

pairs = [[query, result.content] for result in results]
scores = self.model.predict(pairs)  # Runs locally on CPU/GPU
Enter fullscreen mode Exit fullscreen mode

The pipeline is: retrieve 3× more results than needed → rerank → return top-k. This consistently improves precision without touching your embeddings or vector store.

Document Versioning: SHA256 Dedup

Re-ingesting a document shouldn't create duplicates. Our pipeline uses content hashing:

async def ingest_file(self, filepath, collection_name, replace=True):
    document = await self.processor.process_file(filepath)

    # Check for existing version by source path or content hash
    existing_id = await self._find_existing_by_source(collection_name, source_path)
    if not existing_id:
        existing_id = await self._find_existing_by_hash(collection_name, document.metadata.content_hash)

    # Replace old chunks with new ones
    if existing_id:
        await self.store.delete_document(collection_name, existing_id)

    await self.store.insert_document(collection_name, document)
Enter fullscreen mode Exit fullscreen mode

Google Drive sync? Same logic — changed files get re-embedded, unchanged files skip.

4 Embedding Providers

Provider Model Dimensions API Key?
OpenAI text-embedding-3-small 1536 Yes
Voyage voyage-3 1024 Yes
Gemini gemini-embedding-exp-03-07 3072 Yes
SentenceTransformers all-MiniLM-L6-v2 384 No (local)

Dimensions are auto-derived from the model name — no manual configuration:

EMBEDDING_DIMENSIONS = {
    "text-embedding-3-small": 1536,
    "voyage-3": 1024,
    "gemini-embedding-exp-03-07": 3072,
    "all-MiniLM-L6-v2": 384,
}
Enter fullscreen mode Exit fullscreen mode

Gemini is the interesting one — it supports multimodal embeddings. Text and images in the same vector space. We use it for image description extraction from PDFs.

The Agent Integration

RAG becomes an agent tool — search_knowledge_base — available to all 5 AI frameworks (Pydantic AI, LangChain, LangGraph, CrewAI, DeepAgents):

async def search_knowledge_base(
    query: str,
    collection: str = "documents",
    collections: list[str] | None = None,  # Multi-collection search
    top_k: int = 5,
) -> str:
    """Search with automatic reranking & hybrid search if enabled."""
Enter fullscreen mode Exit fullscreen mode

Results include source attribution: filename, page number, chunk number, and similarity score. The agent's system prompt instructs it to cite sources with [1], [2] references.

Key Takeaways

  • RAG is a pipeline of 5 decisions (parse, chunk, embed, store, search) — our template makes each one configurable without code changes
  • Vector-only search misses exact matches — hybrid (BM25 + vector + RRF) catches both semantic and keyword queries
  • Reranking is the cheapest quality improvement — 3× over-retrieve + rerank consistently beats tuning embeddings
  • Document versioning prevents duplicate chunks — SHA256 content hash + source path tracking
  • One env var switches everythingVECTOR_STORE=pgvector, RAG_HYBRID_SEARCH=true, EMBEDDING_MODEL=voyage-3

Try it yourself

full-stack-ai-agent-template — generates production-ready FastAPI + Next.js AI apps with full RAG pipeline

pip install fastapi-fullstack
Enter fullscreen mode Exit fullscreen mode

Related:

More from Vstorm's open-source ecosystem:

If this was useful, follow me on LinkedIn for daily AI agent insights.

Top comments (0)