Most RAG systems use a single retriever to search a vector database. It works — until your knowledge base has code, images, tables, and text all mixed together. One retriever can't specialize in all of them.
So I built rag-swarm — a multimodal RAG system where specialized swarm agents search in parallel, and an LLM-powered oracle evaluates every result before it reaches the user.
The architecture is inspired by Karpathy's LLM Wiki three-layer design (ingestion → retrieval → generation), adapted for swarm-based vector retrieval with enterprise-grade evaluation. Where Karpathy's wiki describes a clean separation of concerns for LLM-augmented knowledge systems, rag-swarm takes the retrieval layer and replaces the single search path with a coordinated swarm of specialized agents.
The Problem with Single-Retriever RAG
Traditional RAG does this:
- Embed the query
- Search the vector DB
- Return top-K results
- Feed them to an LLM
It treats all documents the same. A Python function, a data table, and a paragraph of text all get the same embedding strategy, the same search, the same ranking. That's leaving relevance on the table.
How Swarm RAG Works
Instead of one retriever, rag-swarm dispatches your query to 4 specialized agents running in parallel:
- TextAgent — optimized for prose and documentation
- CodeAgent — understands function signatures, docstrings, imports
- ImageAgent — works with captions and CLIP embeddings
- TableAgent — handles structured/tabular data
Each agent searches the same ChromaDB vector store but with modality-aware strategies. The results are deduplicated, re-ranked with a cross-encoder, and then sent to the Oracle.
The Oracle — The Quality Gate
The oracle is the part I'm most proud of. It's a two-stage evaluator:
- Fast pass — embedding similarity between the query and each result
- Deep pass — LLM reasoning that explains why each result is relevant or not
Every result comes back with a human-readable verdict:
{
"relevance_score": 0.9572,
"reasoning": "This chunk is RELEVANT as it directly addresses the query by explaining the functionality and evaluation process of the Oracle Agent.",
"passed": true
}
No black box. The user sees the oracle's reasoning for every single result.
Semantic Query Cache
Every query gets embedded once. If a similar query was asked before (cosine similarity ≥ 0.95), the cached response returns instantly — skipping the entire swarm + oracle pipeline. Near-duplicate queries hit cache too, not just exact matches.
MCP Server — Plug Into Any AI Host
The whole system is exposed as an MCP server (Model Context Protocol, spec 2025-11-25). That means Claude Desktop, VS Code Copilot, or any MCP-compatible host can use it as a tool:
{
"mcpServers": {
"rag-swarm": {
"command": "uv",
"args": ["--directory", "./backend", "run", "python", "-m", "app.mcp_server"]
}
}
}
7 tools, 2 resources, 2 prompts — all discoverable by the host.
The Stack
- Backend: Python + FastAPI + ChromaDB
- Inference: Cloudflare Workers AI (embeddings, LLM, VLM, re-ranker) — no local GPU needed
- Frontend: React + Vite + D3.js for vector space visualization
- MCP: FastMCP with stdio and streamable HTTP transports
Results
I built a comparison mode that runs both approaches side-by-side with evaluation metrics (Precision, Recall, NDCG, MRR). The swarm consistently surfaces results across modalities that a single retriever misses — at the cost of slightly lower average relevance because it casts a wider net.
Try It
The project is open source under MIT:
GitHub: github.com/arananet/rag-swarm
cd backend && pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000
cd ../frontend && npm install && npm run dev
# Open http://localhost:5173
I'd love feedback — especially on the oracle evaluation approach and whether the swarm architecture makes sense for your use cases.
Top comments (0)