DEV Community

Cover image for I Built a Swarm Agent RAG System Inspired by Karpathy's LLM Wiki
Edu Arana
Edu Arana

Posted on

I Built a Swarm Agent RAG System Inspired by Karpathy's LLM Wiki

#ai

Most RAG systems use a single retriever to search a vector database. It works — until your knowledge base has code, images, tables, and text all mixed together. One retriever can't specialize in all of them.

So I built rag-swarm — a multimodal RAG system where specialized swarm agents search in parallel, and an LLM-powered oracle evaluates every result before it reaches the user.

The architecture is inspired by Karpathy's LLM Wiki three-layer design (ingestion → retrieval → generation), adapted for swarm-based vector retrieval with enterprise-grade evaluation. Where Karpathy's wiki describes a clean separation of concerns for LLM-augmented knowledge systems, rag-swarm takes the retrieval layer and replaces the single search path with a coordinated swarm of specialized agents.

The Problem with Single-Retriever RAG

Traditional RAG does this:

  1. Embed the query
  2. Search the vector DB
  3. Return top-K results
  4. Feed them to an LLM

It treats all documents the same. A Python function, a data table, and a paragraph of text all get the same embedding strategy, the same search, the same ranking. That's leaving relevance on the table.

How Swarm RAG Works

Instead of one retriever, rag-swarm dispatches your query to 4 specialized agents running in parallel:

  • TextAgent — optimized for prose and documentation
  • CodeAgent — understands function signatures, docstrings, imports
  • ImageAgent — works with captions and CLIP embeddings
  • TableAgent — handles structured/tabular data

Each agent searches the same ChromaDB vector store but with modality-aware strategies. The results are deduplicated, re-ranked with a cross-encoder, and then sent to the Oracle.

The Oracle — The Quality Gate

The oracle is the part I'm most proud of. It's a two-stage evaluator:

  1. Fast pass — embedding similarity between the query and each result
  2. Deep pass — LLM reasoning that explains why each result is relevant or not

Every result comes back with a human-readable verdict:

{
  "relevance_score": 0.9572,
  "reasoning": "This chunk is RELEVANT as it directly addresses the query by explaining the functionality and evaluation process of the Oracle Agent.",
  "passed": true
}
Enter fullscreen mode Exit fullscreen mode

No black box. The user sees the oracle's reasoning for every single result.

Semantic Query Cache

Every query gets embedded once. If a similar query was asked before (cosine similarity ≥ 0.95), the cached response returns instantly — skipping the entire swarm + oracle pipeline. Near-duplicate queries hit cache too, not just exact matches.

MCP Server — Plug Into Any AI Host

The whole system is exposed as an MCP server (Model Context Protocol, spec 2025-11-25). That means Claude Desktop, VS Code Copilot, or any MCP-compatible host can use it as a tool:

{
  "mcpServers": {
    "rag-swarm": {
      "command": "uv",
      "args": ["--directory", "./backend", "run", "python", "-m", "app.mcp_server"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

7 tools, 2 resources, 2 prompts — all discoverable by the host.

The Stack

  • Backend: Python + FastAPI + ChromaDB
  • Inference: Cloudflare Workers AI (embeddings, LLM, VLM, re-ranker) — no local GPU needed
  • Frontend: React + Vite + D3.js for vector space visualization
  • MCP: FastMCP with stdio and streamable HTTP transports

Results

I built a comparison mode that runs both approaches side-by-side with evaluation metrics (Precision, Recall, NDCG, MRR). The swarm consistently surfaces results across modalities that a single retriever misses — at the cost of slightly lower average relevance because it casts a wider net.

Try It

The project is open source under MIT:

GitHub: github.com/arananet/rag-swarm

cd backend && pip install -r requirements.txt
uvicorn app.main:app --reload --port 8000

cd ../frontend && npm install && npm run dev
# Open http://localhost:5173
Enter fullscreen mode Exit fullscreen mode

I'd love feedback — especially on the oracle evaluation approach and whether the swarm architecture makes sense for your use cases.

Top comments (0)