DEV Community

Cover image for My $5/month RAG System Now Beats $200 Solutions - Hybrid Search, Reranking & Dashboard
Daniel Nwaneri
Daniel Nwaneri

Posted on

My $5/month RAG System Now Beats $200 Solutions - Hybrid Search, Reranking & Dashboard

Last month, I shared how I built a production RAG system for $5/month. The response was incredible but the feedback was clear: vector search alone isn't enough.

So I rebuilt it. Here's what changed.

The Problem With V1

My original system used pure vector similarity search. It worked great for semantic queries like "how does edge computing work?" but failed miserably when users searched for:

  • Exact IDs: FPL-2026-X
  • Specific names: John Smith contract
  • Technical terms: bge-small-en-v1.5

Vector search finds meaning, not exact matches. I needed both.

The Upgrade: 5 Features That Changed Everything

1. Hybrid Search (Vector + BM25)

Instead of choosing between semantic and keyword search, I combined them using Reciprocal Rank Fusion (RRF).

Query
  │
  ├──► Vector Search (Vectorize) ──┐
  │    finds: meaning, concepts    │
  │                                ├──► RRF Fusion ──► Results
  └──► Keyword Search (D1 BM25) ───┘
       finds: exact terms, IDs
Enter fullscreen mode Exit fullscreen mode

The magic is in the fusion. RRF doesn't care about raw scores - it cares about rank position. If a document ranks #1 in both searches, it's definitely relevant.

reciprocalRankFusion(vectorResults, keywordResults) {
  const rrfK = 60; // Standard constant
  const scores = new Map();

  vectorResults.forEach((result, rank) => {
    scores.set(result.id, 1 / (rrfK + rank + 1));
  });

  keywordResults.forEach((result, rank) => {
    const existing = scores.get(result.id) || 0;
    scores.set(result.id, existing + 1 / (rrfK + rank + 1));
  });

  return sorted(scores);
}
Enter fullscreen mode Exit fullscreen mode

2. Cross-Encoder Reranking

Vector search finds candidates. But are they actually correct?

I added @cf/baai/bge-reranker-base as a "judge" that evaluates the top 10 results against the original query:

const reranked = await env.AI.run('@cf/baai/bge-reranker-base', {
  query: "cloudflare workers performance",
  contexts: top10Results.map(r => ({ text: r.content }))
});
Enter fullscreen mode Exit fullscreen mode

The reranker doesn't just check similarity - it checks relevance. Game changer.

3. Smart Chunking (15% Overlap)

My V1 chunking was naive - split every 500 characters. This created chunks like:

Bad chunk:

"...the performance was excellent. The system hand"
Enter fullscreen mode Exit fullscreen mode

Now I use recursive chunking that respects semantic boundaries:

Good chunk:

"The performance was excellent. The system handled 
500K requests daily with 99.9% uptime."
Enter fullscreen mode Exit fullscreen mode

Plus 15% overlap ensures context isn't lost between chunks.

4. Interactive Dashboard

No more explaining curl commands to clients.

Dashboard Screenshot
Hybrid search visualization showing vector embeddings and keyword search merging through RRF fusion into unified search results, with dark background and blue-orange accents

Built as a single HTML file embedded in the Worker - no separate frontend deployment needed:

  • 📊 Real-time stats
  • 📥 Document ingestion form
  • 🔎 Search with latency monitor
  • 🔑 API key management

Access it at /dashboard on any deployment.

5. MCP Integration (AI Agent Ready)

The Model Context Protocol lets AI assistants use your API as a native tool. I added:

GET  /mcp/tools  # List available tools
POST /mcp/call   # Execute a tool
Enter fullscreen mode Exit fullscreen mode

Now Claude Desktop or any MCP-compatible agent can search my knowledge base directly.

Performance: Before vs After

Metric V1 V2
Search Type Vector only Hybrid (Vector + BM25)
Reranking ✅ bge-reranker-base
Chunking Fixed 500 char Semantic + 15% overlap
Dashboard ✅ Built-in
MCP Support
Latency ~360ms ~900ms
Accuracy Good Significantly better
Cost $5/month $5/month

Yes, latency increased - but that's the reranker adding precision. You can disable it for speed-critical queries:

{ "query": "fast search", "rerank": false }
Enter fullscreen mode Exit fullscreen mode

How This Compares to Pinecone

Feature Pinecone This Project
Monthly Cost $50+ minimum ~$5/month
Edge Deployment ❌ Cloud-only ✅ Cloudflare Edge
Hybrid Search Requires workarounds ✅ Native Vector + BM25
Cross-Encoder Reranking Basic ✅ bge-reranker-base
MCP Integration ❌ None ✅ Native

Pinecone costs 10-30x more at scale. And according to VentureBeat, they're "struggling with customer churn largely driven by cost concerns."

The accuracy difference? Hybrid search with reranking achieves 66.43% MRR vs 56.72% for semantic-only — a +9.3 percentage point improvement.

Real-World Test: "The Firm Brain" Proof of Concept

I ran four tests simulating high-stakes research scenarios:

1. Semantic Intent

Query: "Can I take a vacation in July?"

Data: HR policy mentioning "paid leave" (not "vacation")

Result: ✅ Matched concept, not just keywords

2. Needle in a Haystack

Query: "What is the limit for the large BGE model?"

Data: Dense technical doc with multiple similar numbers

Result: ✅ Found correct value (1500) among distractors

3. Contextual Logic

Query: "Which team has the unbeaten home record?"

Data: Paragraph mentioning Arsenal (49-game run) and Chelsea (86-game home run)

Result: ✅ Reranker correctly prioritized Chelsea based on "home" qualifier

4. Security

Action: Search with invalid API key

Result: ✅ Hard stop - "Invalid API key"

Performance Summary

Metric Result
Search Time 341-642ms
Cost ~$5/month
Reranking Adds precision

The Stack

All on Cloudflare's edge:

  • Workers - Runtime
  • Vectorize - Vector database
  • D1 - SQL for BM25 keywords
  • Workers AI - Embeddings + Reranking

No external services. No data leaving your account.

Try It Yourself

Live Dashboard: vectorize-mcp-worker.fpl-test.workers.dev/dashboard

GitHub: github.com/dannwaneri/vectorize-mcp-worker

Deploy your own in 10 minutes:

git clone https://github.com/dannwaneri/vectorize-mcp-worker.git
cd vectorize-mcp-worker
npm install
wrangler vectorize create mcp-knowledge-base --dimensions=384 --metric=cosine
wrangler d1 create mcp-knowledge-db
# Update wrangler.toml with database_id
wrangler d1 execute mcp-knowledge-db --remote --file=./schema.sql
wrangler deploy
Enter fullscreen mode Exit fullscreen mode

Works with mcp-cli

This server is compatible with mcp-cli for efficient tool discovery:

# Add to mcp_servers.json
{
  "mcpServers": {
    "vectorize": {
      "url": "https://your-worker.workers.dev/mcp"
    }
  }
}

# Discover tools
mcp-cli vectorize

# Search your knowledge base
mcp-cli vectorize/search '{"query": "cloudflare workers", "topK": 5}'
Enter fullscreen mode Exit fullscreen mode

What's Next?

Considering:

  • PDF ingestion (client-side parsing)
  • Usage analytics dashboard
  • Batch document upload

But honestly? This covers 90% of RAG use cases. Sometimes "done" is a feature.


Need help deploying this for your team? I offer full-service setup - hire me on Upwork.

⭐ Star the repo if this helped!

Questions? Drop them in the comments.

Top comments (2)

Collapse
 
dannwaneri profile image
Daniel Nwaneri

The real test was finding specific IDs vs general docs. V1 vector search kept mixing up stuff with similar keywords. Adding the reranker finally fixed it .it actually catches the nuance between a 'similar' topic and the 'right' answer.

Collapse
 
dannwaneri profile image
Daniel Nwaneri

@leob checking back in.

Not sure if you started your MCP journey yet, but I just dropped V2 of that RAG stack we talked about.

Since you were looking at the TypeScript path, the repo for this one is way more modular and still runs natively on Workers. It adds Hybrid Search to handle some of the retrieval gaps I found in V1. Might be a good reference if you're still exploring the TS side.