Mohit Verma

Posted on Apr 9 • Originally published at aiwithmohit.hashnode.dev

We Rebuilt Our RAG Pipeline 4 Times — Here's the Architecture That Finally Served 50K Daily Queries Under 800ms

#rag #llm #aiengineering #machinelearning

We Rebuilt Our RAG Pipeline 4 Times — Here's the Architecture That Finally Served 50K Daily Queries Under 800ms

Our first RAG system hit 91% user satisfaction in demos and 34% in production. This is the brutal post-mortem of 4 rebuilds, 3 fired vendors, and the architecture that actually scaled.

Here's the dirty secret nobody talks about at AI conferences: most published RAG architectures have never served 1K daily queries, let alone 50K. The failure modes don't show up until real users — with their typos, ambiguous questions, and zero patience — start hammering your system under latency constraints.

Our stakes were concrete. We were building an internal knowledge base serving 50K queries/day from support agents and customers. Every wrong answer cost $14 in average escalation time — an agent escalating to a senior, a customer calling back, a ticket reopened. Bad latency? Users closed the tab within 3 seconds. We measured it.

What I'm about to walk through is a progression of architectural mistakes that compound. Each fix exposed the next bottleneck. RAG systems fail in sequence, not in isolation. And by the end, I'll tie the final architecture's accuracy improvements back to a concrete daily cost reduction that made leadership actually care.

Source: 5 Reasons Why AI Agents and RAG Pipelines Fail in Production

Rebuild 1→2: How Fixed 512-Token Chunking Destroyed Our Retrieval Precision

The v1 architecture was textbook. LangChain RecursiveCharacterTextSplitter at 512 tokens, OpenAI ada-002 embeddings, Pinecone cosine similarity top-5, GPT-3.5-turbo for generation. It looked great on curated demo queries because our demo docs were short, self-contained, and written by the same person who built the system. Classic demo-ware.

Production corpora are heterogeneous. A 512-token chunk from a legal FAQ splits a clause mid-sentence. A product spec table gets bisected, losing row-column relationships entirely. A troubleshooting guide's "if X then Y" logic gets separated across chunks. Retrieval precision dropped to 0.23 on multi-step procedural queries — meaning fewer than 1 in 4 retrieved chunks actually contained the answer.

We manually reviewed 200 failed queries and categorized chunk-level failures into four types:

Mid-sentence splits: 31% — the chunk boundary fell in the middle of a critical sentence
Table fragmentation: 22% — structured data lost its structure
Context orphaning: 28% — a chunk references "the above" or "as mentioned" with no antecedent
Topic contamination: 19% — unrelated sections merged into a single chunk

That taxonomy changed how we thought about chunking. This wasn't a tuning problem — it was a fundamental mismatch between fixed-size windowing and variable-structure documents.

The Semantic Chunking Solution

The fix: LangChain's SemanticChunker with sentence-transformers for breakpoint detection. Instead of chopping at arbitrary token counts, it identifies semantic boundaries where the topic actually shifts.

from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

semantic_chunker = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
    add_start_index=True,
)

chunks = semantic_chunker.create_documents([document_text])

Chunk relevance scores improved from 0.38 to 0.54 — a 41% lift.

Rebuild 2→3: The Re-Ranking Latency Trap and the Async Pre-Fetch Pattern That Saved Us

Semantic chunking improved chunk quality, but embedding-based retrieval still returned topically related but non-answering chunks. We added Cohere Rerank v2 as a cross-encoder re-ranker. RAGAS faithfulness jumped from 0.61 to 0.82. Then p95 latency exploded from 400ms to 2.1 seconds.

Latency Breakdown

Component	Latency	% of Total
Pinecone query	~45ms	2%
Cohere Rerank API (20 candidates)	~1,200ms	57%
GPT-4o generation	~600ms	29%
Overhead	~255ms	12%

Async Pre-Fetch with Tiered Caching

The solution was an async pre-fetch + tiered cache pattern:

Redis cache for re-ranked results — ~38% hit rate
Speculative generation — fire GPT-4o with top-3 embedding results while re-ranking runs in parallel
Cancellation check — if re-ranker changes top-3 (Jaccard < 0.67), cancel and restart

Net result: p95 dropped to 780ms with quality preserved.

Rebuild 3→4: Context Window Mismanagement and Dynamic Top-K

GPT-4o's 128K context window felt like a cheat code. We stuffed top-20 chunks (~15K tokens). Then the failure reports started.

Liu et al. (2023) — "Lost in the Middle" — showed LLMs degrade on middle content. We saw RAGAS answer relevancy drop 18% for queries where the gold chunk landed in positions 7–14. Contradictory answer rate hit 12% — 6,000 queries/day at $14/escalation.

Dynamic Top-K Strategy

TOP_K_BUDGET = {
    "simple": 3,
    "procedural": 5,
    "comparative": 8,
}

MAX_CONTEXT_TOKENS = 4096

Metric	Before	After
Answer relevancy (RAGAS)	0.71	0.86
Contradictory answer rate	12%	2.3%
Mean tokens per query	~15K	~5K
Monthly API cost	~$4,200	~$2,800

The Final Architecture

The v4 stack that serves 50K queries/day under 800ms p95:

Ingestion: SemanticChunker → metadata enrichment → Pinecone upsert
Retrieval: Hybrid search (BM25 + dense) → Cohere Rerank v2 (self-hosted)
Context Assembly: DistilBERT query classifier → dynamic top-k → delimiter injection
Generation: GPT-4o with structured prompts + source attribution
Caching: Redis (15-min TTL, cosine-distance key matching)
Observability: RAGAS online eval on 5% sample, Prometheus latency histograms

What I'd Tell My Past Self

Benchmark on production queries, not demo queries. Your demo corpus is a lie.
Chunking strategy is architecture, not config. Fixed-size chunking is a leaky abstraction.
Re-ranking is a quality multiplier, but synchronous API re-ranking is a latency trap. Self-host or build async compensation.
Context stuffing is not a strategy. Dynamic top-k with query classification beats brute-force context every time.
Measure escalation cost, not just accuracy. The number that got leadership to fund rebuild 4 wasn't RAGAS — it was $14 × 6,000 queries/day.

If you're building RAG in production and want to compare notes, I'm always up for it. Drop a comment or connect — the failure modes are more interesting than the success stories.

DEV Community

We Rebuilt Our RAG Pipeline 4 Times — Here's the Architecture That Finally Served 50K Daily Queries Under 800ms

We Rebuilt Our RAG Pipeline 4 Times — Here's the Architecture That Finally Served 50K Daily Queries Under 800ms

Rebuild 1→2: How Fixed 512-Token Chunking Destroyed Our Retrieval Precision

The Semantic Chunking Solution

Rebuild 2→3: The Re-Ranking Latency Trap and the Async Pre-Fetch Pattern That Saved Us

Latency Breakdown

Async Pre-Fetch with Tiered Caching

Rebuild 3→4: Context Window Mismanagement and Dynamic Top-K

Dynamic Top-K Strategy

The Final Architecture

What I'd Tell My Past Self

Top comments (0)