Md Ayan Arshad

Posted on Apr 4 • Edited on May 4

5 Critical Failures We Hit Shipping a Multi-Tenant RAG Chatbot to 500+ Enterprises

#agents #ai #programming #security

Our first enterprise tenant onboarded on a Monday.

By Wednesday, 30% of their documents had been silently indexed as empty strings. No error. No exception. The chatbot just said "I don't have enough information", confidently, every time.

That was Failure #1. There were four more.

Here's the honest account of shipping a multi-tenant RAG chatbot to 500+ enterprise clients — what broke, in what order, and what we should have caught earlier.

The System We Built

Before the failures, the context.

We built a RAG chatbot for enterprise warehouse management. Each tenant had their own isolated knowledge base — SOPs, compliance documents, operational guides. Users queried only their tenant's data. Scale target: ~25,000 queries per day at full rollout.

Indexing pipeline:
Document Upload → Type Detection → Preprocessing → Chunking → Embedding → Pinecone

Query pipeline:
User Query → Cache Check → Query Rewrite → Hybrid Search (BM25 + Vector) → RRF Fusion → Reranker → LLM → Response

Two pipelines in the design. One EC2 fleet in reality, which became Failure #4.

Indexing consumed from SQS. Query API sat behind an ALB. One Pinecone namespace per tenant, every query scoped to the authenticated tenant's namespace before touching the vector DB.

The architecture decisions were mostly right.

What broke was the assumptions underneath them.

Failure #1: The PDF Preprocessing Assumption (Week 1)

We assumed all enterprise documents were text-based PDFs.

They weren't.

About 30% of what tenants uploaded were scanned PDFs, images of physical pages, no text layer. When PyMuPDF opened these files, it returned empty strings. We embedded empty strings. We indexed empty chunks. No error. No exception. Just silent failure.

Users asked questions. Retrieval returned nothing relevant. The LLM said "I don't have enough information." Users assumed the chatbot was broken. They were right, just not for the reason they thought.

The fix: A preprocessing gate that checks average characters per page. If avg_chars_per_page < 100, no text layer exists, trigger OCR via AWS Textract before chunking. We also added an admin-facing flag marking documents as "pending OCR" so tenants know their document is processing, not lost.

The lesson: Never assume your input format. Garbage input produces zero output in RAG. Preprocessing is the most boring part of the pipeline and the most catastrophic to skip.

Failure #2: Headers, Footers, and the Chunk Contamination Problem

Even for text-based PDFs, every chunk was contaminated.

Enterprise documents have headers and footers on every page. "Softeon WMS User Guide — Confidential — Page 14 of 203." When you chunk a 200-page document into 512-token pieces, that text bleeds into hundreds of chunks.

The retrieval impact was subtle but real. Queries about "confidential" topics surfaced chunks with "Confidential" in the footer, not because the content was relevant, but because BM25 was matching on that exact term. Relevance scores were quietly polluted.

The fix: A stripping step before chunking. Text appearing in the top 5% and bottom 5% of every page gets flagged and removed. We also converted tables to markdown before chunking, a raw table extracted as "Product Price Refund Laptop 999 30 days" is useless for retrieval. The same table as structured markdown is self-contained and semantically meaningful.

The lesson: Most RAG tutorials skip directly to chunking size debates — 256 vs 512 tokens. They assume clean input. Real enterprise documents are not clean.

Failure #3: The Parallel Pipeline Was Actually Sequential

We ran BM25 and vector search in what we thought was parallel.

It wasn't.

The original implementation called BM25, waited for the result, then called Pinecone. "Parallel" on the architecture diagram. Sequential in the code. At p50 this cost us ~200ms we couldn't afford.

The fix is one line:

# Wrong — sequential
bm25_results = await bm25_search(query)
vector_results = await pinecone_search(query)

# Right — parallel
bm25_results, vector_results = await asyncio.gather(
    bm25_search(query),
    pinecone_search(query)
)

Latency becomes the max of the two, not the sum. Dropped our p95 by ~180ms.

The lesson: "Parallel" on a diagram and "parallel" in code are different things. Profile your pipeline stage by stage. The bottleneck is always somewhere surprising.

Failure #4: One Tenant's Upload Degraded Everyone's Query Latency

This one took a week to diagnose.

We noticed periodic p99 spikes, not consistent, not tied to query volume. Random, unpredictable.

The cause: our indexing pipeline and query pipeline were on the same EC2 instances.

When a large tenant uploaded 500 documents, the embedding loop hammered the instance CPU. Live users querying on the same instance saw response time jump from 800ms to 6+ seconds. The indexer and query service were invisible to each other in code — but very visible to each other on the metal.

The fix: Complete infrastructure separation. Indexing workers on a dedicated EC2 fleet, completely outside the ALB. The query fleet has no knowledge that indexing is happening. A document upload spike now has zero effect on query latency for any tenant. The SQS queue buffers upload bursts and feeds indexing workers at a controlled pace.

The lesson: Load isolation is not just an architectural principle. It's a user experience decision. Enterprise tenants don't care about your architecture, they care that the chatbot was slow when they needed it.

Failure #5: The Namespace Isolation Gap We Almost Missed

Multi-tenant isolation in Pinecone is handled by namespaces. One namespace per tenant. Every write tags it. Every read is scoped to it.

What we almost shipped: namespace scoped at the request body level.

A bad actor passing a forged tenant_id in the request body could scope the query to a different tenant's namespace. Subtle. Critical.

# Wrong — trusting request body
namespace = request.body.tenant_id

# Right — trusting validated token only
namespace = token_context.tenant_id  # resolved from JWT at API layer

The fix: Namespace resolved exclusively from the validated JWT token at the API layer. The request body's tenant_id is ignored entirely. By the time a request reaches the vector DB call, the namespace has already been locked to the authenticated tenant — it cannot be overridden.

If we had shipped the original version, any authenticated user who knew another tenant's ID could have queried their private documents. In a WMS context serving enterprise clients, that's not a security incident, that's a contract termination and a legal conversation.

The lesson: Namespace isolation is not the same as security. Enforce tenant identity at authentication, not the application layer.

What We Still Haven't Built

We don't have automated RAG evaluation in production.

No RAGAS running continuously. No Precision@5 after every deployment. Human review by an internal QA team, representative queries, manual quality ratings. It works at current scale. It won't at full rollout.

What I'd build next with two weeks:
→ A golden evaluation set with 200 curated question-to-chunk pairs from real tenant queries. Your retrieval quality baseline.
→ RAGAS Faithfulness in CI/CD runs on every deployment, blocks release if faithfulness drops more than 5% from baseline.
→ Context Precision tracking, tells you if your reranker is actually earning its latency cost.

The One Thing That Mattered Most

RAG systems fail at the edges of the pipeline, not the center.

Most engineering effort goes into the center, embedding models, reranking algorithms, chunk sizes. The real production failures happen at the edges: what goes into the indexer, what happens when two workloads compete for the same compute, and where tenant identity gets resolved.

What broke first in your RAG pipeline? Drop it in the comments. The failures nobody writes about are always the most useful. I'll compile the best ones into a follow-up post.