Senthil Nathan

Posted on Mar 20

12 Things Nobody Tells You About Building a Production RAG System

#llm #rag #ai #vectordatabase

I spent several months building Ask360, a multi-tenant RAG platform that lets anyone upload documents and deploy an AI assistant with source citations and confidence badges. The UI and CRUD took a few weeks. The search quality, failover, and production hardening? That's where the real engineering happened.

Here are the problems I hit that tutorials don't cover, and the solutions that actually work in production.

1. You need two databases, and that's actually fine

Every RAG tutorial shows one database. In production, I use two:

MariaDB for user accounts, projects, documents, billing, and usage tracking. Everything the admin panel touches.
PostgreSQL + pgvector for vector embeddings, semantic cache, and chunk storage. Everything the search engine touches.

Why not one? MariaDB is excellent for the CRUD-heavy admin layer. Yii2's ActiveRecord, migrations, fixtures, and auth all work out of the box. But MariaDB has no vector similarity search. PostgreSQL with pgvector gives me HNSW indexes for fast approximate nearest neighbor search, ts_rank_cd for BM25 keyword scoring, and trigram indexes for fuzzy matching, all in SQL.

The two databases communicate through the documents table. PHP extracts text and stores it in MariaDB. The Node.js engine reads it, chunks it, embeds it, and stores vectors in PostgreSQL. The project config syncs to PostgreSQL via a simple UPSERT so the engine can join against it during search.

PHP (Yii2)                          Node.js (Express)
MariaDB                             PostgreSQL + pgvector
---------                           ---------------------
user, company, project              project_ref (synced)
rag_source (text, status)           rag_chunk (content, embedding)
billing, usage, alerts              semantic_cache

The admin layer doesn't touch vectors. The engine doesn't touch users. Clean separation. Either can be replaced independently.

The tradeoff: You need to keep project config in sync between the two databases. A 50-byte UPSERT on project save handles this. Worth it for the architectural simplicity of letting each database do what it's best at.

2. Local embeddings are a trap. Pay for API embeddings instead.

I started with local ONNX embeddings: first Xenova/all-MiniLM-L6-v2 (384-dim), then upgraded to Snowflake/snowflake-arctic-embed-m-v1.5 (256-dim, Matryoshka). The appeal was obvious: no API costs, no rate limits, data stays on your server.

The reality on a production server:

Each ONNX inference ties up a CPU core for ~15ms per chunk
On a 6-core server, 4 embed instances process ~260 chunks/second. Sounds fast until a user uploads a 600-chunk document and the server goes to 100% CPU for 2+ minutes
ONNX Runtime's spin-wait burns CPU even when idle. Multiple threads per instance waste cycles on synchronization instead of computation
Model loading takes 3-5 seconds and ~105MB RAM per instance
Tuning OMP_NUM_THREADS, instance counts, and batch sizes is a combinatorial nightmare that changes with every server

Then I tried Gemini's embedding API (gemini-embedding-001, 256-dim):

Metric	Local ONNX	Gemini API
Throughput	~260 chunks/sec (6-core)	~1000+ chunks/sec
CPU load during embed	100%	~5%
Cost per 1000 chunks	$0 (but CPU cost)	~$0.001
Cold start	3-5 sec model load	None
Tuning required	Extensive	None

The API is faster, cheaper than the CPU time it saves, and needs zero tuning. I kept local ONNX as a fallback (configured per-project via EMBED_PROVIDER), but production runs on Gemini embeddings.

The lesson: "Free" local inference isn't free when you account for CPU contention, tuning time, and the ops burden of managing ONNX threads across PM2 instances. API embeddings are a solved problem. Pay the fraction of a cent and focus on search quality instead.

3. Your cross-encoder will lie to you (and return 0.731 for everything)

I added cross-encoder reranking using Xenova/ms-marco-MiniLM-L-6-v2 to improve search quality. Every query returned a confidence score of exactly 0.731. Completely broken.

The trap: I used pipeline("text-classification", ...) from @huggingface/transformers. This applies softmax internally. For single-label models (num_labels=1), softmax over a single logit always produces 1.0. Always.

The fix is to use AutoModelForSequenceClassification directly and apply sigmoid yourself:

const output = await model(inputs);
const logit = output.logits.data[0];
const score = 1 / (1 + Math.exp(-logit));  // sigmoid, not softmax

This is documented in sentence-transformers issue #2874, but you'd never find it from the model card. If you're building RAG with JavaScript and using cross-encoders, skip pipeline() and use the model directly.

4. MS MARCO has a hidden bias that breaks topical queries

After fixing the sigmoid issue, factoid queries worked perfectly:

Query	CE Score	Correct?
"How many PTO days do employees get per year?"	0.9994	Yes
"What health insurance does the company provide?"	0.9891	Yes
"What is the vacation policy?"	0.0002	No

That last query retrieved the right documents (vector similarity 0.38), but the cross-encoder scored them near zero.

Why: MS MARCO's training data marks passages as relevant only if a human used them to write a specific answer. The model learned "Does this passage ANSWER this question?" rather than "Is this passage ABOUT this topic?" Topical queries like "What is the vacation policy?" have no direct Q&A match, so the cross-encoder rejects them.

The fix: multi-signal confidence. Combine two signals, where either one can promote the tier:

if (topCEScore >= 0.5)                              // verified
else if (topCEScore >= 0.02 || vecSim >= 0.35)      // grounded
else if (topCEScore >= 0.001 || vecSim >= 0.25)     // mixed
else                                                 // ungrounded

Now "What is the vacation policy?" scores as grounded (vector catches it), while "How do quantum computers work?" correctly scores as ungrounded (both signals low).

The lesson: Never trust a single neural signal. Let multiple weak signals vote.

5. RRF beats hand-tuned score mixing every time

The first version of hybrid search added vector similarity and keyword scores together:

SELECT *, (cosine_score + MIN(ts_rank_cd * 0.25, 0.20)) AS combined
FROM rag_chunk ORDER BY combined DESC

This is the score-mixing anti-pattern. Vector similarity (0-1 range) and keyword rank scores (completely different scale) distort each other when added. Tuning the weights is a moving target.

Reciprocal Rank Fusion (RRF) solves this elegantly:

rrf_score(doc) = 1/(60 + rank_vector) + 1/(60 + rank_keyword)

Each document gets a score based purely on its rank position in each list, not its raw score. Documents appearing in both lists naturally score higher. The k=60 constant (from the original 2009 paper) dampens outliers.

No weight tuning. No score normalization. Works across any two ranking signals. Total fusion time: <1ms.

6. Neighbor expansion fixes the "lost in the middle" problem

Top-K retrieval picks the 8 best-scoring chunks. But if a policy section spans chunks 45-48 and only chunk 46 scored high enough, the LLM sees one fragment instead of a complete section. The answer goes from comprehensive to "there is no standard policy."

The fix: After reranking selects the top 8, grab the 2 chunks before and after each result:

const TOP_K = 8;           // scored winners
const NEIGHBOR_WINDOW = 2; // context on each side

One additional SQL query (~2ms). The LLM now sees full document sections in reading order, not disconnected fragments.

Why post-rerank, not pre? Expanding before reranking means the cross-encoder scores 20+ extra passages at ~6ms each. Post-rerank expansion is nearly free.

7. Node.js fetch() hides network errors in a cause chain

When Gemini's API times out, Node.js fetch() throws TypeError("fetch failed"). The real error, ETIMEDOUT, is buried in the .cause property, not in .message.

My error classifier checked .message, didn't find ETIMEDOUT, classified it as unknown_error, and didn't trigger failover. The user saw a silent failure.

// Walk the cause chain to find the real error
export function hasNetworkErrorCode(error: unknown): boolean {
  let current: unknown = error;
  while (current && typeof current === 'object') {
    const e = current as any;
    if (NETWORK_ERROR_CODES.includes(e.code)) return true;
    if (NETWORK_ERROR_CODES.some(c => e.message?.includes(c))) return true;
    current = e.cause ?? null;  // Walk up
  }
  return false;
}

Now the full chain works: ETIMEDOUT -> retry 3x -> still failing -> failover to Claude -> retry 3x -> success. Or if both providers are down, the circuit breaker opens.

8. Circuit breakers need escalating cooldowns

When your primary LLM provider (Gemini) goes down, you fail over to Claude. But you also need to stop hammering the dead provider and know when to try again.

A fixed cooldown doesn't work. A 5-minute cooldown retries too aggressively during a major outage. A 1-hour cooldown recovers too slowly from a brief hiccup.

Escalating cooldown:

const COOLDOWN_SCHEDULE_MINS = [5, 15, 60, 180, 600];

2 consecutive failures -> circuit opens, 5 min cooldown
Probe fails after cooldown -> escalate to 15 min
Still down -> 60 min, then 3 hours, then 10 hours (capped)

The state machine: CLOSED -> OPEN -> HALF_OPEN -> (success) -> CLOSED or -> (failure) -> OPEN with escalated cooldown.

Key decision: Circuit state is in-memory only. Restarting the engine resets all circuits to CLOSED. This is intentional because if the engine restarts, the provider may have recovered. No stale state.

9. HTTP was never designed for embedding jobs

Our architecture has PHP (Yii2) orchestrating document processing and Node.js doing the computation. They communicate over HTTP. The problem:

Operation	Duration	HTTP's comfort zone?
Embed 20 chunks	1-3 sec	Borderline
Embed 200+ chunks	30-60 sec	No
LLM during outage + retries	60+ sec	Definitely no

PHP's file_get_contents() silently returns false on timeout. No partial result. No error message. Documents get stuck in "processing" forever.

Three-layer solution:

Break jobs into HTTP-sized pieces. Batch of 20 chunks per call, each completes in 1-3 seconds.
Parallelize across PM2 cluster. Large documents split into N parts, each acquiring a MySQL advisory lock for concurrency control.
IDs-only messaging. PHP sends {sourceId: 123}, engine fetches text from DB, chunks, embeds, stores, and calls back. Only ~50 bytes cross the wire.

Advisory locks auto-release on process death. If an embed worker crashes, the slot is freed immediately. No cleanup needed.

10. Split your RAG engine into two PM2 apps

Running embedding and chat in the same process is a mistake. Embedding is CPU-hungry and blocks the event loop. Chat needs low-latency SSE streaming.

We run two PM2 apps from the same codebase, differentiated by an ENGINE_ROLE env var:

// ecosystem.config.cjs
{
  name: "rag-engine-prd-chat",
  env: { ENGINE_ROLE: "chat", PORT: 3001 },
  instances: 2,
},
{
  name: "rag-engine-prd-embed",
  env: { ENGINE_ROLE: "embed", PORT: 3005 },
  instances: 2,
}

The chat app handles streaming, project config, reranking, and health checks. The embed app handles document embedding only. Scale independently.

Instance counts auto-detect based on provider:

API providers (Gemini/OpenAI): 1 chat, 1 embed (API does the heavy lifting)
Local ONNX: CPU-based (chat = max(2, CPUs*0.25), embed = CPUs*0.5)

11. Stale processing recovery is a production necessity

When an embed process dies mid-flight (OOM kill, server restart, unhandled exception), the document stays in status = 'processing' forever. No HTTP callback comes back. No heartbeat protocol exists.

Pragmatic fix: A cron job runs every 5 minutes, checks for documents stuck in "processing" for more than the threshold, and resets them to "pending":

STALE_THRESHOLD_API = 300   // 5 min for API providers (fast)
STALE_THRESHOLD_LOCAL = 1800 // 30 min for local ONNX (slow)

The recovery uses an atomic UPDATE WHERE status='processing' to prevent overwriting a legitimate callback that arrives at the same moment.

This is a workaround for HTTP's lack of connection lifecycle awareness. A WebSocket would know immediately when the other end dies. But for a system where embeds happen in the background and take seconds to minutes, a periodic recovery sweep is simpler and just as effective.

12. The guardrails pipeline runs in two directions

Input protection catches problems before they reach the LLM. Output protection catches problems in the response. Both run as ordered rule pipelines:

Inbound (before LLM):

Rate limiting (per-IP, per-project, per-minute)
Content safety filter (offensive/harmful queries)
Prompt injection detection (attempts to override the system prompt)

Outbound (after LLM):

PII redaction (SSNs, credit cards, phone numbers) runs in real-time as tokens stream
Hallucination flagging via confidence badges warns when document support is weak
Response length limits prevent runaway generation

The key design: PII redaction runs on each streaming chunk in real time, so sensitive data never reaches the browser. Other checks (hallucination, length) evaluate the complete response for accuracy.

The numbers that matter

After all this tuning, here's where the system landed:

Metric	Value
Total retrieval + rerank	50-80ms
RRF fusion	<1ms
Cross-encoder (5 passages)	~30ms
Neighbor expansion	~2ms
Semantic cache hit	1-2ms
Embedding (20 chunks, API)	1-3 sec
Circuit breaker initial cooldown	5 min
Circuit breaker max cooldown	10 hours
Stale recovery check	every 5 min

What I'd do differently

Use API embeddings from day one. Local ONNX sounds appealing ("no API costs!") but the CPU contention, thread tuning, and cold starts aren't worth it. Gemini embeddings cost fractions of a cent and need zero ops.
Start with RRF + multi-signal from day one. The single-score approach wasted a week of debugging before I understood the MS MARCO bias.
Split chat and embed processes immediately. Running them together caused latency spikes I spent days chasing.
Build the circuit breaker before you need it. The first Gemini outage hit at 2am. I was manually switching providers until the circuit breaker was ready.
Use advisory locks instead of queue tables. They auto-release on crash, which eliminates an entire class of stuck-job bugs.

Try it

Ask360 is live with a free tier: 1 project, 2 documents, 100 queries/month, forever free. Upload some docs and see the confidence badges and source citations in action.

If you're building something similar, I hope this saves you some of the debugging I went through. The RAG tutorials show you how to wire up vector search in 50 lines. The other 10,000 lines are what this post is about.

Built with PHP (Yii2), Node.js, PostgreSQL + pgvector, Gemini, Claude, and a lot of production debugging.