Disclosure: I am a frontend developer transitioning into AI engineering, sharing real experiments and learnings from building production-style RAG systems.
Your RAG pipeline works perfectly on Friday. Then Monday hits. 1,000 users query at once. Suddenly everything breaks: 502 errors, ECONNRESET, OpenAI 429 rate limits, Pinecone timeouts. The demo wasn't wrong—it just wasn't built for production concurrency.
Video: https://youtu.be/-2aS3Yl5-5M
Code: https://github.com/gauravthorath/rag-scale-demo
Full article: https://gauravthorat-portfolio.vercel.app/blog/rag-production-architecture
The Monday morning problem
Locally: chunk docs → embed → upsert to Pinecone → query → LLM. Simple.
Under load: socket exhaustion, connection pool saturation, API 429s, token costs exploding.
Naive RAG (what most people build first)
for (let i = 0; i < SAMPLE_CHUNKS.length; i++) {
const values = await embedOne(openai, embedModel, SAMPLE_CHUNKS[i]);
vectors.push({ id: `demo-naive-${i}`, values, metadata: { text } });
}
const pinecone = new Pinecone({ apiKey: pineconeKey });
for (const v of vectors) {
await index.namespace(DEMO_NAMESPACE).upsert([v]);
}
Why it breaks at scale:
- One embedding call per chunk
- One upsert per vector
- No batching, no connection reuse, no retries
- New client instances repeatedly
3 chunks × 1,000 users × retries = thousands of outbound API calls. Sockets and rate limits run out fast.
Production pattern
Same RAG logic. Better infrastructure.
Singleton Pinecone client:
let client: Pinecone | undefined;
let indexCache = new Map<string, Index>();
export const getPineconeIndex = (indexName?: string): Index => {
const name = indexName ?? getEnv().PINECONE_INDEX_NAME;
let idx = indexCache.get(name);
if (!idx) {
idx = getPineconeClient().index(name);
indexCache.set(name, idx);
}
return idx;
};
Embedding batching:
const res = await openai.embeddings.create({
model: model,
input: inputs,
});
64 texts → 1 API call instead of 64. Big win on latency, cost, and rate limits.
In-process batching only. For multiple servers, add Redis caching and a task queue.
Naive vs production
| Naive | Production |
|---|---|
| New Pinecone client per call | Singleton client |
| One embedding per chunk | Batched embeddings |
| One upsert per vector | Bulk upsert |
| Raw env vars | Zod validation |
| No retries | Backoff + retry |
| No metrics | Tracing + metrics |
Before real scale
- Exponential backoff + jitter on OpenAI and Pinecone
- Top-K + reranking (don't dump every chunk into the prompt)
- Distributed rate limiting across instances
- Metrics: embed latency, retrieval quality, token usage
- Stable vector IDs for safe retries
Try it
git clone https://github.com/gauravthorath/rag-scale-demo
cd rag-scale-demo
cp .env.example .env
npm install
npm run naive
npm run production
Use separate Pinecone namespaces so runs don't overwrite each other.
Final thoughts
Most RAG tutorials stop at "it answers my PDF." Production is about surviving concurrency, retries, rate limits, and cost pressure.
Questions or repo fixes? Drop a comment. I reply here and on YouTube.
Originally published on my portfolio: https://gauravthorat-portfolio.vercel.app/blog/rag-production-architecture
Top comments (0)