DEV Community

Cover image for I built a production RAG pipeline. Here's what most tutorials skip.
Anurag Srivastava
Anurag Srivastava

Posted on

I built a production RAG pipeline. Here's what most tutorials skip.

I wanted a RAG system that was fast to run and fast to set up for clients. Upload a PDF, ask questions, get answers with citations. Pretty standard stuff for anyone freelancing in the AI space.

The problem is that every tutorial I found stops at a Jupyter notebook. Working demo, zero production readiness. No auth, no caching, no way to handle more than one user. The happy path, and nothing else.

So I built the whole thing. Deployed, running, something I can actually show to clients.

Here's what that looked like.

Live Demo | Backend Repo | Frontend Repo

AdvChat conversation showing a streamed answer about the backwards law with five source citation chips expanded below


Following a question through the pipeline

The easiest way to explain this system is to trace what happens when someone asks a question.

Architecture

Say a user uploaded a 150-page book and types: "What's the main argument of chapter 3?"

The fingerprint problem

First thing I need to figure out: have I seen this question before?

Not this exact string. That's easy. But what about "what is the main argument of chapter 3" without the contraction? Or "What's the main argument of chapter 3??" with extra punctuation? Different casing?

Same question, four different strings. A naive string comparison treats each one as unique, and each one costs an OpenAI embedding call. That adds up.

The fix I landed on: normalize the query first. Expand contractions, strip punctuation, remove stopwords, collapse whitespace. Then SHA-256 the cleaned string.

The normalization + fingerprinting code

function normalizeQuery(input: string): string {
    let text = input.toLowerCase().trim();

    // "don't" → "do not", "what's" → "what is"
    for (const [k, v] of Object.entries(CONTRACTIONS)) {
        text = text.replace(new RegExp(`\\b${k}\\b`, 'g'), v);
    }

    text = text.replace(/[^a-z0-9\s]/g, ' '); // strip punctuation
    text = text.replace(/\s+/g, ' ');           // collapse spaces

    const words = text.split(' ').filter(w => !STOPWORDS.has(w));
    return words.join(' ').trim();
}

const fingerprint = crypto.createHash('sha256')
    .update(normalizeQuery(query))
    .digest('hex');
Enter fullscreen mode Exit fullscreen mode

All four variations produce the same hash. One fingerprint, one cache key. The embedding cache and response cache both use this, so every performance shortcut downstream starts here.

Cache check

The pipeline looks in Redis for a cached embedding vector at emb:{fingerprint}. If it finds a 1536-dimensional array there, it skips the OpenAI embedding call. About 200ms saved before we even talk to the vector database.

Embedding cache logic

let queryEmbedding: number[];
const cached = await cacheService.getCache(`emb:${fingerprint}`);

if (cached) {
    queryEmbedding = JSON.parse(cached);
} else {
    queryEmbedding = await embeddingService.getEmbedding(query);
    await cacheService.setCache(
        `emb:${fingerprint}`,
        JSON.stringify(queryEmbedding),
    );
}
Enter fullscreen mode Exit fullscreen mode

There's also a response cache (resp:{fingerprint}) that stores complete LLM answers. I found out during testing that this breaks in conversation mode, though. The same question asked as a follow-up has different context because of chat history, so the cached answer would be wrong. The response cache only kicks in for standalone queries. The embedding cache works everywhere.

I added an X-Cache-Embed response header so the frontend knows which path was taken. Helps when debugging latency issues in production.

Vector search

The embedding goes to Pinecone for similarity search. Multi-tenancy is where things get interesting.

Most tutorials scope queries with metadata filters: where: { userId: 'abc' }. That works until someone forgets the filter and suddenly users can see each other's documents.

I used Pinecone namespaces instead. Each user's vectors live in a separate namespace. It's not a filter you can forget to add. The data is physically separated.

Namespace-scoped operations

// Every operation is scoped to the user's namespace
await this.index()
    .namespace(userId)
    .upsert({ records: vectors.slice(i, i + BATCH) });

const result = await this.index()
    .namespace(userId)
    .query({ vector: queryEmbedding, topK, includeMetadata: true });

await this.index()
    .namespace(userId)
    .deleteMany({ ids });
Enter fullscreen mode Exit fullscreen mode

Redacting context before it reaches the LLM

Pinecone returns the top-K matching chunks. Before they go to the LLM, there's a step most tutorials ignore completely.

If someone uploads a security doc with phrases like "bypass authentication" or "disable firewall rules," and asks the right question, the LLM will repeat those instructions. It's in the prompt context. The model doesn't know it shouldn't say that.

So I scan the retrieved chunks for suspicious patterns and replace them with [REDACTED_REASON] before the LLM ever sees the context. The response includes a policy field (allow or partial) so the client knows whether redaction happened.

Pre-LLM redaction filter

const preFilterDocs = (text: string) => {
    const suspicious =
        /\b(bypass|disable|ignore rules|unrestricted|open firewall|run arbitrary)\b/gi;
    const redacted = text.replace(suspicious, '[REDACTED_REASON]');
    const found = suspicious.test(text);
    return { redacted, found };
};

const policyCheck = () => {
    const hasSuspicious = preFilterResults.some(r => r.prefilter.found);
    if (hasSuspicious)
        return { decision: 'partial', reason: 'context_redacted' };
    return { decision: 'allow', reason: 'ok' };
};
Enter fullscreen mode Exit fullscreen mode

Streaming the answer

The redacted context hits GPT-4o-mini. Total response time is about 5.9 seconds, which sounds slow, but the user sees text after 3.5 seconds because of streaming. They're already reading before the model finishes.

AdvChat mid-conversation with a streaming response being generated in real time while the user asks about entitlement

The tricky part: source citations need to come after the stream ends. I split it into two SSE event types.

The streaming protocol

// Stream LLM chunks as they arrive
await llmService.streamAnswer(query, redactedContext, (chunk) => {
    fullAnswer += chunk;
    res.write(`data: ${JSON.stringify({ type: 'chunk', data: chunk })}\n\n`);
});

// Final event: source citations + policy decision
res.write(`data: ${JSON.stringify({
    type: 'done',
    provenance,
    policy: { decision: policyResult.decision, reason: policyResult.reason },
})}\n\n`);
res.end();
Enter fullscreen mode Exit fullscreen mode

chunk events build the message in real time. done triggers the source citation chips below the answer. Two event types, one connection.


Frontend problems I didn't expect

The backend was the interesting part. The frontend is where things broke in ways I didn't anticipate.

Streaming and navigation don't mix in Next.js. When a user sends their first message, the app creates a conversation and updates the URL from /chat to /chat/abc123. I used router.replace(). It unmounted the component. The SSE connection died. The user saw half an answer and got bounced to an error page.

Fix: window.history.replaceState(). Updates the URL without triggering React navigation. Component stays mounted, stream keeps going.

There's also this: the backend always returns source citations even when it says "I don't know." It found context but couldn't answer. Showing citations under a non-answer confuses people. The frontend checks the response text against a regex and hides the source chips when the answer looks like a "no knowledge" response. Took me longer than it should have.

AdvChat showing the thinking indicator dots while waiting for the LLM to respond to a follow-up question


Going to production

Getting this running locally was one thing. Deploying it was a different exercise.

The first Dockerfile was 900MB and ran the dev server. After a few rounds, I landed on a multi-stage build: TypeScript compiles in one stage, the production image only gets the compiled JavaScript and prod dependencies. About 250MB.

First Railway deployment crashed because uploads/ didn't exist in the container. I'd put it in .dockerignore (you don't want local PDFs in your image), but multer needs that directory to write temp files. One-line fix: fs.mkdirSync('uploads', { recursive: true }) at startup. Then I realized the files should be deleted after indexing anyway. No persistent storage, no S3, no disk space to worry about.

The health endpoint hits Postgres, Redis, and Pinecone in parallel with individual timeouts. Railway restarts the container if it fails. Graceful shutdown drains connections with a 10-second limit. CORS is wide open locally, locked to the Vercel domain in production.

None of this is interesting, but skip any of it and the deploy falls over.


Numbers

From the production deployment:

What How fast
Index a 150-page PDF (840 chunks) ~27s
Index a 42-page PDF (122 chunks) ~7s
First token appears ~3.5s
Full streaming response ~5.9s
Monthly cost ~$5-8

Most of the latency is OpenAI. Embedding generation and LLM inference, not server compute. Streaming covers the gap: 3.5 seconds to first visible text, and people start reading.

Tech stack

Layer Technology Why
Frontend Next.js 15, Tailwind CSS, React Query App Router, SSE streaming, server side auth
Backend Express, TypeScript Full control over SSE and middleware
Auth Clerk OAuth + webhook user sync
Vector DB Pinecone Managed, namespace isolation
LLM OpenAI GPT-4o-mini Fast, cheap ($0.15/1M tokens)
Embeddings text-embedding-3-small 1536 dims, $0.02/1M tokens
Cache Redis Embedding + response caching
Database PostgreSQL + Prisma Users, conversations, documents
Hosting Railway + Vercel ~$5-8/month total

What I'd change

My chunks are fixed at 500 tokens with 100 overlap. They cut mid-sentence sometimes, and the LLM struggles with the partial context. I'd switch to semantic chunking, splitting on paragraph boundaries instead.

I wrote a reranking service and then never connected it. A cross-encoder between vector search and the LLM would drop the marginal chunks before they eat up context window. It's sitting there, just not plugged in.

The other thing: if the LLM errors mid-stream, the user sees a half-finished message and nothing else. No error state, no retry button. I need a type: 'error' event in the SSE protocol. It's on my list.


TypeScript, Express, Next.js, OpenAI, Pinecone, Redis, PostgreSQL, Clerk. Railway + Vercel. About $5-8/month.

If you have questions about any of the architecture decisions, I'm happy to talk about them in the comments.

GitHub logo anuragmerndev / adv-rag

RAG-powered document Q&A API with streaming, dual-layer caching, and multi-tenant vector search

AdvChat — Backend API

TypeScript Node.js Express Prisma OpenAI Pinecone Redis PostgreSQL

RAG-powered document Q&A API with streaming responses, dual-layer caching, multi-tenant vector search, and pre-LLM content redaction.

Live Demo | Frontend Repo | Case Study


Architecture

Architecture

Upload path: PDF → parse (LangChain) → chunk (500 tokens, 100 overlap) → embed (OpenAI text-embedding-3-small) → store (Pinecone, namespaced per user)

Query path: Question → normalize → SHA-256 fingerprint → check embedding cache (Redis) → embed if miss → similarity search (Pinecone) → redact suspicious context → stream answer (GPT-4o-mini) → persist (Postgres)


Features

  • RAG Pipeline — PDF upload, chunking, embedding, vector search, LLM answer generation
  • SSE Streaming — real-time response streaming with a two-event protocol (chunks + done with provenance)
  • Dual-Layer Caching — embedding cache (emb:) saves OpenAI calls; response cache (resp:) for standalone queries
  • Query Fingerprinting — normalizes queries (contractions, stopwords, punctuation) then SHA-256 hashes for cache deduplication
  • Pre-LLM Redaction — scans retrieved context…

GitHub logo anuragmerndev / adv-rag-ui

Chat interface for RAG-powered document Q&A with real-time streaming and source citations

AdvChat — Frontend

Next.js React TypeScript Tailwind CSS Clerk

Chat interface for RAG-powered document Q&A with real-time streaming responses, source citations, conversation history, and drag-and-drop PDF upload.

Live Demo | Backend Repo | Case Study


Architecture

Architecture

The frontend connects to the Express backend via REST and SSE. Clerk handles authentication client-side, and the backend verifies tokens on every request.


Features

  • Streaming Chat — SSE-based real-time response streaming with chunk-by-chunk rendering
  • Source Citations — expandable chips showing document name and quoted content per answer
  • Conversation History — persistent conversations with titles, message counts, sidebar navigation
  • PDF Upload — drag-and-drop zone with file type validation (PDF only, 25MB limit) and progress feedback
  • Document Management — searchable list with multi-select checkboxes to scope queries to specific documents
  • Smart Input — arrow-key history navigation (up/down), Shift+Enter for multiline, auto-growing textarea
  • Skeleton Loaders — animated message-shaped placeholders while loading old conversations
  • Thinking Indicator — pulsing dots while waiting for first token

Top comments (1)

Collapse
 
giftakis profile image
John Giftakis

Any ideas on CAG or other methods of "RAG"-ing?