TutorialQ

Posted on Mar 25 • Originally published at tutorialq.com

RAG Architecture: Building AI Apps That Know Your Data" platform

#ai #llm #rag #vectordatabase

System Design Deep Dive — #3 of 20 | This is part of a 20-post series covering the most critical system design topics. Follow to get the next one.

RAG Architecture: Building AI Apps That Know Your Data

Perplexity AI processes 100M+ queries per month. GitHub Copilot references your entire codebase to suggest context-aware code. Both are powered by Retrieval-Augmented Generation. RAG has become the dominant pattern for building AI applications that need domain-specific knowledge -- and for good reason: it's 10-100x cheaper than fine-tuning, works with any LLM, and updates instantly when your data changes.

TL;DR: RAG gives your LLM access to fresh, domain-specific data at query time without retraining. The architecture involves chunking documents, generating embeddings, storing them in a vector database, retrieving relevant context via hybrid search, and feeding it to the LLM. The difference between a mediocre RAG system and a great one comes down to chunking strategy, hybrid retrieval, and re-ranking.

The Problem

LLMs are trained on public data with a knowledge cutoff. They don't know about your internal docs, your latest product updates, or your proprietary data. Fine-tuning can help, but it's expensive to repeat every time data changes, and it bakes knowledge into weights rather than allowing dynamic updates.

Approach	Cost to Update	Knowledge Freshness	Domain Accuracy	Setup Time
Prompt engineering	Free	Static	Low	Minutes
RAG	Low ($)	Real-time	High	Days
Fine-tuning	High ($$$)	Stale until retrained	Very high	Weeks
Pre-training	Extreme	Stale until retrained	Highest	Months

RAG takes a different approach: retrieve relevant context from your data at query time and feed it alongside the user's question. Simple concept. Tricky to get right.

Building a RAG Pipeline

Step 1: Document Ingestion and Chunking

Your knowledge base needs to be broken into chunks that are small enough to fit in context windows but large enough to carry meaningful information.

Chunk size is one of the most impactful decisions in a RAG system:

def chunk_document(text: str, chunk_size: int = 512, overlap: int = 50) -> list[str]:
    """Split document into overlapping chunks."""
    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = " ".join(words[i:i + chunk_size])
        if chunk:
            chunks.append(chunk)
    return chunks

Too small: chunks lose surrounding context and retrieval misses relevant information
Too large: chunks contain noise that dilutes the useful signal
Overlapping chunks: improve retrieval at chunk boundaries where important information often spans two chunks

Step 2: Embedding Pipeline

Convert each chunk into a vector embedding -- a dense numerical representation that captures semantic meaning.

Consistency matters here. If you generate embeddings with one model for indexing and a different model for queries, similarity search breaks. Pick a model and commit to it.

Embedding Model	Dimensions	Quality (MTEB)	Speed	Best For
OpenAI text-embedding-3-large	3072	High	Fast (API)	Production, accuracy priority
OpenAI text-embedding-3-small	1536	Good	Fast (API)	Cost-sensitive production
Cohere embed-v3	1024	High	Fast (API)	Multilingual support
BGE-large (HuggingFace)	1024	High	Medium	Self-hosted, no API dependency
E5-mistral-7b	4096	Highest	Slow	Maximum quality, self-hosted
all-MiniLM-L6-v2	384	Moderate	Very fast	Development, prototyping

Step 3: Vector Store

Embeddings need to be stored and indexed for fast similarity search. When a user asks a question, you embed the query and find the most similar document chunks.

Vector Store	Type	Max Scale	Filtering	Best For
Pinecone	Managed SaaS	Billions	Rich metadata	Production, managed ops
Weaviate	Managed/Self-hosted	Billions	GraphQL + filters	Hybrid search native
Qdrant	Self-hosted/Cloud	Billions	Payload filtering	Performance-critical
pgvector	PostgreSQL extension	Millions	Full SQL	Already on PostgreSQL
Chroma	Embedded	Millions	Basic	Prototyping, local dev
Milvus	Self-hosted/Cloud	Billions	Rich	Large-scale production

The choice depends on your scale, operational budget, and whether you want to manage infrastructure. Pro tip: Start with pgvector or Chroma for prototyping, migrate to a managed solution when you hit scale.

Step 4: Retrieval Strategy

Pure vector search retrieves semantically similar chunks, but it's not always enough. Hybrid search -- combining semantic similarity with keyword matching -- outperforms pure vector search in most production scenarios.

def hybrid_search(query: str, vector_store, keyword_index, k: int = 10):
    """Combine semantic and keyword search results."""
    semantic_results = vector_store.similarity_search(query, k=k)
    keyword_results = keyword_index.search(query, k=k)

    # Reciprocal rank fusion to merge results
    combined = reciprocal_rank_fusion(semantic_results, keyword_results)
    return combined[:k]

Adding a re-ranking step on top -- retrieve 20 candidates, re-rank to the top 5 -- further improves answer quality. Cohere's reranker and cross-encoder models from the sentence-transformers library are popular choices. The latency cost is typically 50-100ms, which is almost always worth the quality improvement.

Step 5: Context Assembly

Once you have your retrieved chunks, you need to assemble them into a coherent prompt that fits within the model's context window.

This involves deduplicating overlapping chunks, ordering them by relevance, respecting token limits, and adding source attribution metadata so the model (and the user) can trace answers back to specific documents.

Step 6: Generation with Grounding

The LLM generates its answer grounded in the retrieved context. Adding citation tracking -- "Based on [Document X, Section Y]" -- improves trust and allows users to verify claims.

What Goes Wrong

The most common RAG failure mode is poor retrieval. If the retrieval layer returns irrelevant chunks, the model either hallucinates or refuses to answer. Typical causes include:

Chunk sizes that don't match query patterns
Missing metadata filtering (searching all docs when only a subset is relevant)
Embedding model mismatch between indexing and querying
No re-ranking, so marginally relevant chunks outrank highly relevant ones

Always evaluate retrieval quality independently from generation quality. You can't fix bad retrieval with a better prompt.

Advanced RAG Patterns

Pattern	When to Use	Complexity
Naive RAG	Simple Q&A, prototypes	Low
Hybrid search (semantic + keyword)	Most production systems	Medium
Re-ranking (retrieve 20, re-rank to top 5)	When precision matters	Medium
Agentic RAG	Multi-step reasoning, tool use	High
Graph RAG	Entity-relationship-heavy domains	High
Corrective RAG (CRAG)	Self-correcting retrieval	High
Multi-index RAG	Multiple document types/sources	Medium

5 Hidden Gotchas That Will Bite You in Production

Eugene Yan (Anthropic) documented RAG as one of seven essential patterns for production LLM systems. Pinecone's research shows 88% of LLM application builders consider retrieval a key component. But the gap between "RAG demo" and "RAG product" is vast — here's what falls into that gap:

1. Chunking Boundary Problems

Your recursive text splitter cuts a document at 512 tokens. One chunk ends with: "The maximum retry count is". The next chunk starts with: "3, with exponential backoff." A user asks "What's the max retry count?" Neither chunk answers the question alone, and semantic search might retrieve only one of them. This is the most common RAG failure mode — and it's a data engineering problem, not an LLM problem.

Fix: Use semantic chunking (split by paragraph/topic boundaries) instead of fixed-size splits. Add 10-20% token overlap between chunks so boundary information is duplicated. Use a chunking strategy that respects document structure (Markdown headers, HTML tags, paragraph breaks). Test chunking quality by running your eval queries against individual chunks — if a chunk can't answer a question by itself, your chunking is too aggressive.

2. Embedding Model Mismatch

You indexed 100,000 documents with text-embedding-ada-002 (1536 dimensions). Six months later, OpenAI releases a better model. You update your query encoder to text-embedding-3-small (1536 dimensions, same size — but different vector space). Search results become nearly random because the query vectors and document vectors occupy incompatible spaces. The dimensions match, so no error is thrown — just silently terrible retrieval.

Fix: Store the embedding model version alongside your vector index. When you change the embedding model, re-index ALL documents — there is no shortcut. Include the model version in your index name (docs_ada002_v1, docs_3small_v1). Create the new index in parallel, validate retrieval quality, then switch atomically. This is why embedding model selection should be deliberate, not casual.

3. Context Window Stuffing

You retrieve 20 relevant chunks and stuff them all into the LLM's context window. Research from Liu et al. (Stanford, 2023) demonstrated the "lost-in-the-middle" effect: LLMs pay strong attention to the beginning and end of the context but significantly less attention to information in the middle. Your most relevant chunk, ranked #4 of 20, lands in the middle of the context and is effectively ignored.

Fix: Retrieve broadly, then re-rank and truncate. Use a cross-encoder reranker (Cohere Rerank, or open-source models like bge-reranker) to score chunk relevance. Keep only the top 3-5 chunks. Quality over quantity: 3 highly relevant chunks outperform 20 marginally relevant ones. Place the most relevant chunk at the beginning of the context, not in the middle.

4. Stale Vector Index

Your knowledge base was updated last week with a new refund policy. A customer asks about the refund process. The vector index still has the old policy document. The LLM generates a confident, well-cited answer based on the outdated policy — citing the exact section and page number. The customer follows the outdated instructions, gets rejected, and files a complaint. This is worse than no answer — it's a convincingly wrong answer.

Fix: Implement incremental re-indexing triggered by document changes (via a webhook or file watcher). Set TTL on index entries for frequently updated content. Maintain a version field in your document metadata and display to users: "Based on policy version 2.3, last updated March 15." For critical documents, add a freshness check: if the source document has changed since the chunk was indexed, re-embed before answering.

5. Citation Hallucination

The LLM cites "Section 4.2 of the Engineering Handbook" and attributes a specific policy to it. The citation is real — Section 4.2 exists. But the content the LLM attributes to it is fabricated. The LLM blended information from two different chunks and attributed the synthesis to one source. The user sees a real citation and trusts the fabricated content. This is the most dangerous RAG failure: real footnotes pointing to invented facts.

Fix: Use extractive citation: instead of letting the LLM paraphrase sources, require it to include the literal extracted quote from the chunk. Display the actual chunk text alongside the LLM's answer so users can verify. Implement citation verification: programmatically check that the LLM's attributed claims actually appear in the cited chunk (using string matching or semantic similarity with a high threshold).

Common Design-Time Mistakes

Those gotchas emerge at query time. These design mistakes happen during RAG system architecture — decisions about indexing, retrieval, and evaluation that determine whether your RAG system is reliable or randomly wrong.

No evaluation framework

You ship a RAG system without measuring retrieval precision, answer accuracy, or hallucination rate. A prompt change degrades retrieval quality by 15%, but nobody notices because there are no metrics. Build an eval suite: 100+ query-answer pairs with expected source documents. Measure retrieval recall (did you find the right chunks?) and answer accuracy (did the LLM use them correctly?). Run evals on every prompt or retrieval parameter change.

No metadata filtering

All documents are embedded and treated equally. A user asks about "2026 refund policy" and gets results from the 2023 policy, the 2024 policy, AND the 2026 policy — because embedding similarity doesn't understand time. Add metadata (date, version, department, document type) as filterable attributes in your vector store. Filter before similarity search: WHERE year = 2026 AND type = 'policy'.

Single embedding model for everything

You use the same embedding model for code documentation, legal contracts, and customer support articles. Each domain has different semantic structures — what's "similar" in legal text is different from what's "similar" in code. Fine-tune or select domain-appropriate embedding models. Test retrieval quality per domain; if recall drops below 70% for a specific content type, that content may need a specialized encoder.

No hybrid search

Pure vector search misses exact matches. A user searches for "error code ERR-4021" and gets semantically similar but wrong error codes. Pure keyword search misses semantic matches. Combine both: BM25 for keyword precision + vector search for semantic recall. Re-rank the merged results. Eugene Yan's guide on RAG confirms that hybrid retrieval consistently outperforms either approach alone.

Ignoring document freshness

Your indexing pipeline runs weekly. A critical policy document was updated on Monday. Users asking about the policy on Tuesday get week-old answers. The system is confidently wrong with proper citations to the outdated document. Implement near-real-time indexing triggered by document changes. Add a "last indexed" timestamp visible to users so they can assess freshness.

Key Takeaways

RAG lets you ground LLM responses in your own data without retraining -- 10-100x cheaper than fine-tuning
Chunk size and overlap are critical tuning parameters -- experiment extensively, and use different strategies for different document types
Hybrid search (semantic + keyword) outperforms pure vector search in production
Re-ranking retrieved results improves answer quality by 15-25%
Evaluate retrieval and generation quality separately -- bad retrieval can't be fixed with better prompts
Start simple (naive RAG), measure, then add complexity (hybrid search → re-ranking → agentic)

🎯 Real-World Decision: What Would You Do?

You're building RAG for a legal firm with 50,000 contracts (average 40 pages each). Lawyers ask questions like "What are the termination clauses in contracts with Company X?" and "Show me all NDAs that expire in Q2 2026."

Option A: Chunk at 512 tokens, embed everything, pure vector search
Option B: Chunk at 1024 tokens with 100-token overlap, hybrid search (semantic + keyword), re-rank top 20 to top 5
Option C: Structured extraction first (party names, dates, clause types as metadata), then filtered vector search on relevant clauses only

Option C wins here. Legal queries are almost always filtered by entity or date first. Pure vector search across 2M+ chunks returns too much noise. Metadata-filtered retrieval + re-ranking achieves 90%+ relevance. Share your approach in the comments.

Quick Reference Card

Bookmark this — RAG architecture decisions at a glance.

Decision	Default Choice	When to Change
Chunk size	512 tokens	Larger for code, smaller for FAQ
Overlap	50-100 tokens	Increase for dense technical docs
Embedding model	text-embedding-3-small	Upgrade to large for production accuracy
Vector store	pgvector (prototype) → Pinecone/Qdrant (prod)	Chroma for local dev only
Search strategy	Hybrid (semantic + keyword)	Pure semantic only for short-form content
Re-ranking	Yes, cross-encoder on top 20	Skip only in latency-critical flows
Top-K	5 chunks in context	More for complex questions, fewer for simple
Evaluation	Retrieval recall + answer faithfulness	Always evaluate retrieval separately

Golden rule: If your answers are bad, fix retrieval first. Better prompts can't fix irrelevant context.

What's Next?

For more complex use cases, consider agentic RAG — where an AI agent decides when and how to retrieve information, potentially making multiple retrieval passes or using different strategies based on the question type. That leads us directly into AI agent architecture.

📚 System Design Deep Dive Series

This is post #3 of 20 in the System Design Deep Dive series.

Previously: LLM Application Architecture ← | Up next: AI Agent Architecture → | Full series index →

If you found this useful, follow and share it with your team. Building these deep dives takes serious effort — your support keeps the series going.

DEV Community