Chudi Nnorukam

Posted on Feb 10 • Edited on Feb 25 • Originally published at chudi.dev

Your LLM Hallucinating Facts Is Wrong — Here's How RAG Fixes It

#rag #llm #tutorial #ai

Originally published at chudi.dev

TL;DR

RAG (Retrieval-Augmented Generation) combines language models with real-time data retrieval to provide accurate, up-to-date responses. Key benefit: Reduces hallucination by grounding responses in actual documents.

What is RAG?

RAG is a technique that gives LLMs access to external knowledge at inference time. Instead of relying solely on what the model learned during training--which could be months or years old--RAG pulls in relevant documents before generating a response.

Without me realizing it, I had been using a form of RAG every time I asked Claude to help me understand a codebase. Feeding it context before asking questions? That's the RAG pattern in action.

How RAG Works

Query Processing: User question is received
Retrieval: Relevant documents are fetched from a knowledge base
Augmentation: Retrieved context is added to the prompt
Generation: LLM generates a response using both its training and the retrieved context

I thought RAG was only for enterprise systems. Well, it's more like... the pattern exists everywhere we add context to AI conversations.

The Three Core Components

Every RAG system has three parts that need to work together:

Embedding model: Converts text into vectors—lists of numbers representing semantic meaning. "How do I authenticate users?" becomes a numerical vector where similar questions produce similar vectors. Common choices include OpenAI's text-embedding-ada-002, Cohere Embed, or open-source models from sentence-transformers. The embedding model determines how well your system understands semantic similarity. Two sentences that mean the same thing should produce vectors that are close together in vector space, even if they use different words.

Vector database: Stores those embeddings and enables fast similarity search. When a query arrives, the database finds documents whose embeddings are most similar to the query embedding. Options range from Pinecone (managed cloud) to Chroma (local dev) to pgvector (PostgreSQL extension). For most small-to-medium projects, Chroma locally or pgvector in Supabase is more than sufficient—no need for a dedicated vector database service until you're storing millions of documents.

Retrieval strategy: Decides which documents to include and how many. Naive retrieval takes the top-k most similar chunks. Hybrid search combines vector similarity with keyword matching (BM25). Reranking uses a second model to re-score results for relevance. Getting retrieval right often matters more than which database or embedding model you choose.

Why This Matters for Builders

I hated the feeling of asking an AI a question and getting confidently wrong information. But I love being able to trust responses when they're grounded in actual sources.

That specific relief of knowing where information comes from--it changes how you build with AI entirely.

Common RAG Use Cases

RAG vs. Fine-Tuning: When to Use Which

Both approaches customize LLM behavior, but they solve different problems.

	RAG	Fine-Tuning
Best for	Factual queries, current info	Style, tone, task format
Knowledge updates	Instant (update the DB)	Requires retraining
Hallucination risk	Lower (grounded in docs)	Higher (model memorizes)
Source transparency	Can cite sources	Black box
Setup cost	Medium (need vector DB)	High (need training data)

I tried fine-tuning a model on internal documentation once. Updates required retraining. Months later, the model confidently cited outdated API versions. RAG would have been simpler and more reliable for that use case—the knowledge was too dynamic for fine-tuning.

Use RAG when:

Your knowledge changes frequently (documentation, product updates, news)
You need to cite sources and show your work
You're querying factual or domain-specific information
You want explicit control over what the model can reference

Use fine-tuning when:

You need a specific response format or output structure
You're adjusting tone or communication style
Task-specific patterns need to be baked in at training time
You have thousands of high-quality labeled examples

Most production systems end up using both: fine-tuned models for consistent format and tone, RAG for factual grounding.

Getting Started with RAG

Here's a working minimal implementation with ChromaDB and Claude:

import chromadb
import anthropic

# 1. Set up vector store and add documents
client = chromadb.Client()
collection = client.create_collection("docs")
collection.add(
    documents=[
        "Users authenticate via JWT tokens stored in httpOnly cookies.",
        "Database connection uses PostgreSQL on port 5432.",
        "API rate limit is 100 requests per minute per IP."
    ],
    ids=["auth", "db", "rate-limit"]
)

# 2. Retrieve relevant context
def retrieve(query: str, n: int = 3) -> list[str]:
    results = collection.query(query_texts=[query], n_results=n)
    return results["documents"][0]

# 3. Generate with context
anthropic_client = anthropic.Anthropic()

def rag_query(question: str) -> str:
    context = retrieve(question)
    context_str = "\n".join(f"- {doc}" for doc in context)
    response = anthropic_client.messages.create(
        model="claude-opus-4-6",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"Use this context to answer the question.\n\nContext:\n{context_str}\n\nQuestion: {question}"
        }]
    )
    return response.content[0].text

print(rag_query("How do users log in?"))
# → "Users authenticate via JWT tokens stored in httpOnly cookies."

ChromaDB handles embedding automatically using its default model. For production, swap in a dedicated embedding service and a persistent vector database. The key is step 3—context gets prepended to the prompt before the LLM generates. That's it. That's RAG.

Common RAG Mistakes

I've made most of these. Some took months to diagnose.

Chunk size too large

Retrieving entire documents when only one paragraph is relevant. LLMs have context windows—stuffing in irrelevant text wastes tokens and dilutes useful content. Worse, the LLM might confabulate based on unrelated content that got retrieved alongside the correct answer.

Fix: Chunk documents into 256–512 token segments with 50-token overlap to preserve context at boundaries.

Chunk size too small

Splitting so aggressively that context is lost. A 50-word chunk might contain the answer but not the surrounding context that makes it meaningful. "Use bcrypt" without "when storing passwords" misses the point.

Fix: Experiment per content type. For prose, 512 tokens with overlap. For code, split by function or class boundary.

Skipping reranking

Taking top-k vector search results at face value. Vector similarity finds semantically related text but not always the most relevant text. A passage about "fast running" can score well for "quick deployment" when you wanted DevOps documentation.

Fix: Add a cross-encoder reranker as a second pass. It re-scores results by relevance to the specific query, not just semantic proximity. Slower but significantly more accurate for precision-sensitive use cases.

Missing metadata filtering

Treating all documents equally. If you have docs for v1 and v2 of an API, you want v2 docs for v2 questions. Vector similarity doesn't understand versioning or recency.

Fix: Store metadata (version, date, author, section) alongside embeddings. Filter before or after retrieval based on query context or user-supplied parameters.

No evaluation

Building RAG and measuring success by feel. "It seems better" is not a metric.

Fix: Create a test set of 20–50 question-answer pairs. Measure retrieval precision (did the right document get retrieved?) and answer faithfulness (did the LLM actually use the retrieved content?). Tools like RAGAS automate this evaluation loop.

Production RAG Considerations

The working example above gets you a prototype. Production systems need a few more pieces.

Chunking strategy matters more than the database. The most common reason RAG systems underperform isn't the LLM or the vector store—it's that chunks are either too large (stuffing in irrelevant context) or too small (losing surrounding meaning). Start with 512-token chunks with 50-token overlap and test against your actual question set before committing to a strategy.

Evaluate retrieval separately from generation. Two distinct failure modes exist: the right document wasn't retrieved, or the right document was retrieved but the LLM generated a bad answer anyway. Evaluate each independently. If retrieval precision is low (wrong documents retrieved), improve embedding quality or add metadata filtering. If answer faithfulness is low (documents retrieved but ignored in the response), improve the system prompt to be more directive about using context.

Monitor costs in staging. Embedding thousands of documents at 10,000 tokens each adds up. Audit your embedding cost before scaling to production. For most developer knowledge bases, the one-time embedding cost stays under $5—but knowing that number before you scale matters. OpenAI's text-embedding-3-small costs $0.02 per million tokens; for a 500-document knowledge base, that's effectively free at initialization.

Since I no longer need to second-guess every AI response, I can focus on what I actually want to build. I like to see it as a comparative advantage--understanding RAG means building more reliable AI applications.

DEV Community