Leo Han

Posted on Jun 9

rag-explained-how-it-works

#ai #llm #rag #tutorial

RAG Explained: How Retrieval-Augmented Generation Actually Works

What Is RAG?

RAG (Retrieval-Augmented Generation) is one of the most important architectural patterns in LLM applications from 2024–2025. The core idea is simple: before the LLM generates an answer, retrieve relevant information from an external knowledge base, inject the retrieval results into the context, and then have the model generate an answer based on that information.

Why is RAG needed? Large language models have three inherent limitations: knowledge cutoff dates (the temporal boundary of training data), hallucination (fabricating non-existent facts), and insufficient domain expertise (lacking enterprise-internal or specialized data). RAG circumvents the model's internal knowledge constraints by adopting a "retrieve first, generate later" approach, allowing the LLM to reference the latest and most accurate private data.

The Core RAG Workflow

A standard RAG system follows this pipeline:

Document Library → Chunking → Embedding → Vector Database Storage
                                                    ↓
User Query → Query Embedding → Similarity Search → Retrieve Relevant Chunks
                                            ↓
                        LLM Generates Answer (based on retrieved results + original query)

This pipeline can be decomposed into two phases: the offline indexing phase (document processing and storage) and the online query phase (real-time retrieval and generation).

The Offline Indexing Phase

1. Document Parsing & Chunking

Raw documents (PDFs, web pages, database records, etc.) are typically too long for direct vector retrieval. They need to be split into appropriately sized chunks.

Chunking strategy directly impacts retrieval quality. Common approaches include:

Fixed-size chunking: split by token count or character count (e.g., 512 tokens per chunk)
Semantic chunking: split along natural boundaries like paragraphs and sections
Recursive chunking: start with coarse separators (chapter headers), then progressively refine
Overlapping chunking: maintain overlap between adjacent chunks (e.g., 10–20%) to prevent key information from being severed at boundaries

Chunk size involves a trade-off: too small and the semantics are incomplete; too large and retrieval precision degrades.

2. Embedding

After chunking, an embedding model converts each text chunk into a fixed-dimensional vector. These vectors are points in high-dimensional space — semantically similar texts are closer together in vector space.

Choosing the right embedding model is critical. Currently mainstream models include:

Model	Dimensions	Max Tokens	Characteristics
text-embedding-3-large	3072	8191	OpenAI recommended, excellent value
text-embedding-3-small	1536	8191	Lightweight, fast
multilingual-e5-large	1024	512	Strong multilingual support
GTE-Qwen2-7B-instruct	3584	32768	Open-source SOTA, long text support
BGE-M3	1024	8192	Multilingual + sparse-dense hybrid

For example, a user query like "How tall is the Empire State Building?" gets converted through embedding into a dense vector like [1.0, 2.5, 3.7, 5.8, 2.8].

3. Vector Database Storage

The embedded document chunks are stored in a vector database (Pinecone, Weaviate, Milvus, Qdrant, Chroma, etc.). The core capability of a vector database is Approximate Nearest Neighbor (ANN) search, which can find the K most similar results from millions or even billions of vectors in milliseconds.

The Online Query Phase

1. Query Embedding

The user's question is first converted into a vector using the same embedding model. Queries and documents must use the same embedding model — otherwise, they lie in different vector spaces and similarity calculations become meaningless.

2. Similarity Search

The query vector performs a similarity search against the vector database. Common similarity measures include:

Cosine Similarity: measures how close two vectors are in direction; range [-1, 1], unaffected by vector magnitude
Euclidean Distance: the straight-line distance in space; smaller values indicate greater similarity
Dot Product: suitable for normalized vectors

For instance, suppose document chunk A has the vector [1.3, 1.5, 3.3, 5.7, 4.9] and the query vector is [1.0, 2.5, 3.7, 5.8, 2.8]. Their cosine similarity is approximately 0.47. Meanwhile, document chunk B with vector [4.8, 3.7, 1.5, 5.2, 6.0] has a similarity of about 0.51 to the same query — indicating that B is semantically closer and should rank higher.

3. Re-ranking

The results from the initial retrieval (typically top 10–50) are not always optimal. The re-ranking stage uses a more precise (but slower) model to re-sort the candidates.

Cross-Encoders are the standard method for re-ranking. Unlike Bi-Encoders that encode the query and document independently, Cross-Encoders concatenate the query and document together before feeding them into the model, capturing finer-grained interaction patterns between them. The ranking accuracy is significantly higher, though at greater computational cost.

# Bi-Encoder (initial retrieval): fast but lower precision
# Query and document are encoded independently
query_vec = embed(query)
doc_vec = embed(document)
similarity = cosine(query_vec, doc_vec)

# Cross-Encoder (re-ranking): slow but high precision
# Query and document are concatenated and jointly encoded
score = cross_encoder(query, document)

In production, a two-stage retrieval approach is standard: first use a Bi-Encoder to quickly recall the top-N from massive candidates, then use a Cross-Encoder to precisely re-rank the top-N and select the top-K to feed into the LLM.

Evaluating RAG Systems

RAG system quality can be measured across multiple dimensions:

Retrieval quality metrics:

Recall@K: whether the correct answer appears in the top-K results
MRR (Mean Reciprocal Rank): the average reciprocal rank of the first correct answer
NDCG (Normalized Discounted Cumulative Gain): a weighted score accounting for ranking position

Generation quality metrics:

Faithfulness: whether the generated content faithfully reflects the retrieved context
Answer Relevance: whether the answer addresses the question
Context Relevance: whether the retrieved content is relevant to the question

The RAGAS (RAG Assessment) framework is widely used for automated evaluation, providing a systematic scoring system atop standard benchmarks like MTEB (Massive Text Embedding Benchmark).

Advanced RAG Patterns

Query Rewriting

Raw user questions are often imprecise. Rewriting queries before retrieval — expanding synonyms, supplementing context, decomposing complex questions — can significantly boost recall.

Hybrid Search

Fuse results from dense retrieval (vector similarity) and sparse retrieval (keyword matching, e.g., BM25). Dense retrieval excels at semantic matching; sparse retrieval excels at exact matching. The two are complementary.

Multi-hop Retrieval

For questions requiring multi-step reasoning, the first round of retrieval results can generate new queries for a second (or more) round of retrieval, progressively approaching the answer.

Self-RAG

Allow the LLM to self-assess during generation whether retrieval is needed, whether the retrieved results are relevant, and whether the generated content is grounded — achieving "on-demand retrieval" rather than indiscriminate retrieval.

Conclusion

Through its "retrieve → augment → generate" architecture, RAG effectively addresses the three major challenges of LLMs: knowledge staleness, hallucination control, and domain adaptation. A production-grade RAG system involves multiple critical decisions: chunking strategy selection, embedding model choice, vector database configuration, similarity measure design, and re-ranking mechanism integration.

As long-context models advance, some might ask: "Why not just stuff all the documents into the context window?" But practice shows that RAG's value lies not in "how much you can fit," but in precisely finding the most relevant pieces of information — retrieval quality determines the ceiling of answer quality.

This article is adapted from the video "How RAG Works," covering the definition of RAG, the two-phase indexing and query pipeline, embedding and vector retrieval principles, chunking strategies, re-ranking mechanisms, and advanced patterns.

DEV Community