DEV Community

Cover image for RAG Series (3): Tuning These 4 Parameters to Go From 'It Works' to 'It Works Well'
WonderLab
WonderLab

Posted on

RAG Series (3): Tuning These 4 Parameters to Go From 'It Works' to 'It Works Well'

Why Does Your RAG Give Wrong Answers When Someone Else's Doesn't?

In the first two articles, we built a RAG pipeline that runs. But many people find that while the code works, answer quality is inconsistent — sometimes spot-on, sometimes missing information that's clearly in the document, sometimes drifting off-topic even when the right chunks were retrieved.

The problem is usually not the code. It's the parameters.

RAG has four core parameters, like four knobs on a radio:

  • Chunk Size: How long is each text chunk?
  • Chunk Overlap: How much do adjacent chunks overlap?
  • Top-K: How many chunks does the retriever return?
  • Embedding Model: How is text converted into vectors?

The combination of these four parameters directly determines whether the system can find relevant information and whether that information is enough to answer the question. In this article, we'll use a controlled-variable experiment so you can see the effect of different parameters with your own eyes.


Parameter 1: Chunk Size — How Long Is Each Chunk?

What Is Chunk Size?

Imagine you're organizing a 500-page technical manual. Chunk Size is how many pages you read at a time — 1 page, 5 pages, or 50 pages?

In RAG, Chunk Size is the maximum number of characters (or tokens) in each text chunk. The document is cut into many chunks, each no longer than this limit.

Why Does It Matter?

Chunk Size directly impacts two metrics:

Chunk Size Retrieval Precision Context Completeness Analogy
Small (128) High Poor Like reading dictionary entries — precise but isolated
Medium (512) Balanced Balanced Like reading a paragraph — enough context without bloat
Large (2048) Low Good Like reading an entire chapter — complete but noisy

What's wrong with too small? Suppose the document says: "The system uses Redis for caching with a default TTL of 3600 seconds. If this timeout is exceeded, data is automatically purged." If Chunk Size=128, this sentence might be split into two chunks: "The system uses Redis for caching with a default TTL of 3600 seconds." and "If this timeout is exceeded, data is automatically purged." When the user asks "What happens when Redis cache expires?", the retriever might only return the first chunk. The LLM sees "3600 seconds" but doesn't know about "automatically purged" — the answer is incomplete.

What's wrong with too large? Suppose Chunk Size=2048, and one chunk contains five unrelated topics. When the user asks a specific question, this chunk gets retrieved, but the LLM's attention is scattered by irrelevant content — like trying to hear one person speak in a noisy marketplace.

How to Choose?

There's no silver bullet, but there's a rule of thumb:

Chunk Size ≈ 1.5 ~ 2 × the expected answer length
Enter fullscreen mode Exit fullscreen mode
Document Type Recommended Chunk Size Reasoning
FAQ / Q&A pairs 256 ~ 384 Short answers, precision matters more
Technical docs / API manuals 512 ~ 768 Medium-length answers, need some context
Papers / book chapters 1024 ~ 1536 Argument-heavy, need large context
Legal contracts / medical records 768 ~ 1024 Dense terminology, need inference from context

Heuristic: Start with 512, then observe retrieval results. If you notice "answers are cut off", increase it. If you notice "retrieved chunks contain too much irrelevant content", decrease it.


Parameter 2: Chunk Overlap — How Much Should Adjacent Chunks Overlap?

What Is Chunk Overlap?

Back to that technical manual. If you read 5 pages at a time, Overlap is how many pages from the previous chunk you keep when starting the next one. For example, Overlap=1 means: first read pages 1-5, then read pages 5-9 (page 5 appears twice).

Why Is Overlap Needed?

Without overlap, critical information can get "cut at the seam":

Chunk A: "The system uses Redis for caching with a default TTL of 3600 seconds."
Chunk B: "If this timeout is exceeded, data is automatically purged."
Enter fullscreen mode Exit fullscreen mode

If the user asks "What happens when Redis cache expires?", the embedding model might think Chunk B is more relevant (both mention "expires"), and only return Chunk B. But Chunk B starts with "If this timeout is exceeded" — without Chunk A, the LLM doesn't know what "this timeout" refers to.

With Overlap=50, Chunk B starts with the last 50 characters of Chunk A:

Chunk B (with overlap): "...default TTL of 3600 seconds. If this timeout is exceeded, data is automatically purged."
Enter fullscreen mode Exit fullscreen mode

Now even if only Chunk B is retrieved, the LLM can infer "this timeout = 3600 seconds".

How Much Overlap?

Generally set to 10% ~ 20% of Chunk Size:

Chunk Size Recommended Overlap Notes
256 25 ~ 50 Short text, small overlap preserves context
512 50 ~ 100 The sweet spot for general use
1024 100 ~ 200 Long text needs more overlap to preserve continuity

Note: More overlap is not always better. Too much overlap leads to storing massive amounts of duplicate content in the vector database, increasing storage cost and deduplication burden during retrieval.


Parameter 3: Top-K — How Many Chunks to Retrieve?

What Is Top-K?

Top-K is the number of text chunks the retriever returns each time. K=4 means "give me the 4 most relevant chunks", K=10 means "give me the 10 most relevant chunks".

Why Does It Matter?

K too small = missing information. K too large = introducing noise.

Scenario A: K=2, missing critical information

User asks: "How do I configure the database connection pool and log level?" This question involves two topics. If K=2, the retriever might only return two chunks about "database connection pool" and completely miss "log level" — the LLM can only answer half the question.

Scenario B: K=20, noise drowning out the answer

User asks: "What's the default timeout?" The document has a clear answer. But K=20 retrieves 20 chunks, 19 of which discuss unrelated topics. The LLM's context window is filled with irrelevant content, and it can't find that simple number.

How to Choose?

Top-K = Number of topics the answer is expected to cover × 2 ~ 3
Enter fullscreen mode Exit fullscreen mode
Query Type Recommended K Reasoning
Single-point fact ("What's the default port?") 3 ~ 5 Focused answer, fewer is better
Multi-condition ("How do I configure A and B?") 5 ~ 8 Might involve multiple topics
Comprehensive summary ("Summarize Chapter 3") 8 ~ 12 Need to cover multiple points

Heuristic: Start with K=4. If you notice "the answer is missing a part", increase it. If you notice "the answer contains irrelevant content", decrease it.


Parameter 4: Embedding Model — Who Does the "Semantic Translation"?

Embedding Is RAG's "Translator"

What an embedding model does is simple: convert text into a sequence of numbers (a vector). Semantically similar texts have vectors that are close together; semantically dissimilar texts have vectors that are far apart.

The retriever relies on this — it converts the user's question into a vector, then finds the nearest vectors in the database.

How Big Is the Difference Between Models?

Very big. For the same question, different models can return completely different results.

Model Strong Language Dimensions Positioning Best For
text-embedding-3-small English 1536 Cheap & fast English docs, budget-sensitive
text-embedding-3-large English 3072 High precision English docs, precision-first
BAAI/bge-large-zh-v1.5 Chinese 1024 Best for Chinese Chinese docs, China-first choice
BAAI/bge-m3 Multilingual 1024 Multilingual Mixed Chinese-English, cross-lingual

A Real Comparison Experiment

We use the same Chinese technical document (Automotive SPICE PAM v4.0), the same question, and compare retrieval results between text-embedding-3-small and BAAI/bge-large-zh-v1.5:

Question: "What is process capability level 1?"

Model 1st Retrieved Result 2nd Retrieved Result Assessment
text-embedding-3-small Page 12: paragraph about project management Page 89: paragraph about risk assessment ❌ Neither mentions "process capability level"
BAAI/bge-large-zh-v1.5 Page 45: Definition of process capability level 1 Page 46: Example practices for level 1 ✅ Direct hit

Reason: OpenAI's models are primarily trained on English corpora. Their understanding of Chinese technical terminology is not as strong as BGE, which is specifically fine-tuned on Chinese corpora.

How to Choose an Embedding Model?

Decision tree:

What language is your document?
    ├─ Pure English → text-embedding-3-small (best value)
    │                  or text-embedding-3-large (best precision)
    ├─ Pure Chinese → BAAI/bge-large-zh-v1.5 (China-first choice)
    │                  or BAAI/bge-m3 (if mixed Chinese-English)
    └─ Mixed Chinese-English → BAAI/bge-m3 (best multilingual support)
Enter fullscreen mode Exit fullscreen mode

Switching models is a one-line change: Just change model="..." in the build_embeddings() function. Everything else stays the same — that's the beauty of LangChain.


Hands-On: Controlled-Variable Experiment

Let's run an experiment: same document, same question, only changing Chunk Size, and observe how answer quality changes.

Experimental Design

"""
RAG Parameter Controlled-Variable Experiment
Fixed: document, question, embedding model, Top-K, LLM
Variable: Chunk Size
"""

import os
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Load document
doc = PyPDFLoader("./data/Automotive-SPICE-PAM-v40.pdf").load()

# Embedding (fixed)
embeddings = OpenAIEmbeddings(
    model="BAAI/bge-large-zh-v1.5",
    api_key=os.getenv("EMBEDDING_API_KEY"),
    base_url="https://api.siliconflow.cn/v1",
    chunk_size=32,
)

# LLM (fixed)
llm = ChatOpenAI(
    model="glm-4-flash",
    api_key=os.getenv("LLM_API_KEY"),
    base_url="https://open.bigmodel.cn/api/paas/v4",
    temperature=0,
)

# Test different Chunk Sizes
def test_chunk_size(chunk_size, overlap):
    print(f"\n{'='*50}")
    print(f"Chunk Size={chunk_size}, Overlap={overlap}")
    print(f"{'='*50}")

    # Split
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=len,
    )
    chunks = splitter.split_documents(doc)
    print(f"Generated {len(chunks)} chunks")

    # Build vector store
    persist_dir = f"./chroma_db_{chunk_size}"
    if os.path.exists(persist_dir):
        import shutil
        shutil.rmtree(persist_dir)

    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_dir,
    )

    # Build RAG Chain (LCEL style)
    retriever = vector_store.as_retriever(search_kwargs={"k": 4})

    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer based on reference content. Reference:\n{context}"),
        ("human", "{question}")
    ])

    rag_chain = (
        {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
         "question": RunnablePassthrough()}
        | prompt | llm | StrOutputParser()
    )

    # Ask question
    question = "What is process capability level 1?"
    answer = rag_chain.invoke(question)
    print(f"\nAnswer: {answer[:200]}...")

    # Print retrieved sources
    sources = retriever.invoke(question)
    print(f"\nRetrieved {len(sources)} sources:")
    for i, s in enumerate(sources[:3], 1):
        print(f"  [{i}] Page {s.metadata.get('page', '?')}: {s.page_content[:80]}...")

# Run three experiments
test_chunk_size(chunk_size=128, overlap=20)
test_chunk_size(chunk_size=512, overlap=50)
test_chunk_size(chunk_size=1024, overlap=100)
Enter fullscreen mode Exit fullscreen mode

Expected Results

Chunk Size Number of Chunks Retrieval Quality Typical Observation
128 Many (~4000) High precision but broken context Retrieved chunks have the keyword "process capability level" but lack sufficient context; LLM answers are fragmented
512 Medium (~1000) Best balance Retrieved chunks contain complete definitions + examples; LLM answers are coherent and accurate
1024 Few (~500) Complete context but low precision Retrieved chunks contain lots of irrelevant content (e.g., descriptions of other levels); LLM answers are verbose

Key insight: Chunk Size is not "bigger is better" nor "smaller is better". 512 characters is a safe starting point for most Chinese technical documents.


The 5 Most Common Pitfalls

Pitfall 1: Setting Chunk Size by Token Count, but length_function Uses Character Count

# ❌ Wrong: You think chunk_size=512 means 512 tokens
splitter = RecursiveCharacterTextSplitter(chunk_size=512)

# Actually the default length_function=len counts characters!
# 512 characters ≈ 256 tokens (Chinese), so chunks are half the size you expected
Enter fullscreen mode Exit fullscreen mode

Fix: If you want to count by tokens, explicitly specify a tokenizer:

import tiktoken

def token_length(text):
    return len(tiktoken.encoding_for_model("gpt-4").encode(text))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    length_function=token_length,  # ✅ Count by tokens
)
Enter fullscreen mode Exit fullscreen mode

Pitfall 2: Overlap Too Large, Causing 30% Duplicate Content in the Vector Store

Overlap is not free. Every overlapping character requires an embedding computation and takes up storage space in the vector database. Overlap=100 with Chunk Size=200 means 50% of your storage is redundant.

Fix: Set Overlap to 10%~15% of Chunk Size, never exceed 20%.

Pitfall 3: Swapped Embedding Model Without Clearing the Old Vector Store

# ❌ Wrong: Yesterday you built the index with BGE, today you switch to OpenAI
# and reuse the same chroma_db/ directory
vector_store = Chroma.from_documents(documents=chunks, embedding=new_embeddings)
# Result: Query vectors and index vectors come from different models — completely mismatched
Enter fullscreen mode Exit fullscreen mode

Fix: When switching embedding models, always delete the old vector store and re-index:

if os.path.exists(persist_directory):
    shutil.rmtree(persist_directory)  # ✅ Clear old data
Enter fullscreen mode Exit fullscreen mode

Pitfall 4: Hardcoding Top-K Without Adjusting for Question Complexity

Using K=4 for all questions, but "What's the default port?" (simple fact) and "Summarize all key points from Chapter 3" (comprehensive overview) require vastly different amounts of information.

Fix: Simple questions use K=3~4, complex questions use K=8~10. A more advanced approach is to use an LLM to judge question complexity first, then dynamically decide K (covered in a later article).

Pitfall 5: Not Monitoring "Zero Retrieval"

Sometimes the retriever returns 0 relevant chunks (e.g., the user asks about something completely absent from the document), but you don't know. The LLM has no choice but to hallucinate from memory.

Fix: Add a threshold filter after retrieval — if the similarity score of the most relevant chunk is below a threshold (e.g., 0.6), directly tell the user "The document doesn't contain relevant information" instead of feeding irrelevant chunks to the LLM:

# Add a filter layer after retrieval
docs = retriever.invoke(question)
if not docs or max_similarity < 0.6:
    return "Sorry, I cannot answer this question based on the available documents."
Enter fullscreen mode Exit fullscreen mode

Parameter Selection Cheat Sheet

Condense everything above into one table, tape it next to your monitor:

Parameter Default for Beginners When to Increase When to Decrease
Chunk Size 512 Answer needs large context (books/papers) Answer is short (FAQ/config items)
Chunk Overlap 50 (~10%) Sentences often span pages/paragraphs Document is highly structured with clear boundaries
Top-K 4 Question involves multiple topics Question is very specific with a unique answer
Embedding BGE (Chinese) / OpenAI (English) Chinese professional documents English general-purpose documents

Summary

In this article, we covered the four core parameters of RAG:

  1. Chunk Size: Determines how long each text chunk is. Default 512. Use 256 for short answers, 1024 for long arguments.
  2. Chunk Overlap: Determines how much adjacent chunks overlap. Default 10% of Chunk Size. Prevents cross-chunk information from being severed.
  3. Top-K: Determines how many chunks to retrieve. Default 4. Increase to 8 for complex questions, decrease to 3 for simple facts.
  4. Embedding Model: Chinese documents use BGE, English documents use OpenAI. Remember to clear the vector store when switching models.

Through the controlled-variable experiment, we demonstrated that parameters are not "bigger is better" nor "smaller is better" — the key is finding the balance that suits your document type and query patterns.

Next up, we enter Part 2: Core Components — a deep dive into 4 chunking strategies (Fixed Size, Recursive Character, Semantic Chunking, Document Structure), thoroughly unpacking the "how to cut" problem.


References

Top comments (0)