WonderLab

Posted on May 2

RAG Series (3): Tuning These 4 Parameters to Go From 'It Works' to 'It Works Well'

#rag #chunk #vectordatabase #tuning

Why Does Your RAG Give Wrong Answers When Someone Else's Doesn't?

In the first two articles, we built a RAG pipeline that runs. But many people find that while the code works, answer quality is inconsistent — sometimes spot-on, sometimes missing information that's clearly in the document, sometimes drifting off-topic even when the right chunks were retrieved.

The problem is usually not the code. It's the parameters.

RAG has four core parameters, like four knobs on a radio:

Chunk Size: How long is each text chunk?
Chunk Overlap: How much do adjacent chunks overlap?
Top-K: How many chunks does the retriever return?
Embedding Model: How is text converted into vectors?

The combination of these four parameters directly determines whether the system can find relevant information and whether that information is enough to answer the question. In this article, we'll use a controlled-variable experiment so you can see the effect of different parameters with your own eyes.

Parameter 1: Chunk Size — How Long Is Each Chunk?

What Is Chunk Size?

Imagine you're organizing a 500-page technical manual. Chunk Size is how many pages you read at a time — 1 page, 5 pages, or 50 pages?

In RAG, Chunk Size is the maximum number of characters (or tokens) in each text chunk. The document is cut into many chunks, each no longer than this limit.

Why Does It Matter?

Chunk Size directly impacts two metrics:

Chunk Size	Retrieval Precision	Context Completeness	Analogy
Small (128)	High	Poor	Like reading dictionary entries — precise but isolated
Medium (512)	Balanced	Balanced	Like reading a paragraph — enough context without bloat
Large (2048)	Low	Good	Like reading an entire chapter — complete but noisy

What's wrong with too small? Suppose the document says: "The system uses Redis for caching with a default TTL of 3600 seconds. If this timeout is exceeded, data is automatically purged." If Chunk Size=128, this sentence might be split into two chunks: "The system uses Redis for caching with a default TTL of 3600 seconds." and "If this timeout is exceeded, data is automatically purged." When the user asks "What happens when Redis cache expires?", the retriever might only return the first chunk. The LLM sees "3600 seconds" but doesn't know about "automatically purged" — the answer is incomplete.

What's wrong with too large? Suppose Chunk Size=2048, and one chunk contains five unrelated topics. When the user asks a specific question, this chunk gets retrieved, but the LLM's attention is scattered by irrelevant content — like trying to hear one person speak in a noisy marketplace.

How to Choose?

There's no silver bullet, but there's a rule of thumb:

Chunk Size ≈ 1.5 ~ 2 × the expected answer length

Document Type	Recommended Chunk Size	Reasoning
FAQ / Q&A pairs	256 ~ 384	Short answers, precision matters more
Technical docs / API manuals	512 ~ 768	Medium-length answers, need some context
Papers / book chapters	1024 ~ 1536	Argument-heavy, need large context
Legal contracts / medical records	768 ~ 1024	Dense terminology, need inference from context

Heuristic: Start with 512, then observe retrieval results. If you notice "answers are cut off", increase it. If you notice "retrieved chunks contain too much irrelevant content", decrease it.

Parameter 2: Chunk Overlap — How Much Should Adjacent Chunks Overlap?

What Is Chunk Overlap?

Back to that technical manual. If you read 5 pages at a time, Overlap is how many pages from the previous chunk you keep when starting the next one. For example, Overlap=1 means: first read pages 1-5, then read pages 5-9 (page 5 appears twice).

Why Is Overlap Needed?

Without overlap, critical information can get "cut at the seam":

Chunk A: "The system uses Redis for caching with a default TTL of 3600 seconds."
Chunk B: "If this timeout is exceeded, data is automatically purged."

If the user asks "What happens when Redis cache expires?", the embedding model might think Chunk B is more relevant (both mention "expires"), and only return Chunk B. But Chunk B starts with "If this timeout is exceeded" — without Chunk A, the LLM doesn't know what "this timeout" refers to.

With Overlap=50, Chunk B starts with the last 50 characters of Chunk A:

Chunk B (with overlap): "...default TTL of 3600 seconds. If this timeout is exceeded, data is automatically purged."

Now even if only Chunk B is retrieved, the LLM can infer "this timeout = 3600 seconds".

How Much Overlap?

Generally set to 10% ~ 20% of Chunk Size:

Chunk Size	Recommended Overlap	Notes
256	25 ~ 50	Short text, small overlap preserves context
512	50 ~ 100	The sweet spot for general use
1024	100 ~ 200	Long text needs more overlap to preserve continuity

Note: More overlap is not always better. Too much overlap leads to storing massive amounts of duplicate content in the vector database, increasing storage cost and deduplication burden during retrieval.

Parameter 3: Top-K — How Many Chunks to Retrieve?

What Is Top-K?

Top-K is the number of text chunks the retriever returns each time. K=4 means "give me the 4 most relevant chunks", K=10 means "give me the 10 most relevant chunks".

Why Does It Matter?

K too small = missing information. K too large = introducing noise.

Scenario A: K=2, missing critical information

User asks: "How do I configure the database connection pool and log level?" This question involves two topics. If K=2, the retriever might only return two chunks about "database connection pool" and completely miss "log level" — the LLM can only answer half the question.

Scenario B: K=20, noise drowning out the answer

User asks: "What's the default timeout?" The document has a clear answer. But K=20 retrieves 20 chunks, 19 of which discuss unrelated topics. The LLM's context window is filled with irrelevant content, and it can't find that simple number.

How to Choose?

Top-K = Number of topics the answer is expected to cover × 2 ~ 3

Query Type	Recommended K	Reasoning
Single-point fact ("What's the default port?")	3 ~ 5	Focused answer, fewer is better
Multi-condition ("How do I configure A and B?")	5 ~ 8	Might involve multiple topics
Comprehensive summary ("Summarize Chapter 3")	8 ~ 12	Need to cover multiple points

Heuristic: Start with K=4. If you notice "the answer is missing a part", increase it. If you notice "the answer contains irrelevant content", decrease it.

Parameter 4: Embedding Model — Who Does the "Semantic Translation"?

Embedding Is RAG's "Translator"

What an embedding model does is simple: convert text into a sequence of numbers (a vector). Semantically similar texts have vectors that are close together; semantically dissimilar texts have vectors that are far apart.

The retriever relies on this — it converts the user's question into a vector, then finds the nearest vectors in the database.

How Big Is the Difference Between Models?

Very big. For the same question, different models can return completely different results.

Model	Strong Language	Dimensions	Positioning	Best For
text-embedding-3-small	English	1536	Cheap & fast	English docs, budget-sensitive
text-embedding-3-large	English	3072	High precision	English docs, precision-first
BAAI/bge-large-zh-v1.5	Chinese	1024	Best for Chinese	Chinese docs, China-first choice
BAAI/bge-m3	Multilingual	1024	Multilingual	Mixed Chinese-English, cross-lingual

A Real Comparison Experiment

We use the same Chinese technical document (Automotive SPICE PAM v4.0), the same question, and compare retrieval results between text-embedding-3-small and BAAI/bge-large-zh-v1.5:

Question: "What is process capability level 1?"

Model	1st Retrieved Result	2nd Retrieved Result	Assessment
text-embedding-3-small	Page 12: paragraph about project management	Page 89: paragraph about risk assessment	❌ Neither mentions "process capability level"
BAAI/bge-large-zh-v1.5	Page 45: Definition of process capability level 1	Page 46: Example practices for level 1	✅ Direct hit

Reason: OpenAI's models are primarily trained on English corpora. Their understanding of Chinese technical terminology is not as strong as BGE, which is specifically fine-tuned on Chinese corpora.

How to Choose an Embedding Model?

Decision tree:

What language is your document?
    ├─ Pure English → text-embedding-3-small (best value)
    │                  or text-embedding-3-large (best precision)
    ├─ Pure Chinese → BAAI/bge-large-zh-v1.5 (China-first choice)
    │                  or BAAI/bge-m3 (if mixed Chinese-English)
    └─ Mixed Chinese-English → BAAI/bge-m3 (best multilingual support)

Switching models is a one-line change: Just change model="..." in the build_embeddings() function. Everything else stays the same — that's the beauty of LangChain.

Hands-On: Controlled-Variable Experiment

Let's run an experiment: same document, same question, only changing Chunk Size, and observe how answer quality changes.

Experimental Design

"""
RAG Parameter Controlled-Variable Experiment
Fixed: document, question, embedding model, Top-K, LLM
Variable: Chunk Size
"""

import os
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Load document
doc = PyPDFLoader("./data/Automotive-SPICE-PAM-v40.pdf").load()

# Embedding (fixed)
embeddings = OpenAIEmbeddings(
    model="BAAI/bge-large-zh-v1.5",
    api_key=os.getenv("EMBEDDING_API_KEY"),
    base_url="https://api.siliconflow.cn/v1",
    chunk_size=32,
)

# LLM (fixed)
llm = ChatOpenAI(
    model="glm-4-flash",
    api_key=os.getenv("LLM_API_KEY"),
    base_url="https://open.bigmodel.cn/api/paas/v4",
    temperature=0,
)

# Test different Chunk Sizes
def test_chunk_size(chunk_size, overlap):
    print(f"\n{'='*50}")
    print(f"Chunk Size={chunk_size}, Overlap={overlap}")
    print(f"{'='*50}")

    # Split
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        length_function=len,
    )
    chunks = splitter.split_documents(doc)
    print(f"Generated {len(chunks)} chunks")

    # Build vector store
    persist_dir = f"./chroma_db_{chunk_size}"
    if os.path.exists(persist_dir):
        import shutil
        shutil.rmtree(persist_dir)

    vector_store = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory=persist_dir,
    )

    # Build RAG Chain (LCEL style)
    retriever = vector_store.as_retriever(search_kwargs={"k": 4})

    prompt = ChatPromptTemplate.from_messages([
        ("system", "Answer based on reference content. Reference:\n{context}"),
        ("human", "{question}")
    ])

    rag_chain = (
        {"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
         "question": RunnablePassthrough()}
        | prompt | llm | StrOutputParser()
    )

    # Ask question
    question = "What is process capability level 1?"
    answer = rag_chain.invoke(question)
    print(f"\nAnswer: {answer[:200]}...")

    # Print retrieved sources
    sources = retriever.invoke(question)
    print(f"\nRetrieved {len(sources)} sources:")
    for i, s in enumerate(sources[:3], 1):
        print(f"  [{i}] Page {s.metadata.get('page', '?')}: {s.page_content[:80]}...")

# Run three experiments
test_chunk_size(chunk_size=128, overlap=20)
test_chunk_size(chunk_size=512, overlap=50)
test_chunk_size(chunk_size=1024, overlap=100)

Expected Results

Chunk Size	Number of Chunks	Retrieval Quality	Typical Observation
128	Many (~4000)	High precision but broken context	Retrieved chunks have the keyword "process capability level" but lack sufficient context; LLM answers are fragmented
512	Medium (~1000)	Best balance	Retrieved chunks contain complete definitions + examples; LLM answers are coherent and accurate
1024	Few (~500)	Complete context but low precision	Retrieved chunks contain lots of irrelevant content (e.g., descriptions of other levels); LLM answers are verbose

Key insight: Chunk Size is not "bigger is better" nor "smaller is better". 512 characters is a safe starting point for most Chinese technical documents.

The 5 Most Common Pitfalls

Pitfall 1: Setting Chunk Size by Token Count, but length_function Uses Character Count

# ❌ Wrong: You think chunk_size=512 means 512 tokens
splitter = RecursiveCharacterTextSplitter(chunk_size=512)

# Actually the default length_function=len counts characters!
# 512 characters ≈ 256 tokens (Chinese), so chunks are half the size you expected

Fix: If you want to count by tokens, explicitly specify a tokenizer:

import tiktoken

def token_length(text):
    return len(tiktoken.encoding_for_model("gpt-4").encode(text))

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    length_function=token_length,  # ✅ Count by tokens
)

Pitfall 2: Overlap Too Large, Causing 30% Duplicate Content in the Vector Store

Overlap is not free. Every overlapping character requires an embedding computation and takes up storage space in the vector database. Overlap=100 with Chunk Size=200 means 50% of your storage is redundant.

Fix: Set Overlap to 10%~15% of Chunk Size, never exceed 20%.

Pitfall 3: Swapped Embedding Model Without Clearing the Old Vector Store

# ❌ Wrong: Yesterday you built the index with BGE, today you switch to OpenAI
# and reuse the same chroma_db/ directory
vector_store = Chroma.from_documents(documents=chunks, embedding=new_embeddings)
# Result: Query vectors and index vectors come from different models — completely mismatched

Fix: When switching embedding models, always delete the old vector store and re-index:

if os.path.exists(persist_directory):
    shutil.rmtree(persist_directory)  # ✅ Clear old data

Pitfall 4: Hardcoding Top-K Without Adjusting for Question Complexity

Using K=4 for all questions, but "What's the default port?" (simple fact) and "Summarize all key points from Chapter 3" (comprehensive overview) require vastly different amounts of information.

Fix: Simple questions use K=3~4, complex questions use K=8~10. A more advanced approach is to use an LLM to judge question complexity first, then dynamically decide K (covered in a later article).

Pitfall 5: Not Monitoring "Zero Retrieval"

Sometimes the retriever returns 0 relevant chunks (e.g., the user asks about something completely absent from the document), but you don't know. The LLM has no choice but to hallucinate from memory.

Fix: Add a threshold filter after retrieval — if the similarity score of the most relevant chunk is below a threshold (e.g., 0.6), directly tell the user "The document doesn't contain relevant information" instead of feeding irrelevant chunks to the LLM:

# Add a filter layer after retrieval
docs = retriever.invoke(question)
if not docs or max_similarity < 0.6:
    return "Sorry, I cannot answer this question based on the available documents."

Parameter Selection Cheat Sheet

Condense everything above into one table, tape it next to your monitor:

Parameter	Default for Beginners	When to Increase	When to Decrease
Chunk Size	512	Answer needs large context (books/papers)	Answer is short (FAQ/config items)
Chunk Overlap	50 (~10%)	Sentences often span pages/paragraphs	Document is highly structured with clear boundaries
Top-K	4	Question involves multiple topics	Question is very specific with a unique answer
Embedding	BGE (Chinese) / OpenAI (English)	Chinese professional documents	English general-purpose documents

Summary

In this article, we covered the four core parameters of RAG:

Chunk Size: Determines how long each text chunk is. Default 512. Use 256 for short answers, 1024 for long arguments.
Chunk Overlap: Determines how much adjacent chunks overlap. Default 10% of Chunk Size. Prevents cross-chunk information from being severed.
Top-K: Determines how many chunks to retrieve. Default 4. Increase to 8 for complex questions, decrease to 3 for simple facts.
Embedding Model: Chinese documents use BGE, English documents use OpenAI. Remember to clear the vector store when switching models.

Through the controlled-variable experiment, we demonstrated that parameters are not "bigger is better" nor "smaller is better" — the key is finding the balance that suits your document type and query patterns.

Next up, we enter Part 2: Core Components — a deep dive into 4 chunking strategies (Fixed Size, Recursive Character, Semantic Chunking, Document Structure), thoroughly unpacking the "how to cut" problem.

References

LangChain Text Splitters Documentation — Official chunking strategy guide
BGE Embedding Models GitHub — Best practices for Chinese embeddings
MTEB Leaderboard — Authoritative embedding model ranking
ChromaDB Distance Metrics — Cosine similarity vs Euclidean distance

DEV Community

RAG Series (3): Tuning These 4 Parameters to Go From 'It Works' to 'It Works Well'

Why Does Your RAG Give Wrong Answers When Someone Else's Doesn't?

Parameter 1: Chunk Size — How Long Is Each Chunk?

What Is Chunk Size?

Why Does It Matter?

How to Choose?

Parameter 2: Chunk Overlap — How Much Should Adjacent Chunks Overlap?

What Is Chunk Overlap?

Why Is Overlap Needed?

How Much Overlap?

Parameter 3: Top-K — How Many Chunks to Retrieve?

What Is Top-K?

Why Does It Matter?

How to Choose?

Parameter 4: Embedding Model — Who Does the "Semantic Translation"?

Embedding Is RAG's "Translator"

How Big Is the Difference Between Models?

A Real Comparison Experiment

How to Choose an Embedding Model?

Hands-On: Controlled-Variable Experiment

Experimental Design

Expected Results

The 5 Most Common Pitfalls

Pitfall 1: Setting Chunk Size by Token Count, but length_function Uses Character Count

Pitfall 2: Overlap Too Large, Causing 30% Duplicate Content in the Vector Store

Pitfall 3: Swapped Embedding Model Without Clearing the Old Vector Store

Pitfall 4: Hardcoding Top-K Without Adjusting for Question Complexity

Pitfall 5: Not Monitoring "Zero Retrieval"

Parameter Selection Cheat Sheet

Summary

References

Top comments (0)