Why Does Your RAG Give Wrong Answers When Someone Else's Doesn't?
In the first two articles, we built a RAG pipeline that runs. But many people find that while the code works, answer quality is inconsistent — sometimes spot-on, sometimes missing information that's clearly in the document, sometimes drifting off-topic even when the right chunks were retrieved.
The problem is usually not the code. It's the parameters.
RAG has four core parameters, like four knobs on a radio:
- Chunk Size: How long is each text chunk?
- Chunk Overlap: How much do adjacent chunks overlap?
- Top-K: How many chunks does the retriever return?
- Embedding Model: How is text converted into vectors?
The combination of these four parameters directly determines whether the system can find relevant information and whether that information is enough to answer the question. In this article, we'll use a controlled-variable experiment so you can see the effect of different parameters with your own eyes.
Parameter 1: Chunk Size — How Long Is Each Chunk?
What Is Chunk Size?
Imagine you're organizing a 500-page technical manual. Chunk Size is how many pages you read at a time — 1 page, 5 pages, or 50 pages?
In RAG, Chunk Size is the maximum number of characters (or tokens) in each text chunk. The document is cut into many chunks, each no longer than this limit.
Why Does It Matter?
Chunk Size directly impacts two metrics:
| Chunk Size | Retrieval Precision | Context Completeness | Analogy |
|---|---|---|---|
| Small (128) | High | Poor | Like reading dictionary entries — precise but isolated |
| Medium (512) | Balanced | Balanced | Like reading a paragraph — enough context without bloat |
| Large (2048) | Low | Good | Like reading an entire chapter — complete but noisy |
What's wrong with too small? Suppose the document says: "The system uses Redis for caching with a default TTL of 3600 seconds. If this timeout is exceeded, data is automatically purged." If Chunk Size=128, this sentence might be split into two chunks: "The system uses Redis for caching with a default TTL of 3600 seconds." and "If this timeout is exceeded, data is automatically purged." When the user asks "What happens when Redis cache expires?", the retriever might only return the first chunk. The LLM sees "3600 seconds" but doesn't know about "automatically purged" — the answer is incomplete.
What's wrong with too large? Suppose Chunk Size=2048, and one chunk contains five unrelated topics. When the user asks a specific question, this chunk gets retrieved, but the LLM's attention is scattered by irrelevant content — like trying to hear one person speak in a noisy marketplace.
How to Choose?
There's no silver bullet, but there's a rule of thumb:
Chunk Size ≈ 1.5 ~ 2 × the expected answer length
| Document Type | Recommended Chunk Size | Reasoning |
|---|---|---|
| FAQ / Q&A pairs | 256 ~ 384 | Short answers, precision matters more |
| Technical docs / API manuals | 512 ~ 768 | Medium-length answers, need some context |
| Papers / book chapters | 1024 ~ 1536 | Argument-heavy, need large context |
| Legal contracts / medical records | 768 ~ 1024 | Dense terminology, need inference from context |
Heuristic: Start with 512, then observe retrieval results. If you notice "answers are cut off", increase it. If you notice "retrieved chunks contain too much irrelevant content", decrease it.
Parameter 2: Chunk Overlap — How Much Should Adjacent Chunks Overlap?
What Is Chunk Overlap?
Back to that technical manual. If you read 5 pages at a time, Overlap is how many pages from the previous chunk you keep when starting the next one. For example, Overlap=1 means: first read pages 1-5, then read pages 5-9 (page 5 appears twice).
Why Is Overlap Needed?
Without overlap, critical information can get "cut at the seam":
Chunk A: "The system uses Redis for caching with a default TTL of 3600 seconds."
Chunk B: "If this timeout is exceeded, data is automatically purged."
If the user asks "What happens when Redis cache expires?", the embedding model might think Chunk B is more relevant (both mention "expires"), and only return Chunk B. But Chunk B starts with "If this timeout is exceeded" — without Chunk A, the LLM doesn't know what "this timeout" refers to.
With Overlap=50, Chunk B starts with the last 50 characters of Chunk A:
Chunk B (with overlap): "...default TTL of 3600 seconds. If this timeout is exceeded, data is automatically purged."
Now even if only Chunk B is retrieved, the LLM can infer "this timeout = 3600 seconds".
How Much Overlap?
Generally set to 10% ~ 20% of Chunk Size:
| Chunk Size | Recommended Overlap | Notes |
|---|---|---|
| 256 | 25 ~ 50 | Short text, small overlap preserves context |
| 512 | 50 ~ 100 | The sweet spot for general use |
| 1024 | 100 ~ 200 | Long text needs more overlap to preserve continuity |
Note: More overlap is not always better. Too much overlap leads to storing massive amounts of duplicate content in the vector database, increasing storage cost and deduplication burden during retrieval.
Parameter 3: Top-K — How Many Chunks to Retrieve?
What Is Top-K?
Top-K is the number of text chunks the retriever returns each time. K=4 means "give me the 4 most relevant chunks", K=10 means "give me the 10 most relevant chunks".
Why Does It Matter?
K too small = missing information. K too large = introducing noise.
Scenario A: K=2, missing critical information
User asks: "How do I configure the database connection pool and log level?" This question involves two topics. If K=2, the retriever might only return two chunks about "database connection pool" and completely miss "log level" — the LLM can only answer half the question.
Scenario B: K=20, noise drowning out the answer
User asks: "What's the default timeout?" The document has a clear answer. But K=20 retrieves 20 chunks, 19 of which discuss unrelated topics. The LLM's context window is filled with irrelevant content, and it can't find that simple number.
How to Choose?
Top-K = Number of topics the answer is expected to cover × 2 ~ 3
| Query Type | Recommended K | Reasoning |
|---|---|---|
| Single-point fact ("What's the default port?") | 3 ~ 5 | Focused answer, fewer is better |
| Multi-condition ("How do I configure A and B?") | 5 ~ 8 | Might involve multiple topics |
| Comprehensive summary ("Summarize Chapter 3") | 8 ~ 12 | Need to cover multiple points |
Heuristic: Start with K=4. If you notice "the answer is missing a part", increase it. If you notice "the answer contains irrelevant content", decrease it.
Parameter 4: Embedding Model — Who Does the "Semantic Translation"?
Embedding Is RAG's "Translator"
What an embedding model does is simple: convert text into a sequence of numbers (a vector). Semantically similar texts have vectors that are close together; semantically dissimilar texts have vectors that are far apart.
The retriever relies on this — it converts the user's question into a vector, then finds the nearest vectors in the database.
How Big Is the Difference Between Models?
Very big. For the same question, different models can return completely different results.
| Model | Strong Language | Dimensions | Positioning | Best For |
|---|---|---|---|---|
| text-embedding-3-small | English | 1536 | Cheap & fast | English docs, budget-sensitive |
| text-embedding-3-large | English | 3072 | High precision | English docs, precision-first |
| BAAI/bge-large-zh-v1.5 | Chinese | 1024 | Best for Chinese | Chinese docs, China-first choice |
| BAAI/bge-m3 | Multilingual | 1024 | Multilingual | Mixed Chinese-English, cross-lingual |
A Real Comparison Experiment
We use the same Chinese technical document (Automotive SPICE PAM v4.0), the same question, and compare retrieval results between text-embedding-3-small and BAAI/bge-large-zh-v1.5:
Question: "What is process capability level 1?"
| Model | 1st Retrieved Result | 2nd Retrieved Result | Assessment |
|---|---|---|---|
| text-embedding-3-small | Page 12: paragraph about project management | Page 89: paragraph about risk assessment | ❌ Neither mentions "process capability level" |
| BAAI/bge-large-zh-v1.5 | Page 45: Definition of process capability level 1 | Page 46: Example practices for level 1 | ✅ Direct hit |
Reason: OpenAI's models are primarily trained on English corpora. Their understanding of Chinese technical terminology is not as strong as BGE, which is specifically fine-tuned on Chinese corpora.
How to Choose an Embedding Model?
Decision tree:
What language is your document?
├─ Pure English → text-embedding-3-small (best value)
│ or text-embedding-3-large (best precision)
├─ Pure Chinese → BAAI/bge-large-zh-v1.5 (China-first choice)
│ or BAAI/bge-m3 (if mixed Chinese-English)
└─ Mixed Chinese-English → BAAI/bge-m3 (best multilingual support)
Switching models is a one-line change: Just change
model="..."in thebuild_embeddings()function. Everything else stays the same — that's the beauty of LangChain.
Hands-On: Controlled-Variable Experiment
Let's run an experiment: same document, same question, only changing Chunk Size, and observe how answer quality changes.
Experimental Design
"""
RAG Parameter Controlled-Variable Experiment
Fixed: document, question, embedding model, Top-K, LLM
Variable: Chunk Size
"""
import os
from pathlib import Path
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_chroma import Chroma
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
# Load document
doc = PyPDFLoader("./data/Automotive-SPICE-PAM-v40.pdf").load()
# Embedding (fixed)
embeddings = OpenAIEmbeddings(
model="BAAI/bge-large-zh-v1.5",
api_key=os.getenv("EMBEDDING_API_KEY"),
base_url="https://api.siliconflow.cn/v1",
chunk_size=32,
)
# LLM (fixed)
llm = ChatOpenAI(
model="glm-4-flash",
api_key=os.getenv("LLM_API_KEY"),
base_url="https://open.bigmodel.cn/api/paas/v4",
temperature=0,
)
# Test different Chunk Sizes
def test_chunk_size(chunk_size, overlap):
print(f"\n{'='*50}")
print(f"Chunk Size={chunk_size}, Overlap={overlap}")
print(f"{'='*50}")
# Split
splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=overlap,
length_function=len,
)
chunks = splitter.split_documents(doc)
print(f"Generated {len(chunks)} chunks")
# Build vector store
persist_dir = f"./chroma_db_{chunk_size}"
if os.path.exists(persist_dir):
import shutil
shutil.rmtree(persist_dir)
vector_store = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory=persist_dir,
)
# Build RAG Chain (LCEL style)
retriever = vector_store.as_retriever(search_kwargs={"k": 4})
prompt = ChatPromptTemplate.from_messages([
("system", "Answer based on reference content. Reference:\n{context}"),
("human", "{question}")
])
rag_chain = (
{"context": retriever | (lambda docs: "\n\n".join(d.page_content for d in docs)),
"question": RunnablePassthrough()}
| prompt | llm | StrOutputParser()
)
# Ask question
question = "What is process capability level 1?"
answer = rag_chain.invoke(question)
print(f"\nAnswer: {answer[:200]}...")
# Print retrieved sources
sources = retriever.invoke(question)
print(f"\nRetrieved {len(sources)} sources:")
for i, s in enumerate(sources[:3], 1):
print(f" [{i}] Page {s.metadata.get('page', '?')}: {s.page_content[:80]}...")
# Run three experiments
test_chunk_size(chunk_size=128, overlap=20)
test_chunk_size(chunk_size=512, overlap=50)
test_chunk_size(chunk_size=1024, overlap=100)
Expected Results
| Chunk Size | Number of Chunks | Retrieval Quality | Typical Observation |
|---|---|---|---|
| 128 | Many (~4000) | High precision but broken context | Retrieved chunks have the keyword "process capability level" but lack sufficient context; LLM answers are fragmented |
| 512 | Medium (~1000) | Best balance | Retrieved chunks contain complete definitions + examples; LLM answers are coherent and accurate |
| 1024 | Few (~500) | Complete context but low precision | Retrieved chunks contain lots of irrelevant content (e.g., descriptions of other levels); LLM answers are verbose |
Key insight: Chunk Size is not "bigger is better" nor "smaller is better". 512 characters is a safe starting point for most Chinese technical documents.
The 5 Most Common Pitfalls
Pitfall 1: Setting Chunk Size by Token Count, but length_function Uses Character Count
# ❌ Wrong: You think chunk_size=512 means 512 tokens
splitter = RecursiveCharacterTextSplitter(chunk_size=512)
# Actually the default length_function=len counts characters!
# 512 characters ≈ 256 tokens (Chinese), so chunks are half the size you expected
Fix: If you want to count by tokens, explicitly specify a tokenizer:
import tiktoken
def token_length(text):
return len(tiktoken.encoding_for_model("gpt-4").encode(text))
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
length_function=token_length, # ✅ Count by tokens
)
Pitfall 2: Overlap Too Large, Causing 30% Duplicate Content in the Vector Store
Overlap is not free. Every overlapping character requires an embedding computation and takes up storage space in the vector database. Overlap=100 with Chunk Size=200 means 50% of your storage is redundant.
Fix: Set Overlap to 10%~15% of Chunk Size, never exceed 20%.
Pitfall 3: Swapped Embedding Model Without Clearing the Old Vector Store
# ❌ Wrong: Yesterday you built the index with BGE, today you switch to OpenAI
# and reuse the same chroma_db/ directory
vector_store = Chroma.from_documents(documents=chunks, embedding=new_embeddings)
# Result: Query vectors and index vectors come from different models — completely mismatched
Fix: When switching embedding models, always delete the old vector store and re-index:
if os.path.exists(persist_directory):
shutil.rmtree(persist_directory) # ✅ Clear old data
Pitfall 4: Hardcoding Top-K Without Adjusting for Question Complexity
Using K=4 for all questions, but "What's the default port?" (simple fact) and "Summarize all key points from Chapter 3" (comprehensive overview) require vastly different amounts of information.
Fix: Simple questions use K=3~4, complex questions use K=8~10. A more advanced approach is to use an LLM to judge question complexity first, then dynamically decide K (covered in a later article).
Pitfall 5: Not Monitoring "Zero Retrieval"
Sometimes the retriever returns 0 relevant chunks (e.g., the user asks about something completely absent from the document), but you don't know. The LLM has no choice but to hallucinate from memory.
Fix: Add a threshold filter after retrieval — if the similarity score of the most relevant chunk is below a threshold (e.g., 0.6), directly tell the user "The document doesn't contain relevant information" instead of feeding irrelevant chunks to the LLM:
# Add a filter layer after retrieval
docs = retriever.invoke(question)
if not docs or max_similarity < 0.6:
return "Sorry, I cannot answer this question based on the available documents."
Parameter Selection Cheat Sheet
Condense everything above into one table, tape it next to your monitor:
| Parameter | Default for Beginners | When to Increase | When to Decrease |
|---|---|---|---|
| Chunk Size | 512 | Answer needs large context (books/papers) | Answer is short (FAQ/config items) |
| Chunk Overlap | 50 (~10%) | Sentences often span pages/paragraphs | Document is highly structured with clear boundaries |
| Top-K | 4 | Question involves multiple topics | Question is very specific with a unique answer |
| Embedding | BGE (Chinese) / OpenAI (English) | Chinese professional documents | English general-purpose documents |
Summary
In this article, we covered the four core parameters of RAG:
- Chunk Size: Determines how long each text chunk is. Default 512. Use 256 for short answers, 1024 for long arguments.
- Chunk Overlap: Determines how much adjacent chunks overlap. Default 10% of Chunk Size. Prevents cross-chunk information from being severed.
- Top-K: Determines how many chunks to retrieve. Default 4. Increase to 8 for complex questions, decrease to 3 for simple facts.
- Embedding Model: Chinese documents use BGE, English documents use OpenAI. Remember to clear the vector store when switching models.
Through the controlled-variable experiment, we demonstrated that parameters are not "bigger is better" nor "smaller is better" — the key is finding the balance that suits your document type and query patterns.
Next up, we enter Part 2: Core Components — a deep dive into 4 chunking strategies (Fixed Size, Recursive Character, Semantic Chunking, Document Structure), thoroughly unpacking the "how to cut" problem.
References
- LangChain Text Splitters Documentation — Official chunking strategy guide
- BGE Embedding Models GitHub — Best practices for Chinese embeddings
- MTEB Leaderboard — Authoritative embedding model ranking
- ChromaDB Distance Metrics — Cosine similarity vs Euclidean distance
Top comments (0)