Introduction
When building RAG (Retrieval-Augmented Generation) systems, you often hit a wall where accuracy just won't improve no matter what you try.
In this article, I'll share how I improved answer accuracy from 73.3% to 100% in a RAG system for internal company policy documents. Among the various chunking strategies tested, surprisingly, the simplest solution turned out to be the most effective.
I also discovered that adding Re-ranking, which I expected to improve accuracy, actually made it worse. I'll explain why this happened.
Project Architecture
Tech Stack
| Layer | Technology |
|---|---|
| Frontend | Next.js 16 + React 19 + TypeScript |
| Backend | FastAPI + Python 3.12 |
| Embedding Model | intfloat/multilingual-e5-large |
| Vector DB | Chroma |
| LLM | Google Gemini 2.0 Flash |
| Deployment | Vercel (Frontend) + Hugging Face Spaces (Backend) |
System Architecture
User Query
│
▼
┌─────────────┐
│ Next.js │ Frontend (SSE streaming support)
└─────────────┘
│
▼
┌─────────────┐
│ FastAPI │ Backend
└─────────────┘
│
├─── Embedding ─── multilingual-e5-large
│
├─── Vector Search ─── Chroma DB
│
└─── LLM ─── Gemini 2.0 Flash
│
▼
Streaming Response
User queries are first vectorized by the embedding model, then similar chunks are retrieved from Chroma DB. These chunks are passed as context to the LLM, which generates the answer.
Common RAG Failure Patterns
When RAG accuracy is low, the problem is usually not the LLM's capability, but the retrieval step. Here are the specific patterns that caused issues in this project.
1. Implicit Reference Problem
This occurs when terminology in documents doesn't match the terms users actually use.
Example:
- Document: "Employees defined in Article 2-2 shall have a maximum allowance of 20,000 yen per month"
- User query: "What's the commuting allowance limit for part-time workers?"
The fact that "employees defined in Article 2-2" refers to "part-time workers" isn't clear unless you read the entire document. Vector search prioritizes chunks containing "part-time workers," missing this critical exception clause.
2. Multi-hop Reasoning
This happens when answering requires information from multiple sources.
Example:
- User query: "What's the difference between full-time and part-time employee wedding bonuses?"
To answer this, you need both the full-time bonus (50,000 yen) and the part-time bonus (10,000 yen). If these are in different parts of the document, only one may be retrieved, making it impossible to calculate the difference.
3. Buried Exception Clauses
Japanese company regulations typically follow this structure:
Articles 1-10: General rules (for all employees, high keyword density)
↓
Articles 11+: Special provisions (for specific employee types, lower keyword density)
↓
Supplementary Provisions: Additional rules (exception cases, lowest keyword density)
Vector search is influenced by keyword frequency, so general rules are retrieved while exception clauses get buried.
4. Negative/Exclusion Queries
Queries with negations like "cannot" or "not eligible for" are challenging for vector search.
Example:
- "What benefits are part-time workers not eligible for?"
Vector search cannot directly handle logical operations (NOT, AND, OR), making it difficult to retrieve appropriate information for such queries.
Designing a Dataset That Intentionally Fails
To validate RAG improvements, I created a dataset that intentionally includes the problems described above.
Design Principles
-
Mimicked actual Japanese company regulation structure
- General rules → Special provisions → Supplementary provisions pattern
- 6 documents total (commuting allowance, leave, expenses, remote work, conduct, benefits)
Multiple terms for the same concept (alias problem)
| User's term | Document terms |
|-------------|----------------|
| Part-timer | Short-term employee, Person defined in Article 2-2 |
| Part-time | Part-time employee, Person working 4+ days per week |Exception clauses intentionally placed far from general rules
Concrete Example: Commuting Allowance Policy
# Commuting Allowance Policy
## Article 3 (Payment of Commuting Allowance)
The maximum commuting allowance shall be 50,000 yen per month.
← General rule (keyword "commuting allowance" appears frequently)
...(about 50 lines later)...
## Article 12 (Special Provisions for Payment)
For persons defined in Article 2-2, the maximum shall be 20,000 yen per month,
and commuter passes shall not be provided; payment shall be prorated based on actual work days.
← Exception clause (no "part-timer" keyword)
When a user asks "What's the commuting allowance limit for part-timers?":
- Vector search retrieves chunks around Article 3 containing "commuting allowance"
- The exception in Article 12 (actual answer: 20,000 yen) is not retrieved
- LLM gives the wrong answer: "50,000 yen"
Evaluation Queries
I created 15 evaluation queries, each with "required keywords" and "prohibited keywords."
{
"question": "What is the maximum commuting allowance for part-timers?",
"required_keywords": ["20,000 yen"],
"prohibited_keywords": ["50,000 yen"]
}
If "50,000 yen" (full-time limit) appears in the answer, it's incorrect. If "20,000 yen" (part-time limit) appears, it's correct. This enables automated evaluation.
Chunking Strategies Tested
1. Standard Chunking (Baseline)
The most basic chunking strategy.
# Configuration
chunk_size = 1000 # characters
overlap = 200 # characters
Result: 73.3% (11/15 correct)
With 1000-character chunks, general rules and exception clauses ended up in separate chunks, causing failures on exception-related questions.
2. Large Chunking
A simple improvement by increasing chunk size.
# Configuration
chunk_size = 2000 # characters
overlap = 500 # characters
Result: 100% (15/15 correct)
By using 2000-character chunks, general rules and exception clauses more often ended up in the same chunk, achieving perfect accuracy.
3. Parent-Child Chunking
A strategy that separates small chunks for search (children) from large chunks for LLM context (parents).
# Configuration
child_chunk_size = 400 # for search
parent_chunk_size = 2000 # for LLM context
How it works:
- Perform vector search with small child chunks (400 chars)
- Pass the parent chunk (2000 chars) associated with the hit to the LLM
Result: 93.3% (14/15 correct)
This enabled more detailed and comprehensive answers, but failed on one comparison query ("difference between full-time and part-time"). This was because the two amounts were in different parent chunks.
4. Hypothetical Questions
A strategy that pre-generates expected questions for each chunk using an LLM.
# During index creation
for chunk in chunks:
questions = llm.generate(f"Generate 3 questions a user might ask about this content:\n{chunk}")
# Vectorize and index the generated questions
Example:
- Original chunk: "For persons defined in Article 2-2, the maximum shall be 20,000 yen per month"
- Generated question: "What is the maximum commuting allowance for part-timers?"
Result: 93.3% (14/15 correct)
This was effective for solving the alias problem ("persons defined in Article 2-2" ≈ "part-timers"), but the downside is increased LLM API costs.
5. Re-ranking (Surprising Result)
A strategy that re-ranks initial search results using a Cross-Encoder.
# Configuration
initial_k = 10 # retrieve 10 in initial search
final_k = 4 # narrow to top 4 with Cross-Encoder
model = "cross-encoder/ms-marco-MiniLM-L-6-v2"
Result: 60.0% (9/15 correct) ← Worse than Standard alone
This was unexpected. Adding Re-ranking actually decreased accuracy.
Analysis of Results
Results Summary
| Strategy | Accuracy | Implementation Complexity |
|---|---|---|
| Standard (1000/200) | 73.3% | Low |
| Large (2000/500) | 100% | Low |
| Parent-Child | 93.3% | Medium |
| Hypothetical Questions | 93.3% | High |
| Standard + Reranking | 60.0% | Medium |
Why Large Chunk Was the Winner
In this dataset, exception clauses were located within 300-500 characters of general rules. With 2000-character chunks, both could often be included in the same chunk.
This is a "simple solution wins" result, but there are caveats:
- This result is dataset-dependent
- If exception clauses are more than 1000 characters away, Large Chunk will also fail
- Chunks that are too large may reduce search precision
Why Re-ranking Failed
Re-ranking decreased accuracy because of the difference between Precision and Recall.
Re-ranking's role:
Select the most relevant 4 from 10 retrieved results
→ A tool for improving Precision
The actual problem:
Exception clauses weren't in the 10 results to begin with
→ A Recall problem
Conclusion:
Information not retrieved in the initial search
cannot be rescued by Re-ranking
Re-ranking is a tool for "removing noise," not for "finding missed information." Since the core problem was retrieval misses (insufficient Recall), Re-ranking had no effect.
In fact, relevant chunks that barely made it into the lower ranks of the initial search were likely eliminated by Re-ranking, causing accuracy to decrease.
Future Improvements
Improvements I Want to Implement
1. Header-based Chunking
Instead of fixed-length chunks, split by Markdown headings (##, ###).
## Article 3 (Payment of Commuting Allowance)
...content...
## Article 4 (Application Method) ← Split here
...content...
This creates chunks that preserve semantic coherence.
2. Query Expansion
Convert user queries to match document terminology before searching.
User: "What's the commuting allowance for part-timers?"
↓ Expand with LLM
Expanded: "part-timer short-term employee Article 2-2 commuting allowance transportation"
3. Increase Retrieval Count
Simply increasing k raises the probability of including exception clauses.
# Before
k = 4
# After
k = 10
However, increasing k too much lengthens the context and increases LLM processing costs.
Data Quality Improvements
There are limits to improving chunking strategies. A more fundamental solution is to preprocess the documents themselves.
Before: commuting_allowance_policy.md (for all employees, exceptions scattered)
After:
- commuting_allowance_policy_fulltime.md
- commuting_allowance_policy_parttime.md
- commuting_allowance_policy_temporary.md
By splitting documents by employee type and rewriting from each perspective, we can fundamentally eliminate the exception clause problem. This approach achieved 93.3% accuracy.
Things I Want to Test
- Semantic Chunking: Automatically detecting meaning boundaries for chunking
- GraphRAG: RAG using knowledge graphs
- Fine-tuned Embedding Models: Domain-specific embedding models
Conclusion
This article presented validation results for RAG accuracy improvement using real data.
Key Findings
-
The simplest solution was the most effective
- Large Chunk (2000 chars) achieved 100%
- More effective than complex Parent-Child or Hypothetical Questions
-
Re-ranking is not a silver bullet
- Has no effect on Recall (retrieval coverage) problems
- Can actually decrease accuracy in some cases
-
Data quality matters most
- There are limits to chunking strategy improvements
- Document preprocessing can be a fundamental solution
Key Takeaway for RAG Accuracy Improvement
"The key to RAG accuracy improvement is not LLM optimization, but data quality and retrieval accuracy."
First, analyze what kinds of retrieval failures are occurring with your dataset. Then, start with the simplest solutions.
References
Technical Stack Details
- Embedding Model: intfloat/multilingual-e5-large
- Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2
- Vector DB: Chroma
- LLM: Gemini 2.0 Flash
Top comments (0)