Yu-Chen, Lin

Posted on Dec 30, 2025

How I Improved RAG Accuracy from 73% to 100% - A Chunking Strategy Comparison

#rag #llm #performance #ai

Introduction

When building RAG (Retrieval-Augmented Generation) systems, you often hit a wall where accuracy just won't improve no matter what you try.

In this article, I'll share how I improved answer accuracy from 73.3% to 100% in a RAG system for internal company policy documents. Among the various chunking strategies tested, surprisingly, the simplest solution turned out to be the most effective.

I also discovered that adding Re-ranking, which I expected to improve accuracy, actually made it worse. I'll explain why this happened.

Project Architecture

Tech Stack

Layer	Technology
Frontend	Next.js 16 + React 19 + TypeScript
Backend	FastAPI + Python 3.12
Embedding Model	intfloat/multilingual-e5-large
Vector DB	Chroma
LLM	Google Gemini 2.0 Flash
Deployment	Vercel (Frontend) + Hugging Face Spaces (Backend)

System Architecture

User Query
    │
    ▼
┌─────────────┐
│  Next.js    │  Frontend (SSE streaming support)
└─────────────┘
    │
    ▼
┌─────────────┐
│  FastAPI    │  Backend
└─────────────┘
    │
    ├─── Embedding ─── multilingual-e5-large
    │
    ├─── Vector Search ─── Chroma DB
    │
    └─── LLM ─── Gemini 2.0 Flash
                    │
                    ▼
            Streaming Response

User queries are first vectorized by the embedding model, then similar chunks are retrieved from Chroma DB. These chunks are passed as context to the LLM, which generates the answer.

Common RAG Failure Patterns

When RAG accuracy is low, the problem is usually not the LLM's capability, but the retrieval step. Here are the specific patterns that caused issues in this project.

1. Implicit Reference Problem

This occurs when terminology in documents doesn't match the terms users actually use.

Example:

Document: "Employees defined in Article 2-2 shall have a maximum allowance of 20,000 yen per month"
User query: "What's the commuting allowance limit for part-time workers?"

The fact that "employees defined in Article 2-2" refers to "part-time workers" isn't clear unless you read the entire document. Vector search prioritizes chunks containing "part-time workers," missing this critical exception clause.

2. Multi-hop Reasoning

This happens when answering requires information from multiple sources.

Example:

User query: "What's the difference between full-time and part-time employee wedding bonuses?"

To answer this, you need both the full-time bonus (50,000 yen) and the part-time bonus (10,000 yen). If these are in different parts of the document, only one may be retrieved, making it impossible to calculate the difference.

3. Buried Exception Clauses

Japanese company regulations typically follow this structure:

Articles 1-10: General rules (for all employees, high keyword density)
     ↓
Articles 11+: Special provisions (for specific employee types, lower keyword density)
     ↓
Supplementary Provisions: Additional rules (exception cases, lowest keyword density)

Vector search is influenced by keyword frequency, so general rules are retrieved while exception clauses get buried.

4. Negative/Exclusion Queries

Queries with negations like "cannot" or "not eligible for" are challenging for vector search.

Example:

"What benefits are part-time workers not eligible for?"

Vector search cannot directly handle logical operations (NOT, AND, OR), making it difficult to retrieve appropriate information for such queries.

Designing a Dataset That Intentionally Fails

To validate RAG improvements, I created a dataset that intentionally includes the problems described above.

Design Principles

Mimicked actual Japanese company regulation structure
- General rules → Special provisions → Supplementary provisions pattern
- 6 documents total (commuting allowance, leave, expenses, remote work, conduct, benefits)
Multiple terms for the same concept (alias problem)
| User's term | Document terms |
|-------------|----------------|
| Part-timer | Short-term employee, Person defined in Article 2-2 |
| Part-time | Part-time employee, Person working 4+ days per week |
Exception clauses intentionally placed far from general rules

Concrete Example: Commuting Allowance Policy

# Commuting Allowance Policy

## Article 3 (Payment of Commuting Allowance)
The maximum commuting allowance shall be 50,000 yen per month.
← General rule (keyword "commuting allowance" appears frequently)

...(about 50 lines later)...

## Article 12 (Special Provisions for Payment)
For persons defined in Article 2-2, the maximum shall be 20,000 yen per month,
and commuter passes shall not be provided; payment shall be prorated based on actual work days.
← Exception clause (no "part-timer" keyword)

When a user asks "What's the commuting allowance limit for part-timers?":

Vector search retrieves chunks around Article 3 containing "commuting allowance"
The exception in Article 12 (actual answer: 20,000 yen) is not retrieved
LLM gives the wrong answer: "50,000 yen"

Evaluation Queries

I created 15 evaluation queries, each with "required keywords" and "prohibited keywords."

{
  "question": "What is the maximum commuting allowance for part-timers?",
  "required_keywords": ["20,000 yen"],
  "prohibited_keywords": ["50,000 yen"]
}

If "50,000 yen" (full-time limit) appears in the answer, it's incorrect. If "20,000 yen" (part-time limit) appears, it's correct. This enables automated evaluation.

Chunking Strategies Tested

1. Standard Chunking (Baseline)

The most basic chunking strategy.

# Configuration
chunk_size = 1000  # characters
overlap = 200      # characters

Result: 73.3% (11/15 correct)

With 1000-character chunks, general rules and exception clauses ended up in separate chunks, causing failures on exception-related questions.

2. Large Chunking

A simple improvement by increasing chunk size.

# Configuration
chunk_size = 2000  # characters
overlap = 500      # characters

Result: 100% (15/15 correct)

By using 2000-character chunks, general rules and exception clauses more often ended up in the same chunk, achieving perfect accuracy.

3. Parent-Child Chunking

A strategy that separates small chunks for search (children) from large chunks for LLM context (parents).

# Configuration
child_chunk_size = 400   # for search
parent_chunk_size = 2000 # for LLM context

How it works:

Perform vector search with small child chunks (400 chars)
Pass the parent chunk (2000 chars) associated with the hit to the LLM

Result: 93.3% (14/15 correct)

This enabled more detailed and comprehensive answers, but failed on one comparison query ("difference between full-time and part-time"). This was because the two amounts were in different parent chunks.

4. Hypothetical Questions

A strategy that pre-generates expected questions for each chunk using an LLM.

# During index creation
for chunk in chunks:
    questions = llm.generate(f"Generate 3 questions a user might ask about this content:\n{chunk}")
    # Vectorize and index the generated questions

Example:

Original chunk: "For persons defined in Article 2-2, the maximum shall be 20,000 yen per month"
Generated question: "What is the maximum commuting allowance for part-timers?"

Result: 93.3% (14/15 correct)

This was effective for solving the alias problem ("persons defined in Article 2-2" ≈ "part-timers"), but the downside is increased LLM API costs.

5. Re-ranking (Surprising Result)

A strategy that re-ranks initial search results using a Cross-Encoder.

# Configuration
initial_k = 10  # retrieve 10 in initial search
final_k = 4     # narrow to top 4 with Cross-Encoder
model = "cross-encoder/ms-marco-MiniLM-L-6-v2"

Result: 60.0% (9/15 correct) ← Worse than Standard alone

This was unexpected. Adding Re-ranking actually decreased accuracy.

Analysis of Results

Results Summary

Strategy	Accuracy	Implementation Complexity
Standard (1000/200)	73.3%	Low
Large (2000/500)	100%	Low
Parent-Child	93.3%	Medium
Hypothetical Questions	93.3%	High
Standard + Reranking	60.0%	Medium

Why Large Chunk Was the Winner

In this dataset, exception clauses were located within 300-500 characters of general rules. With 2000-character chunks, both could often be included in the same chunk.

This is a "simple solution wins" result, but there are caveats:

This result is dataset-dependent
If exception clauses are more than 1000 characters away, Large Chunk will also fail
Chunks that are too large may reduce search precision

Why Re-ranking Failed

Re-ranking decreased accuracy because of the difference between Precision and Recall.

Re-ranking's role:
  Select the most relevant 4 from 10 retrieved results
  → A tool for improving Precision

The actual problem:
  Exception clauses weren't in the 10 results to begin with
  → A Recall problem

Conclusion:
  Information not retrieved in the initial search
  cannot be rescued by Re-ranking

Re-ranking is a tool for "removing noise," not for "finding missed information." Since the core problem was retrieval misses (insufficient Recall), Re-ranking had no effect.

In fact, relevant chunks that barely made it into the lower ranks of the initial search were likely eliminated by Re-ranking, causing accuracy to decrease.

Future Improvements

Improvements I Want to Implement

1. Header-based Chunking

Instead of fixed-length chunks, split by Markdown headings (##, ###).

## Article 3 (Payment of Commuting Allowance)
...content...

## Article 4 (Application Method)  ← Split here
...content...

This creates chunks that preserve semantic coherence.

2. Query Expansion

Convert user queries to match document terminology before searching.

User: "What's the commuting allowance for part-timers?"
     ↓ Expand with LLM
Expanded: "part-timer short-term employee Article 2-2 commuting allowance transportation"

3. Increase Retrieval Count

Simply increasing k raises the probability of including exception clauses.

# Before
k = 4

# After
k = 10

However, increasing k too much lengthens the context and increases LLM processing costs.

Data Quality Improvements

There are limits to improving chunking strategies. A more fundamental solution is to preprocess the documents themselves.

Before: commuting_allowance_policy.md (for all employees, exceptions scattered)

After:
  - commuting_allowance_policy_fulltime.md
  - commuting_allowance_policy_parttime.md
  - commuting_allowance_policy_temporary.md

By splitting documents by employee type and rewriting from each perspective, we can fundamentally eliminate the exception clause problem. This approach achieved 93.3% accuracy.

Things I Want to Test

Semantic Chunking: Automatically detecting meaning boundaries for chunking
GraphRAG: RAG using knowledge graphs
Fine-tuned Embedding Models: Domain-specific embedding models

Conclusion

This article presented validation results for RAG accuracy improvement using real data.

Key Findings

The simplest solution was the most effective
- Large Chunk (2000 chars) achieved 100%
- More effective than complex Parent-Child or Hypothetical Questions
Re-ranking is not a silver bullet
- Has no effect on Recall (retrieval coverage) problems
- Can actually decrease accuracy in some cases
Data quality matters most
- There are limits to chunking strategy improvements
- Document preprocessing can be a fundamental solution

Key Takeaway for RAG Accuracy Improvement

"The key to RAG accuracy improvement is not LLM optimization, but data quality and retrieval accuracy."

First, analyze what kinds of retrieval failures are occurring with your dataset. Then, start with the simplest solutions.

References

Technical Stack Details

Embedding Model: intfloat/multilingual-e5-large
Re-ranking: cross-encoder/ms-marco-MiniLM-L-6-v2
Vector DB: Chroma
LLM: Gemini 2.0 Flash

DEV Community

How I Improved RAG Accuracy from 73% to 100% - A Chunking Strategy Comparison

Introduction

Project Architecture

Tech Stack

System Architecture

Common RAG Failure Patterns

1. Implicit Reference Problem

2. Multi-hop Reasoning

3. Buried Exception Clauses

4. Negative/Exclusion Queries

Designing a Dataset That Intentionally Fails

Design Principles

Concrete Example: Commuting Allowance Policy

Evaluation Queries

Chunking Strategies Tested

1. Standard Chunking (Baseline)

2. Large Chunking

3. Parent-Child Chunking

4. Hypothetical Questions

5. Re-ranking (Surprising Result)

Analysis of Results

Results Summary

Why Large Chunk Was the Winner

Why Re-ranking Failed

Future Improvements

Improvements I Want to Implement

1. Header-based Chunking

2. Query Expansion

3. Increase Retrieval Count

Data Quality Improvements

Things I Want to Test

Conclusion

Key Findings

Key Takeaway for RAG Accuracy Improvement

References

Technical Stack Details

Top comments (0)