Started with a simple RAG for MCP—the kind of thing you build in a weekend. Ended up implementing semantic chunking (Max-Min algorithm) and rethinking hybrid search entirely. This article is written for people who have already built RAG systems and started hitting quality limits. If you've hit walls with fixed-size chunks and top-K retrieval, this might be useful.
Table of Contents
- Context: RAG for Agentic Coding
- The Invisible Problem: What Does the LLM Actually Receive?
- Semantic Chunking: Why Fixed Chunks Break Down
- When Semantic Chunks Broke Hybrid Search
- Results: What Actually Changed
- Architecture Summary
- The Other Side: Query Quality
- Tradeoffs and Limitations
- Conclusion
1. Context: RAG for Agentic Coding
Problem statement
The request was straightforward: load domain knowledge from PDFs for a specialized agent. Framework best practices, project principles (rules), and specifications (PRDs)—the kind of documents you'd want an AI coding assistant to reference while working.
The constraints made it interesting:
- Personal use → No external APIs, privacy matters
- MCP ecosystem → Integration with Cursor, Claude Code, Codex
- "Agentic Coding support" as the use case
Initial implementation
The first version was textbook RAG:
Document → Fixed-size chunks (500 chars) → Embeddings → LanceDB
Query → Vector search → Top-K results → LLM
Standard fixed-size chunking. Vector search with top-K retrieval. Local embedding model via Transformers.js. LanceDB for vector storage—file-based, no server process required.
It worked... sort of.
2. The Invisible Problem: What Does the LLM Actually Receive?
Discovery
Here's the thing about MCP: search results go directly to the LLM. The user never sees them.
User → LLM → MCP(RAG) → LLM → Response
↑
Results hidden from user
When the RAG returns garbage, you don't see it. You just notice the LLM behaving strangely—making additional searches, reading files directly, or giving incomplete answers.
To debug this, I forced the LLM to output the raw JSON search results. The prompt was simple: "Show me the exact JSON you received from the RAG search."
What I found: lots of irrelevant chunks polluting the context. Page markers, decoration lines, fragments cut mid-sentence.
Why top-K fails
The standard approach is "return the top 10 closest vectors." But closeness in vector space doesn't equal usefulness.
- Increasing K just adds more noise
- No quality signal—just "top 10 closest vectors"
- A chunk with distance 0.1 and another with distance 0.9 both make the cut if they're in the top K
First fix: Quality filtering
Three mechanisms, each addressing a different problem:
1. Distance-based threshold (RAG_MAX_DISTANCE)
// src/vectordb/index.ts
if (this.config.maxDistance !== undefined) {
query = query.distanceRange(undefined, this.config.maxDistance)
}
Only return results below a certain distance. If nothing is close enough, return nothing—better than returning garbage.
2. Relevance gap grouping (RAG_GROUPING)
Instead of arbitrary K, detect natural "quality groups" in the results:
// src/vectordb/index.ts
// Calculate statistical threshold: mean + 1.5 * std
const threshold = mean + GROUPING_BOUNDARY_STD_MULTIPLIER * std
// Find significant gaps (group boundaries)
const boundaries = gaps.filter((g) => g.gap > threshold)
// 'similar' mode: first group only
// 'related' mode: top 2 groups
Results cluster naturally—there's usually a gap between "highly relevant" and "somewhat related." This detects that gap statistically.
3. Garbage chunk removal
// src/chunker/semantic-chunker.ts
export function isGarbageChunk(text: string): boolean {
// Decoration line patterns (----, ====, ****, etc.)
if (/^[\-=_.*#|~`@!%^&*()\[\]{}\\/<>:+\s]+$/.test(trimmed)) return true
// Excessive repetition of single character (>80%)
const maxCount = Math.max(...charCounts.values())
if (maxCount / trimmed.length > 0.8) return true
return false
}
Page markers, separator lines, repeated characters—filter them before they ever reach the index.
New problem emerged
Technical terms like useEffect or ERR_CONNECTION_REFUSED were getting filtered out. They're semantically distant from natural language queries but keyword-relevant.
The fix: hybrid search (semantic + keyword blend). But implementing it properly required rethinking the chunking strategy first.
3. Semantic Chunking: Why Fixed Chunks Break Down
Trigger
I read about "semantic center of gravity" in chunks—the idea that a chunk should have a coherent meaning, not just a coherent length.
Then I observed the LLM's behavior: after RAG search, it would often search again with different terms, or just read the file directly. The chunks weren't trustworthy—they lacked sufficient context for the LLM to act on them.
The waste
If a chunk doesn't contain enough meaning:
- LLM makes additional tool calls to compensate
- Context gets polluted with redundant searches
- Latency increases
- Tokens get wasted
The LLM was doing work that good chunking should prevent.
Solution: Max-Min Algorithm
The Max-Min semantic chunking paper (Kiss et al., Springer 2025) provided the foundation. This implementation is a pragmatic adaptation of the Max–Min idea, not a faithful reproduction of the paper's algorithm.
The core idea: group consecutive sentences based on semantic similarity, not character count.
// src/chunker/semantic-chunker.ts
// Should we add this sentence to the current chunk?
private shouldAddToChunk(maxSim: number, threshold: number): boolean {
return maxSim > threshold
}
// Dynamic threshold based on chunk coherence
private calculateThreshold(minSim: number, chunkSize: number): number {
// threshold = max(c * minSim * sigmoid(|C|), hardThreshold)
const sigmoid = 1 / (1 + Math.exp(-chunkSize))
return Math.max(this.config.c * minSim * sigmoid, this.config.hardThreshold)
}
The algorithm:
- Split text into sentences
- Generate embeddings for all sentences
- For each sentence, decide: add to current chunk or start new?
- Decision based on comparing max similarity with new sentence vs. min similarity within chunk
When the new sentence's similarity drops below the threshold, it signals a topic boundary.
Implementation details
Sentence detection: Intl.Segmenter
// src/chunker/sentence-splitter.ts
const segmenter = new Intl.Segmenter('und', { granularity: 'sentence' })
No external dependencies. Multilingual support via Unicode standard (UAX #29). The 'und' (undetermined) locale provides general Unicode support.
Code block preservation
// src/chunker/sentence-splitter.ts
const CODE_BLOCK_PLACEHOLDER = '\u0000CODE_BLOCK\u0000'
// Extract before sentence splitting
const codeBlockRegex = /```
{% endraw %}
[\s\S]*?
{% raw %}
```/g
// ... replace with placeholders ...
// Restore after chunking
Markdown code blocks stay intact—never split mid-block. Critical for technical documentation where copy-pastable code is the point.
Performance tuning
The paper uses O(k²) comparisons within each chunk. For long homogeneous documents, this explodes.
// src/chunker/semantic-chunker.ts
const WINDOW_SIZE = 5 // Compare only recent 5 sentences: O(k²) → O(25)
const MAX_SENTENCES = 15 // Force split at 15 sentences (3x paper's median)
PDF parsing: pdfjs-dist
Switched from pdf-parse to pdfjs-dist for access to position information (x, y coordinates, font size). This enables semantic header/footer detection—variable content like "Page 7 of 75" that pdf-parse would include as regular text.
4. When Semantic Chunks Broke Hybrid Search
The problem
Semantic chunks are richer—more content per chunk, more coherent meaning. But this broke the original keyword matching.
The issue: scores became unreliable. A keyword match in a dense, high-quality chunk meant something different than a match in a sparse, fragmented one.
Attempted: RRF (Reciprocal Rank Fusion)
RRF is the standard approach for merging BM25 and vector results:
RRF_score = Σ 1/(k + rank_i)
Combine rankings by position, not by score. Elegant, widely used, no tuning required.
But there's a fundamental problem: distance information is lost.
Original distances: 0.1, 0.2, 0.9 → Ranks: 1, 2, 3
Original distances: 0.1, 0.15, 0.18 → Ranks: 1, 2, 3
# Same ranks, completely different quality gaps
RRF outputs ranks, not distances. Our quality filters—distance threshold, relevance gap grouping—need actual distances to work.
As noted in Microsoft's hybrid search documentation: "RRF aggregates rankings rather than scores." This is by design—it avoids the problem of incompatible score scales. But it means downstream quality filtering can't distinguish "barely made top-10" from "clearly the best match."
Solution: Semantic-first with keyword boost
Keep vector search as the primary signal. Use keywords to adjust distances, not replace them.
// src/vectordb/index.ts
// Multiplicative boost: distance / (1 + keyword_score * weight)
const boostedDistance = result.score / (1 + keywordScore * weight)
The formula:
-
No keyword match (score=0):
distance / 1 = distance(unchanged) -
Perfect match with weight=0.6:
distance / 1.6(reduced by 37.5%) -
Perfect match with weight=1.0:
distance / 2(halved)
This preserves the distance for quality filtering while boosting exact matches.
Architecture
// src/vectordb/index.ts
// 1. Vector search with 2x candidate pool
const candidateLimit = limit * HYBRID_SEARCH_CANDIDATE_MULTIPLIER
// 2. Apply distance filter
if (this.config.maxDistance !== undefined) {
query = query.distanceRange(undefined, this.config.maxDistance)
}
// 3. Apply grouping
if (this.config.grouping && results.length > 1) {
results = this.applyGrouping(results, this.config.grouping)
}
// 4. Keyword boost via FTS
const ftsResults = await this.table
.search(queryText, 'fts', 'text')
// ...
results = this.applyKeywordBoost(results, ftsResults, hybridWeight)
Quality filters apply to meaningful vector distances. Keyword matching acts as a boost, not a replacement.
Multilingual challenge
Japanese keyword matching broke with richer chunks. The default tokenizer couldn't handle CJK characters properly.
Solution: LanceDB FTS with n-gram indexing.
// src/vectordb/index.ts
await this.table.createIndex('text', {
config: Index.fts({
baseTokenizer: 'ngram',
ngramMinLength: 2, // Capture Japanese bi-grams (東京, 設計)
ngramMaxLength: 3, // Balance precision vs index size
prefixOnly: false, // All positions for proper CJK support
stem: false, // Preserve exact terms
}),
})
N-grams at min=2, max=3 capture both English terms and Japanese compound words without language-specific tokenization.
5. Results: What Actually Changed
Observed behavior (real usage)
My setup: framework best practices (official PDFs), project principles (rules), specifications (PRDs) stored in RAG. Before each task, the agent analyzes requirements and searches RAG for relevant context.
Before (fixed chunks + top-K):
- Agent couldn't find relevant information on first search
- Multiple search attempts with different query formulations
- Eventually gave up and read rule files directly
- PDFs were too large to read, so that context was effectively lost
After (semantic chunks + boost + filtering):
- Single search usually provides sufficient context
- Additional searches happen for depth, not compensation
- Agent stopped reading files directly—RAG results were trustworthy
LLM evaluation (before/after comparison)
I had an LLM evaluate search results with project context—not a formal LLM-as-Judge setup, but structured comparison.
Old version:
- Garbage chunks (outliers) and fragmented information in ~2/10 results for some queries
- Results required additional verification
Updated version:
- No garbage chunks
- 8/10 results directly relevant to the query
- 2/10 results tangentially related (still useful context)
- Evaluator noted: "Search results alone provide necessary and sufficient information"
Examining the raw JSON confirmed the qualitative assessment—chunks contained coherent, dense information rather than fragments.
No benchmarks
This is qualitative observation from real usage, not controlled experiments. But the behavioral change is clear: the LLM stopped compensating for bad RAG results.
6. Architecture Summary
Document → Semantic Chunking (Max-Min) → Embeddings → LanceDB
Query → Vector Search → Distance Filter → Grouping → Keyword Boost → Results
Key decisions
| Choice | Reason |
|---|---|
| Semantic chunking over fixed | Meaning-preserving units reduce LLM compensation |
| Keyword boost over RRF | Preserves distance for quality filtering |
| Distance-based grouping | Quality signal, not arbitrary K |
| N-gram FTS | Multilingual support without tokenizer complexity |
| Local-only | Privacy, cost, offline capability |
Configuration
# Environment variables
RAG_HYBRID_WEIGHT=0.6 # Keyword boost factor (0=semantic, 1=BM25-dominant)
RAG_GROUPING=related # 'similar' (top group) or 'related' (top 2 groups)
RAG_MAX_DISTANCE=0.5 # Filter low-relevance results
7. The Other Side: Query Quality
RAG accuracy depends on two things:
- Search quality (what we've discussed)
- Query quality (what the LLM sends)
MCP's dual invisibility
User → LLM → MCP(RAG) → LLM → Response
↑ ↑
Query hidden Results hidden
Even perfect RAG fails with bad queries. And users can't see either side.
Solution: Agent Skills
Agent Skills is an open format for extending AI agent capabilities with specialized knowledge. Skills are portable, version-controlled packages of procedural knowledge that agents load on-demand.
For this RAG, skills teach the LLM:
Query formulation
# Query patterns by intent
| Intent | Pattern |
|--------|---------|
| Definition/Concept | "[term] definition concept" |
| How-To/Procedure | "[action] steps example usage" |
| API/Function | "[function] API arguments return" |
| Troubleshooting | "[error] fix solution cause" |
Score interpretation
# Score thresholds
< 0.3 : Use directly (high confidence)
0.3-0.5: Include if mentions same concept/entity
> 0.5 : Skip unless no better results
Skills can be installed via the mcp-local-rag-skills CLI.
This completes the optimization loop:
- RAG side: semantic chunks + distance filters + keyword boost
- LLM side: query formulation + result interpretation
Both sides matter. Optimizing only one leaves performance on the table.
8. Tradeoffs and Limitations
What this approach gives up
- BM25-only hits don't surface: Must appear in semantic results first to get boosted
- No reranker: Would improve accuracy but adds complexity/latency
- No formal benchmarks: Qualitative evaluation only
Where heavier approaches win
- RRF + Reranker: Broader candidate pool, reranker compensates for RRF's rank-only output
- LLM-as-reranker: Best accuracy, but slow and expensive
Position on the spectrum
Light & Fast ←————————————————————→ Heavy & Accurate
semantic-only
└─ semantic + boost (here)
└─ RRF + Cross-Encoder
└─ RRF + LLM Rerank
The goal was: maximum quality within zero-setup, local-only constraints.
9. Conclusion
- Standard RAG (fixed chunks + top-K) breaks down for agentic coding use cases
- Semantic chunking + quality filtering + keyword boost is a viable middle ground
- RRF looks elegant but loses distance information critical for filtering
- Query quality matters as much as search quality—Agent Skills address this
- The real test: does the LLM stop making compensatory tool calls?
Code: github.com/shinpr/mcp-local-rag
References
- Kiss, C., Nagy, M. & Szilágyi, P. (2025). Max–Min semantic chunking of documents for RAG application. Discover Computing 28, 117. https://doi.org/10.1007/s10791-025-09638-7
- LanceDB Full-Text Search: https://lancedb.github.io/lancedb/fts/
- MCP Specification: https://modelcontextprotocol.io
- Agent Skills: https://agentskills.io
- Reciprocal Rank Fusion (OpenSearch): https://opensearch.org/blog/introducing-reciprocal-rank-fusion-hybrid-search/
- Hybrid Search Scoring (Microsoft): https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking
Top comments (0)