Shinsuke KAGAWA

Posted on Jan 6 • Edited on Jan 12 • Originally published at norsica.jp

Building a Local RAG for Agentic Coding: From Fixed Chunks to Semantic Search with Keyword Boost

#ai #rag #architecture #mcp

Started with a simple RAG for MCP—the kind of thing you build in a weekend. Ended up implementing semantic chunking (Max-Min algorithm) and rethinking hybrid search entirely. This article is written for people who have already built RAG systems and started hitting quality limits. If you've hit walls with fixed-size chunks and top-K retrieval, this might be useful.

Context: RAG for Agentic Coding
The Invisible Problem: What Does the LLM Actually Receive?
Semantic Chunking: Why Fixed Chunks Break Down
When Semantic Chunks Broke Hybrid Search
Results: What Actually Changed
Architecture Summary
The Other Side: Query Quality
Tradeoffs and Limitations
Conclusion

1. Context: RAG for Agentic Coding

Problem statement

The request was straightforward: load domain knowledge from PDFs for a specialized agent. Framework best practices, project principles (rules), and specifications (PRDs)—the kind of documents you'd want an AI coding assistant to reference while working.

The constraints made it interesting:

Personal use → No external APIs, privacy matters
MCP ecosystem → Integration with Cursor, Claude Code, Codex
"Agentic Coding support" as the use case

Initial implementation

The first version was textbook RAG:

Document → Fixed-size chunks (500 chars) → Embeddings → LanceDB
Query → Vector search → Top-K results → LLM

Standard fixed-size chunking. Vector search with top-K retrieval. Local embedding model via Transformers.js. LanceDB for vector storage—file-based, no server process required.

It worked... sort of.

2. The Invisible Problem: What Does the LLM Actually Receive?

Discovery

Here's the thing about MCP: search results go directly to the LLM. The user never sees them.

User → LLM → MCP(RAG) → LLM → Response
               ↑
         Results hidden from user

When the RAG returns garbage, you don't see it. You just notice the LLM behaving strangely—making additional searches, reading files directly, or giving incomplete answers.

To debug this, I forced the LLM to output the raw JSON search results. The prompt was simple: "Show me the exact JSON you received from the RAG search."

What I found: lots of irrelevant chunks polluting the context. Page markers, decoration lines, fragments cut mid-sentence.

Why top-K fails

The standard approach is "return the top 10 closest vectors." But closeness in vector space doesn't equal usefulness.

Increasing K just adds more noise
No quality signal—just "top 10 closest vectors"
A chunk with distance 0.1 and another with distance 0.9 both make the cut if they're in the top K

First fix: Quality filtering

Three mechanisms, each addressing a different problem:

1. Distance-based threshold (RAG_MAX_DISTANCE)

// src/vectordb/index.ts
if (this.config.maxDistance !== undefined) {
  query = query.distanceRange(undefined, this.config.maxDistance)
}

Only return results below a certain distance. If nothing is close enough, return nothing—better than returning garbage.

2. Relevance gap grouping (RAG_GROUPING)

Instead of arbitrary K, detect natural "quality groups" in the results:

// src/vectordb/index.ts
// Calculate statistical threshold: mean + 1.5 * std
const threshold = mean + GROUPING_BOUNDARY_STD_MULTIPLIER * std

// Find significant gaps (group boundaries)
const boundaries = gaps.filter((g) => g.gap > threshold)

// 'similar' mode: first group only
// 'related' mode: top 2 groups

Results cluster naturally—there's usually a gap between "highly relevant" and "somewhat related." This detects that gap statistically.

3. Garbage chunk removal

// src/chunker/semantic-chunker.ts
export function isGarbageChunk(text: string): boolean {
  // Decoration line patterns (----, ====, ****, etc.)
  if (/^[\-=_.*#|~`@!%^&*()\[\]{}\\/<>:+\s]+$/.test(trimmed)) return true

  // Excessive repetition of single character (>80%)
  const maxCount = Math.max(...charCounts.values())
  if (maxCount / trimmed.length > 0.8) return true

  return false
}

Page markers, separator lines, repeated characters—filter them before they ever reach the index.

New problem emerged

Technical terms like useEffect or ERR_CONNECTION_REFUSED were getting filtered out. They're semantically distant from natural language queries but keyword-relevant.

The fix: hybrid search (semantic + keyword blend). But implementing it properly required rethinking the chunking strategy first.

3. Semantic Chunking: Why Fixed Chunks Break Down

Trigger

I read about "semantic center of gravity" in chunks—the idea that a chunk should have a coherent meaning, not just a coherent length.

Then I observed the LLM's behavior: after RAG search, it would often search again with different terms, or just read the file directly. The chunks weren't trustworthy—they lacked sufficient context for the LLM to act on them.

The waste

If a chunk doesn't contain enough meaning:

LLM makes additional tool calls to compensate
Context gets polluted with redundant searches
Latency increases
Tokens get wasted

The LLM was doing work that good chunking should prevent.

Solution: Max-Min Algorithm

The Max-Min semantic chunking paper (Kiss et al., Springer 2025) provided the foundation. This implementation is a pragmatic adaptation of the Max–Min idea, not a faithful reproduction of the paper's algorithm.

The core idea: group consecutive sentences based on semantic similarity, not character count.

// src/chunker/semantic-chunker.ts

// Should we add this sentence to the current chunk?
private shouldAddToChunk(maxSim: number, threshold: number): boolean {
  return maxSim > threshold
}

// Dynamic threshold based on chunk coherence
private calculateThreshold(minSim: number, chunkSize: number): number {
  // threshold = max(c * minSim * sigmoid(|C|), hardThreshold)
  const sigmoid = 1 / (1 + Math.exp(-chunkSize))
  return Math.max(this.config.c * minSim * sigmoid, this.config.hardThreshold)
}

The algorithm:

Split text into sentences
Generate embeddings for all sentences
For each sentence, decide: add to current chunk or start new?
Decision based on comparing max similarity with new sentence vs. min similarity within chunk

When the new sentence's similarity drops below the threshold, it signals a topic boundary.

Implementation details

Sentence detection: Intl.Segmenter

// src/chunker/sentence-splitter.ts
const segmenter = new Intl.Segmenter('und', { granularity: 'sentence' })

No external dependencies. Multilingual support via Unicode standard (UAX #29). The 'und' (undetermined) locale provides general Unicode support.

Code block preservation

// src/chunker/sentence-splitter.ts
const CODE_BLOCK_PLACEHOLDER = '\u0000CODE_BLOCK\u0000'

// Extract before sentence splitting
const codeBlockRegex = /```
{% endraw %}
[\s\S]*?
{% raw %}
```/g
// ... replace with placeholders ...

// Restore after chunking

Markdown code blocks stay intact—never split mid-block. Critical for technical documentation where copy-pastable code is the point.

Performance tuning

The paper uses O(k²) comparisons within each chunk. For long homogeneous documents, this explodes.

// src/chunker/semantic-chunker.ts
const WINDOW_SIZE = 5      // Compare only recent 5 sentences: O(k²) → O(25)
const MAX_SENTENCES = 15   // Force split at 15 sentences (3x paper's median)

PDF parsing: pdfjs-dist

Switched from pdf-parse to pdfjs-dist for access to position information (x, y coordinates, font size). This enables semantic header/footer detection—variable content like "Page 7 of 75" that pdf-parse would include as regular text.

4. When Semantic Chunks Broke Hybrid Search

The problem

Semantic chunks are richer—more content per chunk, more coherent meaning. But this broke the original keyword matching.

The issue: scores became unreliable. A keyword match in a dense, high-quality chunk meant something different than a match in a sparse, fragmented one.

Attempted: RRF (Reciprocal Rank Fusion)

RRF is the standard approach for merging BM25 and vector results:

RRF_score = Σ 1/(k + rank_i)

Combine rankings by position, not by score. Elegant, widely used, no tuning required.

But there's a fundamental problem: distance information is lost.

Original distances: 0.1, 0.2, 0.9  →  Ranks: 1, 2, 3
Original distances: 0.1, 0.15, 0.18  →  Ranks: 1, 2, 3
# Same ranks, completely different quality gaps

RRF outputs ranks, not distances. Our quality filters—distance threshold, relevance gap grouping—need actual distances to work.

As noted in Microsoft's hybrid search documentation: "RRF aggregates rankings rather than scores." This is by design—it avoids the problem of incompatible score scales. But it means downstream quality filtering can't distinguish "barely made top-10" from "clearly the best match."

Solution: Semantic-first with keyword boost

Keep vector search as the primary signal. Use keywords to adjust distances, not replace them.

// src/vectordb/index.ts
// Multiplicative boost: distance / (1 + keyword_score * weight)
const boostedDistance = result.score / (1 + keywordScore * weight)

The formula:

No keyword match (score=0): distance / 1 = distance (unchanged)
Perfect match with weight=0.6: distance / 1.6 (reduced by 37.5%)
Perfect match with weight=1.0: distance / 2 (halved)

This preserves the distance for quality filtering while boosting exact matches.

Architecture

// src/vectordb/index.ts
// 1. Vector search with 2x candidate pool
const candidateLimit = limit * HYBRID_SEARCH_CANDIDATE_MULTIPLIER

// 2. Apply distance filter
if (this.config.maxDistance !== undefined) {
  query = query.distanceRange(undefined, this.config.maxDistance)
}

// 3. Apply grouping
if (this.config.grouping && results.length > 1) {
  results = this.applyGrouping(results, this.config.grouping)
}

// 4. Keyword boost via FTS
const ftsResults = await this.table
  .search(queryText, 'fts', 'text')
  // ...
results = this.applyKeywordBoost(results, ftsResults, hybridWeight)

Quality filters apply to meaningful vector distances. Keyword matching acts as a boost, not a replacement.

Multilingual challenge

Japanese keyword matching broke with richer chunks. The default tokenizer couldn't handle CJK characters properly.

Solution: LanceDB FTS with n-gram indexing.

// src/vectordb/index.ts
await this.table.createIndex('text', {
  config: Index.fts({
    baseTokenizer: 'ngram',
    ngramMinLength: 2,  // Capture Japanese bi-grams (東京, 設計)
    ngramMaxLength: 3,  // Balance precision vs index size
    prefixOnly: false,  // All positions for proper CJK support
    stem: false,        // Preserve exact terms
  }),
})

N-grams at min=2, max=3 capture both English terms and Japanese compound words without language-specific tokenization.

5. Results: What Actually Changed

Observed behavior (real usage)

My setup: framework best practices (official PDFs), project principles (rules), specifications (PRDs) stored in RAG. Before each task, the agent analyzes requirements and searches RAG for relevant context.

Before (fixed chunks + top-K):

Agent couldn't find relevant information on first search
Multiple search attempts with different query formulations
Eventually gave up and read rule files directly
PDFs were too large to read, so that context was effectively lost

After (semantic chunks + boost + filtering):

Single search usually provides sufficient context
Additional searches happen for depth, not compensation
Agent stopped reading files directly—RAG results were trustworthy

LLM evaluation (before/after comparison)

I had an LLM evaluate search results with project context—not a formal LLM-as-Judge setup, but structured comparison.

Old version:

Garbage chunks (outliers) and fragmented information in ~2/10 results for some queries
Results required additional verification

Updated version:

No garbage chunks
8/10 results directly relevant to the query
2/10 results tangentially related (still useful context)
Evaluator noted: "Search results alone provide necessary and sufficient information"

Examining the raw JSON confirmed the qualitative assessment—chunks contained coherent, dense information rather than fragments.

No benchmarks

This is qualitative observation from real usage, not controlled experiments. But the behavioral change is clear: the LLM stopped compensating for bad RAG results.

6. Architecture Summary

Document → Semantic Chunking (Max-Min) → Embeddings → LanceDB

Query → Vector Search → Distance Filter → Grouping → Keyword Boost → Results

Key decisions

Choice	Reason
Semantic chunking over fixed	Meaning-preserving units reduce LLM compensation
Keyword boost over RRF	Preserves distance for quality filtering
Distance-based grouping	Quality signal, not arbitrary K
N-gram FTS	Multilingual support without tokenizer complexity
Local-only	Privacy, cost, offline capability

Configuration

# Environment variables
RAG_HYBRID_WEIGHT=0.6    # Keyword boost factor (0=semantic, 1=BM25-dominant)
RAG_GROUPING=related     # 'similar' (top group) or 'related' (top 2 groups)
RAG_MAX_DISTANCE=0.5     # Filter low-relevance results

7. The Other Side: Query Quality

RAG accuracy depends on two things:

Search quality (what we've discussed)
Query quality (what the LLM sends)

MCP's dual invisibility

User → LLM → MCP(RAG) → LLM → Response
         ↑         ↑
     Query hidden  Results hidden

Even perfect RAG fails with bad queries. And users can't see either side.

Solution: Agent Skills

Agent Skills is an open format for extending AI agent capabilities with specialized knowledge. Skills are portable, version-controlled packages of procedural knowledge that agents load on-demand.

For this RAG, skills teach the LLM:

Query formulation

# Query patterns by intent
| Intent | Pattern |
|--------|---------|
| Definition/Concept | "[term] definition concept" |
| How-To/Procedure | "[action] steps example usage" |
| API/Function | "[function] API arguments return" |
| Troubleshooting | "[error] fix solution cause" |

Score interpretation

# Score thresholds
< 0.3  : Use directly (high confidence)
0.3-0.5: Include if mentions same concept/entity
> 0.5  : Skip unless no better results

Skills can be installed via the mcp-local-rag-skills CLI.

This completes the optimization loop:

RAG side: semantic chunks + distance filters + keyword boost
LLM side: query formulation + result interpretation

Both sides matter. Optimizing only one leaves performance on the table.

8. Tradeoffs and Limitations

What this approach gives up

BM25-only hits don't surface: Must appear in semantic results first to get boosted
No reranker: Would improve accuracy but adds complexity/latency
No formal benchmarks: Qualitative evaluation only

Where heavier approaches win

RRF + Reranker: Broader candidate pool, reranker compensates for RRF's rank-only output
LLM-as-reranker: Best accuracy, but slow and expensive

Position on the spectrum

Light & Fast ←————————————————————→ Heavy & Accurate
    semantic-only
        └─ semantic + boost (here)
               └─ RRF + Cross-Encoder
                      └─ RRF + LLM Rerank

The goal was: maximum quality within zero-setup, local-only constraints.

9. Conclusion

Standard RAG (fixed chunks + top-K) breaks down for agentic coding use cases
Semantic chunking + quality filtering + keyword boost is a viable middle ground
RRF looks elegant but loses distance information critical for filtering
Query quality matters as much as search quality—Agent Skills address this
The real test: does the LLM stop making compensatory tool calls?

Code: github.com/shinpr/mcp-local-rag

References

Kiss, C., Nagy, M. & Szilágyi, P. (2025). Max–Min semantic chunking of documents for RAG application. Discover Computing 28, 117. https://doi.org/10.1007/s10791-025-09638-7
LanceDB Full-Text Search: https://lancedb.github.io/lancedb/fts/
MCP Specification: https://modelcontextprotocol.io
Agent Skills: https://agentskills.io
Reciprocal Rank Fusion (OpenSearch): https://opensearch.org/blog/introducing-reciprocal-rank-fusion-hybrid-search/
Hybrid Search Scoring (Microsoft): https://learn.microsoft.com/en-us/azure/search/hybrid-search-ranking

DEV Community