DEV Community

Cover image for Building a Local RAG for Agentic Coding: From Fixed Chunks to Semantic Search with Keyword Boost
Shinsuke KAGAWA
Shinsuke KAGAWA

Posted on

Building a Local RAG for Agentic Coding: From Fixed Chunks to Semantic Search with Keyword Boost

Started with a simple RAG for MCP—the kind of thing you build in a weekend. Ended up implementing semantic chunking (Max-Min algorithm) and rethinking hybrid search entirely. This article is written for people who have already built RAG systems and started hitting quality limits. If you've hit walls with fixed-size chunks and top-K retrieval, this might be useful.

Table of Contents

  1. Context: RAG for Agentic Coding
  2. The Invisible Problem: What Does the LLM Actually Receive?
  3. Semantic Chunking: Why Fixed Chunks Break Down
  4. When Semantic Chunks Broke Hybrid Search
  5. Results: What Actually Changed
  6. Architecture Summary
  7. The Other Side: Query Quality
  8. Tradeoffs and Limitations
  9. Conclusion

1. Context: RAG for Agentic Coding

Problem statement

The request was straightforward: load domain knowledge from PDFs for a specialized agent. Framework best practices, project principles (rules), and specifications (PRDs)—the kind of documents you'd want an AI coding assistant to reference while working.

The constraints made it interesting:

  • Personal use → No external APIs, privacy matters
  • MCP ecosystem → Integration with Cursor, Claude Code, Codex
  • "Agentic Coding support" as the use case

Initial implementation

The first version was textbook RAG:

Document → Fixed-size chunks (500 chars) → Embeddings → LanceDB
Query → Vector search → Top-K results → LLM
Enter fullscreen mode Exit fullscreen mode

Standard fixed-size chunking. Vector search with top-K retrieval. Local embedding model via Transformers.js. LanceDB for vector storage—file-based, no server process required.

It worked... sort of.

2. The Invisible Problem: What Does the LLM Actually Receive?

Discovery

Here's the thing about MCP: search results go directly to the LLM. The user never sees them.

User → LLM → MCP(RAG) → LLM → Response
               ↑
         Results hidden from user
Enter fullscreen mode Exit fullscreen mode

When the RAG returns garbage, you don't see it. You just notice the LLM behaving strangely—making additional searches, reading files directly, or giving incomplete answers.

To debug this, I forced the LLM to output the raw JSON search results. The prompt was simple: "Show me the exact JSON you received from the RAG search."

What I found: lots of irrelevant chunks polluting the context. Page markers, decoration lines, fragments cut mid-sentence.

Why top-K fails

The standard approach is "return the top 10 closest vectors." But closeness in vector space doesn't equal usefulness.

  • Increasing K just adds more noise
  • No quality signal—just "top 10 closest vectors"
  • A chunk with distance 0.1 and another with distance 0.9 both make the cut if they're in the top K

First fix: Quality filtering

Three mechanisms, each addressing a different problem:

1. Distance-based threshold (RAG_MAX_DISTANCE)

// src/vectordb/index.ts
if (this.config.maxDistance !== undefined) {
  query = query.distanceRange(undefined, this.config.maxDistance)
}
Enter fullscreen mode Exit fullscreen mode

Only return results below a certain distance. If nothing is close enough, return nothing—better than returning garbage.

2. Relevance gap grouping (RAG_GROUPING)

Instead of arbitrary K, detect natural "quality groups" in the results:

// src/vectordb/index.ts
// Calculate statistical threshold: mean + 1.5 * std
const threshold = mean + GROUPING_BOUNDARY_STD_MULTIPLIER * std

// Find significant gaps (group boundaries)
const boundaries = gaps.filter((g) => g.gap > threshold)

// 'similar' mode: first group only
// 'related' mode: top 2 groups
Enter fullscreen mode Exit fullscreen mode

Results cluster naturally—there's usually a gap between "highly relevant" and "somewhat related." This detects that gap statistically.

3. Garbage chunk removal

// src/chunker/semantic-chunker.ts
export function isGarbageChunk(text: string): boolean {
  // Decoration line patterns (----, ====, ****, etc.)
  if (/^[\-=_.*#|~`@!%^&*()\[\]{}\\/<>:+\s]+$/.test(trimmed)) return true

  // Excessive repetition of single character (>80%)
  const maxCount = Math.max(...charCounts.values())
  if (maxCount / trimmed.length > 0.8) return true

  return false
}
Enter fullscreen mode Exit fullscreen mode

Page markers, separator lines, repeated characters—filter them before they ever reach the index.

New problem emerged

Technical terms like useEffect or ERR_CONNECTION_REFUSED were getting filtered out. They're semantically distant from natural language queries but keyword-relevant.

The fix: hybrid search (semantic + keyword blend). But implementing it properly required rethinking the chunking strategy first.

3. Semantic Chunking: Why Fixed Chunks Break Down

Trigger

I read about "semantic center of gravity" in chunks—the idea that a chunk should have a coherent meaning, not just a coherent length.

Then I observed the LLM's behavior: after RAG search, it would often search again with different terms, or just read the file directly. The chunks weren't trustworthy—they lacked sufficient context for the LLM to act on them.

The waste

If a chunk doesn't contain enough meaning:

  1. LLM makes additional tool calls to compensate
  2. Context gets polluted with redundant searches
  3. Latency increases
  4. Tokens get wasted

The LLM was doing work that good chunking should prevent.

Solution: Max-Min Algorithm

The Max-Min semantic chunking paper (Kiss et al., Springer 2025) provided the foundation. This implementation is a pragmatic adaptation of the Max–Min idea, not a faithful reproduction of the paper's algorithm.

The core idea: group consecutive sentences based on semantic similarity, not character count.

// src/chunker/semantic-chunker.ts

// Should we add this sentence to the current chunk?
private shouldAddToChunk(maxSim: number, threshold: number): boolean {
  return maxSim > threshold
}

// Dynamic threshold based on chunk coherence
private calculateThreshold(minSim: number, chunkSize: number): number {
  // threshold = max(c * minSim * sigmoid(|C|), hardThreshold)
  const sigmoid = 1 / (1 + Math.exp(-chunkSize))
  return Math.max(this.config.c * minSim * sigmoid, this.config.hardThreshold)
}
Enter fullscreen mode Exit fullscreen mode

The algorithm:

  1. Split text into sentences
  2. Generate embeddings for all sentences
  3. For each sentence, decide: add to current chunk or start new?
  4. Decision based on comparing max similarity with new sentence vs. min similarity within chunk

When the new sentence's similarity drops below the threshold, it signals a topic boundary.

Implementation details

Sentence detection: Intl.Segmenter

// src/chunker/sentence-splitter.ts
const segmenter = new Intl.Segmenter('und', { granularity: 'sentence' })
Enter fullscreen mode Exit fullscreen mode

No external dependencies. Multilingual support via Unicode standard (UAX #29). The 'und' (undetermined) locale provides general Unicode support.

Code block preservation

// src/chunker/sentence-splitter.ts
const CODE_BLOCK_PLACEHOLDER = '\u0000CODE_BLOCK\u0000'

// Extract before sentence splitting
const codeBlockRegex = /```
{% endraw %}
[\s\S]*?
{% raw %}
```/g
// ... replace with placeholders ...

// Restore after chunking
Enter fullscreen mode Exit fullscreen mode

Markdown code blocks stay intact—never split mid-block. Critical for technical documentation where copy-pastable code is the point.

Performance tuning

The paper uses O(k²) comparisons within each chunk. For long homogeneous documents, this explodes.

// src/chunker/semantic-chunker.ts
const WINDOW_SIZE = 5      // Compare only recent 5 sentences: O(k²) → O(25)
const MAX_SENTENCES = 15   // Force split at 15 sentences (3x paper's median)
Enter fullscreen mode Exit fullscreen mode

PDF parsing: pdfjs-dist

Switched from pdf-parse to pdfjs-dist for access to position information (x, y coordinates, font size). This enables semantic header/footer detection—variable content like "Page 7 of 75" that pdf-parse would include as regular text.

4. When Semantic Chunks Broke Hybrid Search

The problem

Semantic chunks are richer—more content per chunk, more coherent meaning. But this broke the original keyword matching.

The issue: scores became unreliable. A keyword match in a dense, high-quality chunk meant something different than a match in a sparse, fragmented one.

Attempted: RRF (Reciprocal Rank Fusion)

RRF is the standard approach for merging BM25 and vector results:

RRF_score = Σ 1/(k + rank_i)
Enter fullscreen mode Exit fullscreen mode

Combine rankings by position, not by score. Elegant, widely used, no tuning required.

But there's a fundamental problem: distance information is lost.

Original distances: 0.1, 0.2, 0.9  →  Ranks: 1, 2, 3
Original distances: 0.1, 0.15, 0.18  →  Ranks: 1, 2, 3
# Same ranks, completely different quality gaps
Enter fullscreen mode Exit fullscreen mode

RRF outputs ranks, not distances. Our quality filters—distance threshold, relevance gap grouping—need actual distances to work.

As noted in Microsoft's hybrid search documentation: "RRF aggregates rankings rather than scores." This is by design—it avoids the problem of incompatible score scales. But it means downstream quality filtering can't distinguish "barely made top-10" from "clearly the best match."

Solution: Semantic-first with keyword boost

Keep vector search as the primary signal. Use keywords to adjust distances, not replace them.

// src/vectordb/index.ts
// Multiplicative boost: distance / (1 + keyword_score * weight)
const boostedDistance = result.score / (1 + keywordScore * weight)
Enter fullscreen mode Exit fullscreen mode

The formula:

  • No keyword match (score=0): distance / 1 = distance (unchanged)
  • Perfect match with weight=0.6: distance / 1.6 (reduced by 37.5%)
  • Perfect match with weight=1.0: distance / 2 (halved)

This preserves the distance for quality filtering while boosting exact matches.

Architecture

// src/vectordb/index.ts
// 1. Vector search with 2x candidate pool
const candidateLimit = limit * HYBRID_SEARCH_CANDIDATE_MULTIPLIER

// 2. Apply distance filter
if (this.config.maxDistance !== undefined) {
  query = query.distanceRange(undefined, this.config.maxDistance)
}

// 3. Apply grouping
if (this.config.grouping && results.length > 1) {
  results = this.applyGrouping(results, this.config.grouping)
}

// 4. Keyword boost via FTS
const ftsResults = await this.table
  .search(queryText, 'fts', 'text')
  // ...
results = this.applyKeywordBoost(results, ftsResults, hybridWeight)
Enter fullscreen mode Exit fullscreen mode

Quality filters apply to meaningful vector distances. Keyword matching acts as a boost, not a replacement.

Multilingual challenge

Japanese keyword matching broke with richer chunks. The default tokenizer couldn't handle CJK characters properly.

Solution: LanceDB FTS with n-gram indexing.

// src/vectordb/index.ts
await this.table.createIndex('text', {
  config: Index.fts({
    baseTokenizer: 'ngram',
    ngramMinLength: 2,  // Capture Japanese bi-grams (東京, 設計)
    ngramMaxLength: 3,  // Balance precision vs index size
    prefixOnly: false,  // All positions for proper CJK support
    stem: false,        // Preserve exact terms
  }),
})
Enter fullscreen mode Exit fullscreen mode

N-grams at min=2, max=3 capture both English terms and Japanese compound words without language-specific tokenization.

5. Results: What Actually Changed

Observed behavior (real usage)

My setup: framework best practices (official PDFs), project principles (rules), specifications (PRDs) stored in RAG. Before each task, the agent analyzes requirements and searches RAG for relevant context.

Before (fixed chunks + top-K):

  • Agent couldn't find relevant information on first search
  • Multiple search attempts with different query formulations
  • Eventually gave up and read rule files directly
  • PDFs were too large to read, so that context was effectively lost

After (semantic chunks + boost + filtering):

  • Single search usually provides sufficient context
  • Additional searches happen for depth, not compensation
  • Agent stopped reading files directly—RAG results were trustworthy

LLM evaluation (before/after comparison)

I had an LLM evaluate search results with project context—not a formal LLM-as-Judge setup, but structured comparison.

Old version:

  • Garbage chunks (outliers) and fragmented information in ~2/10 results for some queries
  • Results required additional verification

Updated version:

  • No garbage chunks
  • 8/10 results directly relevant to the query
  • 2/10 results tangentially related (still useful context)
  • Evaluator noted: "Search results alone provide necessary and sufficient information"

Examining the raw JSON confirmed the qualitative assessment—chunks contained coherent, dense information rather than fragments.

No benchmarks

This is qualitative observation from real usage, not controlled experiments. But the behavioral change is clear: the LLM stopped compensating for bad RAG results.

6. Architecture Summary

Document → Semantic Chunking (Max-Min) → Embeddings → LanceDB

Query → Vector Search → Distance Filter → Grouping → Keyword Boost → Results
Enter fullscreen mode Exit fullscreen mode

Key decisions

Choice Reason
Semantic chunking over fixed Meaning-preserving units reduce LLM compensation
Keyword boost over RRF Preserves distance for quality filtering
Distance-based grouping Quality signal, not arbitrary K
N-gram FTS Multilingual support without tokenizer complexity
Local-only Privacy, cost, offline capability

Configuration

# Environment variables
RAG_HYBRID_WEIGHT=0.6    # Keyword boost factor (0=semantic, 1=BM25-dominant)
RAG_GROUPING=related     # 'similar' (top group) or 'related' (top 2 groups)
RAG_MAX_DISTANCE=0.5     # Filter low-relevance results
Enter fullscreen mode Exit fullscreen mode

7. The Other Side: Query Quality

RAG accuracy depends on two things:

  1. Search quality (what we've discussed)
  2. Query quality (what the LLM sends)

MCP's dual invisibility

User → LLM → MCP(RAG) → LLM → Response
         ↑         ↑
     Query hidden  Results hidden
Enter fullscreen mode Exit fullscreen mode

Even perfect RAG fails with bad queries. And users can't see either side.

Solution: Agent Skills

Agent Skills is an open format for extending AI agent capabilities with specialized knowledge. Skills are portable, version-controlled packages of procedural knowledge that agents load on-demand.

For this RAG, skills teach the LLM:

Query formulation

# Query patterns by intent
| Intent | Pattern |
|--------|---------|
| Definition/Concept | "[term] definition concept" |
| How-To/Procedure | "[action] steps example usage" |
| API/Function | "[function] API arguments return" |
| Troubleshooting | "[error] fix solution cause" |
Enter fullscreen mode Exit fullscreen mode

Score interpretation

# Score thresholds
< 0.3  : Use directly (high confidence)
0.3-0.5: Include if mentions same concept/entity
> 0.5  : Skip unless no better results
Enter fullscreen mode Exit fullscreen mode

Skills can be installed via the mcp-local-rag-skills CLI.

This completes the optimization loop:

  • RAG side: semantic chunks + distance filters + keyword boost
  • LLM side: query formulation + result interpretation

Both sides matter. Optimizing only one leaves performance on the table.

8. Tradeoffs and Limitations

What this approach gives up

  • BM25-only hits don't surface: Must appear in semantic results first to get boosted
  • No reranker: Would improve accuracy but adds complexity/latency
  • No formal benchmarks: Qualitative evaluation only

Where heavier approaches win

  • RRF + Reranker: Broader candidate pool, reranker compensates for RRF's rank-only output
  • LLM-as-reranker: Best accuracy, but slow and expensive

Position on the spectrum

Light & Fast ←————————————————————→ Heavy & Accurate
    semantic-only
        └─ semantic + boost (here)
               └─ RRF + Cross-Encoder
                      └─ RRF + LLM Rerank
Enter fullscreen mode Exit fullscreen mode

The goal was: maximum quality within zero-setup, local-only constraints.

9. Conclusion

  • Standard RAG (fixed chunks + top-K) breaks down for agentic coding use cases
  • Semantic chunking + quality filtering + keyword boost is a viable middle ground
  • RRF looks elegant but loses distance information critical for filtering
  • Query quality matters as much as search quality—Agent Skills address this
  • The real test: does the LLM stop making compensatory tool calls?

Code: github.com/shinpr/mcp-local-rag

References

Top comments (0)