Manju

Posted on Jul 3

How I Boosted RAG Code Search Accuracy From 55% to 95%

#ai #opensource #java #vectordatabase

How I built GitGrok, a Java codebase search tool using Spring Boot 3.4 and Spring AI, and got to 95% retrieval accuracy by throwing out standard RAG for something closer to a search engine that understands intent.

A practical breakdown of how "just embed and retrieve" falls apart on real code, and what I built instead.

If you’ve ever built a basic Retrieval Augmented Generation (RAG) application, you know the general flow:

Chunk the text.
Generate vector embeddings.
Store those embeddings in a vector database.
Run a similarity search.

This works incredibly well for text based documents like PDFs, articles, or wikis. However, it completely failed when I tried applying it to source code.

When I was building GitGrok, a tool that lets developers chat with their repositories in plain English, I wanted the user experience to be simple: ask a question about your codebase and get an instant answer. Sounds straightforward, right?

It wasn’t.

Initially, I pushed my Java source code into a Pinecone vector database and built a basic search endpoint. This default semantic search yielded poor, noisy, inconsistent, and often outright wrong retrieval quality. Here is how I diagnosed the problems and how I fixed them.

Why Pure Semantic Search Fails on Codebases

Semantic search implies context-aware searching (focusing on meaning) instead of exact keyword matching. While this sounds powerful, codebases require deterministic precision.
When I first implemented GitGrok using default, file-level embeddings, two fatal flaws emerged:

1. The Test File Pollution Trap

The vector database consistently ranked test classes much higher than actual production code. This happened because test files are naturally more verbose and repetitive, whereas production source code files are compact and abstract.

2. Semantic Drift and Confusion

Consider a user querying: “Show me the controller responsible for handling owner registration.”

• The Drift: Instead of returning OwnerController.java (the actual business logic), semantic search retrieved Owner.java (the domain entity) or OwnerControllerTests.java (the test suite).

• The Cause: Both the entity and the test file shared a high cosine similarity with the query because they contained the terms "Owner" and "Controller" repeatedly. Pure semantic search proved far too fuzzy for this use case.

The Fix: Two Separate Phases

We cannot fix everything on the retrieval side alone; how we store code in a vector database is just as critical. You can't simply dump raw files into a database and expect them to yield high-quality embeddings.

To solve this, I split the architecture into two clean phases:

• Phase 1: Smart Ingestion – Shaping and filtering the data before feeding it into Pinecone.

• Phase 2: Smart Retrieval – Processing user queries intelligently before they touch the database.

Phase 1: Smarter Ingestion

I started my optimizations at the ingestion level. Initially, I had ingested entire files as single documents. This caused severe concept dilution because a 500-line Java class contains imports, annotations, comments, and various helper functions that muddy the waters. The first major pivot was changing how code is chunked before indexing.

1. Drop Test Classes Entirely
I excluded test directories and documentation entirely during the ingestion phase, focusing the index strictly on production files.

2. Chunk by Method
I broke classes down into distinct functional blocks (methods) to preserve tight, localized semantic context.

3. Code-Aware Term Weighting
Because pure semantic search wasn’t enough, I needed to merge semantic meaning with exact syntax. I realized GitGrok couldn't rely on embeddings alone, so I pivoted to Hybrid Search, combining dense vectors for context with sparse vectors for precise terminology.
In simple terms, our hybrid search combines two retrieval methods:

• Dense Vectors (Semantic Search): An embedding model converts code into high-dimensional vectors. This is great at understanding underlying concepts and intent.

• Sparse Vectors (Keyword Search): A mechanism that scores documents based on token frequency, acting like a traditional search engine. This is great for exact matches and precise variable/method names.

Solving the Scoring Problem
When you combine dense and sparse search methods in a hybrid pipeline, you run into a core mathematical hurdle: they score on completely different scales.

Dense search returns scores between 0.0 and 1.0, while sparse keyword search returns scores from 1 to 100+. You cannot simply add them together (e.g., 0.85 + 45.0). If you do, raw keyword counts will completely drown out semantic intelligence.

To fix this apples-to-oranges problem, I normalized the scales during ingestion by applying code-aware token multipliers to the sparse vectors before storing them:

Token Type                       Weight
Method signatures                 3.0x
Class names                       2.5x
Property names                    2.0x
Generic keywords                  0.3x
Stop words (if, for, return)      0.1x

This adjustment made the database inherently structure-aware before any queries were even executed against it.

Phase 2: Smarter Retrieval

Next came query handling. Instead of passing a user's raw query directly to the database, I routed it through layers of intelligent filtering first.

1. Intent Detection
First, the system analyzes the query to understand exactly what the type of resource the user wants. Pre-compiled regex matching patterns are utilized in a helper method, detectQueryType(), to isolate the specific structural intent.

2. Filename Extraction and Metadata Filtering
Once the intent is identified, the system extracts target filenames from the query (if mentioned) and builds a metadata filter map. If the intent is a method lookup, the system restricts the database search space exclusively to chunks tagged with symbolType: method/class, completely bypassing validators, repositories, and factories. This prunes irrelevant chunks and slashes latency.

3. Alpha-Scaled Hybrid Search
While hybrid search combines sparse and dense retrieval, a blind combination usually results in one side overpowering the other. To fix this, I optimized the pipeline by tuning the α (alpha) parameter to control the exact balance.

After extensive experimentation, I landed on α = 0.6 as the sweet spot:

• Dense gets 60% weight: Raw embedding coordinates are multiplied by α (0.6), capturing what the developer means.

• Sparse gets 40% weight: Query token frequency scores are scaled by (1 - α) (0.4), keeping results rigidly tied to exact code syntax.

The scaling happens inside the application layer before the payload leaves the app. Pinecone receives both the dense and sparse vectors together in a single payload with the balance pre-calibrated.

Note: α scales the query vectors dynamically on every request, not the stored document vectors. The stored vectors remain fixed by our ingestion-time weights.

The Result: 95% Accuracy, Zero Hallucinations

Two shifts made all the difference: breaking code into clean, method-sized chunks, and pairing that with a tightly tuned hybrid search funnel. Together, they pushed retrieval accuracy to 95%.
With such clean context finally reaching the LLM, I felt confident enough to enforce one hard rule in the prompt engineering layer:

"If the snippet isn't explicitly in the provided context, say 'Not Found'. Do not guess."

It worked flawlessly. Hallucinations plummeted - not because the LLM magically got smarter, but because the retrieval engine became reliable enough to back up that strict constraint. I learned that a system that honestly confesses "Not Found" is infinitely more valuable than one that confidently invents code.

Performance Metrics

Metric                     Before       After     Change
Retrieval Accuracy          55%          95%       +73%
Hallucinations              40%          <5%       −87%
Query Latency               45s          12s       3.7x faster
Wrong File Types Returned   60%          <10%      −83%

The Full Architecture

Key Takeaways

• Meaning isn’t structure: Building GitGrok taught me that standard semantic search only looks for the "vibe" of text. Code doesn't work that way. To cut through repository noise, you must lock things down with metadata filters and strict hybrid (vector + keyword) constraints.

• From fragments to graphs: Right now, GitGrok fetches isolated code snippets, but it doesn't fully map how they connect. The next step is building AST-driven (Abstract Syntax Tree) graphs to trace deep dependencies, like tracking an API request all the way from a controller down to the database layer.

• The "Why" matters: If you ask GitGrok why a configuration value is set to 0.6, it can't tell you, because that context lives outside the source code. Future iterations will pull in Git commit histories and PR comments to surface the human decisions behind the lines.

Over to You

Have you run into similar retrieval traps when building RAG systems for highly technical or structured datasets? Let's talk about your chunking and hybrid optimization strategies in the comments below!

DEV Community