DEV Community

Cover image for Building KernelMind Part 2: Hybrid Retrieval, Reranking, and Actually Retrieving Useful Code
Ishaan Mavinkurve
Ishaan Mavinkurve

Posted on

Building KernelMind Part 2: Hybrid Retrieval, Reranking, and Actually Retrieving Useful Code

By the end of the first phase of KernelMind, the repository had stopped behaving like disconnected text. Functions now had identity, relationships attached to them. The graph architecture was finally stable enough to represent execution flow across the repository.

The next challenge was obvious:

How do I retrieve the right parts of this graph efficiently?

That was where retrieval engineering began.

Initially, I shifted the retrieval pipeline to operate directly on chunks retrieved from FAISS instead of querying raw documents from MongoDB. The idea was fairly simple:

  • use embeddings to retrieve likely entry points
  • then use the graph to reconstruct surrounding execution context

That combination became the foundation of KernelMind’s retrieval pipeline.

The First Retrieval Pipeline

The naive version of retrieval looked roughly like this:

all-MiniLM-L6-v2 + FAISS
Enter fullscreen mode Exit fullscreen mode

I intentionally started lightweight because I wanted fast local experimentation while debugging retrieval behavior. At this stage, I was not trying to build the perfect retriever. I just wanted something fast enough to:

  • retrieve semantically relevant chunks
  • test graph expansion
  • debug execution flow reconstruction
  • and iterate quickly without destroying my laptop

And honestly, embeddings worked reasonably well at first.

Questions like:

How does authentication work?
Enter fullscreen mode Exit fullscreen mode

usually surfaced relevant code. But implementation-heavy queries struggled badly.

For example:

query: cookies
Enter fullscreen mode Exit fullscreen mode

might retrieve semantically similar request-handling logic instead of the actual cookie implementation.

That was the first moment I realized something important:

semantic similarity alone is not enough for repositories.

Because repositories rely heavily on exact operational language, like:

* imports
* function names
* config values
* error strings
* middleware identifiers
Enter fullscreen mode Exit fullscreen mode

Things embeddings sometimes blur together semantically.

BM25 vs Embeddings

This was where BM25 entered the system. After reading more about BM25, my rough mental model became:

embeddings understand meaning, BM25 understands exact language.

BM25 is a lexical retrieval algorithm that ranks documents using exact token overlap, token rarity, and frequency instead of semantic similarity.

That turned out to be extremely useful for repositories.

For example:

create_user()
update_user()
delete_user()
Enter fullscreen mode Exit fullscreen mode

all belong to the same semantic neighborhood. But operationally, they are completely different. Embeddings handled such conceptual understanding well.

BM25 handled lexical precision much better.

Neither alone was enough, so KernelMind evolved into hybrid retrieval. Instead of replacing embeddings entirely, I started combining both retrieval signals together using Reciprocal Rank Fusion (a fancy term for simply combining two results together).

Reciprocal Rank Fusion (RRF) helped combine both retrieval systems by
rewarding chunks that consistently appeared near the top across both FAISS
and BM25 results. 
That gave KernelMind a much more stable retrieval signal than relying on either retriever independently.
Enter fullscreen mode Exit fullscreen mode

The retrieval pipeline slowly evolved into:

Embedding Retrieval + BM25 Retrieval + Reciprocal Rank Fusion
Enter fullscreen mode Exit fullscreen mode

This improved retrieval quality almost immediately. The embedding retriever surfaced semantically relevant chunks. BM25 reinforced exact implementation-level details.

And the fusion layer combined both into a much stronger retrieval baseline.

Graph Expansion Over Retrieved Chunks

Once hybrid retrieval stabilized, I started layering the graph architecture over the retrieved results themselves. This was one of the biggest shifts in the system.

Initially, retrieval still operated mostly on isolated chunks returned from FAISS and BM25.

But repositories rarely store logic in one place.

Authentication systems, for example, are spread across routes, middleware, services, validators, token handlers, configuration, dependency layers

Retrieving one isolated chunk was often not enough to reconstruct execution flow.

So instead of treating retrieval results as final answers, I started treating them as entry points into the graph.

The pipeline became:

Retrieve relevant chunks
↓
Expand neighboring execution context
↓
Rank expanded graph nodes
Enter fullscreen mode Exit fullscreen mode

This improved workflow reconstruction dramatically.

Questions like:

How does login create the access token?
Enter fullscreen mode Exit fullscreen mode

no longer returned disconnected helper functions. The graph expansion layer started surfacing:

* login routes
* auth middleware
* token creation
* validation flows
* session handling
Enter fullscreen mode Exit fullscreen mode

as connected execution context. This was the first time I started seeing actual repository aware chunks being exposed in the pipeline.

Integrating the Cross Encoder

Even hybrid retrieval and my powerful graph architecture (from the first Blog) still produced noisy candidates. Sibling-operation pollution became a recurring issue:

create_user()
update_user()
delete_user()
read_user()
Enter fullscreen mode Exit fullscreen mode

would cluster together semantically even when only one of these actually answered the question. That was where cross encoder reranking entered the system. I started using:

cross-encoder/ms-marco-MiniLM-L-6-v2
Enter fullscreen mode Exit fullscreen mode

Initially, I didn't really know how a cross-encoder worked or whether it would be useful. So, I researched it, and basically, BM25 would match the content retrieved from the chunk with the query itself for literal lexical overlap (great for exact matches), whereas my cross-encoder would add both:

(query + chunk)
Enter fullscreen mode Exit fullscreen mode

together and directly predict relevance using neural relevance evaluations. That distinction mattered a lot. The reranker became really good at cleaning up semantically adjacent but incorrect retrievals, especially after graph expansion widened the context.

Questions like:

How does login create the access token?
Enter fullscreen mode Exit fullscreen mode

started consistently surfacing the right chunks instead of unrelated utility code nearby in semantic space.

The reranker essentially became a way to restore precision after graph expansion.

Choosing The Generation Model

Once retrieval quality became stable enough, I finally started experimenting more seriously with answer generation. I ahd all these chunks, and all the metadata with it, but for a human to make sense of it, it had to be in a proper readable format. This is where LLMs came in.

I tested several local and hosted models during development:

  • GPT-4o-mini
  • GPT-5-nano
  • Qwen 2.5 Code
  • and Sarvam’s absurdly generous free 105B model, which occasionally spoke enough sweet architectural encouragement into my ears for me to add another retrieval layer at 2 AM.

Eventually, Sarvam's 105b parameter model became the primary generation model because it gave me very good quality results FOR FREE and did not try to fry my GPU like the local models.

How the Architecture Changed

Originally, KernelMind looked something like this:

Embeddings → Retrieval → Answer
Enter fullscreen mode Exit fullscreen mode

Eventually, it evolved into:

Query
↓
BM25 Retrieval + Embedding Retrieval
↓
Reciprocal Rank Fusion (RRF)
↓
Query-Aware Seed Reranking
↓
Graph Expansion + Graph-Aware Ranking
↓
Cross-Encoder Reranking
↓
Context Building
↓
Answer Generation
Enter fullscreen mode Exit fullscreen mode

But - none of this architecture was pre-planned. Almost every layer was built because I observed some failures in the previous layers:

  • embeddings missed identifiers
  • retrieval lost workflow context
  • graph expansion introduced noise
  • re-ranking restored precision
  • orchestration improved grounding

After a little bit of fine-tuning and prompt engineering, my final answer started coming up looking like this:

Q. How is login handled in the fastapi library?
A. The login flow begins in `login_access_token()
` inside `backend/app/api/routes/login.py`.

When a POST request is sent to the login endpoint,
 FastAPI injects the submitted credentials through 
`OAuth2PasswordRequestForm`. The route then calls 
`crud.authenticate()` to validate the username and 
password against the database.


If authentication fails or the user is inactive, the
 API raises an HTTP 400 error. If authentication 
succeeds, the system generates a JWT access token 
using `security.create_access_token()`. The token 
includes the user ID and an expiration time 
configured through `ACCESS_TOKEN_EXPIRE_MINUTES`.


Finally, the endpoint returns a `Token` response
 containing the generated access token.

The retrieved workflow also shows that authenticated
 endpoints like `test_token()` depend on the 
validity of this token through FastAPI dependency 
injection, linking token generation directly to 
downstream protected routes.

Enter fullscreen mode Exit fullscreen mode

My project evolved incrementally through debugging and experimentation rather than some giant architectural master plan. And once answer generation stabilized, a much harder question appeared:

How do I actually KNOW whether the system is improving?

Because retrieval systems are easy to overestimate when you only test them manually. That eventually led into the next phase of the project:

  • evaluation
  • RAGAS benchmarking
  • retrieval ablations

and figuring out whether the architecture changes were genuinely improving the system or just looking impressive during demos.

Top comments (1)

Collapse
 
max_quimby profile image
Max Quimby

The BM25 + dense + RRF + cross-encoder stack is the architecture I keep returning to for code retrieval — it's almost the default at this point, and your post is one of the clearest write-ups I've seen of why each piece is load-bearing rather than just "we threw rerankers at it."

A few things worth pulling out for anyone following along:

  • The "exact operational language" point is huge. Pure embeddings lose on identifier-heavy queries because getUserById and fetch_user are semantically identical to a sentence-trained encoder. BM25 grounds you in the actual symbols the user typed.
  • Chunking strategy matters as much as retrieval. Splitting by AST node (function/class/method) rather than fixed token windows changed our recall@10 by ~15 points on a code-search benchmark. Otherwise you get half-functions in the same chunk as half of an unrelated import block.
  • Graph expansion has a recall/precision knife-edge. Expand too aggressively and you swamp the cross-encoder with noise; expand too little and you lose call-site context. We ended up scoring expansion edges separately so we could tune that depth per query type.

Looking forward to Part 3 — are you planning to address cross-repo retrieval (where the call graph crosses package boundaries), or staying within single-repo scope?