By the end of the first phase of KernelMind, the repository had stopped behaving like disconnected text. Functions now had identity, relationships attached to them. The graph architecture was finally stable enough to represent execution flow across the repository.
The next challenge was obvious:
How do I retrieve the right parts of this graph efficiently?
That was where retrieval engineering began.
Initially, I shifted the retrieval pipeline to operate directly on chunks retrieved from FAISS instead of querying raw documents from MongoDB. The idea was fairly simple:
- use embeddings to retrieve likely entry points
- then use the graph to reconstruct surrounding execution context
That combination became the foundation of KernelMind’s retrieval pipeline.
The First Retrieval Pipeline
The naive version of retrieval looked roughly like this:
all-MiniLM-L6-v2 + FAISS
I intentionally started lightweight because I wanted fast local experimentation while debugging retrieval behavior. At this stage, I was not trying to build the perfect retriever. I just wanted something fast enough to:
- retrieve semantically relevant chunks
- test graph expansion
- debug execution flow reconstruction
- and iterate quickly without destroying my laptop
And honestly, embeddings worked reasonably well at first.
Questions like:
How does authentication work?
usually surfaced relevant code. But implementation-heavy queries struggled badly.
For example:
query: cookies
might retrieve semantically similar request-handling logic instead of the actual cookie implementation.
That was the first moment I realized something important:
semantic similarity alone is not enough for repositories.
Because repositories rely heavily on exact operational language, like:
* imports
* function names
* config values
* error strings
* middleware identifiers
Things embeddings sometimes blur together semantically.
BM25 vs Embeddings
This was where BM25 entered the system. After reading more about BM25, my rough mental model became:
embeddings understand meaning, BM25 understands exact language.
BM25 is a lexical retrieval algorithm that ranks documents using exact token overlap, token rarity, and frequency instead of semantic similarity.
That turned out to be extremely useful for repositories.
For example:
create_user()
update_user()
delete_user()
all belong to the same semantic neighborhood. But operationally, they are completely different. Embeddings handled such conceptual understanding well.
BM25 handled lexical precision much better.
Neither alone was enough, so KernelMind evolved into hybrid retrieval. Instead of replacing embeddings entirely, I started combining both retrieval signals together using Reciprocal Rank Fusion (a fancy term for simply combining two results together).
Reciprocal Rank Fusion (RRF) helped combine both retrieval systems by
rewarding chunks that consistently appeared near the top across both FAISS
and BM25 results.
That gave KernelMind a much more stable retrieval signal than relying on either retriever independently.
The retrieval pipeline slowly evolved into:
Embedding Retrieval + BM25 Retrieval + Reciprocal Rank Fusion
This improved retrieval quality almost immediately. The embedding retriever surfaced semantically relevant chunks. BM25 reinforced exact implementation-level details.
And the fusion layer combined both into a much stronger retrieval baseline.
Graph Expansion Over Retrieved Chunks
Once hybrid retrieval stabilized, I started layering the graph architecture over the retrieved results themselves. This was one of the biggest shifts in the system.
Initially, retrieval still operated mostly on isolated chunks returned from FAISS and BM25.
But repositories rarely store logic in one place.
Authentication systems, for example, are spread across routes, middleware, services, validators, token handlers, configuration, dependency layers
Retrieving one isolated chunk was often not enough to reconstruct execution flow.
So instead of treating retrieval results as final answers, I started treating them as entry points into the graph.
The pipeline became:
Retrieve relevant chunks
↓
Expand neighboring execution context
↓
Rank expanded graph nodes
This improved workflow reconstruction dramatically.
Questions like:
How does login create the access token?
no longer returned disconnected helper functions. The graph expansion layer started surfacing:
* login routes
* auth middleware
* token creation
* validation flows
* session handling
as connected execution context. This was the first time I started seeing actual repository aware chunks being exposed in the pipeline.
Integrating the Cross Encoder
Even hybrid retrieval and my powerful graph architecture (from the first Blog) still produced noisy candidates. Sibling-operation pollution became a recurring issue:
create_user()
update_user()
delete_user()
read_user()
would cluster together semantically even when only one of these actually answered the question. That was where cross encoder reranking entered the system. I started using:
cross-encoder/ms-marco-MiniLM-L-6-v2
Initially, I didn't really know how a cross-encoder worked or whether it would be useful. So, I researched it, and basically, BM25 would match the content retrieved from the chunk with the query itself for literal lexical overlap (great for exact matches), whereas my cross-encoder would add both:
(query + chunk)
together and directly predict relevance using neural relevance evaluations. That distinction mattered a lot. The reranker became really good at cleaning up semantically adjacent but incorrect retrievals, especially after graph expansion widened the context.
Questions like:
How does login create the access token?
started consistently surfacing the right chunks instead of unrelated utility code nearby in semantic space.
The reranker essentially became a way to restore precision after graph expansion.
Choosing The Generation Model
Once retrieval quality became stable enough, I finally started experimenting more seriously with answer generation. I ahd all these chunks, and all the metadata with it, but for a human to make sense of it, it had to be in a proper readable format. This is where LLMs came in.
I tested several local and hosted models during development:
- GPT-4o-mini
- GPT-5-nano
- Qwen 2.5 Code
- and Sarvam’s absurdly generous free 105B model, which occasionally spoke enough sweet architectural encouragement into my ears for me to add another retrieval layer at 2 AM.
Eventually, Sarvam's 105b parameter model became the primary generation model because it gave me very good quality results FOR FREE and did not try to fry my GPU like the local models.
How the Architecture Changed
Originally, KernelMind looked something like this:
Embeddings → Retrieval → Answer
Eventually, it evolved into:
Query
↓
BM25 Retrieval + Embedding Retrieval
↓
Reciprocal Rank Fusion (RRF)
↓
Query-Aware Seed Reranking
↓
Graph Expansion + Graph-Aware Ranking
↓
Cross-Encoder Reranking
↓
Context Building
↓
Answer Generation
But - none of this architecture was pre-planned. Almost every layer was built because I observed some failures in the previous layers:
- embeddings missed identifiers
- retrieval lost workflow context
- graph expansion introduced noise
- re-ranking restored precision
- orchestration improved grounding
After a little bit of fine-tuning and prompt engineering, my final answer started coming up looking like this:
Q. How is login handled in the fastapi library?
A. The login flow begins in `login_access_token()
` inside `backend/app/api/routes/login.py`.
When a POST request is sent to the login endpoint,
FastAPI injects the submitted credentials through
`OAuth2PasswordRequestForm`. The route then calls
`crud.authenticate()` to validate the username and
password against the database.
If authentication fails or the user is inactive, the
API raises an HTTP 400 error. If authentication
succeeds, the system generates a JWT access token
using `security.create_access_token()`. The token
includes the user ID and an expiration time
configured through `ACCESS_TOKEN_EXPIRE_MINUTES`.
Finally, the endpoint returns a `Token` response
containing the generated access token.
The retrieved workflow also shows that authenticated
endpoints like `test_token()` depend on the
validity of this token through FastAPI dependency
injection, linking token generation directly to
downstream protected routes.
My project evolved incrementally through debugging and experimentation rather than some giant architectural master plan. And once answer generation stabilized, a much harder question appeared:
How do I actually KNOW whether the system is improving?
Because retrieval systems are easy to overestimate when you only test them manually. That eventually led into the next phase of the project:
- evaluation
- RAGAS benchmarking
- retrieval ablations
and figuring out whether the architecture changes were genuinely improving the system or just looking impressive during demos.
Top comments (1)
The BM25 + dense + RRF + cross-encoder stack is the architecture I keep returning to for code retrieval — it's almost the default at this point, and your post is one of the clearest write-ups I've seen of why each piece is load-bearing rather than just "we threw rerankers at it."
A few things worth pulling out for anyone following along:
getUserByIdandfetch_userare semantically identical to a sentence-trained encoder. BM25 grounds you in the actual symbols the user typed.Looking forward to Part 3 — are you planning to address cross-repo retrieval (where the call graph crosses package boundaries), or staying within single-repo scope?