Ishaan Mavinkurve

Posted on May 20

Building KernelMind Part 3: Evaluation, Retrieval Ablations, RAGAS, and Turning The Project Into Something Measurable

#rag #showdev #llm #performance

By this point, KernelMind had already evolved far beyond the original “embeddings over code” idea.

The system now had:

AST-aware chunking
fully qualified symbol identities
graph-aware retrieval
hybrid BM25 + embedding search
query-aware graph expansion
cross-encoder reranking
workflow reconstruction
grounded answer synthesis

And honestly, the demos looked pretty convincing, which was kinda scary... because I knew from experience that retrieval systems are extremely easy to overestimate when you only test them manually.

If I asked:

How does login work?

The answer sounded smart enough and my brain immediately started cooperating with the system.

The issue was:

“sounds correct” is not an evaluation strategy.

At some point, I realized I had absolutely no reliable way to answer the question:

Is KernelMind actually improving?

I needed the following:

evaluation
↓
benchmarking
↓
retrieval ablations
↓
RAGAS scoring
↓
precision / recall analysis

Building A Retrieval Benchmark

The first thing I needed was a benchmark suite grounded in the actual repository.

Initially, I made the classic mistake:

"yeah I'll just manually write expected answers"

Terrible idea.

Very quickly I realized that retrieval evaluation only works if the benchmark references:

real indexed chunks
real graph nodes
real repository symbols
real workflows

Otherwise you end up evaluating benchmark inaccuracies instead of retrieval quality. So I started inspecting the actual indexed graph and rebuilding benchmark questions around real repository functions.

The benchmark suite eventually covered things like:

authentication workflows
password reset flows
CRUD operations
dependency injection
database initialization
middleware chains
token generation
API → CRUD traversal

At that point, I finally had something measurable.

Precision vs Recall

Once the benchmark suite existed, the retrieval behavior became much clearer to reason about.

And almost immediately, I noticed a pattern:

KernelMind was actually very good at:

workflow reconstruction
semantic neighborhoods
execution flow retrieval

But precision was messy.

Recall - Is my retriever actually getting all the required chunks for this answer?
Precision - How many of the retrieved chunks are relevant, and which ones are noise?

For example:

Query:
How are users updated?

might retrieve:

create_user()
update_user()
delete_user()
read_users()

Which sounds bad initially.

But interestingly:
the retriever clearly understood the domain correctly.

The remaining problem was: operation specificity. That distinction became really important later.

The First Ablation Test

This was where I started learning about ablation testing.

An ablation test is basically remove one system component and observe what changes.

The goal is to isolate whether a specific architectural layer is actually contributing measurable value or just making the pipeline look more complicated.

So I started removing pieces of KernelMind individually and rerunning the evaluation benchmarks.

The first major test:

graph expansion.

I disabled graph expansion entirely.

WITHOUT Graph Expansion

KernelMind produced:

Precision: 0.267
Recall:    0.722

The retrieval became cleaner.
Less noisy.
More focused.

But:
important workflow nodes started disappearing.

Authentication flows became incomplete.
Password reset chains broke apart.
Execution flow reconstruction weakened significantly.

Then I re-enabled graph expansion.

WITH Graph Expansion

KernelMind produced:

Precision: 0.243
Recall:    1.000

That result gave me measurable evidence that graph traversal was actually improving workflow recovery.

The graph architecture was not decorative complexity anymore.

It was contributing real retrieval value.

And interestingly, the precision drop was relatively small compared to the recall improvement.

System	Precision	Recall
No Graph Expansion	0.267	0.722
Graph Expansion	0.243	1.000

That tradeoff actually makes sense for repository reasoning systems.

Missing workflow-critical chunks is usually worse than retrieving a few extra neighboring functions.

Cross Encoder Reranking

The next ablation targeted the reranker.

At this point, graph expansion was improving recall significantly, but it also widened the semantic neighborhood too aggressively.

Authentication questions started retrieving:

password reset helpers
email token utilities
related middleware
adjacent auth flows

So I disabled the cross-encoder reranker to isolate its effect.

Almost immediately:
precision degraded further.

The reranker turned out to be extremely good at:

suppressing graph noise
cleaning semantic drift
removing unrelated neighboring chunks

That clarified something important for me. Each retrieval stage now had a very distinct responsibility:

Stage	Responsibility
BM25	lexical precision
embeddings	semantic discovery
graph expansion	workflow recovery
reranking	precision cleanup

That was the point where KernelMind stopped feeling like:

"random retrieval layers stacked together"

and started feeling like an actual retrieval architecture.

Retrieval Window Tuning

Another interesting discovery appeared while evaluating precision - my retrieval window was too large. Initially, KernelMind retrieved around:

8–10 chunks

for many questions.

That improved recall, but precision became diluted because the benchmarks usually expected only:

1–4 relevant chunks

So I started experimenting with smaller retrieval windows.

K = 10

Average Precision: ~0.175
Average Recall:    ~0.824

K = 5

Average Precision: 0.276
Average Recall:    0.720

K = 4

Average Precision: 0.339
Average Recall:    0.711

This was one of the clearest retrieval tradeoffs in the entire project:

Retrieval Size	Precision	Recall
larger K	lower precision	higher recall
smaller K	higher precision	lower recall

And honestly, seeing these tradeoffs emerge experimentally was incredibly satisfying because now retrieval tuning stopped being "vibes-based engineering"

and became measurable system behavior.

Integrating RAGAS

Once retrieval stabilized, I finally moved into answer evaluation using RAGAS.

This was another huge shift in mindset.

Because retrieval quality alone does not necessarily guarantee:

grounded explanations
coherent synthesis
faithful generation

So I started evaluating:

faithfulness
answer relevancy
context precision
context recall

I made a RAGAS evaluator file, but now I had a dilema - RAGAS actually uses LLMs to evaluate other LLMs (crazy, I know!)
So, I had to give it an API key - but which LLM should I evaluate with? I was on a budget here with my side project, so I couldn't move directly to gpt-5.5, although it is considered the most precise evaluator.

I also could not use Sarvam AI - because that was the LLM generating my answers, and I didn't really want any bias here (I don't know for sure if that's how it works, but I didn't want to take my chances!). So I decided to add:
an OpenAI judge with gpt-5-nano
and an Ollama Local model - Qwen2.5: 7b

When testing with Ollama, I got my best results, partially because the small 7b parameter model probably blew up while evaluating my large retrieval codes!

Finally, KernelMind produced:

{
    "faithfulness": 0.6080,
    "answer_relevancy": 0.7697,
    "llm_context_precision_without_reference": 0.5962,
    "context_recall": 0.5357
}

Honestly, I was pretty happy with these results considering:

Most things, except the Synthesis using Sarvam AI, ran locally
the retrieval pipeline was graph-aware
the system reconstructed workflows instead of isolated chunks
the generation was grounded entirely in retrieved repository context

More importantly:
The generated answers read like grounded, non-hallucinated, work-flow answers, rather than generic RAG quality.

The login flow begins in login_access_token().
The route authenticates the user through crud.authenticate(),
then generates a JWT token using create_access_token(),
which downstream authenticated routes depend on through
FastAPI dependency injection.

That was the moment KernelMind genuinely started feeling like: a repository reasoning assistant instead of vector search over code.

The TUI Phase

And finally:
once the retrieval and generation pipeline stabilized, I wanted a proper interface for interacting with the system.

Could I have built a web app?

Probably.

Did I instead build a terminal UI because I use Linux and enjoy turning every side project into a cyberpunk terminal application?

Absolutely.

KernelMind now runs through a TUI built using:

textual
rich

The interface supports:

conversational repository querying
retrieval visualization
grounded answer display
live workflow exploration
repository loading
indexing feedback

And honestly, interacting with the system through the terminal felt surprisingly natural for this kind of project.

There is something extremely satisfying about asking How does authentication work?

and watching a graph-aware retrieval engine reconstruct repository workflows directly inside the terminal.

Final Thoughts

KernelMind started as:

Repository → Embeddings → Search

It eventually evolved into:

Query
↓
BM25 + Embedding Retrieval
↓
Hybrid Fusion
↓
Graph Expansion
↓
Graph-Aware Ranking
↓
Cross-Encoder Reranking
↓
Context Building
↓
Grounded Answer Generation
↓
Evaluation + RAGAS Benchmarking
↓
Conversational TUI Interface

But honestly, I had never really planned any of these steps. Almost every architectural layer emerged because the previous one failed in some interesting way. And that was probably the most fun part of the project - exploring, engineering my way around problems and learning some new stuff along the way!

GitHub Repository:

https://github.com/IdiotCoffee/kernel-mind

Top comments (1)

Tanvee Deshmukh • May 20

Love this write-up. Every single layer of this architecture was born because the previous layer failed in an interesting way that’s pure engineering.

Moving from raw vector search to an AST-aware, graph-expanded pipeline with strict RAGAS benchmarking is a massive leap. It’s the difference between a generic LLM wrapper and a genuine repository reasoning engine. Bonus points for building a cyberpunk TUI instead of a web app!