By this point, KernelMind had already evolved far beyond the original “embeddings over code” idea.
The system now had:
- AST-aware chunking
- fully qualified symbol identities
- graph-aware retrieval
- hybrid BM25 + embedding search
- query-aware graph expansion
- cross-encoder reranking
- workflow reconstruction
- grounded answer synthesis
And honestly, the demos looked pretty convincing, which was kinda scary... because I knew from experience that retrieval systems are extremely easy to overestimate when you only test them manually.
If I asked:
How does login work?
The answer sounded smart enough and my brain immediately started cooperating with the system.
The issue was:
“sounds correct” is not an evaluation strategy.
At some point, I realized I had absolutely no reliable way to answer the question:
Is KernelMind actually improving?
I needed the following:
evaluation
↓
benchmarking
↓
retrieval ablations
↓
RAGAS scoring
↓
precision / recall analysis
Building A Retrieval Benchmark
The first thing I needed was a benchmark suite grounded in the actual repository.
Initially, I made the classic mistake:
"yeah I'll just manually write expected answers"
Terrible idea.
Very quickly I realized that retrieval evaluation only works if the benchmark references:
- real indexed chunks
- real graph nodes
- real repository symbols
- real workflows
Otherwise you end up evaluating benchmark inaccuracies instead of retrieval quality. So I started inspecting the actual indexed graph and rebuilding benchmark questions around real repository functions.
The benchmark suite eventually covered things like:
- authentication workflows
- password reset flows
- CRUD operations
- dependency injection
- database initialization
- middleware chains
- token generation
- API → CRUD traversal
At that point, I finally had something measurable.
Precision vs Recall
Once the benchmark suite existed, the retrieval behavior became much clearer to reason about.
And almost immediately, I noticed a pattern:
KernelMind was actually very good at:
- workflow reconstruction
- semantic neighborhoods
- execution flow retrieval
But precision was messy.
Recall - Is my retriever actually getting all the required chunks for this answer?
Precision - How many of the retrieved chunks are relevant, and which ones are noise?
For example:
Query:
How are users updated?
might retrieve:
create_user()
update_user()
delete_user()
read_users()
Which sounds bad initially.
But interestingly:
the retriever clearly understood the domain correctly.
The remaining problem was: operation specificity. That distinction became really important later.
The First Ablation Test
This was where I started learning about ablation testing.
An ablation test is basically remove one system component and observe what changes.
The goal is to isolate whether a specific architectural layer is actually contributing measurable value or just making the pipeline look more complicated.
So I started removing pieces of KernelMind individually and rerunning the evaluation benchmarks.
The first major test:
graph expansion.
I disabled graph expansion entirely.
WITHOUT Graph Expansion
KernelMind produced:
Precision: 0.267
Recall: 0.722
The retrieval became cleaner.
Less noisy.
More focused.
But:
important workflow nodes started disappearing.
Authentication flows became incomplete.
Password reset chains broke apart.
Execution flow reconstruction weakened significantly.
Then I re-enabled graph expansion.
WITH Graph Expansion
KernelMind produced:
Precision: 0.243
Recall: 1.000
That result gave me measurable evidence that graph traversal was actually improving workflow recovery.
The graph architecture was not decorative complexity anymore.
It was contributing real retrieval value.
And interestingly, the precision drop was relatively small compared to the recall improvement.
| System | Precision | Recall |
|---|---|---|
| No Graph Expansion | 0.267 | 0.722 |
| Graph Expansion | 0.243 | 1.000 |
That tradeoff actually makes sense for repository reasoning systems.
Missing workflow-critical chunks is usually worse than retrieving a few extra neighboring functions.
Cross Encoder Reranking
The next ablation targeted the reranker.
At this point, graph expansion was improving recall significantly, but it also widened the semantic neighborhood too aggressively.
Authentication questions started retrieving:
- password reset helpers
- email token utilities
- related middleware
- adjacent auth flows
So I disabled the cross-encoder reranker to isolate its effect.
Almost immediately:
precision degraded further.
The reranker turned out to be extremely good at:
- suppressing graph noise
- cleaning semantic drift
- removing unrelated neighboring chunks
That clarified something important for me. Each retrieval stage now had a very distinct responsibility:
| Stage | Responsibility |
|---|---|
| BM25 | lexical precision |
| embeddings | semantic discovery |
| graph expansion | workflow recovery |
| reranking | precision cleanup |
That was the point where KernelMind stopped feeling like:
"random retrieval layers stacked together"
and started feeling like an actual retrieval architecture.
Retrieval Window Tuning
Another interesting discovery appeared while evaluating precision - my retrieval window was too large. Initially, KernelMind retrieved around:
8–10 chunks
for many questions.
That improved recall, but precision became diluted because the benchmarks usually expected only:
1–4 relevant chunks
So I started experimenting with smaller retrieval windows.
K = 10
Average Precision: ~0.175
Average Recall: ~0.824
K = 5
Average Precision: 0.276
Average Recall: 0.720
K = 4
Average Precision: 0.339
Average Recall: 0.711
This was one of the clearest retrieval tradeoffs in the entire project:
| Retrieval Size | Precision | Recall |
|---|---|---|
| larger K | lower precision | higher recall |
| smaller K | higher precision | lower recall |
And honestly, seeing these tradeoffs emerge experimentally was incredibly satisfying because now retrieval tuning stopped being "vibes-based engineering"
and became measurable system behavior.
Integrating RAGAS
Once retrieval stabilized, I finally moved into answer evaluation using RAGAS.
This was another huge shift in mindset.
Because retrieval quality alone does not necessarily guarantee:
- grounded explanations
- coherent synthesis
- faithful generation
So I started evaluating:
- faithfulness
- answer relevancy
- context precision
- context recall
I made a RAGAS evaluator file, but now I had a dilema - RAGAS actually uses LLMs to evaluate other LLMs (crazy, I know!)
So, I had to give it an API key - but which LLM should I evaluate with? I was on a budget here with my side project, so I couldn't move directly to gpt-5.5, although it is considered the most precise evaluator.
I also could not use Sarvam AI - because that was the LLM generating my answers, and I didn't really want any bias here (I don't know for sure if that's how it works, but I didn't want to take my chances!). So I decided to add:
an OpenAI judge with gpt-5-nano
and an Ollama Local model - Qwen2.5: 7b
When testing with Ollama, I got my best results, partially because the small 7b parameter model probably blew up while evaluating my large retrieval codes!
Finally, KernelMind produced:
{
"faithfulness": 0.6080,
"answer_relevancy": 0.7697,
"llm_context_precision_without_reference": 0.5962,
"context_recall": 0.5357
}
Honestly, I was pretty happy with these results considering:
- Most things, except the Synthesis using Sarvam AI, ran locally
- the retrieval pipeline was graph-aware
- the system reconstructed workflows instead of isolated chunks
- the generation was grounded entirely in retrieved repository context
More importantly:
The generated answers read like grounded, non-hallucinated, work-flow answers, rather than generic RAG quality.
The login flow begins in login_access_token().
The route authenticates the user through crud.authenticate(),
then generates a JWT token using create_access_token(),
which downstream authenticated routes depend on through
FastAPI dependency injection.
That was the moment KernelMind genuinely started feeling like: a repository reasoning assistant instead of vector search over code.
The TUI Phase
And finally:
once the retrieval and generation pipeline stabilized, I wanted a proper interface for interacting with the system.
Could I have built a web app?
Probably.
Did I instead build a terminal UI because I use Linux and enjoy turning every side project into a cyberpunk terminal application?
Absolutely.
KernelMind now runs through a TUI built using:
textualrich
The interface supports:
- conversational repository querying
- retrieval visualization
- grounded answer display
- live workflow exploration
- repository loading
- indexing feedback
And honestly, interacting with the system through the terminal felt surprisingly natural for this kind of project.
There is something extremely satisfying about asking How does authentication work?
and watching a graph-aware retrieval engine reconstruct repository workflows directly inside the terminal.
Final Thoughts
KernelMind started as:
Repository → Embeddings → Search
It eventually evolved into:
Query
↓
BM25 + Embedding Retrieval
↓
Hybrid Fusion
↓
Graph Expansion
↓
Graph-Aware Ranking
↓
Cross-Encoder Reranking
↓
Context Building
↓
Grounded Answer Generation
↓
Evaluation + RAGAS Benchmarking
↓
Conversational TUI Interface
But honestly, I had never really planned any of these steps. Almost every architectural layer emerged because the previous one failed in some interesting way. And that was probably the most fun part of the project - exploring, engineering my way around problems and learning some new stuff along the way!
GitHub Repository:
https://github.com/IdiotCoffee/kernel-mind

Top comments (0)