DEV Community: Ishaan Mavinkurve

Building Dhrishti - Part 3: Testing on a Production Grade System

Ishaan Mavinkurve — Sat, 13 Jun 2026 06:24:26 +0000

I was now done with the basic setup. However, during my time working at my startup, I have learnt to think about a project wearing multiple caps. One such aspect was - With Dhrishti running on a server that was already loaded, I did NOT want the tracking application itself to be heavy. I had to set some benchmarks to ensure that Dhrishti did not consume a tonne of space while tracking the metrics. I also had a problem with unresolved requests - in my mock_services, I had a client that was continuously hitting the API Gateway service. I had to fine-tune all the requests so that I could run tests under different loads, but the advantage was that my project was easily able to discern where the client request was coming from. However, in a production scenario, you can never know where a request is coming from - obviously, we cannot resolve different customer IPs to their respective customer names.

This was the first problem. I had to specify what a customer was, and what an unknown request was. I came up with the following solution -

Any unresolved IPs are going to be added to a table in the UI called unresolved IP table. This would help me with debugging later. Now, any unresolved IPs which also made requests to an ENTRY-POINT into my application could be added as the customers. For this, I very simply had to filter out the unknown IPs, and keep a configurable entry-point in dhrishti.json in which I would add a bunch of entry-points (in the case of my mock micro-service architecture, only 1)

Now, I could differentiate between 2 types of unknown IPs - one which was potentially a customer, one which was a background network call, not important to the working system.

The next problem was with the client service itself. It was difficult to simulate, say - a million users in my system. I had essentially built a service which was only being used by 1 customer, but how would Dhrishti behave if I added multiple client IPs?

Using K6

k6 is a Grafana based application that helps developers simulate real world load on servers. All I have to do for this is make a file load.js and define a bunch of parameters -

Which entry point to hit?
Do I ramp up traffic at a certain point? Do I ramp it down? Keep it constant?
Is there a potential sleep between 2 subsequent requests?
How many virtual users do I want to simulate? How long do I run the simulation for?
I can also simulate different scenarios - many purchases being made, a flash sale service, people simply browsing the website, etc.

All in a JSON format! simple, easy to implement. Then, I could simulate the load with this JSON as the parameter into k6.

Side-note: This, by the way, was a really interesting experience - I could model my own sandbox and even wreak havoc inside it. I really want to try out more simulations with k6 in the future!

After this, I built a module to benchmark the overhead for Dhrishti. I had to write a bunch of shell scripts for this, which fortunately I was familiar with as I have been using Linux for quite some time now.

First, I first had to write some code in Go to get CPU information - how many threads, how many cores, which OS I was running, etc.

Then, I ran k6 without Dhrishti to establish a baseline. I wrote a baseline.sh script , which would run k6 and give me baseline stats:

Then, I wrote a script that would simply benchmark the results according to Dhrishti - so, instead of using the results given by k6, I would print out the results given by Dhrishti, and I also had a small compare.sh script compare CPU performance with and without Dhrishti.

The benchmark with Dhrishti came up as follows:

AWESOME - the engine used up barely any RAM. Dhrishti was pretty lightweight.

Then, I wrote the compare script, which would basically normalize events per second into requests per second - because Dhrishti was capturing a request as TCP Connect, TCP accept, TCP close events. After tuning it all together, I had the following:

Dhrishti was doing pretty well! I got a delta of barely 0.01 req/s while running a simulation with 50,000 concurrent users! The more problematic numbers were the p95 latency numbers - where I had a delta of around 14% - but that was to be expected.

Why?

Because Dhrishti was not always able to detect a connection close event. In Part 2, I had discussed how very short lived connections were basically never being detected by Dhrishti OR it would only detect a TCP_OPEN event, and no corresponding TCP_CLOSE event. This was expected because the inference engine was lossy by nature, but I had added a cleanup service in Go to clean up such connections after 30 seconds of inactivity. Hence, for those requests, Dhrishti captured a longer p95 / p99 latency, as compared to K6, which I believe used a different, more accurate method to keep track of the requests that it was making.

Now, I was ready for a bigger challenge. All this while I was using mock micro-services which were… naive. There were no hidden surprises, no production grade code. Dhrishti was cruising along while observing this architecture. I needed to challenge my project and stress test it to the fullest.

So, I deleted my mock micro-service architecture, and made a production grade architecture simulating an e-commerce website. It was complete with 15 interconnected services: gateway, catalog, pricing, inventory, payments, notifications, etc., designed to mimic a real production topology rather than a toy demo. There were lots of hidden dependencies and this architecture had a much more complex topology. This is what Dhrishti would typically see in a production grade e-commerce store. I also removed the client service completely as k6 would be handling requests to this architecture internally.

Another small aspect I feel I should cover - in the last Part about building Dhrishti, I designed a very simple, completely vibe-coded UI. Now, I learnt a little more about cytoscape.js, and modelled a better, cleaner UI which looked like this:

I also wanted to play around with a new technology that I had recently learnt: Redis. Redis, by the way, is a caching technology that has a LOT of pretty cool quirks. I had recently studied Redis Sorted sets, Geo-analysis, and TimeSeries while going through a course on Redis University. They have, in my opinion, one of the best communities in the tech world. However, having now learnt about it, I wanted to use it in a project to really get a feel for it. But, what was the best way to do so?

I decided to use RedisTimeSeries. TimeSeries allows us to capture time-stamped data using an underlying sorted set data structure. I thought it was one of the coolest features Redis had to offer, (apart from their Geo-spatial offerings, of-course!)and I wanted to use it to capture all events occuring in the last 24 hours. I would then store them so that I could view the events in a clean time-line view. This would allow users to compare things like - how did incoming requests change over time for Service A as compared to Service B? When exactly did Service X crash? When did users peak on my e-commerce store?

I also added the capability for users to REPLAY the exact events as they occurred in the last 24 hours, again using TimeSeries, to allow users to see exactly how it all went down before their servers crashed.

With this, the end-to-end application was complete. Dhrishti could now watch live traffic in a production-grade, 15-service topology, infer the dependency graph with sub-0.1% CPU overhead, and let users replay the last 24 hours of activity through Redis TimeSeries, all through a UI that no longer looked vibe-coded.

Looking back at this part specifically, I think the hardest part was learning to treat my own project as something that had to survive contact with reality. Benchmarking showed me exactly where Dhrishti was reliable and where it wasn’t and the overhead that Dhrishti itself produced. And k6 forced me to stop testing against a system I'd built to be easy to test.

Of course, there's always more to add, more probes, more metrics, deeper inference. But the core loop is done: observe, infer, benchmark, replay. I think that is a complete story.

And along the way, I learnt a lot more about kernel-space programming, taming Go's garbage collector, fighting Docker internals, and getting genuinely useful results out of k6 and Redis, which are two tools I'd love to explore more in future projects.

This wraps up the Dhrishti series (for now). Thanks for following along through all three parts!

Github Link: https://github.com/IdiotCoffee/dhrishti/tree/master

Building Dhrishti Part 2: Go-Lang Quirks

Ishaan Mavinkurve — Sun, 31 May 2026 03:59:14 +0000

— written by a human!

Now, my thinking about Dhrishti had evolved - I wanted to decouple the different steps of actually receiving telemetry which were originally bunched together into one single loader.go file.

I made the following architecture:

events.go - When my eBPF code ran, it would produce data in raw binary structs. Hence, my Go code, while going through the ring buffer, would get RAW BYTES. In Go, I needed structs that would EXACTLY match the structs written in my bpf.c code. This is what is called as Application Binary Interface or ABI. This would allow my Go code to exactly decode the binary bytes and get the actual data in a readable format.
receiver.go - This was the layer that would ingest my raw data by reading it continuously from the ring-buffer. This had some beautiful event-driven architecture to be implemented, and this was actually the first time I had tried it out.
normalize.go - Now, I had data in machine code… my timestamps were in nano seconds, my enums were numeric, my IP addresses were uint32 - this was useful to the machine, not so useful for me or other humans. I now needed to normalize the data and convert it to human readable code.
pipeline.go - This was the orchestrator, where different go routines were running in parallel to receive the emitted data from my probes, and normalize and log them.
attach.go - I needed this file to attach the probes to my receiver, and make a connection ****so I could start reading the events. It would load the object files, create the ring-buffer readables and attach the probes to the kernel programs.

I thought this was clean enough architecture. Now, when I ran my basic server in docker, and ran the main.go program, I got:

Beautiful. This did not look like much, but I was actually processing quite a few events. Now, I had to resolve the names of the docker containers, so I knew the actual connections rather than the IPs. I already had the functions to do this, and I just had to add them into the updated flow to get:

Now, it was time to take a bigger step. Until now, I was using a simple client-server architecture. This was good. However, I now wanted a real challenge for my project.
So I made the following architecture:

I built a micro-services architecture that was using this design. This would be a more complex, more real world test for Dhrishti. I dockerized the services, ran the containers, started Dhrishti.

And the result?

Beautiful. All connections were seen correctly.

Now, the next step was to actually make sense of all of these arrows. The raw telemetry I was getting was stateless. That meant, it could only understand:

connect happened
close happened
accept happened

But… who connected to whom? How long was the connection? How many connection attempts succeeded?

To answer this, I decided to build a connection state. This would track a connection from open to close, and also track failed connections.

I also had a seperate problem - sometimes, I saw

gateway -> auth-service
auth-service -> gateway

This was essentially 1 request response cycle. I had to track it as such. So, I decided to construct a flow correlation engine.

The next problem I had to tackle was - if I saw a closed=True with an accept=False - that meant I was looking at a failed connection - it was never accepted by the server. I had to track these as well. I also had a problem with short-lived connections - connections that were made and closed so fast that either I missed the connection itself (which was okay, because I think telemetry services are lossy to some extent anyway) or I could record the connection open, but not the connection close - which was a problem. Some graph edges remained open for ever, which was not right.

Hence, I added a cleaner - it would track connections that were open for more than 30 seconds (later reduced), close them and clean up memory.

I also needed something that looked like real time metrics. Currently, I was calculating Average latency between connections, for example. But when I observed my results, I saw that after a point, new connections did not change the average latency as much. I wanted to ensure that if something was failing, I knew it immediately - so I added calculation for

- rolling window temporal calculations
- p95 latency (what is the latency expected 95% of the time)
- rolling averages (over a sliding window)

After adding these components, my metrics started to look like this:

If you're thinking, "This is a LOT of information!" - yes, so was I. At this point, the client in my mock service was REALLY RAPIDLY sending requests to my API gateway, and it was becoming difficult to actually analyze my results.

I even tried to add some time gaps between requests sent by the client in my mock service, and added a keep-alive time for my requests themselves… but the terminal logs were still going by very fast for me to understand anything.

So, I decided to load up Cursor, and vibe-coded the entire front-end for my application. I just wanted a UI to view my metrics correctly. I was not concerned with UI polish for now. After a little bit of prompting, I decided to implement a cytoscape.js Graph (which would give me an interactive graph with a legend) to simulate the front-end using a web-socket from my Go backend.

Okayy, this was looking pretty good! The connections that were active would be dotted lines, the colors in the connections represented the latencies and hovering on the connections even gave me all the exta information - like connection life, p99 and p95 latencies, etc.

It also exposed some Go-Lang related issues. This was the part where it got interesting. I had never worked with Go so heavily until now. I knew the concepts I was using and the documentation was VERY comprehensive, but I still made some very interesting mistakes:

I was using Mutexes for a certain part of Dhrishti, basically, a Go listener would hold a thread until it heard a probe emit an event.
This was directly messing with my server stats, because it caused deadlocks, with one go function waiting on the other to release, and the other one waiting on the first to release - so I had to do some refactoring to prevent it.
The next, more subtle issue was with Go’s own Garbage Collector. This is a program that runs periodically and checks whether there are any variables it can clean up to free up memory. This bug took me SO LONG to resolve, but when I finally had it, I was probably the happiest man alive for about 3 minutes.

My app had 4 “listeners” plugged into Linux kernel events (like satellite dishes listening for TCP connect/accept/close activity from kernel space). Those listeners were created at startup and used to feed data into my Go pipelines. However, the GC used to only see that these listeners were created ONCE and then unused - so it decided to clean it up, breaking my graph after around 20 to 30 seconds. I had to force these listener objects to stay alive for the full life of the app by storing them in a forever-running go - routine context.

In simple terms: I gave Go a permanent “don’t throw this away” reference. This was the first time I had run into problems with Go-Lang’s quirks

Now, I had a working UI, a good amount of information from my Probes, some GREAT lessons by building the project in Go, and it was time to test out my project on something…. bigger.

The next step was to setup and use a real, actual GitHub repo that replicated an application. I had options like Google Boutique, for example - which simulated a real E-commerce website with a lot of micro-services. I also wanted to experiment with tools like hey and k6 to simulate production behaviour. But I am still building this phase out, and I will document it as I move forward. Let me know if you have some tips for this phase, please!

Check out Dhrishti here: https://github.com/IdiotCoffee/dhrishti

Dhrishti Part 1 - Building Runtime Observability for Distributed Systems

Ishaan Mavinkurve — Thu, 28 May 2026 10:06:59 +0000

— written by a human!

Recently at work, I worked on a major project - Multitenancy.

Initially, we used to provide one virtual machine to every customer that we aquired. This meant a lot of manual configuration, multiple deployments for a small hot-fix, and more importantly, a lot of time spent in connecting to a remote SSH session and debugging network issues. Multitenancy would fix this by basically alloting all customers to a single machine. This didn’t sound bad, but now think about the legacy code - all the MongoDB connections, for example, or my .env files - everything was customized to an individual instance, and I had to make it so that the application for each customer worked within the scope of their own organization. In short, I did not want data from one organization to be visible in another.

The code itself was difficult to conceptualize, but not impossible. What I felt was harder were the migrations themselves. My team and I spent countless hours pouring over connection errors, debugging Docker containerization issues, pointing our code to the correct env files - we almost gave up on this massive undertaking multiple times!

Once we pulled through and this project was done, I began to wonder - what if there was some way to make this process easier?

What if, through some coding magic, I could ACTUALLY make a graph to visualize all the network connections in an application? I could simply point my program to a docker container, and it would dive into the Kernel and reverse engineer its own architecture from system-calls to network events.

I began doing some research, and I found the main character in this story - eBPF.

What is eBPF?

eBPF is a program that would allow me to run sandbox programs inside the Linux KERNEL. It would do so without modifying kernel sources or loading any kernel modules that were potentially unsafe.

The Kernel in Linux handles all the cool stuff - TCP connections, when a process starts, how much memory is allocated, etc.

eBPF would allow me to send a small “probe” into this Linux Kernel Space, and observe what happens around it. Then, any important or significant information would be emitted back to me.

I like to think of it like Voyager 1 . (I love reading about space exploration!). This is a space probe that happens to be the FARTHEST human made object from us - and we can still communicate with it!

So, all I had to do was create a probe, send it out on an adventure into Kernal space, and have it emit events back to me. Simple. How would I capture the events it sent? Well, Claude suggested using a receiver, which I would write in Go, to collect these events.

So I started. I opened up Zed and made 2 files - a server.py, and then a client.py. The client would simply send a request to the server every 3 seconds, and the server would return a Hello, world! response.

Next, I put both of them into their own docker containers, with the client being dependent on the server container.

After that, I ran

docker compose up --build

And boom, I had just created a sandbox environment wherein TCP connections were being made, and a real application was running.

Now, I had to build a probe to venture out into the vast expanse of (Linux Kernel) space and emit discoveries! For this, I used the help of ChatGPT. I asked it to make me a probe that would run and collect TCP events. It made a probe using C, and also said:

I always knew that space exploration could be dangerous, and I would never understand everything fully. But, at a high level, the code did the following:

my probe would hook onto the Kernel, look for tcp_connect events, extract the meta-data and emit it out.

Also, to make sure I followed CO-RE principles (Code Once, Run Everywhere), I had to make a vmlinux.h file with my kernel’s actual type definitions, extracted from BTF metadata, specifically for BPF programs.

BPF - this is kernel runtime type metadata.

For those like me who didn’t understand a word of the above, basically, I knew my probes would run on MY kernel space, but I could not guarantee that they would run on another type of Linux Kernel, or that they would not break if the libraries I was using god updated. So, I had to make a file to store all metadata about how to run my probes in every (known) situation.

So I ran this:

bpftool btf dump file /sys/kernel/btf/vmlinux format c > vmlinux.h

I compiled the probe and ran the probe.o file. Now, I had a probe sent into the docker sandbox application, and it was already emitting events. I now needed to make a receiver that would receive these events.

For this, I wanted to collect the telemetry in a language that was fast, efficient and easily compiled, so that my activity of listening to the probe did not slow down the application that I was supposed to observe.
Hence, I selected Go.
Go is truly a beautiful language, and I really wanted to use it in a project after having learnt it a little while ago. I also came across some really cool quirks of Go which I had to work around (stay tuned, this is for Part 2!)

In Go, I built a struct to collect the events that were being sent by the probe:

type Event struct {
    Pid   uint32
    Comm  [16]byte
    Daddr uint32
    Dport uint16
}

I also built a resolver that would resolve a Docker Client and return the Client:

func NewDockerResolver() (*DockerResolver, error) {
    cli, err := client.NewClientWithOpts(
        client.FromEnv,
        client.WithAPIVersionNegotiation(),
    )
    if err != nil {
        return nil, err
    }

    return &DockerResolver{
        cli: cli,
    }, nil
}

Now, I was ready to connect to my probe and get some data! The concept was as follows:

I would attach the satellite to the tcp_connect probe, meaning, when the Linux Kernel made a new TCP connection, my code would run and gather telemetry
Resolve the container name from the PID
Based on the tcp_connect info, make a graph with 2 vertices to denote client and server (resolved from PID) and an edge denoting the dependency.
Check every 5 seconds, collect telemetry, and also calculate number of requests that came in.

After a lot of experimentation, referring to docs and to ChatGPT, I managed to code out the steps exactly like this. My code was being orchestrated by a file called loader.go so, I turned it into an executable.

Then, I ran my docker service and also my executable.

Oh. My. God. I could talk to my PROBE!! I sat there for a good 15 minutes just looking at my telemetry logs. This was beautiful.

It was also insufficient. This did not tell me EVERYTHING I wanted to know about my containers. But now, the basic idea was built. All I had to do was send out multiple probes that specialized in multiple types of data gathering, and make sure I collected ALL of that data.

When this was done, all I had to do was make a beautiful (AI Generated) front-end to show this graph by polling an API repeatedly.

With the help of ChatGPT, I built the following probes:

tcp_connect - allows me to find out which processes initiated an outbound TCP Connection. This is the birth of a dependency graph. It was the core of my project
tcp_close - tells me when a connection gets terminated. This would allow me to compute the lifetime of 1 connection.
tcp_accept - triggers when a server actually ACCEPTS a connection. This gives me server side visibility, whereas tcp_connect gave me client side visibility. This would help decide failed connections, queue saturation, etc.
tcp_state - would tell me when the STATE of a connection changed. States like - established, fin_wait, time_wait, etc.

I wanted to start with this, but first, I needed to improve my coding architecture. I had a loader.go that was basically handling everything, and that would not be scalable as I added more probes.

So I had to come up with a better architecture for my code, but the project wasn’t just an idea anymore!

Stay tuned for the second part, or feel free to check out my full project here!

https://github.com/IdiotCoffee/dhrishti

Building KernelMind Part 3: Evaluation, Retrieval Ablations, RAGAS, and Turning The Project Into Something Measurable

Ishaan Mavinkurve — Wed, 20 May 2026 01:29:24 +0000

By this point, KernelMind had already evolved far beyond the original “embeddings over code” idea.

The system now had:

AST-aware chunking
fully qualified symbol identities
graph-aware retrieval
hybrid BM25 + embedding search
query-aware graph expansion
cross-encoder reranking
workflow reconstruction
grounded answer synthesis

And honestly, the demos looked pretty convincing, which was kinda scary... because I knew from experience that retrieval systems are extremely easy to overestimate when you only test them manually.

If I asked:

How does login work?

The answer sounded smart enough and my brain immediately started cooperating with the system.

The issue was:

“sounds correct” is not an evaluation strategy.

At some point, I realized I had absolutely no reliable way to answer the question:

Is KernelMind actually improving?

I needed the following:

evaluation
↓
benchmarking
↓
retrieval ablations
↓
RAGAS scoring
↓
precision / recall analysis

Building A Retrieval Benchmark

The first thing I needed was a benchmark suite grounded in the actual repository.

Initially, I made the classic mistake:

"yeah I'll just manually write expected answers"

Terrible idea.

Very quickly I realized that retrieval evaluation only works if the benchmark references:

real indexed chunks
real graph nodes
real repository symbols
real workflows

Otherwise you end up evaluating benchmark inaccuracies instead of retrieval quality. So I started inspecting the actual indexed graph and rebuilding benchmark questions around real repository functions.

The benchmark suite eventually covered things like:

authentication workflows
password reset flows
CRUD operations
dependency injection
database initialization
middleware chains
token generation
API → CRUD traversal

At that point, I finally had something measurable.

Precision vs Recall

Once the benchmark suite existed, the retrieval behavior became much clearer to reason about.

And almost immediately, I noticed a pattern:

KernelMind was actually very good at:

workflow reconstruction
semantic neighborhoods
execution flow retrieval

But precision was messy.

Recall - Is my retriever actually getting all the required chunks for this answer?
Precision - How many of the retrieved chunks are relevant, and which ones are noise?

For example:

Query:
How are users updated?

might retrieve:

create_user()
update_user()
delete_user()
read_users()

Which sounds bad initially.

But interestingly:
the retriever clearly understood the domain correctly.

The remaining problem was: operation specificity. That distinction became really important later.

The First Ablation Test

This was where I started learning about ablation testing.

An ablation test is basically remove one system component and observe what changes.

The goal is to isolate whether a specific architectural layer is actually contributing measurable value or just making the pipeline look more complicated.

So I started removing pieces of KernelMind individually and rerunning the evaluation benchmarks.

The first major test:

graph expansion.

I disabled graph expansion entirely.

WITHOUT Graph Expansion

KernelMind produced:

Precision: 0.267
Recall:    0.722

The retrieval became cleaner.
Less noisy.
More focused.

But:
important workflow nodes started disappearing.

Authentication flows became incomplete.
Password reset chains broke apart.
Execution flow reconstruction weakened significantly.

Then I re-enabled graph expansion.

WITH Graph Expansion

KernelMind produced:

Precision: 0.243
Recall:    1.000

That result gave me measurable evidence that graph traversal was actually improving workflow recovery.

The graph architecture was not decorative complexity anymore.

It was contributing real retrieval value.

And interestingly, the precision drop was relatively small compared to the recall improvement.

System	Precision	Recall
No Graph Expansion	0.267	0.722
Graph Expansion	0.243	1.000

That tradeoff actually makes sense for repository reasoning systems.

Missing workflow-critical chunks is usually worse than retrieving a few extra neighboring functions.

Cross Encoder Reranking

The next ablation targeted the reranker.

At this point, graph expansion was improving recall significantly, but it also widened the semantic neighborhood too aggressively.

Authentication questions started retrieving:

password reset helpers
email token utilities
related middleware
adjacent auth flows

So I disabled the cross-encoder reranker to isolate its effect.

Almost immediately:
precision degraded further.

The reranker turned out to be extremely good at:

suppressing graph noise
cleaning semantic drift
removing unrelated neighboring chunks

That clarified something important for me. Each retrieval stage now had a very distinct responsibility:

Stage	Responsibility
BM25	lexical precision
embeddings	semantic discovery
graph expansion	workflow recovery
reranking	precision cleanup

That was the point where KernelMind stopped feeling like:

"random retrieval layers stacked together"

and started feeling like an actual retrieval architecture.

Retrieval Window Tuning

Another interesting discovery appeared while evaluating precision - my retrieval window was too large. Initially, KernelMind retrieved around:

8–10 chunks

for many questions.

That improved recall, but precision became diluted because the benchmarks usually expected only:

1–4 relevant chunks

So I started experimenting with smaller retrieval windows.

K = 10

Average Precision: ~0.175
Average Recall:    ~0.824

K = 5

Average Precision: 0.276
Average Recall:    0.720

K = 4

Average Precision: 0.339
Average Recall:    0.711

This was one of the clearest retrieval tradeoffs in the entire project:

Retrieval Size	Precision	Recall
larger K	lower precision	higher recall
smaller K	higher precision	lower recall

And honestly, seeing these tradeoffs emerge experimentally was incredibly satisfying because now retrieval tuning stopped being "vibes-based engineering"

and became measurable system behavior.

Integrating RAGAS

Once retrieval stabilized, I finally moved into answer evaluation using RAGAS.

This was another huge shift in mindset.

Because retrieval quality alone does not necessarily guarantee:

grounded explanations
coherent synthesis
faithful generation

So I started evaluating:

faithfulness
answer relevancy
context precision
context recall

I made a RAGAS evaluator file, but now I had a dilema - RAGAS actually uses LLMs to evaluate other LLMs (crazy, I know!)
So, I had to give it an API key - but which LLM should I evaluate with? I was on a budget here with my side project, so I couldn't move directly to gpt-5.5, although it is considered the most precise evaluator.

I also could not use Sarvam AI - because that was the LLM generating my answers, and I didn't really want any bias here (I don't know for sure if that's how it works, but I didn't want to take my chances!). So I decided to add:
an OpenAI judge with gpt-5-nano
and an Ollama Local model - Qwen2.5: 7b

When testing with Ollama, I got my best results, partially because the small 7b parameter model probably blew up while evaluating my large retrieval codes!

Finally, KernelMind produced:

{
    "faithfulness": 0.6080,
    "answer_relevancy": 0.7697,
    "llm_context_precision_without_reference": 0.5962,
    "context_recall": 0.5357
}

Honestly, I was pretty happy with these results considering:

Most things, except the Synthesis using Sarvam AI, ran locally
the retrieval pipeline was graph-aware
the system reconstructed workflows instead of isolated chunks
the generation was grounded entirely in retrieved repository context

More importantly:
The generated answers read like grounded, non-hallucinated, work-flow answers, rather than generic RAG quality.

The login flow begins in login_access_token().
The route authenticates the user through crud.authenticate(),
then generates a JWT token using create_access_token(),
which downstream authenticated routes depend on through
FastAPI dependency injection.

That was the moment KernelMind genuinely started feeling like: a repository reasoning assistant instead of vector search over code.

The TUI Phase

And finally:
once the retrieval and generation pipeline stabilized, I wanted a proper interface for interacting with the system.

Could I have built a web app?

Probably.

Did I instead build a terminal UI because I use Linux and enjoy turning every side project into a cyberpunk terminal application?

Absolutely.

KernelMind now runs through a TUI built using:

textual
rich

The interface supports:

conversational repository querying
retrieval visualization
grounded answer display
live workflow exploration
repository loading
indexing feedback

And honestly, interacting with the system through the terminal felt surprisingly natural for this kind of project.

There is something extremely satisfying about asking How does authentication work?

and watching a graph-aware retrieval engine reconstruct repository workflows directly inside the terminal.

Final Thoughts

KernelMind started as:

Repository → Embeddings → Search

It eventually evolved into:

Query
↓
BM25 + Embedding Retrieval
↓
Hybrid Fusion
↓
Graph Expansion
↓
Graph-Aware Ranking
↓
Cross-Encoder Reranking
↓
Context Building
↓
Grounded Answer Generation
↓
Evaluation + RAGAS Benchmarking
↓
Conversational TUI Interface

But honestly, I had never really planned any of these steps. Almost every architectural layer emerged because the previous one failed in some interesting way. And that was probably the most fun part of the project - exploring, engineering my way around problems and learning some new stuff along the way!

GitHub Repository:

https://github.com/IdiotCoffee/kernel-mind

Building KernelMind Part 2: Hybrid Retrieval, Reranking, and Actually Retrieving Useful Code

Ishaan Mavinkurve — Mon, 18 May 2026 14:00:00 +0000

By the end of the first phase of KernelMind, the repository had stopped behaving like disconnected text. Functions now had identity, relationships attached to them. The graph architecture was finally stable enough to represent execution flow across the repository.

The next challenge was obvious:

How do I retrieve the right parts of this graph efficiently?

That was where retrieval engineering began.

Initially, I shifted the retrieval pipeline to operate directly on chunks retrieved from FAISS instead of querying raw documents from MongoDB. The idea was fairly simple:

use embeddings to retrieve likely entry points
then use the graph to reconstruct surrounding execution context

That combination became the foundation of KernelMind’s retrieval pipeline.

The First Retrieval Pipeline

The naive version of retrieval looked roughly like this:

all-MiniLM-L6-v2 + FAISS

I intentionally started lightweight because I wanted fast local experimentation while debugging retrieval behavior. At this stage, I was not trying to build the perfect retriever. I just wanted something fast enough to:

retrieve semantically relevant chunks
test graph expansion
debug execution flow reconstruction
and iterate quickly without destroying my laptop

And honestly, embeddings worked reasonably well at first.

Questions like:

How does authentication work?

usually surfaced relevant code. But implementation-heavy queries struggled badly.

For example:

query: cookies

might retrieve semantically similar request-handling logic instead of the actual cookie implementation.

That was the first moment I realized something important:

semantic similarity alone is not enough for repositories.

Because repositories rely heavily on exact operational language, like:

* imports
* function names
* config values
* error strings
* middleware identifiers

Things embeddings sometimes blur together semantically.

BM25 vs Embeddings

This was where BM25 entered the system. After reading more about BM25, my rough mental model became:

embeddings understand meaning, BM25 understands exact language.

BM25 is a lexical retrieval algorithm that ranks documents using exact token overlap, token rarity, and frequency instead of semantic similarity.

That turned out to be extremely useful for repositories.

For example:

create_user()
update_user()
delete_user()

all belong to the same semantic neighborhood. But operationally, they are completely different. Embeddings handled such conceptual understanding well.

BM25 handled lexical precision much better.

Neither alone was enough, so KernelMind evolved into hybrid retrieval. Instead of replacing embeddings entirely, I started combining both retrieval signals together using Reciprocal Rank Fusion (a fancy term for simply combining two results together).

Reciprocal Rank Fusion (RRF) helped combine both retrieval systems by
rewarding chunks that consistently appeared near the top across both FAISS
and BM25 results. 
That gave KernelMind a much more stable retrieval signal than relying on either retriever independently.

The retrieval pipeline slowly evolved into:

Embedding Retrieval + BM25 Retrieval + Reciprocal Rank Fusion

This improved retrieval quality almost immediately. The embedding retriever surfaced semantically relevant chunks. BM25 reinforced exact implementation-level details.

And the fusion layer combined both into a much stronger retrieval baseline.

Graph Expansion Over Retrieved Chunks

Once hybrid retrieval stabilized, I started layering the graph architecture over the retrieved results themselves. This was one of the biggest shifts in the system.

Initially, retrieval still operated mostly on isolated chunks returned from FAISS and BM25.

But repositories rarely store logic in one place.

Authentication systems, for example, are spread across routes, middleware, services, validators, token handlers, configuration, dependency layers

Retrieving one isolated chunk was often not enough to reconstruct execution flow.

So instead of treating retrieval results as final answers, I started treating them as entry points into the graph.

The pipeline became:

Retrieve relevant chunks
↓
Expand neighboring execution context
↓
Rank expanded graph nodes

This improved workflow reconstruction dramatically.

Questions like:

How does login create the access token?

no longer returned disconnected helper functions. The graph expansion layer started surfacing:

* login routes
* auth middleware
* token creation
* validation flows
* session handling

as connected execution context. This was the first time I started seeing actual repository aware chunks being exposed in the pipeline.

Integrating the Cross Encoder

Even hybrid retrieval and my powerful graph architecture (from the first Blog) still produced noisy candidates. Sibling-operation pollution became a recurring issue:

create_user()
update_user()
delete_user()
read_user()

would cluster together semantically even when only one of these actually answered the question. That was where cross encoder reranking entered the system. I started using:

cross-encoder/ms-marco-MiniLM-L-6-v2

Initially, I didn't really know how a cross-encoder worked or whether it would be useful. So, I researched it, and basically, BM25 would match the content retrieved from the chunk with the query itself for literal lexical overlap (great for exact matches), whereas my cross-encoder would add both:

(query + chunk)

together and directly predict relevance using neural relevance evaluations. That distinction mattered a lot. The reranker became really good at cleaning up semantically adjacent but incorrect retrievals, especially after graph expansion widened the context.

Questions like:

How does login create the access token?

started consistently surfacing the right chunks instead of unrelated utility code nearby in semantic space.

The reranker essentially became a way to restore precision after graph expansion.

Choosing The Generation Model

Once retrieval quality became stable enough, I finally started experimenting more seriously with answer generation. I ahd all these chunks, and all the metadata with it, but for a human to make sense of it, it had to be in a proper readable format. This is where LLMs came in.

I tested several local and hosted models during development:

GPT-4o-mini
GPT-5-nano
Qwen 2.5 Code
and Sarvam’s absurdly generous free 105B model, which occasionally spoke enough sweet architectural encouragement into my ears for me to add another retrieval layer at 2 AM.

Eventually, Sarvam's 105b parameter model became the primary generation model because it gave me very good quality results FOR FREE and did not try to fry my GPU like the local models.

How the Architecture Changed

Originally, KernelMind looked something like this:

Embeddings → Retrieval → Answer

Eventually, it evolved into:

Query
↓
BM25 Retrieval + Embedding Retrieval
↓
Reciprocal Rank Fusion (RRF)
↓
Query-Aware Seed Reranking
↓
Graph Expansion + Graph-Aware Ranking
↓
Cross-Encoder Reranking
↓
Context Building
↓
Answer Generation

But - none of this architecture was pre-planned. Almost every layer was built because I observed some failures in the previous layers:

embeddings missed identifiers
retrieval lost workflow context
graph expansion introduced noise
re-ranking restored precision
orchestration improved grounding

After a little bit of fine-tuning and prompt engineering, my final answer started coming up looking like this:

Q. How is login handled in the fastapi library?
A. The login flow begins in `login_access_token()
` inside `backend/app/api/routes/login.py`.

When a POST request is sent to the login endpoint,
 FastAPI injects the submitted credentials through 
`OAuth2PasswordRequestForm`. The route then calls 
`crud.authenticate()` to validate the username and 
password against the database.


If authentication fails or the user is inactive, the
 API raises an HTTP 400 error. If authentication 
succeeds, the system generates a JWT access token 
using `security.create_access_token()`. The token 
includes the user ID and an expiration time 
configured through `ACCESS_TOKEN_EXPIRE_MINUTES`.


Finally, the endpoint returns a `Token` response
 containing the generated access token.

The retrieved workflow also shows that authenticated
 endpoints like `test_token()` depend on the 
validity of this token through FastAPI dependency 
injection, linking token generation directly to 
downstream protected routes.

My project evolved incrementally through debugging and experimentation rather than some giant architectural master plan. And once answer generation stabilized, a much harder question appeared:

How do I actually KNOW whether the system is improving?

Because retrieval systems are easy to overestimate when you only test them manually. That eventually led into the next phase of the project:

evaluation
RAGAS benchmarking
retrieval ablations

and figuring out whether the architecture changes were genuinely improving the system or just looking impressive during demos.

Building KernelMind, A Code-Aware Github Companion

Ishaan Mavinkurve — Sun, 17 May 2026 14:00:00 +0000

I have always wanted to contribute to Open Source Projects on Github. If you check out my Profile, you will see that I have even tried to get into it. But, once I went past the documentation changes and minor fixes, I realized that OSS Contributions were HARD

So, I decided to code a RAG project that would help me out. Of course, I could just use the inbuilt coding agents in the IDE, but where's the fun in that?
The original version of KernelMind was pretty basic.
I just wanted a way to ask questions about large repositories without manually opening forty files and mentally reconstructing execution flow.

At the time, the plan looked straightforward:

Repository -> AST Parsing -> Chunk Extraction -> Embeddings -> Vector Search -> Answer Generation

That was it.No fancy business. Just embeddings over code. But it broke immediately.

The First Hurdles

The first step was parsing. I made a basic AST parser and ran it against a deliberately small repository, storing my chunks in MongoDB for now. I wanted something predictable so debugging would be easier. I decided to use full-stack-fastapi-template

The indexing pipeline finished and printed:

Inserted 1258 chunks.
Checked 57 files.

That made absolutely no sense. There was no way a small repository like that should explode into that many chunks. So I started tracing the parser output manually.

The first issue was trivial. I was ingesting... everything. Tests, initializers, EVERYTHING. This was a small fix ... I added a simple IGNORE_LIST that would skip the garbage files and only download the relevant python files.

The second issue was slightly more confusing: Turns out methods inside classes were being extracted twice:

once correctly as methods
once incorrectly as standalone functions

This meant that no chunk in my system had a concept of unique identity.

Everything was just “chunks.” And chunks had repetitive content...

Another related problem:

Originally, the parser stored function names like this:

__init__

Which is technically valid. It is also practically useless.

There could be dozens of __init__ methods across the repository.

So I introduced this (totally cool and non ChatGPT researched) concept - Fully Qualified Names.

Instead of:

__init__

the system generated:

matplotlib.figure.Figure.__init__

That single architectural change completely shifted the project. FQNs were now the atomic elements in the data - an FQN would be completely unique across the entire repo. Now, while parsing, I had to only construct the FQN once - if I found out that another function had the same FQN, then - it was already parsed, so ignore it.

Now that symbols had stable identities:

imports could resolve properly
dependencies became traceable

The repository stopped behaving like disconnected text.

It started behaving like a connected system.

The “self” Problem

One of the MOST CONFUSING bugs came from method calls.

Initially, method relationships looked like this:

"calls": ["self.get_host"]

Which looks reasonable at first glance ... except self means nothing globally.

A graph cannot reason over:

self.get_host

because it has no stable reference. So I had to build resolution logic that converted local method calls into globally addressable symbols.

Eventually:

"calls": ["src.requests.cookies.MockRequest.get_host"]

started appearing in the graph output. That was a huge leap for me - my system was no longer parsing syntax alone. It was starting to reconstruct semantic relationships.

Once FQNs entered the system, something clicked for me almost immediately.

I realized I was no longer dealing with isolated chunks of text. Every function now had identity, relationships, callers, callees, imports, and dependencies. The repository was starting to look far less like a document collection and much more like a graph data structure describing execution flow.

Building The Graph

And once I saw the repository that way, a lot of the later architecture decisions suddenly started making sense.

The next obvious question became:

If functions are connected, could I retrieve them together?

That question basically led to the entire graph architecture.

Constructing Relationships

The first step was building explicit call relationships. Whenever the parser encountered a function call, I attempted to resolve it into an FQN and create a directed edge:

caller -> callee

So if:

login_user()

called:

create_access_token()

the graph stored that relationship directly.

Initially, the graph nodes were fairly simple. Each node stored:

- the FQN
- file path
- source code
- outgoing calls
- incoming calls

Something roughly like:

class GraphNode:
    def __init__(self):
        self.calls = []
        self.called_by = []

At first, this mainly helped with debugging. Then I realized the graph could fundamentally improve retrieval itself. Because codebases are not isolated files. They are execution systems.

Forward And Reverse Traversal

Once the graph structure stabilized, I realized traversal needed to work in both directions. Forward traversal helped answer questions like:

“What does this function eventually call?”

which was useful for reconstructing execution flow and understanding downstream behavior. Reverse traversal was equally important because it answered:

 “Who depends on this logic?”

That became extremely useful for tracing middleware usage, validation chains, service dependencies, and understanding how deeply certain functionality was integrated into the repository.

I decided to implement naive BFS - semantic search (implemented later) would reveal the start node most similar to the query, and then BFS would reveal other function calls (and other "chunks") that were related to that node.

Together, forward and reverse traversal made the graph feel much less like static metadata and much more like a navigable execution map of the repository.

Once I switched traversal to BFS, retrieval immediately started feeling more coherent.

Query-Aware Expansion

The next problem was the naive BFS implementation. Naive graph expansion retrieves way too much context. If you blindly expand neighbors inside a large repository, the graph explodes into noise very quickly. Especially around highly connected framework code.

So graph expansion had to become query-aware.

Instead of expanding everything equally, the system started looking at:

- symbol overlap
- semantic similarity
- auth-related terminology
- file roles
- query keywords

before deciding what to expand.

For example:

query = authentication

should prioritize:

token middleware
JWT validation
auth decorators

and not:

generic request logging
unrelated utilities
serialization helpers

Once I managed to code this in, the graph was no longer purely structural. It was becoming semantic.

The Utility Node Problem

Another issue appeared during expansion. Highly connected utility functions started dominating retrieval.

Things like:

log_info()
handle_error()
serialize_response()

showed up everywhere. The graph accidentally rewarded centrality. Which sounds mathematically elegant until your retrieval system starts implying logging is the answer to everything, simply because that function appeared 1000 times...

So I introduced penalties for high-degree nodes. Highly connected utility-heavy functions received lower expansion priority. This was similar to how TF-IDF matrix works, except over function calls.

That cleanup improved retrieval quality far more than I expected ... because now the graph stopped constantly expanding into irrelevant framework plumbing.

Semantic Graph Expansion

This was where the architecture started becoming much more interesting. Originally, graph relationships were purely structural:

A calls B

Eventually, I started combining:

graph relationships with semantic similarity
symbol relevance
query intent so the traversal could prioritize execution paths actually related to the user’s question instead of blindly expanding every connected node.

This made a huge difference for repository reasoning
Queries about authentication naturally began surfacing middleware chains, token validation logic, and request lifecycle flows instead of drifting into unrelated utility code and framework plumbing.

The traversal pipeline slowly evolved into something closer to:

results = initial_retrieval(query)

expanded = bfs_expand(
    results,
    query_aware=True,
    semantic_weighting=True,
    depth=2
)

Now, my retrieval architecture started feeling execution-aware.

The Biggest Realization

This entire phase fundamentally changed how I thought about retrieval systems. Originally, I assumed retrieval quality depended mostly on embeddings.

Eventually I realized:

Retrieval quality depends heavily on structure.

The graph was improving retrieval not because the model became smarter, but because the context became more coherent. The system stopped retrieving isolated functions. It started retrieving workflows.

And finally, once the graph structure stabilized:

- symbol identity existed
- traversal worked
- execution flow became traceable
- relationships became meaningful

All this time, I was working with MongoDB, and storing the "chunks" in a collection. This was excellent for debugging, but now that my repository structure had stabilized, and I was confident enough in my Graph architecture, I was ready to move into embeddings and retrieval ranking properly.

Part 2 is coming up soon! Until then, you can check out my code here