Ishaan Mavinkurve

Posted on May 31

Building Dhrishti Part 2: Go-Lang Quirks

#programming #devops #architecture #showdev

— written by a human!

Now, my thinking about Dhrishti had evolved - I wanted to decouple the different steps of actually receiving telemetry which were originally bunched together into one single loader.go file.

I made the following architecture:

events.go - When my eBPF code ran, it would produce data in raw binary structs. Hence, my Go code, while going through the ring buffer, would get RAW BYTES. In Go, I needed structs that would EXACTLY match the structs written in my bpf.c code. This is what is called as Application Binary Interface or ABI. This would allow my Go code to exactly decode the binary bytes and get the actual data in a readable format.
receiver.go - This was the layer that would ingest my raw data by reading it continuously from the ring-buffer. This had some beautiful event-driven architecture to be implemented, and this was actually the first time I had tried it out.
normalize.go - Now, I had data in machine code… my timestamps were in nano seconds, my enums were numeric, my IP addresses were uint32 - this was useful to the machine, not so useful for me or other humans. I now needed to normalize the data and convert it to human readable code.
pipeline.go - This was the orchestrator, where different go routines were running in parallel to receive the emitted data from my probes, and normalize and log them.
attach.go - I needed this file to attach the probes to my receiver, and make a connection ****so I could start reading the events. It would load the object files, create the ring-buffer readables and attach the probes to the kernel programs.

I thought this was clean enough architecture. Now, when I ran my basic server in docker, and ran the main.go program, I got:

Beautiful. This did not look like much, but I was actually processing quite a few events. Now, I had to resolve the names of the docker containers, so I knew the actual connections rather than the IPs. I already had the functions to do this, and I just had to add them into the updated flow to get:

Now, it was time to take a bigger step. Until now, I was using a simple client-server architecture. This was good. However, I now wanted a real challenge for my project.
So I made the following architecture:

I built a micro-services architecture that was using this design. This would be a more complex, more real world test for Dhrishti. I dockerized the services, ran the containers, started Dhrishti.

And the result?

Beautiful. All connections were seen correctly.

Now, the next step was to actually make sense of all of these arrows. The raw telemetry I was getting was stateless. That meant, it could only understand:

connect happened
close happened
accept happened

But… who connected to whom? How long was the connection? How many connection attempts succeeded?

To answer this, I decided to build a connection state. This would track a connection from open to close, and also track failed connections.

I also had a seperate problem - sometimes, I saw

gateway -> auth-service
auth-service -> gateway

This was essentially 1 request response cycle. I had to track it as such. So, I decided to construct a flow correlation engine.

The next problem I had to tackle was - if I saw a closed=True with an accept=False - that meant I was looking at a failed connection - it was never accepted by the server. I had to track these as well. I also had a problem with short-lived connections - connections that were made and closed so fast that either I missed the connection itself (which was okay, because I think telemetry services are lossy to some extent anyway) or I could record the connection open, but not the connection close - which was a problem. Some graph edges remained open for ever, which was not right.

Hence, I added a cleaner - it would track connections that were open for more than 30 seconds (later reduced), close them and clean up memory.

I also needed something that looked like real time metrics. Currently, I was calculating Average latency between connections, for example. But when I observed my results, I saw that after a point, new connections did not change the average latency as much. I wanted to ensure that if something was failing, I knew it immediately - so I added calculation for

- rolling window temporal calculations
- p95 latency (what is the latency expected 95% of the time)
- rolling averages (over a sliding window)

After adding these components, my metrics started to look like this:

If you're thinking, "This is a LOT of information!" - yes, so was I. At this point, the client in my mock service was REALLY RAPIDLY sending requests to my API gateway, and it was becoming difficult to actually analyze my results.

I even tried to add some time gaps between requests sent by the client in my mock service, and added a keep-alive time for my requests themselves… but the terminal logs were still going by very fast for me to understand anything.

So, I decided to load up Cursor, and vibe-coded the entire front-end for my application. I just wanted a UI to view my metrics correctly. I was not concerned with UI polish for now. After a little bit of prompting, I decided to implement a cytoscape.js Graph (which would give me an interactive graph with a legend) to simulate the front-end using a web-socket from my Go backend.

Okayy, this was looking pretty good! The connections that were active would be dotted lines, the colors in the connections represented the latencies and hovering on the connections even gave me all the exta information - like connection life, p99 and p95 latencies, etc.

It also exposed some Go-Lang related issues. This was the part where it got interesting. I had never worked with Go so heavily until now. I knew the concepts I was using and the documentation was VERY comprehensive, but I still made some very interesting mistakes:

I was using Mutexes for a certain part of Dhrishti, basically, a Go listener would hold a thread until it heard a probe emit an event.
This was directly messing with my server stats, because it caused deadlocks, with one go function waiting on the other to release, and the other one waiting on the first to release - so I had to do some refactoring to prevent it.
The next, more subtle issue was with Go’s own Garbage Collector. This is a program that runs periodically and checks whether there are any variables it can clean up to free up memory. This bug took me SO LONG to resolve, but when I finally had it, I was probably the happiest man alive for about 3 minutes.

My app had 4 “listeners” plugged into Linux kernel events (like satellite dishes listening for TCP connect/accept/close activity from kernel space). Those listeners were created at startup and used to feed data into my Go pipelines. However, the GC used to only see that these listeners were created ONCE and then unused - so it decided to clean it up, breaking my graph after around 20 to 30 seconds. I had to force these listener objects to stay alive for the full life of the app by storing them in a forever-running go - routine context.

In simple terms: I gave Go a permanent “don’t throw this away” reference. This was the first time I had run into problems with Go-Lang’s quirks

Now, I had a working UI, a good amount of information from my Probes, some GREAT lessons by building the project in Go, and it was time to test out my project on something…. bigger.

The next step was to setup and use a real, actual GitHub repo that replicated an application. I had options like Google Boutique, for example - which simulated a real E-commerce website with a lot of micro-services. I also wanted to experiment with tools like hey and k6 to simulate production behaviour. But I am still building this phase out, and I will document it as I move forward. Let me know if you have some tips for this phase, please!

Check out Dhrishti here: https://github.com/IdiotCoffee/dhrishti

Top comments (2)

Harjot Singh • May 31

The ABI struct-matching detail is the part that bites everyone doing eBPF-to-Go and almost nobody warns you about: your Go struct and your bpf.c struct have to agree byte-for-byte, and the failure mode is silent, no compile error, no panic, just subtly wrong decoded values because of padding or alignment differences between how C and Go lay out fields. It's the same class of bug as schema drift in any contract: two sides believe they agree, the bytes say otherwise, and you only find out when a field reads garbage. That's exactly why decoupling loader.go into events/receiver was the right move, separating "decode the raw bytes correctly" from "do something with them" means you can verify the boundary in isolation instead of debugging the whole pipeline at once. The discipline of verifying the contract at the boundary rather than trusting it is the same instinct I build into Moonshift. Did you end up generating the Go structs from the C definitions to keep them in sync, or are they hand-maintained and you just stay disciplined about matching them?

Ishaan Mavinkurve • Jun 1

Currently, I am building them up by hand, but I think I need a scalable way of doing it as I do want to keep adding more probes into my project with time....
I would love to check out your repo too!