Building Dhrishti - Part 3: Testing on a Production Grade System

#showdev #go #redis #architecture

I was now done with the basic setup. However, during my time working at my startup, I have learnt to think about a project wearing multiple caps. One such aspect was - With Dhrishti running on a server that was already loaded, I did NOT want the tracking application itself to be heavy. I had to set some benchmarks to ensure that Dhrishti did not consume a tonne of space while tracking the metrics. I also had a problem with unresolved requests - in my mock_services, I had a client that was continuously hitting the API Gateway service. I had to fine-tune all the requests so that I could run tests under different loads, but the advantage was that my project was easily able to discern where the client request was coming from. However, in a production scenario, you can never know where a request is coming from - obviously, we cannot resolve different customer IPs to their respective customer names.

This was the first problem. I had to specify what a customer was, and what an unknown request was. I came up with the following solution -

Any unresolved IPs are going to be added to a table in the UI called unresolved IP table. This would help me with debugging later. Now, any unresolved IPs which also made requests to an ENTRY-POINT into my application could be added as the customers. For this, I very simply had to filter out the unknown IPs, and keep a configurable entry-point in dhrishti.json in which I would add a bunch of entry-points (in the case of my mock micro-service architecture, only 1)

Now, I could differentiate between 2 types of unknown IPs - one which was potentially a customer, one which was a background network call, not important to the working system.

The next problem was with the client service itself. It was difficult to simulate, say - a million users in my system. I had essentially built a service which was only being used by 1 customer, but how would Dhrishti behave if I added multiple client IPs?

Using K6

k6 is a Grafana based application that helps developers simulate real world load on servers. All I have to do for this is make a file load.js and define a bunch of parameters -

Which entry point to hit?
Do I ramp up traffic at a certain point? Do I ramp it down? Keep it constant?
Is there a potential sleep between 2 subsequent requests?
How many virtual users do I want to simulate? How long do I run the simulation for?
I can also simulate different scenarios - many purchases being made, a flash sale service, people simply browsing the website, etc.

All in a JSON format! simple, easy to implement. Then, I could simulate the load with this JSON as the parameter into k6.

Side-note: This, by the way, was a really interesting experience - I could model my own sandbox and even wreak havoc inside it. I really want to try out more simulations with k6 in the future!

After this, I built a module to benchmark the overhead for Dhrishti. I had to write a bunch of shell scripts for this, which fortunately I was familiar with as I have been using Linux for quite some time now.

First, I first had to write some code in Go to get CPU information - how many threads, how many cores, which OS I was running, etc.

Then, I ran k6 without Dhrishti to establish a baseline. I wrote a baseline.sh script , which would run k6 and give me baseline stats:

Then, I wrote a script that would simply benchmark the results according to Dhrishti - so, instead of using the results given by k6, I would print out the results given by Dhrishti, and I also had a small compare.sh script compare CPU performance with and without Dhrishti.

The benchmark with Dhrishti came up as follows:

AWESOME - the engine used up barely any RAM. Dhrishti was pretty lightweight.

Then, I wrote the compare script, which would basically normalize events per second into requests per second - because Dhrishti was capturing a request as TCP Connect, TCP accept, TCP close events. After tuning it all together, I had the following:

Dhrishti was doing pretty well! I got a delta of barely 0.01 req/s while running a simulation with 50,000 concurrent users! The more problematic numbers were the p95 latency numbers - where I had a delta of around 14% - but that was to be expected.

Why?

Because Dhrishti was not always able to detect a connection close event. In Part 2, I had discussed how very short lived connections were basically never being detected by Dhrishti OR it would only detect a TCP_OPEN event, and no corresponding TCP_CLOSE event. This was expected because the inference engine was lossy by nature, but I had added a cleanup service in Go to clean up such connections after 30 seconds of inactivity. Hence, for those requests, Dhrishti captured a longer p95 / p99 latency, as compared to K6, which I believe used a different, more accurate method to keep track of the requests that it was making.

Now, I was ready for a bigger challenge. All this while I was using mock micro-services which were… naive. There were no hidden surprises, no production grade code. Dhrishti was cruising along while observing this architecture. I needed to challenge my project and stress test it to the fullest.

So, I deleted my mock micro-service architecture, and made a production grade architecture simulating an e-commerce website. It was complete with 15 interconnected services: gateway, catalog, pricing, inventory, payments, notifications, etc., designed to mimic a real production topology rather than a toy demo. There were lots of hidden dependencies and this architecture had a much more complex topology. This is what Dhrishti would typically see in a production grade e-commerce store. I also removed the client service completely as k6 would be handling requests to this architecture internally.

Another small aspect I feel I should cover - in the last Part about building Dhrishti, I designed a very simple, completely vibe-coded UI. Now, I learnt a little more about cytoscape.js, and modelled a better, cleaner UI which looked like this:

I also wanted to play around with a new technology that I had recently learnt: Redis. Redis, by the way, is a caching technology that has a LOT of pretty cool quirks. I had recently studied Redis Sorted sets, Geo-analysis, and TimeSeries while going through a course on Redis University. They have, in my opinion, one of the best communities in the tech world. However, having now learnt about it, I wanted to use it in a project to really get a feel for it. But, what was the best way to do so?

I decided to use RedisTimeSeries. TimeSeries allows us to capture time-stamped data using an underlying sorted set data structure. I thought it was one of the coolest features Redis had to offer, (apart from their Geo-spatial offerings, of-course!)and I wanted to use it to capture all events occuring in the last 24 hours. I would then store them so that I could view the events in a clean time-line view. This would allow users to compare things like - how did incoming requests change over time for Service A as compared to Service B? When exactly did Service X crash? When did users peak on my e-commerce store?

I also added the capability for users to REPLAY the exact events as they occurred in the last 24 hours, again using TimeSeries, to allow users to see exactly how it all went down before their servers crashed.

With this, the end-to-end application was complete. Dhrishti could now watch live traffic in a production-grade, 15-service topology, infer the dependency graph with sub-0.1% CPU overhead, and let users replay the last 24 hours of activity through Redis TimeSeries, all through a UI that no longer looked vibe-coded.

Looking back at this part specifically, I think the hardest part was learning to treat my own project as something that had to survive contact with reality. Benchmarking showed me exactly where Dhrishti was reliable and where it wasn’t and the overhead that Dhrishti itself produced. And k6 forced me to stop testing against a system I'd built to be easy to test.

Of course, there's always more to add, more probes, more metrics, deeper inference. But the core loop is done: observe, infer, benchmark, replay. I think that is a complete story.

And along the way, I learnt a lot more about kernel-space programming, taming Go's garbage collector, fighting Docker internals, and getting genuinely useful results out of k6 and Redis, which are two tools I'd love to explore more in future projects.

This wraps up the Dhrishti series (for now). Thanks for following along through all three parts!

Github Link: https://github.com/IdiotCoffee/dhrishti/tree/master