Ideas

#devops #docker #monitoring #sre

I've a lot of various ideas and usually save them off in a text file. As you never know when something might occur to you.

We've all heard the issue of "it works on my machine" but breaks down in production. As I noted in another post (that may or may not have been posted yet) I've always tried to get my dev env to match to production as close as possible. I've played a bit with Docker environments to help with that: running Grafana, Loki, Prometheus and Redis is the simple approach so that I have all the services that production has. But like Netflix's Chaos Monkeys, this does not simulate the sheer volume of HTTP calls, DB access and all the things that may go wrong. I know there is a way we can simulate all this on the local dev or at least a test environment. And it may already be a thing for some places.

But what would be really sweet was a way to spin up an environment that simulates the crushing load of peak usage. A quick search reveals that there are a lot of instructions out there, and maybe even some prepackaged setups. With the advent of AI agents, it may not be that hard a thing to do now. The pluses for me would be that this runs entirely locally - no cloud costs but somehow make it look like a cloud instance as much as possible. We do not want runaway costs for testing, but we do want runaway jobs and processes and missing and slow connections.

I started down this route with my last job for a specific service I was replacing: when an order came in, some customers had a web hook to get notified on order ingestion. I wrote a minimal API that had 3 endpoints to simulate what we'd get back: a valid 200 response, an invalid response, and a randomized one. I was testing Hangfire, to make sure it handled the four cases, the 4th being an invalid URL. My integration testing had multiple customers and I assigned them the various URL endpoints. I learned a lot doing this, but it was just to be the 1st step. Sadly, the new CIO decided "to go in a different direction" so that project, as well as me, got shelved. Along with the aforementioned Docker images, I had a lot of insight into what was happening with the service replacement. A lot more than the original API where I had to log on to the correct server and review log files. And oddly, I was really having fun doing all this as I was learning a lot of new and interesting things. Including a great appcreciation for open telemetry! My long term plan was to recreate production as much as possible and throw more at it as stress testing is the best testing if you can do it correctly.

And of course, the elephant in the room for this entire idea: pretty sure that is what Docker is for (making it work on any machine) except for the stress testing. Which I know a lot of companies do but I've never done before.

DEV Community

Ideas

Top comments (0)