DEV Community

Cover image for How I Built an Autonomous SRE (and made it into the OpenAI Cookbook!)

How I Built an Autonomous SRE (and made it into the OpenAI Cookbook!)

Zaynul Abedin Miah on May 08, 2026

Taming GPT-4o for EKS Workflows Let’s be honest for a second: the idea of letting an LLM blindly run kubectl apply on an AWS EKS cluster...
Collapse
 
itskondrat profile image
Mykola Kondratiuk

spent 90% of our autonomous agent design time on permission scope, not prompt quality - that's basically this article's thesis. narrow kubectl access + approval gate beats trying to prompt-engineer good judgment into a model.

Collapse
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

Spot on, Mykola. 100% agreed. You simply cannot prompt engineer deterministic safety into a probabilistic model. Restricting the blast radius with narrow kubectl permissions and hard approval gates is the only way to actually sleep at night when building these systems.

Building on that exact thesis, I realized that once you lock down those permissions, the next massive hurdle is observability. If the agent gets stuck in a loop inside its sandbox, you need to know exactly why.

I actually just shipped v0.1.0 of Kube-AutoFix today to tackle exactly this integrating MLflow so we can track the validation latency, Pydantic schema passes, and YAML artifacts of every single agent run.

Just published the breakdown on how I built the observability layer here if you are curious: dev.to/azaynul10/how-i-made-an-aut...

Great to connect with another engineer focusing on robust architecture rather than just prompt tweaks!

Collapse
 
itskondrat profile image
Mykola Kondratiuk

observability is the right next layer - though I'd narrow it to tracing over metrics. knowing an agent made 14 kubectl calls in a session is fine; knowing which branch of the decision tree triggered call #14 is what helps you tighten the scope. metrics tell you something happened, traces tell you why the permission boundary you drew was in the wrong place.

Thread Thread
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

You've hit the nail on the head again, Mykola. Metrics tell you that you have a problem traces tell you where the problem is.

That is exactly why I configured the MLflow integration to log iterations as nested runs (Parent Run = The Incident -> Child Runs = Observe, Think, Act). If an agent hits a hard stop on a permission boundary, a flat metric just says 'Failed: True'. But a nested trace allows you to visually inspect the state machine and see: 'Ah, the LLM reasoned it needed to create a RoleBinding, tried to output the YAML, and hit the pre-flight deny-by-default guardrail.'

It shifts the debugging process from guessing the LLM's logic to visually inspecting its exact decision tree. Spot on distinction between tracing and metrics that is the exact mindset needed for production AI Ops

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

mlflow nested runs work well for sequential agents - the incident->obs->act structure makes permission failures readable. where I've hit friction is concurrent multi-agent runs: two parents overlapping and the hierarchy stops being meaningful. correlation IDs across runs tend to work better at that point than nesting

Thread Thread
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

That is a brilliant callout. You are entirely right Kube-AutoFix is strictly a single threaded, sequential loop (Observe -> Think -> Act), which is why the parent-child nesting maps so cleanly to it right now.

But you're spot on about the friction in multi agent swarms. The moment you introduce concurrency say spinning up a LogAnalyzerAgent and a ConfigScraperAgent asynchronously to triage the same incident that rigid hierarchy collapses. Transitioning to distributed tracing concepts using correlation IDs (treating agent operations more like microservices with OpenTelemetry) is definitely the mandatory architectural leap there.

That gives me a fantastic perspective on how to architect the observability layer if I ever upgrade this from a single SRE agent into a concurrent 'SRE team' swarm. Appreciate the insight!

Collapse
 
jschilling12 profile image
Jordan

Solid stuff great job and implementation

Collapse
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

Thanks, Jordan! Glad the implementation details resonated with you. If you ever want to test it out or break things in a safe sandbox, feel free to fork the repo .♥️