DEV Community

Cover image for How I Built an Autonomous SRE (and made it into the OpenAI Cookbook!)
Zaynul Abedin Miah for AWS Community Builders

Posted on • Edited on

How I Built an Autonomous SRE (and made it into the OpenAI Cookbook!)

Taming GPT-4o for EKS Workflows

Let’s be honest for a second: the idea of letting an LLM blindly run kubectl apply on an AWS EKS cluster is terrifying.

It is the stuff of late-night DevOps nightmares. One rogue hallucination, an accidental namespace change, or a sudden ClusterRoleBinding injection, and your infrastructure could be severely impacted.

As an AWS Community Builder and AWS Student Builder Group Leader working with developer communities in the Global South, I see more developers exploring how to integrate GenAI into their pipelines. But zero-shot LLM generation for infrastructure is not just risky — it lacks deterministic safety.

I call this the Infrastructure Hallucination problem.

To explore a solution, I built Kube-AutoFix: an experimental autonomous Kubernetes debugging agent prototype. It attempts to deploy, monitor, debug, and validate proposed fixes through strict schema enforcement and infrastructure guardrails.

My OpenAI Cookbook PR has been reviewed, improved through feedback, and is awaiting final maintainer approval:

👉 OpenAI Cookbook PR #2659

Here is how I built it, and the guardrails required when exploring agentic infrastructure workflows.


The Infrastructure Hallucination Problem

When standard LLMs attempt Infrastructure as Code (IaC), they often fail gracefully — which is the most dangerous kind of failure.

They will confidently generate YAML that looks perfect but contains fatal flaws:

  • Syntax hallucinations: Adding random markdown fences inside the execution pipeline.
  • Scope creep: Deciding your deployment needs a new ServiceAccount with elevated privileges because it saw a similar pattern in its training data.
  • Destructive state changes: Modifying core invariants like the Namespace or overriding replica counts during a hotfix.

You cannot pipe probabilistic text generation directly into a deterministic system like AWS EKS without a strict translation layer.


Enter Kube-AutoFix & Structured Outputs

To bridge this gap, I designed a closed-loop agentic workflow:

Deploy → Monitor → Debug → Fix
Enter fullscreen mode Exit fullscreen mode


`yaml

The tech stack

  • Python 3.11 — the glue
  • Kubernetes Python Client — for cluster interaction
  • AWS EKS — for Kubernetes testing
  • OpenAI SDK / GPT-4o — the reasoning engine
  • Pydantic — the schema gatekeeper

The key idea is OpenAI Structured Outputs.

By wrapping the expected YAML fix in a Pydantic schema, GPT-4o is forced to follow a strict JSON schema at the API level. It stops acting like a creative writer and starts behaving closer to a constrained function.

But even Structured Outputs are not enough for cluster operations.

We still need guardrails.


Building the Guardrails

Submitting a pattern to the official OpenAI Cookbook requires careful attention to system safety, reproducibility, and automated checks.

To prepare Kube-AutoFix for review, I designed three core guardrails.


1. Strict YAML Validation

LLMs love to wrap code in markdown fences.

For example, instead of returning clean YAML, they may return something like:

`

apiVersion: apps/v1
kind: Deployment
Enter fullscreen mode Exit fullscreen mode


`
That is useful for a blog post, but fatal if passed directly into an execution pipeline.

Kube-AutoFix intercepts the LLM response, strips hallucinated markdown formatting, and strictly parses the result through:

python
yaml.safe_load_all()
`

If the output is not valid YAML, the script halts before the proposed fix can proceed.

The principle is simple:

If the agent cannot produce valid structured output, it should fail closed.


2. Deny-by-Default Architecture

Kube-AutoFix follows a strict “deny-by-default” model.

Before proposing a fix, the agent parses the kind of Kubernetes resource the LLM is trying to generate.

If the LLM attempts to introduce an unexpected resource type, such as:

  • Role
  • ClusterRoleBinding
  • DaemonSet
  • unexpected ServiceAccount

the payload is rejected.

This matters because an LLM should not be allowed to escalate infrastructure scope just because it found a pattern in its training data.

If the original task is to fix a Deployment, the agent should not suddenly introduce cluster-level permissions.


3. Structural Invariants

The next problem is architectural drift.

How do you make sure the LLM fixes a crashing pod without changing the broader system design?

You lock the state.

Kube-AutoFix extracts important invariants from the original deployment, including:

  • original Namespace
  • original replica count
  • deployment name
  • container ports

Then it forces those values back into the generated YAML.

So even if GPT-4o hallucinates a scale-up to 50 replicas, Kube-AutoFix restores the original replica count before proceeding.

The goal is not to let the model redesign the system.

The goal is to let it propose a narrow, validated remediation.


Why This Matters for AWS Builders

As an AWS Community Builder, I see this pattern as a stepping stone for safer cloud engineering with AI.

The concepts behind Kube-AutoFix are not limited to EKS or Kubernetes. The same closed-loop, validated agentic pattern can be explored across the AWS ecosystem.

For example:

Amazon Bedrock

Foundation models can be wrapped with schema validation to generate structured AWS CloudFormation suggestions instead of free-form infrastructure code.

AWS CDK

Agents can analyze failing CDK synthesizer states and propose structurally validated TypeScript fixes.

Incident Response Assistance

Agentic workflows can be connected to Amazon CloudWatch alarms to propose remediation steps for CPU throttling, memory pressure, or deployment failures — while keeping a human in the loop for final approval.

The shift is not simply:

AI that writes code.

The more interesting direction is:

AI that helps operate infrastructure safely, with validation and observability.


What I Learned

Building Kube-AutoFix taught me that prompt quality is only one part of the problem.

For infrastructure agents, the more important questions are:

  • What is the agent allowed to change?
  • What must remain invariant?
  • What happens when the model produces invalid output?
  • How do we prevent privilege or scope escalation?
  • Where should human approval remain mandatory?
  • How do we inspect what the agent did?

In other words, safe AI infrastructure workflows require more than a good prompt.

They require boundaries.


Current Status

Kube-AutoFix has been submitted as an OpenAI Cookbook pattern and is currently under review.

The project is experimental, but it demonstrates a practical pattern for combining:

  • structured outputs
  • Kubernetes debugging
  • Pydantic validation
  • YAML safety checks
  • deny-by-default resource control
  • structural invariants

You can check out the source code here:

🔗 Kube-AutoFix on GitHub

And the OpenAI Cookbook PR here:

🔗 OpenAI Cookbook PR #2659


Conclusion

Submitting Kube-AutoFix to the OpenAI Cookbook was a major milestone for me.

It demonstrates that with the right guardrails and strict validation, we can begin to explore AI integration within infrastructure workflows more safely.

The important lesson is not that an LLM should be trusted blindly with Kubernetes.

The lesson is the opposite:

If we want AI agents near infrastructure, we need constrained outputs, validation gates, scoped permissions, and human review where it matters.

If you are curious about Agentic AI, Kubernetes, or cloud operations, I’d love for you to dig into the code and share feedback.


Let’s Discuss

How are you currently exploring AI in your CI/CD or infrastructure workflows?

Are you experimenting with agents, or are you sticking strictly to code-generation assistants?

Let me know in the comments.

Top comments (21)

Collapse
 
itskondrat profile image
Mykola Kondratiuk

spent 90% of our autonomous agent design time on permission scope, not prompt quality - that's basically this article's thesis. narrow kubectl access + approval gate beats trying to prompt-engineer good judgment into a model.

Collapse
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

Spot on, Mykola. 100% agreed. You simply cannot prompt engineer deterministic safety into a probabilistic model. Restricting the blast radius with narrow kubectl permissions and hard approval gates is the only way to actually sleep at night when building these systems.

Building on that exact thesis, I realized that once you lock down those permissions, the next massive hurdle is observability. If the agent gets stuck in a loop inside its sandbox, you need to know exactly why.

I actually just shipped v0.1.0 of Kube-AutoFix today to tackle exactly this integrating MLflow so we can track the validation latency, Pydantic schema passes, and YAML artifacts of every single agent run.

Just published the breakdown on how I built the observability layer here if you are curious: dev.to/azaynul10/how-i-made-an-aut...

Great to connect with another engineer focusing on robust architecture rather than just prompt tweaks!

Collapse
 
itskondrat profile image
Mykola Kondratiuk

observability is the right next layer - though I'd narrow it to tracing over metrics. knowing an agent made 14 kubectl calls in a session is fine; knowing which branch of the decision tree triggered call #14 is what helps you tighten the scope. metrics tell you something happened, traces tell you why the permission boundary you drew was in the wrong place.

Thread Thread
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

You've hit the nail on the head again, Mykola. Metrics tell you that you have a problem traces tell you where the problem is.

That is exactly why I configured the MLflow integration to log iterations as nested runs (Parent Run = The Incident -> Child Runs = Observe, Think, Act). If an agent hits a hard stop on a permission boundary, a flat metric just says 'Failed: True'. But a nested trace allows you to visually inspect the state machine and see: 'Ah, the LLM reasoned it needed to create a RoleBinding, tried to output the YAML, and hit the pre-flight deny-by-default guardrail.'

It shifts the debugging process from guessing the LLM's logic to visually inspecting its exact decision tree. Spot on distinction between tracing and metrics that is the exact mindset needed for production AI Ops

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

mlflow nested runs work well for sequential agents - the incident->obs->act structure makes permission failures readable. where I've hit friction is concurrent multi-agent runs: two parents overlapping and the hierarchy stops being meaningful. correlation IDs across runs tend to work better at that point than nesting

Thread Thread
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

That is a brilliant callout. You are entirely right Kube-AutoFix is strictly a single threaded, sequential loop (Observe -> Think -> Act), which is why the parent-child nesting maps so cleanly to it right now.

But you're spot on about the friction in multi agent swarms. The moment you introduce concurrency say spinning up a LogAnalyzerAgent and a ConfigScraperAgent asynchronously to triage the same incident that rigid hierarchy collapses. Transitioning to distributed tracing concepts using correlation IDs (treating agent operations more like microservices with OpenTelemetry) is definitely the mandatory architectural leap there.

That gives me a fantastic perspective on how to architect the observability layer if I ever upgrade this from a single SRE agent into a concurrent 'SRE team' swarm. Appreciate the insight!

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

yeah, single-agent sequential makes sense for this. once you get concurrent branches on the same resource though, parent-child shows timing not causality.

Thread Thread
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

Exactly. Timing vs. causality is the exact trap with concurrent spans. If two agents touch the same deployment at the same time, a standard trace just becomes a timeline of race conditions rather than a logical decision tree.

Your previous point actually inspired me to push a new update to the repo today! I just shipped an enhancement to Kube-AutoFix using mlflow.start_span() to build out a strict hierarchical trace for the sequential loop (Deploy -> Monitor -> Debug -> LLM).

Since the agent is strictly single threaded right now, the parent-child nesting perfectly captures the causal chain. But your warning on concurrency is noted for when I eventually explore multi-agent swarms. We would absolutely need OpenTelemetry style correlation IDs to map true causality at that scale.

Really appreciate the technical sparring here, Mykola it directly influenced the codebase today!

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

race-condition timeline is exactly what makes concurrent agent traces misleading — you get the what but lose the why. curious what the update adds — are you modeling agent intent as a separate node, or tightening the span boundaries?

Thread Thread
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

Great question! For this update, I went with tightening the span boundaries rather than adding a separate intent node.

Since Kube-AutoFix operates sequentially, I wrapped the core lifecycle phases (Deploy, Monitor, Debug, LLM Diagnosis) into strict, explicit MLflow spans. Previously, I just had a timeline of raw LLM API calls and metrics, which showed that something happened but left the context ambiguous.

Now, every MLflow trace clearly delineates the exact boundaries of the agent's actions. If the agent makes a deployment and it fails, the subsequent LLM Diagnosis span captures the exact DebugBundle it looked at, and logs the LLM's root_cause and confidence_score directly within that span. This effectively captures the 'intent' (why it chose a specific fix) inside the tight boundary of the diagnosis phase, before it moves on to the next Deploy span.

It keeps the causality perfectly clear for a single-agent loop without overcomplicating the trace with separate intent nodes

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

span tightening over intent nodes makes sense for sequential workflows — the phase boundary is already the signal you need. did the Deploy-to-Monitor span surface any latency surprises?

Thread Thread
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

Spot on, Mykola! The phase boundary really is the perfect signal there. As for the Deploy-to-Monitor span, it definitely did. The biggest surprise was visualizing the variance in Kubernetes reconciliation delays. Applying the fix is near-instant, but the trace highlighted just how much the time to reach the desired state (and confirm it via readiness probes) fluctuated. Having that span separated in MLflow made it immediately obvious that we needed more adaptive backoff strategies in the monitoring phase, rather than just blaming the LLM overhead.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

kubernetes reconciliation delay variance is the kind of thing that blindsides you. we had similar surprises when we first got real tracing in - thought the deploy fix was fast, turns out there was a tail hiding in the scheduling gap we had never seen before

Thread Thread
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

Exactly! That 'scheduling gap' is a silent killer for autonomous workflows. It is so easy for an agent to assume a fix failed just because the pod isn't instantly ready, when in reality it is just waiting on a heavy image pull or the cluster autoscaler to spin up a new node.

Without tracing to expose that tail latency, an AI agent will panic and try to hallucinate a second fix while the first one is still pending in the queue creating a cascading failure. Visualizing 'LLM execution time' strictly separated from 'Infrastructure spin-up time' is the only way to build an agent with enough patience to actually work in prod.

It has been awesome diving into these architectural nuances with you, Mykola! Always great to cross paths with someone who knows where the actual friction lies in these distributed systems.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

tracing shows you the gap after the fact; the real fix is encoding explicit wait-windows per operation type in your agent before it decides to retry. without that, tracing just makes the failure reproducible, not preventable.

Thread Thread
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

You are absolutely right. Observability is purely diagnostic, but the control flow needs to be preventative.

Tracing gave me the hard proof of that scheduling gap, but to actually prevent the cascading failure, the agent has to respect Kubernetes state transitions. Instead of relying on a naive timeout or blind retries, the next architectural step is encoding state-aware readiness gates. The agent needs to actively poll the K8s API for status.conditions (like waiting for DeploymentAvailable=True) with an operation-specific exponential backoff before it's even allowed to classify the deployment as a 'failure' and wake the LLM back up.

Tracing made the gap reproducible; as you said, operation-specific wait-windows are absolutely the cure. Brilliant way to frame it.

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

polling K8s conditions still has a gap - DeploymentAvailable=True doesn't mean traffic can actually hit your pods yet. the wait-window needs to gate on your service's readiness probe, not just the deployment status

Thread Thread
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

That is a great catch! There is indeed a subtle gap between pod-level availability and end-to-end service traffic routing.

Under the hood, KubeMonitor actually evaluates the individual Pod statuses rather than the high-level Deployment condition. It loops over all pods and checks container_statuses[].ready:

all_ready = all(c.ready for c in cs_list) and len(cs_list) > 0

Enter fullscreen mode Exit fullscreen mode

In Kubernetes, a container status only becomes ready=True when all its configured container probes (including the Readiness Probe) are passing. So, we are gating on individual container readiness probes before marking the run as DeploymentState.HEALTHY.

However, you're 100% correct that this still leaves a data-path gap:

  1. Even after a container's readiness probe passes, it takes time for the endpoints controller to sync that Pod IP to the matching Endpoints / EndpointSlice, and for kube-proxy (or a Service Mesh) to update the actual routing rules.
  2. Just because the pods are ready doesn't guarantee the Service is configured with the correct selector or port mappings.

To fully close this gap, your suggestion is spot on. If a Service is part of the applied manifest, the agent should:

  • Wait for the corresponding EndpointSlice to populate with matching ready IPs.
  • Run an active synthetic transaction probe (e.g., a simple curl) directly against the Service endpoint to guarantee traffic is actually hitting the application.

But adding an Endpoint/synthetic probe verification step is officially going on the roadmap for v2. Thanks for the sharp feedback!

Thread Thread
 
itskondrat profile image
Mykola Kondratiuk

pod-level status over the deployment condition is a better signal. the place i've seen traffic gaps persist is the service mesh layer - proxy not ready while the pod reports healthy. does KubeMonitor gate on endpoint slice readiness or just the pod status loop?

Collapse
 
jschilling12 profile image
Jordan

Solid stuff great job and implementation

Collapse
 
azaynul10 profile image
Zaynul Abedin Miah AWS Community Builders

Thanks, Jordan! Glad the implementation details resonated with you. If you ever want to test it out or break things in a safe sandbox, feel free to fork the repo .♥️