Zaynul Abedin Miah

Posted on May 15

How I Made an Autonomous Kubernetes SRE Agent Observable with MLflow

#mlflow #aiagents #devops #kubernetes

AI infrastructure agents are exciting, but they are also difficult to trust.

A Kubernetes debugging agent can generate a remediation plan, but if we cannot inspect its iterations, validation results, latency, artifacts, and failure modes, the system becomes a black box. That is a problem, especially when the agent is working near infrastructure.

I built Kube-AutoFix as an autonomous Kubernetes SRE agent prototype that uses structured outputs, Pydantic validation, YAML safety checks, dry-run support, and namespace isolation to reduce risky behavior. Recently, I added an MLflow observability layer so each agent run can be tracked, inspected, and compared.

This article explains how I added optional MLflow tracking to Kube-AutoFix and why observability is essential for evaluating AI infrastructure agents.

Why You Should Learn This

Adding observability to AI agents moves you from "vibes-based" testing to rigorous, data-driven engineering. By learning how to track your agents, you will be able to build trustworthy AI systems, easily identify why an agent hallucinated or failed, and systematically improve reliability before deploying near any critical environments. Whether you are an industry veteran or a beginner exploring AI, mastering agent observability is a critical skill for the future of DevOps.

The problem: AI infrastructure agents are black boxes

When an AI agent attempts to debug a failing Kubernetes pod, it executes a series of thoughts, actions, and observations. Without a dedicated observability layer, these steps exist only as fleeting console logs. If the agent gets stuck in a loop or proposes a dangerous change, debugging the debugger becomes a nightmare. You are left asking: Was the prompt bad? Did the cluster return unexpected state? Did the validation fail? Without an audit trail, AI infrastructure agents cannot be trusted.

The dashboard above uses synthetic traces to demonstrate the observability interface. No live Kubernetes cluster, OpenAI call, or Databricks-hosted run was executed in this recording environment.

What Kube-AutoFix already did

Before adding MLflow, Kube-AutoFix was already focused on safe, controlled executions. It utilized:

Structured Outputs: Forcing the LLM to reply in a predictable JSON schema.
Pydantic Validation: Ensuring the LLM's output met strict data requirements before taking action.
YAML Safety Checks: Preventing the deployment of unauthorized resource types or privileged containers.
Dry-Run Support: Validating Kubernetes manifests against the API server without actually applying them.
Namespace Isolation: Restricting the agent's blast radius to a safe, designated namespace.

What You Need

To follow along with this kind of implementation, you need:

Python 3.10+
MLflow: Installed locally (pip install mlflow) or hosted via Databricks.
A Local Kubernetes Cluster: Minikube, Kind, or Docker Desktop (do not use production).
An LLM API Key: e.g., OpenAI API key.

At a high level, the workflow looks like this:

Broken Kubernetes Manifest
        ↓
Deploy / Dry Run
        ↓
Monitor Pod Status
        ↓
Collect Debug Signals
        ↓
LLM Diagnosis with Structured Output
        ↓
Pydantic + YAML Validation
        ↓
Corrected Manifest Artifact
        ↓
MLflow Run Tracking

What I added with MLflow

I integrated MLflow to turn the black box into a glass box. Here is the step-by-step breakdown of how the observability layer was implemented:

1. Initialize the MLflow Experiment

The first step was setting up a dedicated MLflow experiment to group all Kube-AutoFix runs. This allows us to separate debugging tests from regular operations.

2. Log Agent Configuration and Parameters

At the start of every run, the script logs the configuration parameters: the model name being used, the target namespace, maximum allowed iterations, and whether dry-run mode is enabled.

3. Track Iterations as Nested Runs

Agents work in loops. I configured MLflow to log each iteration (Observe -> Think -> Act) as a step, allowing us to see exactly how many tries it took the agent to resolve the issue or hit the iteration cap.

4. Log Artifacts and Manifests

Instead of just logging text, the agent now saves the generated YAML manifests, the final remediation plans, and error tracebacks as MLflow artifacts. You can download and inspect the exact files the agent tried to apply.

5. Record Metrics

Finally, I added metrics tracking for execution latency, prompt token usage, completion token usage, and validation success rates. This makes it easy to compare the cost and speed of different models.

What gets logged

The MLflow integration captures useful telemetry for each agent run without exposing sensitive data by default:

Run metadata: target namespace, model name, dry-run status, max iteration setting, and project tags.
Metrics: iteration duration, LLM latency, validation status, confidence score, resource counts, and final success/failure state.
Artifacts: corrected YAML, root-cause summaries, change summaries, and redacted debug summaries.
Tags: agent type, safety model, and integration context.

What does not get logged by default

By default, the tracker does not log full prompts, raw Kubernetes logs, credentials, environment variables, or unredacted cluster data. Full prompt/debug logging is intentionally gated behind an explicit configuration flag for local experimentation only.

Synthetic demo traces disclaimer

Demo note: The MLflow dashboard demo in this article uses synthetic traces to show the observability interface. No live Kubernetes cluster, OpenAI call, or Databricks-hosted run was executed in the recording environment.

Why this matters for Databricks and MLflow

MLflow is widely known for tracking traditional machine learning experiments, but this project shows how the same tracking model can also help evaluate AI agent workflows. As Databricks continues to expand its AI capabilities, tools like MLflow can provide a practical observability layer for comparing runs, inspecting artifacts, and understanding how agent behavior changes over time. It provides the structured governance that AI engineering teams increasingly need.

Pro-Tips

Use Tags for Filtering: Tag your MLflow runs with the specific Kubernetes issue being tested (e.g., issue: CrashLoopBackOff) so you can easily filter the dashboard later.
Keep Artifacts Focused: Don't log the entire cluster state. Log only the specific manifests the agent interacted with to keep your artifact storage clean and cheap.
Leverage the UI: Use the MLflow UI's comparison feature to select two runs and instantly see a diff of their metrics and parameters.

What I learned

Building this taught me that an agent is only as good as your ability to debug it. Prompt engineering is important, but having a clear, visual dashboard of the agent's entire thought process changes how you develop. You stop guessing why an agent failed and start measuring its behavior systematically. Permission scope and approval gates matter significantly more than having the absolute perfect prompt.

Top comments (6)

Harjot Singh • May 31

Making an autonomous Kubernetes SRE agent observable is the right thing to obsess over, because autonomous plus production plus opaque is the scariest combination in this whole space, an agent acting on a live cluster where you can't see why it did what it did is one bad decision from an outage with no audit trail. Observability is what makes autonomy survivable: every decision, every kubectl-equivalent action, every input it reasoned from, captured so a human can reconstruct the chain after the fact. Using MLflow for it is a nice pragmatic choice, you get run tracking and comparison for free instead of building bespoke logging. The pairing I'd stress for an SRE agent specifically: observability tells you what it did, but the safety comes from bounding what it can do, so the diagnose-vs-act split matters, let it read cluster state freely, but gate the mutating actions (scale, delete, rollout) behind approval or tight policy, because a confident wrong remediation on prod deepens the incident. Observable and bounded together is what lets you trust an autonomous agent near production. See everything it does, and gate the actions that can hurt. That make-it-observable-then-bound-the-irreversible instinct is core to how I think about Moonshift. Is the agent allowed to apply remediations autonomously, or does it propose and wait for a human on the mutating actions?

Zaynul Abedin Miah • Jun 3

You've hit on the exact core of the trust boundary. "Observable and bounded" is the only way an SRE agent survives contact with production systems.

To answer your question: currently, Kube-AutoFix supports two operational modes controlled by the dry_run flag

Dry-Run Mode (Propose-only): The agent collects diagnostics and uses LLMEngine.diagnose to generate a corrected YAML manifest, but halts without applying it. This acts as a proposal engine.
Autonomous Mode (Auto-apply): The agent applies the changes directly to the cluster. To keep it "bounded," we lock it down to a hardcoded namespace target (autofix-agent-env) defined as KUBE_NAMESPACE However, for a true production workflow, relying strictly on namespace isolation isn't enough. We are exploring two approaches to bridge this gap:
GitOps PR Flow: Instead of granting the agent direct mutating access to the cluster via KubeDeployer.apply_manifest, the agent runs in a CI pipeline or event listener, proposes the fix, and automatically opens a Pull Request (containing the diagnosis and corrected YAML). A human SRE reviews the PR, and merging it triggers the existing GitOps deployment pipeline.
Slack/Webhook Interactive Gate: Inserting an interactive approval step inside the AgentLoop before it executes a mutation. The agent posts its reasoning and proposed manifest to a Slack channel with Approve / Reject interactive buttons.

I really like your "diagnose-vs-act" split philosophy. Bounding the write-path while keeping the read-path open is exactly how we can safely test and build trust in these agents.

Whatsonyourmind • May 19

The "glass box → trust" framing is exactly right, and adding MLflow turns the agent from anecdote to artifact. Two extensions that fit naturally on top of what you've built and would close a real production-trust gap I see in SRE-agent designs:

Multi-criteria safety scoring instead of binary YAML checks. Right now the agent runs a pass/fail validation pipeline (Pydantic → YAML safety → dry-run). In production, "this manifest changes a Deployment's replica count from 3 to 1 in the analytics namespace" and "this manifest changes the RBAC policy of a system-wide ClusterRole" both pass binary YAML safety, but one is near-zero-risk rollback and the other is a blast-radius nightmare. A composite risk score (blast_radius × resource_sensitivity × rollback_complexity × namespace_scope) emitted as an MLflow metric per remediation would let you set a numerical gate ("auto-apply if risk < 0.3, require human approval otherwise") rather than relying on namespace isolation alone.
Decision-aware iteration-cap recovery. "Hit max iterations" is currently a terminal failure. A graceful chain — (a) drop to a cheaper model for this remediation, (b) narrow the scope (single-resource fix vs. full manifest rewrite), (c) escalate to human with a structured summary of what was tried — is usually what production SRE workflows want. MLflow already captures the data needed to learn which fallback strategy works for which failure mode; the picker step is the missing piece.

The picker / composite-scoring layer is the kind of thing that ends up reinvented in every SRE-agent project. I packaged decision algorithms (composite scoring, weighted utility, Pareto pickers, UCB1 for fallback-strategy learning) as an MCP server in case it's useful as a building block: github.com/Whatsonyourmind/oraclaw — would slot in cleanly between the LLM diagnosis step and the YAML validation gate. Synthetic-traces caveat well-flagged — the harder evidence-of-trust comes from comparing real runs against synthetic baselines once you do go live.

Zaynul Abedin Miah • May 22 • Edited

Thanks for the detailed feedback! You've hit on the exact trust gaps that prevent autonomous agents from moving past localized sandboxes into real production environments.

A binary check is indeed too blunt. Restructuring our validation gate to compute a composite risk score (e.g., blast_radius × resource_sensitivity × rollback_complexity × namespace_scope) is a natural next step. For example, changing a stateless microservice's replica count vs. modifying a system-wide RBAC ClusterRole or cluster-scoped CRD. We can calculate this risk score by parsing the resource kind, checking if it touches sensitive APIs (RBAC, Networking, StatefulSets), and logging it as an MLflow metric. Introducing a gate (e.g., auto-apply if risk < 0.3, else block for human approval) would let SREs tune their risk tolerance dynamically.
Currently, hitting the iteration cap in AgentLoop.run is a hard terminal failure. A structured fallback chain would make the agent much more resilient:
Step A: Try a cheaper/smaller model (e.g., GPT-4o-mini) to reduce token costs and API latency during high-iteration retries.
Step B: Scope reduction (e.g., isolate and fix a single deployment/service spec rather than attempting a full manifest rewrite).
Step C: Human escalation via Webhook/Slack with a structured markdown post-mortem summary of the failure, what the agent tried, and what changes were made.
Your MCP server slots in perfectly here. Wrapping decision algorithms like Pareto pickers or UCB1 as MCP tools allows us to dynamically pick fallback strategies and calculate the composite risk score between LLMEngine.diagnose and the final deployment step.

Your point about real runs vs. synthetic baselines is spot-on evidence of trust must be backed by comparing actual production rollouts against our synthetic baselines.

Whatsonyourmind • May 27

The composite risk score works, but the multiplicative form collapses too aggressively as you add factors and the [0, 1] interpretability disappears above three components. Two small adjustments:

Geometric mean instead of product:

risk = (blast_radius * sensitivity * rollback * namespace) ** (1/4)

Keeps the score in [0, 1] regardless of how many factors you add later, preserves the "any-near-zero kills the score" semantics (geom mean ≤ min), and the 0.3 / 0.7 thresholds stay interpretable. Weighted variant if some factors should dominate: risk = ∏ factor[i]^w[i] with Σw = 1. Lets you say "namespace_scope is 50% of the decision" without changing the contract.

Calibrate the threshold against MLflow logs.
Log every gate decision as (risk_score, was_actually_safe) once the gate is live. After ~200 decisions, an isotonic regression on those tuples surfaces what the threshold should be — e.g., "risk = 0.4 is actually 95% safe in practice, you're blocking too aggressively." Same isotonic pattern that calibrates ML probabilities (sklearn.isotonic.IsotonicRegression), fed by the metrics you're already storing. This is where MLflow earns its keep beyond the dashboards.

Fallback chain — Pareto pick + lexicographic order beats raw bandits for first-of-its-kind failures.
For a given failure, the candidate set has costs (api_$, latency_s, blast_risk). Pareto-optimal extraction returns 2–3 non-dominated picks; scalarize lexicographically (risk → cost) for the final choice — easier to defend in a postmortem than a weighted sum. UCB1 only earns its keep once you have repeated draws on the same fallback path with reward signal (50th deployment of a recurring service spec, Step A succeeded 8/10 times, system learns "skip A → try B"). Worth wiring both into the MCP layer: Pareto picker for cold decisions, UCB1 for the warm ones.

@oraclaw/decision-graph does Pareto frontier extraction in 3 lines, @oraclaw/anomaly has isotonic calibration in TS, and @oraclaw/bandit is UCB1/Thompson — all under decide_* in the MCP server so the agent can route between them without changing the call site. Happy to sketch the specific tool wrappers if useful.

Zaynul Abedin Miah • May 29 • Edited

This is a very strong framing of the decision-design problem. You are right that the product formulation can collapse too aggressively if we treat every factor as a hard multiplicative gate.

For the risk model, I like the idea of a weighted geometric mean:

risk = ∏ factor_i ^ w_i, where ∑w_i = 1

It keeps the score bounded in [0,1] and lets us express that some factors, such as namespace_scope or resource_sensitivity, should dominate the final decision. The only implementation detail I would handle carefully is zero-valued factors. In practice, I would probably clip each factor to a small ε floor unless a true zero is intended to mean “no risk at all.” Otherwise, one harmless dimension could accidentally erase a high blast-radius risk.

The isotonic calibration idea is also very compelling. Logging (risk_score, outcome) tuples into MLflow would let us move from arbitrary thresholds to an empirical safety curve. The important part would be defining the outcome label carefully: for example, no rollback, no SLO regression, no failed rollout, and no human override. Then we could eventually say something like, “historically, this threshold corresponds to a 95% safe rollout rate,” which is much more convincing for platform teams.

I also agree with the cold-path vs warm-path distinction. For first-of-its-kind failures, UCB1 or Thompson sampling would be hard to justify without priors or safety constraints. A Pareto set followed by lexicographic selection safety first, then blast radius, then cost/latency would make the agent behavior much more predictable and auditable. Bandit-style strategies make more sense later, once we have repeated patterns and enough historical outcome data.

I would be interested to see how you would expose the Oraclaw decision-graph or anomaly tools as MCP tools, especially at the decision step between diagnosis and deployment.

DEV Community