AI infrastructure agents are exciting, but they are also difficult to trust.
A Kubernetes debugging agent can generate a remediation plan, but if we cannot inspect its iterations, validation results, latency, artifacts, and failure modes, the system becomes a black box. That is a problem, especially when the agent is working near infrastructure.
I built Kube-AutoFix as an autonomous Kubernetes SRE agent prototype that uses structured outputs, Pydantic validation, YAML safety checks, dry-run support, and namespace isolation to reduce risky behavior. Recently, I added an MLflow observability layer so each agent run can be tracked, inspected, and compared.
This article explains how I added optional MLflow tracking to Kube-AutoFix and why observability is essential for evaluating AI infrastructure agents.
Why You Should Learn This
Adding observability to AI agents moves you from "vibes-based" testing to rigorous, data-driven engineering. By learning how to track your agents, you will be able to build trustworthy AI systems, easily identify why an agent hallucinated or failed, and systematically improve reliability before deploying near any critical environments. Whether you are an industry veteran or a beginner exploring AI, mastering agent observability is a critical skill for the future of DevOps.
The problem: AI infrastructure agents are black boxes
When an AI agent attempts to debug a failing Kubernetes pod, it executes a series of thoughts, actions, and observations. Without a dedicated observability layer, these steps exist only as fleeting console logs. If the agent gets stuck in a loop or proposes a dangerous change, debugging the debugger becomes a nightmare. You are left asking: Was the prompt bad? Did the cluster return unexpected state? Did the validation fail? Without an audit trail, AI infrastructure agents cannot be trusted.
The dashboard above uses synthetic traces to demonstrate the observability interface. No live Kubernetes cluster, OpenAI call, or Databricks-hosted run was executed in this recording environment.
What Kube-AutoFix already did
Before adding MLflow, Kube-AutoFix was already focused on safe, controlled executions. It utilized:
- Structured Outputs: Forcing the LLM to reply in a predictable JSON schema.
- Pydantic Validation: Ensuring the LLM's output met strict data requirements before taking action.
- YAML Safety Checks: Preventing the deployment of unauthorized resource types or privileged containers.
- Dry-Run Support: Validating Kubernetes manifests against the API server without actually applying them.
- Namespace Isolation: Restricting the agent's blast radius to a safe, designated namespace.
What You Need
To follow along with this kind of implementation, you need:
- Python 3.10+
-
MLflow: Installed locally (
pip install mlflow) or hosted via Databricks. - A Local Kubernetes Cluster: Minikube, Kind, or Docker Desktop (do not use production).
- An LLM API Key: e.g., OpenAI API key.
At a high level, the workflow looks like this:
Broken Kubernetes Manifest
↓
Deploy / Dry Run
↓
Monitor Pod Status
↓
Collect Debug Signals
↓
LLM Diagnosis with Structured Output
↓
Pydantic + YAML Validation
↓
Corrected Manifest Artifact
↓
MLflow Run Tracking
What I added with MLflow
I integrated MLflow to turn the black box into a glass box. Here is the step-by-step breakdown of how the observability layer was implemented:
1. Initialize the MLflow Experiment
The first step was setting up a dedicated MLflow experiment to group all Kube-AutoFix runs. This allows us to separate debugging tests from regular operations.
2. Log Agent Configuration and Parameters
At the start of every run, the script logs the configuration parameters: the model name being used, the target namespace, maximum allowed iterations, and whether dry-run mode is enabled.
3. Track Iterations as Nested Runs
Agents work in loops. I configured MLflow to log each iteration (Observe -> Think -> Act) as a step, allowing us to see exactly how many tries it took the agent to resolve the issue or hit the iteration cap.
4. Log Artifacts and Manifests
Instead of just logging text, the agent now saves the generated YAML manifests, the final remediation plans, and error tracebacks as MLflow artifacts. You can download and inspect the exact files the agent tried to apply.
5. Record Metrics
Finally, I added metrics tracking for execution latency, prompt token usage, completion token usage, and validation success rates. This makes it easy to compare the cost and speed of different models.
What gets logged
The MLflow integration captures useful telemetry for each agent run without exposing sensitive data by default:
- Run metadata: target namespace, model name, dry-run status, max iteration setting, and project tags.
- Metrics: iteration duration, LLM latency, validation status, confidence score, resource counts, and final success/failure state.
- Artifacts: corrected YAML, root-cause summaries, change summaries, and redacted debug summaries.
- Tags: agent type, safety model, and integration context.
What does not get logged by default
By default, the tracker does not log full prompts, raw Kubernetes logs, credentials, environment variables, or unredacted cluster data. Full prompt/debug logging is intentionally gated behind an explicit configuration flag for local experimentation only.
Synthetic demo traces disclaimer
Demo note: The MLflow dashboard demo in this article uses synthetic traces to show the observability interface. No live Kubernetes cluster, OpenAI call, or Databricks-hosted run was executed in the recording environment.
Why this matters for Databricks and MLflow
MLflow is widely known for tracking traditional machine learning experiments, but this project shows how the same tracking model can also help evaluate AI agent workflows. As Databricks continues to expand its AI capabilities, tools like MLflow can provide a practical observability layer for comparing runs, inspecting artifacts, and understanding how agent behavior changes over time. It provides the structured governance that AI engineering teams increasingly need.
Pro-Tips
-
Use Tags for Filtering: Tag your MLflow runs with the specific Kubernetes issue being tested (e.g.,
issue: CrashLoopBackOff) so you can easily filter the dashboard later. - Keep Artifacts Focused: Don't log the entire cluster state. Log only the specific manifests the agent interacted with to keep your artifact storage clean and cheap.
- Leverage the UI: Use the MLflow UI's comparison feature to select two runs and instantly see a diff of their metrics and parameters.
What I learned
Building this taught me that an agent is only as good as your ability to debug it. Prompt engineering is important, but having a clear, visual dashboard of the agent's entire thought process changes how you develop. You stop guessing why an agent failed and start measuring its behavior systematically. Permission scope and approval gates matter significantly more than having the absolute perfect prompt.
Links
- Kube-AutoFix GitHub Repository: github.com/azaynul10/kube-autofix
- v0.1.0 Release: MLflow Observability for Kube-AutoFix
- OpenAI Cookbook PR: PR #2659

Top comments (0)