DevOps Start

Posted on Apr 23 • Originally published at devopsstart.com

How to Build AI Agents for Kubernetes Deployments

#kubernetesaiagents #modelcontextprotocol #k8sgpt #gitopsautomation

Ever wanted an AI that doesn't just explain Kubernetes errors but actually helps you fix them? This guide, originally published on devopsstart.com, walks through building autonomous K8s agents using MCP, Kagent, and K8sGPT.

Introduction

AI agents for Kubernetes deployments are autonomous systems that follow an "Observe → Reason → Act" loop to resolve cluster issues without manual intervention. While a standard LLM can explain what a CrashLoopBackOff is, a true agent can detect the error, pull the logs, analyze the stack trace, cross-reference it with recent Git commits, and propose a specific PR to fix the environment variable causing the crash.

Building these agents requires moving beyond simple prompting and into "tool use" or "function calling." You are essentially giving an LLM a set of specialized skills (API wrappers) that allow it to interact with your cluster, your GitOps pipeline, and your observability stack. In this guide, you will learn how to architect these skills using the Model Context Protocol (MCP) and frameworks like Kagent and K8sGPT to automate the most tedious parts of Kubernetes operations.

For a deep dive into the foundational concepts of managing the pods these agents will be monitoring, see the guide on Kubernetes for Beginners: Deploy Your First Application.

Prerequisites

Before starting this tutorial, you need a functioning Kubernetes environment and the necessary API access for the LLM. I recommend a development cluster (Kind or Minikube) or a staging namespace in a cloud provider like GKE or EKS to avoid accidental production outages.

You will need the following tools installed on your local machine:

kubectl v1.30+: The standard Kubernetes CLI.
Helm v3.14+: For managing the agent's dependencies.
Python 3.11+: Most agent frameworks, including Kagent and LangChain, require modern Python.
An OpenAI API Key (GPT-4o) or Anthropic API Key (Claude 3.5 Sonnet): Agents require high-reasoning models to avoid hallucinations during tool selection.
K8sGPT v0.12+: For the diagnostic skill set implementation.

You should also have a basic understanding of Kubernetes RBAC. Agents operate as identities within the cluster, and giving them cluster-admin privileges is a security risk. You will need to be comfortable creating ServiceAccounts and RoleBindings to enforce the principle of least privilege.

Overview

In this tutorial, we are building a "Deployment Guardian" agent. This isn't a monolithic script, but a modular system capable of three specific skills:

Automated Diagnostics: Using K8sGPT to scan for misconfigurations and interpreting those errors using an LLM.
Resource Right-Sizing: Analyzing pod resource usage and suggesting updates to the Horizontal Pod Autoscaler (HPA).
GitOps Sync Validation: Monitoring ArgoCD application health and triggering syncs when drifts are detected.

The core of this architecture relies on the Model Context Protocol (MCP). MCP is an open standard that decouples the LLM from the specific implementation of the tool. Instead of writing a custom wrapper for every single kubectl command, MCP allows you to expose a standardized "server" that tells the LLM exactly what tools are available, what arguments they take, and what the expected output format is.

By the end of this guide, you will have an agent that provides the root cause and the exact YAML change needed to fix a deployment, integrated directly into your operational workflow. For those managing the underlying infrastructure of these clusters, understanding how to Deploy an EKS Cluster with Terraform provides the necessary context for where these agents actually reside.

Step 1: Architecting the Agent Loop

Before writing code, you must understand how the agent thinks. A standard LLM request is a linear path: Prompt → Response. An agent loop is circular.

When you ask an agent to "Fix the failing deployment in the staging namespace," it performs the following sequence:

Observation: The agent calls a tool (for example, get_pod_status) to see which pods are failing.
Reasoning: It observes three pods in CrashLoopBackOff and reasons that it needs logs to understand the root cause.
Action: It calls get_pod_logs for one of the failing pods.
Observation: The logs show a java.lang.NullPointerException related to a missing database URL.
Reasoning: It checks the ConfigMap to see if the environment variable is defined.
Action: It calls get_configmap.
Final Response: It concludes the environment variable is missing and suggests the specific kubectl patch command or Git PR.

To implement this, you can use a framework like Kagent, which is built on AutoGen. It treats the "DevOps Engineer" as one agent and the "Kubernetes Cluster" as a tool-providing environment.

Step 2: Implementing the Tooling Layer with MCP

The Model Context Protocol (MCP) is the primary mechanism for production-grade agents. Instead of hardcoding functions into your Python script, you run an MCP server that exposes your Kubernetes API.

First, install the MCP SDK for Python:

pip install mcp

Now, create a simple MCP server that provides a "skill" to get pod events. This is more efficient than giving the LLM raw kubectl access because you can filter the output to only include errors, which reduces token usage and hallucination risk.

# k8s_mcp_server.py
from mcp.server.fastmcp import FastMCP
import subprocess

mcp = FastMCP("K8s-Guardian")

@mcp.tool()
def get_pod_errors(namespace: str) -> str:
    """Fetches only Warning events for pods in a specific namespace."""
    cmd = ["kubectl", "get", "events", "-n", namespace, "--field-selector", "type=Warning"]
    result = subprocess.run(cmd, capture_output=True, text=True)

    if result.returncode != 0:
        return f"Error fetching events: {result.stderr}"

    return result.stdout if result.stdout else "No warning events found."

if __name__ == "__main__":
    mcp.run()

To run this server:

python k8s_mcp_server.py

The LLM now sees get_pod_errors as a capability. When it encounters a deployment failure, it will autonomously decide to call this function rather than guessing. This architectural separation allows you to update the Python "skill" without changing the prompt of the LLM.

Step 3: Configuring Least-Privilege RBAC

Giving an AI agent a kubeconfig with cluster-admin is an unacceptable security risk. If the LLM hallucinates a command like kubectl delete ns --all, the agent will execute it.

You must create a dedicated ServiceAccount with a restricted Role. For our Deployment Guardian, the agent needs to read pods, events, and logs, but it should only be able to "patch" specific resources.

Create a file named agent-rbac.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: k8s-ai-agent
  namespace: ai-ops
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: agent-read-write-role
  namespace: staging
rules:
- apiGroups: [""]
  resources: ["pods", "pods/log", "events", "configmaps"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: agent-read-write-binding
  namespace: staging
subjects:
- kind: ServiceAccount
  name: k8s-ai-agent
  namespace: ai-ops
roleRef:
  kind: Role
  name: agent-read-write-role
  apiGroup: rbac.authorization.k8s.io

Apply the configuration:

kubectl create namespace ai-ops
kubectl apply -f agent-rbac.yaml

To connect your agent to this identity, use a token-based approach or a projected volume if the agent runs inside the cluster. For local development, you can impersonate the ServiceAccount to verify permissions:

kubectl get pods -n staging --as=system:serviceaccount:ai-ops:k8s-ai-agent

Step 4: Integrating K8sGPT for Diagnostic Skills

While custom MCP tools are great for specific tasks, K8sGPT provides a powerful set of pre-built diagnostic skills. It scans your cluster for common issues and uses an LLM to explain them.

First, install the K8sGPT CLI:

brew install k8sgpt

Now, authenticate it with your LLM provider:

k8sgpt auth add --backend openai --model gpt-4o

To integrate K8sGPT into your agent's skill set, wrap the k8sgpt analyze command into a tool. This allows the agent to trigger a full cluster scan and reason over the results.

# Adding K8sGPT as a tool in our MCP server
@mcp.tool()
def analyze_cluster_health(namespace: str) -> str:
    """Runs a K8sGPT analysis on the namespace to find errors."""
    cmd = ["k8sgpt", "analyze", "--namespace", namespace]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return result.stdout

When you run this, the output provides a detailed analysis:

$ k8sgpt analyze --namespace staging
[!] Pod 'auth-service-6f7d' is in CrashLoopBackOff
Analysis: The pod is failing because the 'DB_PASSWORD' environment variable is missing.
The application expects this variable to be provided via a Secret.

The agent can now combine this high-level analysis with its own get_configmap tool to find where the secret is missing. This creates a tiered diagnostic approach: K8sGPT finds the "what," and the custom MCP tools find the "how" and "where." If you see these errors frequently, check the Fix Kubernetes CrashLoopBackOff in Production guide for manual remediation steps.

Step 5: Building the Resource Optimization Skill

Resource optimization requires the agent to observe metrics (via Prometheus or Metrics Server) and act on the Horizontal Pod Autoscaler (HPA).

To implement this, your agent needs a tool that can query the Metrics Server:

@mcp.tool()
def get_pod_resource_usage(pod_name: str, namespace: str) -> str:
    """Retrieves CPU and Memory usage for a specific pod."""
    cmd = ["kubectl", "top", "pod", pod_name, "-n", namespace]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return result.stdout

The agent's reasoning logic for optimization follows this pattern:

Trigger: The agent is asked to "Optimize the checkout-service."
Observation: It calls get_pod_resource_usage and sees the pod is consistently using 95% of its memory limit.
Observation: It calls kubectl get hpa and sees the HPA is targeting 50% CPU, but the bottleneck is actually memory.
Reasoning: The agent realizes the HPA should be updated to include memory metrics or the memory limit should be increased.
Action: It proposes a YAML change to the HPA definition.

For a detailed explanation of how HPA works to better tune your agent's prompts, read the Kubernetes HPA Deep Dive.

Step 6: Automating GitOps with ArgoCD Integration

An agent that runs kubectl patch directly creates "configuration drift." The source of truth must always be Git. Therefore, your agent's "Act" phase should target your GitOps tool.

If you are using ArgoCD, give your agent tools to interact with the ArgoCD API or the Git repository. First, ensure you have ArgoCD installed; if not, follow the How to Install Argo CD guide.

Now, create a tool that allows the agent to check the sync status of an application:

@mcp.tool()
def get_argocd_app_status(app_name: str) -> str:
    """Checks if an ArgoCD application is Synced and Healthy."""
    cmd = ["argocd", "app", "get", app_name]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return result.stdout

The "GitOps Loop" for the agent is:

Detect: The agent sees a pod failing in the cluster.
Diagnose: It finds that the image tag v1.2.0 has a bug.
Resolve: It searches for the latest stable image tag in the registry.
Act: Instead of running kubectl set image, it uses a GitHub API tool to create a Pull Request updating the image tag in the Git repository.
Verify: It monitors ArgoCD until the app shows as Synced and Healthy.

This workflow ensures the AI agent remains a part of the governed pipeline. You can learn more about managing these sync policies in the Advanced Argo CD Sync Policies tutorial.

Step 7: Implementing Safety Rails and Human-in-the-Loop (HITL)

To prevent "hallucination-driven outages," you must implement a safety layer between the agent's reasoning and the action.

1. The Dry-Run Constraint

Every tool that modifies the cluster must implement a --dry-run=server flag by default. The agent should first call the tool in dry-run mode and present the proposed change.

@mcp.tool()
def propose_deployment_patch(deployment_name: str, namespace: str, patch_yaml: str) -> str:
    """Proposes a change to a deployment using dry-run."""
    with open("patch.yaml", "w") as f:
        f.write(patch_yaml)

    cmd = ["kubectl", "patch", "deployment", deployment_name, "-n", namespace, "--patch-file", "patch.yaml", "--dry-run=server"]
    result = subprocess.run(cmd, capture_output=True, text=True)
    return f"Proposed Change: {result.stdout}"

2. The Approval Gate (HITL)

The agent must not execute a patch or delete command without manual approval from a human operator, typically via a Slack bot or CLI prompt.

Agent: "I've found that the auth-service is OOMKilled. I propose increasing the memory limit from 256Mi to 512Mi. Should I apply this change? [Yes/No]"
Human: "Yes"
Agent: (Executes the actual patch command)

3. Policy-as-Code (Kyverno/OPA)

A cluster-level policy engine like Kyverno or OPA Gatekeeper should be the final line of defense. For example, a policy that prevents any resource from being deleted in the production namespace, regardless of the requester's identity.

Step 8: Testing and Validating Agent Performance

Treat your agent's skills like production code.

Unit Testing Tools

Test each MCP tool independently. If your get_pod_errors tool fails to parse kubectl output, the LLM will receive garbage and hallucinate a solution.

# Example test for the tool
python -c "from k8s_mcp_server import get_pod_errors; print(get_pod_errors('staging'))"

Scenario-Based Validation (Chaos Engineering)

Test your agent by intentionally breaking things in a sandbox:

Inject a Failure: Delete a Secret that a deployment needs.
Trigger Agent: Ask, "Why is the deployment failing?"
Evaluate:

Did it find the missing secret? (Correctness)
Did it suggest the right fix? (Accuracy)
Did it try to delete the namespace? (Safety)
How many tool calls did it take? (Efficiency)

Token Cost and Latency Tracking

Agents can be expensive. A complex diagnostic loop might call 10 different tools, sending significant context back to the LLM. Use tools like LangSmith or Arize Phoenix to trace the agent's thoughts. If the agent loops infinitely (calling the same tool repeatedly), refine the system prompt to include a "maximum tool call" limit.

Troubleshooting

The Agent "Loops" Infinitely

Symptom: The agent calls get_pod_status repeatedly for 20 turns.
Fix: Update the system prompt: "If a tool returns the same result twice, do not call it again. Instead, try a different diagnostic tool or ask the user for more information."

RBAC "Forbidden" Errors

Symptom: Error from server (Forbidden): pods "my-pod" is forbidden: User "system:serviceaccount:ai-ops:k8s-ai-agent" cannot get resource "pods/log".
Fix: Check your Role definition. pods and pods/log are different resources in Kubernetes. You must explicitly list pods/log in the resources section of your RBAC YAML.

Hallucinated CLI Flags

Symptom: The agent tries to run kubectl get pods --show-all-errors, which is not a real flag.
Fix: Be explicit in your MCP tool description. Instead of "Fetch pods," say "Fetch pods using the exact command kubectl get pods -n {namespace}."

Context Window Overflow

Symptom: The agent "forgets" the initial error after calling several tools.
Fix: Implement "summarization" in your tools. Instead of returning raw kubectl output, filter for the top 5 most relevant errors before sending the text to the LLM.

Conclusion

Building AI agents for Kubernetes is a shift from "writing scripts" to "designing capabilities." By utilizing the Model Context Protocol (MCP), you decouple your agent's reasoning from the underlying API calls, allowing you to iterate on "skills" without breaking the agent's logic.

We have moved from the basic "Observe → Reason → Act" loop to a production-ready architecture featuring least-privilege RBAC, GitOps integration via ArgoCD, and strict human-in-the-loop safety rails.

Actionable Next Steps:

Start Small: Implement one "read-only" skill (like the get_pod_errors tool) and run it in a local Kind cluster.
Secure the Perimeter: Apply the RBAC constraints before moving the agent to a shared development environment.
Implement the Gate: Add a manual approval step for any tool that uses kubectl patch or kubectl delete.
Monitor and Refine: Use a tracing tool to see where your agent is hallucinating and refine your tool descriptions accordingly.

DEV Community