DEV Community

Samir Saqer
Samir Saqer

Posted on

How I Built a K8S MCP Troubleshooting Agent on GKE

For the GKE Turns 10 Hackathon gketurns10.devpost I built a Kubernetes troubleshooting agent designed to resolve common cluster issues quickly and efficiently on Google Kubernetes Engine (GKE).
The goal: to combine Kubernetes observability with AI-powered insights for faster, smarter operations.

Why I Built It

The Challenge: Give Microservices an AI Upgrade
The hackathon challenge was to supercharge an existing microservice application with agentic AI. Using one of the pre-built applications — either the Bank of Anthos or the Online Boutique — the task was to build on top of it.

The twist? You don’t touch the core application code. Instead, you build new, containerized components that interact with the existing APIs. Think of it as building a smart, external brain that adds a whole new layer of intelligence.

Running apps on Kubernetes is powerful, but when things go wrong, troubleshooting can be stressful. Pods crash, services fail, resources spike — and you often find yourself chasing logs across different tools.

So I asked myself: What if Kubernetes could monitor itself intelligently? What if it could not only alert me, but also suggest — or even take — the first step toward a fix?

That question became the starting point for my project.

What the Agent Can Do

The troubleshooting agent acts as an intelligent assistant for Kubernetes clusters. It continuously monitors pods, services, deployments, and overall resource utilization to keep track of cluster health.
When problems occur, it can gather information (list pods, grab logs, describe deployments, check service status), analyze issues (pod failures, image pull errors, network connectivity), and provide AI-powered troubleshooting suggestions.
It doesn’t stop at analysis — the agent can also take remediation actions such as restarting failed pods, scaling deployments, or cleaning up stuck resources. With support for YAML manifest management, dynamic scaling, and log inspection, it extends beyond monitoring into full lifecycle management.
integration with GKE, Prometheus, Cloud Monitoring, and custom MCP tools, the agent plugs into existing workflows while offering smarter automation and predictive insights.

Available tools include:

- get_cluster_info – basic cluster info, node status, health
- list_pods – pod list with status, resource usage, readiness
- get_pod_logs – retrieve logs for troubleshooting
- describe_pod – detailed pod info and events
- get_service_status – check service endpoints & networking
- get_deployment_status – monitor deployment health & replicas
- delete_resource – delete a K8s resource (pod, service, deployment, etc.)
- suggest_troubleshooting – AI-powered troubleshooting tips
- automate_remediation – remediation analysis (e.g., image pull errors)
- get_gke_cluster_metrics – GKE-specific performance metrics
- scale_deployment – scale deployments to N replicas
- exec_pod_command – run commands inside a pod container
- network_connectivity_test – test DNS and network connectivity
Enter fullscreen mode Exit fullscreen mode

Technologies Used

  • Google Kubernetes Engine (GKE)
  • Model Context Protocol (MCP) server via mcp.server.fastmcp
  • Google ADK (google.adk.agents.LlmAgent) for the conversational agent
  • Vertex AI Authentication
  • Python 3.11, kubernetes Python client, httpx, requests
  • Docker + Artifact Registry (or GCR) and Cloud Build for CI
  • kubectl manifests and RBAC for in-cluster deployment

Data Sources & External Services

  • Kubernetes API (via ServiceAccount or kubeconfig)
  • Google Cloud Project metadata (project ID, cluster name/zone)
  • metrics-server for node/pod metrics

Findings and Learnings

  1. Building this project for the GKE Turns 10 Hackathon gave me several insights:
  2. GKE’s robust API and integration capabilities make it ideal for intelligent monitoring.
  3. Combining ADK with Kubernetes operations enables smarter automation and decision-making.
  4. MCP provides a flexible framework for extending cluster monitoring.
  5. Real-time monitoring + AI-driven insights = significantly better cluster management.

Future Enhancements

  1. Stronger predictive analytics for autoscaling
  2. ML models for anomaly detection
  3. Extended automation capabilities
  4. Tighter integration with Google Cloud services

Closing Thoughts

Hackathons are always a mix of excitement and chaos. I hit plenty of walls — wrong metrics, RBAC permission issues, weird errors at 2am. But each blocker forced me to learn more, and when it finally clicked, it felt like magic.

This project was my contribution to the GKE Turns 10 Hackathon, and I’m proud of what I built. More than just code, it’s a step toward making cloud-native systems less overwhelming and more human-friendly.

If you’re curious, the code and details are in the repo. And if you’ve ever wished Kubernetes could troubleshoot itself, I’d love to hear your thoughts.
github.com/k8s-mcp-and-adk-agent
Try the Agent on ADK

👉 Here’s to ten years of GKE — and to building the next ten smarter.
#GKETurns10 #GKEHackathon

Top comments (0)