Why I Built This
Every DevOps incident I've dealt with follows the same pattern.
Something breaks. I open five terminal tabs. I run kubectl, check the AWS console, tail Docker logs, and manually piece together what went wrong — all while the clock is ticking.
The tools all exist. They're just completely disconnected from each other.
I wanted one interface that could ask a question, reach into all of them simultaneously, and explain what it found. So I built an MCP server that gives Claude Desktop direct access to my real infrastructure.
How It Works
The server runs locally on my laptop. Claude Desktop launches it on startup and communicates over stdio (standard input/output).
When I type a question, Claude reads the 14 registered tool descriptions, decides which ones to call, and sends a JSON request to my server:
{"method": "call_tool", "params": {"name": "get_failing_pods", "arguments": {"namespace": "default"}}}
My server executes against real infrastructure and returns structured text. Claude synthesises everything into a plain English response.
The key thing: Claude never connects to AWS or Kubernetes directly. It only talks to my Python server. My server uses local credentials (~/.aws/credentials, ~/.kube/config) to reach the real systems.
14 tools across 4 categories:
- Kubernetes: pod status, failing pods, logs, restart deployment, describe pod
- AWS: cost report, EC2 instances, CloudWatch alarms, S3 buckets
- Docker: list containers, container logs, restart container
- Terraform: run plan, check state
Demo / Results
I typed: "Give me a health report of my infrastructure"
Claude called get_pod_status, get_aws_cost, get_cloudwatch_alarms, and list_containers simultaneously and returned a unified summary — Kubernetes pods, AWS costs, Docker containers, CloudWatch alarms — all in one response.
One pod was in CrashLoopBackOff with 14 restarts. I typed:
"Pull the logs from broken-app and tell me why it's crashing"
Claude diagnosed it instantly — busybox with no long-running process, exits immediately every time, Kubernetes keeps restarting it.
Before: 45 minutes of manual investigation across 5 tabs.
After: 10 seconds, one chat window.
What I Learned
1. The tool description is everything.
Claude picks tools based on their description, not their name. A vague description causes Claude to call the wrong tool or skip it entirely. My first version had get_failing_pods described as "Get pods". Claude never called it. I changed it to "List only pods in Failed, CrashLoopBackOff, OOMKilled, or Error state" and it worked perfectly every time. The description is the API contract between you and Claude.
2. Parallel tool calls are not guaranteed — they're emergent.
I didn't write any code to call tools in parallel. Claude decided to do that on its own when the question was broad enough ("health report"). For narrow questions ("which pods are failing?") Claude called only one tool. The parallelism comes from Claude's reasoning, not from your server. This means your tool descriptions need to be distinct enough that Claude can make that judgment call correctly.
3. Sync SDKs in an async server is the silent performance killer.
boto3, the Kubernetes Python client, and the Docker SDK are all blocking. If you call them directly inside an async MCP handler, you stall the entire event loop. Every tool call wraps the SDK call in asyncio.to_thread(). Without this, Claude would time out waiting for responses when multiple tools were called simultaneously.
4. Error handling scope matters more than you think.
My first version caught SDK import errors at module level. If boto3 failed to connect to AWS at startup, the entire server crashed — Kubernetes and Docker tools went down too. The fix: catch errors inside each tool call, not at import time. One broken service should never take down the whole server.
5. The MCP protocol is simpler than it looks.
There are really only two things you implement: list_tools() (return tool names + descriptions + input schema) and call_tool() (receive tool name + arguments, return text). Everything else is handled by the MCP SDK. Building a tool is 20 lines of Python.
Limitations
Single EC2 instance. The demo infrastructure runs on one t3.small. In a real multi-node Kubernetes cluster, the pod information would be spread across namespaces and nodes. The tools work against any kubeconfig — but the demo is scoped to what's in one cluster.
Read-heavy, not write-heavy. Most tools read state — pod status, AWS costs, Docker containers. The only write operation is restart_deployment. There's no tool to edit a deployment YAML, push a config change, or create resources. Giving Claude write access to production infrastructure is a separate (and more serious) conversation about guardrails.
Claude restarts but can't fix root causes. If a pod is in CrashLoopBackOff because of a bad container image, Claude can restart it — but it'll crash again immediately. Claude diagnosed the issue correctly and warned me, but the actual fix (updating the deployment YAML and pushing a correct image) is outside the scope of the tools I gave it. The tools match what I was comfortable giving an AI access to.
No authentication on the MCP server. The server runs locally over stdio — there's no network exposure. But if you ever expose this over a network (e.g., run the server on EC2 and connect Claude Desktop remotely), you'd need to add authentication. Don't do that without thinking it through.
AWS credentials are local. The server uses whatever credentials are in ~/.aws/credentials. If you're running this on a shared machine, those credentials are accessible to anything running as your user. Use IAM roles with minimal permissions scoped to read-only for Cost Explorer, EC2, CloudWatch, and S3.
Try It
GitHub: [https://github.com/ThinkWithOps/ai-devops-systems-lab.git]
Demo: [https://youtu.be/2XsNcKSa28s]
git clone https://github.com/ThinkWithOps/ai-devops-systems-lab.git
cd 03-ai-devops-mcp-server
pip install -r requirements.txt
# Test with mock data — no real infra needed
KUBE_MOCK_MODE=true python -m pytest tests/ -v
Add the MCP server to your Claude Desktop config and restart. Full setup guide in the README.
This is Project 03 of the ThinkWithOps AI + DevOps series
What would you control first if Claude had access to your infrastructure? Drop it in the comments.
Top comments (0)