Troubleshooting EKS with MCP: The Good, the Bad, and the Ugly (plus the Setup)

#ai #aws #kubernetes #learning

As part of our sessions to develop the skills of a one-person army with AI tools, we began exploring how to integrate their use into our daily tasks.

One of the tasks that might be simple, but takes time is to troubleshoot issues during the deployment of our applications. In some of our projects, we use Kubernetes as the platform to deploy them. In dev environments, it's not uncommon for dev teams to struggle with deploying their applications. Either they are new to the topic, lack some information, or have a typo in the code, so we can have a flawless pipeline using GitOps, but the human factor still persists, and one way to help the teams is to also provide the tools to help them do the troubleshooting.

The Setup

Since we cannot share an actual setup from one of the projects, we tested the efficiency of solving issues in a Kubernetes cluster as an example. The complete setup is the following:

EKS cluster with a sample web application
- Deployed via eksctl
- 3-Tier Web application deployed with hidden errors
Local environment with
- Q Developer, using Claude Sonnet 4
- EKS MCP server
- Kubectl and eksctl

The references used for the setup will be at the end of the post, but as a summary, these steps were done:

# Prerequisites: Q Developer, MCP servers, uv, kubectl and AWS CLI installed 
# References at the end of the post, as summary for Q:
## [Download Amazon Q for command line for Linux AppImage](https://desktop-release.q.us-east-1.amazonaws.com/latest/amazon-q.appimage)
chmod +x amazon-q.appimage
./amazon-q.appimage
## Authenticate with Builder ID, or with IAM Identity Center using the start URL given to you by your account administrator

# MCP Servers configuration
pip install awslabs.aws-api-mcp-server

vim ~/.aws/amazonq/mcp.json
### Copy and paste content below
{
  "mcpServers": {
    "awslabs.aws-api-mcp-server": {
      "command": "python",
      "args": [
        "-m",
        "awslabs.aws_api_mcp_server.server"
      ],
      "env": {
        "AWS_REGION": "YOUR_REGION"
      },
      "disabled": false,
      "autoApprove": []
    },
    "awslabs.eks-mcp-server": {
      "command": "uvx",
      "args": [
        "awslabs.eks-mcp-server@latest",
        "--allow-write",
        "--allow-sensitive-data-access"
      ],
      "env": {
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "autoApprove": [],
      "disabled": false
    }
  }
}


# AWS CLI configuration
aws config

## To store credentials:
vim ~/.aws/credentials
### Copy paste the credentials
[default]
aws_access_key_id=SHALALA
aws_secret_access_key=shalalala
aws_session_token=token


# Install eksctl
cd ~
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo install -m 0755 /tmp/eksctl /usr/local/bin && rm /tmp/eksctl


# Deploy K8s cluster
export AWS_REGION=eu-west-1
export EKS_CLUSTER_NAME=eks-workshop
cat cluster.yaml | envsubst | eksctl create cluster -f -

# Kubectl configuration
aws eks update-kubeconfig --name $EKS_CLUSTER_NAME

# Once EKS is ready, deploy application. On this case, kubernetes_xD has some hidden issues in it.
kubectl apply -f kubernetes_xD.yaml
kubectl wait --for=condition=available deployments --all

# Validating application is working
## Get load balancer URL
kubectl get svc ui
## Access it via browser (port 80), example:
http://a7a2821a812cb40daa48ab4cca3e4179-191623826.eu-west-1.elb.amazonaws.com:80

## Initially the above page will throw a 500 Error

# To redeployed the components via YAML
kubectl apply -f kubernetes_xD.yaml

# To delete the components
kubectl delete -f kubernetes_xD.yaml

# REMEMEBR!!!!!
# Once you are finished, destroy the cluster to avoid unnecessary costs 
eksctl delete cluster $EKS_CLUSTER_NAME --wait

After this, we had an amazing EKS cluster with several crashed pods. Great, now how do we fix this, while we have tons of meetings in parallel in the project? The answer is to delegate to the AI and check and approve the proposed fixes.

The Good

The summary provided by Q Developer regarding our environment's status is actually good. It took around 5 minutes to check all namespaces and detect what components were failing.

The prompt for this feels natural to write. Instead of asking to fix everything from the beginning, we started with a request for a summary, focusing on the fixes later.

After that, given the list of issues provided by Q, we started fixing one by one, and that's where the "problems started".

The Bad

Assumptions are a risk! Without much context, we observed that the tools accept the metadata written in the descriptions as truth, which leads to their own assumptions, some of which are beneficial, while others are not. The trick is in the prompt we give to clarify those assumptions.

As an example:
One of the pods was failing because, as an environment variable, we set up (on purpose for testing) an ActiveMQ reference variable, while the platform is using RabbitMQ.

And then, instead of using the second, it assumed that what needs to be done is to deploy ActiveMQ for the application. And as mentioned before, this was avoided and solved by providing the correct context in the prompt.

The Ugly

While trying to fix the issue, several requests were made that might not be necessary. A human would be able to detect such cases faster, or even the tool, once faced with a similar issue, learn from it and in the subsequent request not make the same mistake again.

As an example:
While solving issue one, it detected that the pod was having a problem. It first checked the pod with a get command, to later understand that a describe command was needed. On the second issue, the same applied. For the third time, I expected the describe command to be the first one tried, but it still kept the process, even though the get command didn't provide anything useful.

This might be because it still needs more iterations and use on the tasks for the learning to happen, but still, it's something to consider: It will take time to adjust to the current context of the project.

Final Thoughts

Even if this sample scenario is quite basic, we hope it will help other teams to be onboarded on this new way of doing troubleshooting.

As AI continues to evolve, now is the perfect time to become familiar with it and explore how it can support our projects.

By being careful with the information we share with it, and also, considering the kind of setup we use (some projects might prefer to use their own models, rather than use the ones provided by Bedrock, for example), it can give us a lot of advantages, like

Saving a lot of time that we can use for other tasks (like designing, supervising, and meetings)
Automate tasks and focus on what really brings value to our project

One of the following steps is this second point, where the use of AI Agents will be the focus of our knowledge gathering, and later shared in a post.