In today's cloud-native environments, detecting anomalies across containerized applications is becoming increasingly complex. This post explores how to implement deep anomaly detection and analysis on workloads running on Amazon Elastic Kubernetes Service (EKS) using Amazon Q and EKS Model Context Protocol (MCP) server capabilities.
The Power of Natural Language
The integration of Natural Language Processing (NLP) done by Amazon Q in combination with EKS MCP server represents a significant advancement in cloud-native observability. By analyzing log patterns, system metrics, and container behaviors, DevOps teams can now interpret complex EKS cluster events in a more intuitive, human-readable format. This approach transforms traditionally cryptic system signals into comprehensible narratives, enabling faster root cause analysis and proactive issue detection. The power of natural language in this context lies in its ability to bridge the gap between raw machine data and human understanding, allowing teams to quickly identify and respond to anomalies through semantic analysis rather than purely statistical methods.
For example, you might type: "Check my bmn-rack-outposts-cluster EKS cluster in us-east-1, inspect all pods in that cluster, report any anomalies detected over the past 3 days".
In this blog, we will use Amazon Q for command line in combination with EKS Model Context Protocol (MCP) server to achieve our goal.
Benefits
- Proactive Issue Detection: Identify potential problems before they affect your applications
- Reduced False Positives: Machine learning-based approach learns from your environment
- Automated Response: Node debugger pods will be launched automatically in your cluster, which will provide automated way of detecting anomalies from the workloads in your cluster
- Natural Language Intention: Use natural language to ask your EKS environment on what anomalies happened in the past X days
Getting Started
To begin using Amazon Q Developer for VPC diagramming:
(The below guide is for Ubuntu OS, for other OS, you can follow this guide)
- Install Amazon Q for command line
# Download Amazon Q for command line for Ubuntu.
ubuntu@cloudhost-sigitp:~$ wget https://desktop-release.q.us-east-1.amazonaws.com/latest/amazon-q.deb
# Install the package:
ubuntu@cloudhost-sigitp:~$ sudo apt-get install -f
ubuntu@cloudhost-sigitp:~$ sudo dpkg -i amazon-q.deb
# Launch Amazon Q for command line:
ubuntu@cloudhost-sigitp:~$ q
# Authenticate with Builder ID, or with IAM Identity Center using the start URL given to you by your account administrator.
# Quit Amazon Q to configure and load MCP servers:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
> /quit
More detailed guide are available here.
- Load the required MCP servers for Amazon Q to use as tools from mcp.json file. Note that I have put EKS and AWS Diagram MCP servers here, but you can find more MCP servers from this link.
ubuntu@cloudhost-sigitp:~$ cat .aws/amazonq/mcp.json
{
"mcpServers": {
"awslabs.core-mcp-server": {
"command": "uvx",
"args": [
"awslabs.core-mcp-server@latest"
],
"env": {
"FASTMCP_LOG_LEVEL": "ERROR"
}
},
"awslabs.aws-documentation-mcp-server": {
"command": "uvx",
"args": [
"awslabs.aws-documentation-mcp-server@latest"
],
"env": {
"FASTMCP_LOG_LEVEL": "ERROR"
},
"disabled": false,
"autoApprove": []
},
"awslabs.eks-mcp-server": {
"autoApprove": [],
"disabled": false,
"command": "uvx",
"args": [
"awslabs.eks-mcp-server@latest",
"--allow-write"
],
"env": {
"FASTMCP_LOG_LEVEL": "ERROR"
},
"transportType": "stdio"
},
"awslabs.aws-diagram-mcp-server": {
"command": "uvx",
"args": [
"awslabs.aws-diagram-mcp-server"
],
"env": {
"FASTMCP_LOG_LEVEL": "ERROR"
},
"autoApprove": [],
"disabled": false
}
}
}
ubuntu@cloudhost-sigitp:~$
- Load Amazon Q with the MCP servers
ubuntu@cloudhost-sigitp:~$ q
✓ awslabsaws_diagram_mcp_server loaded in 2.46 s
⚠ 1 of 4 mcp servers initialized. Servers still loading:
- awslabseks_mcp_server
- awslabscore_mcp_server
- awslabsaws_documentation_mcp_server
⢠⣶⣶⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣤⣶⣿⣿⣿⣶⣦⡀⠀
⠀⠀⠀⣾⡿⢻⣿⡆⠀⠀⠀⢀⣄⡄⢀⣠⣤⣤⡀⢀⣠⣤⣤⡀⠀⠀⢀⣠⣤⣤⣤⣄⠀⠀⢀⣤⣤⣤⣤⣤⣤⡀⠀⠀⣀⣤⣤⣤⣀⠀⠀⠀⢠⣤⡀⣀⣤⣤⣄⡀⠀⠀⠀⠀⠀⠀⢠⣿⣿⠋⠀⠀⠀⠙⣿⣿⡆
⠀⠀⣼⣿⠇⠀⣿⣿⡄⠀⠀⢸⣿⣿⠛⠉⠻⣿⣿⠛⠉⠛⣿⣿⠀⠀⠘⠛⠉⠉⠻⣿⣧⠀⠈⠛⠛⠛⣻⣿⡿⠀⢀⣾⣿⠛⠉⠻⣿⣷⡀⠀⢸⣿⡟⠛⠉⢻⣿⣷⠀⠀⠀⠀⠀⠀⣼⣿⡏⠀⠀⠀⠀⠀⢸⣿⣿
⠀⢰⣿⣿⣤⣤⣼⣿⣷⠀⠀⢸⣿⣿⠀⠀⠀⣿⣿⠀⠀⠀⣿⣿⠀⠀⢀⣴⣶⣶⣶⣿⣿⠀⠀⠀⣠⣾⡿⠋⠀⠀⢸⣿⣿⠀⠀⠀⣿⣿⡇⠀⢸⣿⡇⠀⠀⢸⣿⣿⠀⠀⠀⠀⠀⠀⢹⣿⣇⠀⠀⠀⠀⠀⢸⣿⡿
⢀⣿⣿⠋⠉⠉⠉⢻⣿⣇⠀⢸⣿⣿⠀⠀⠀⣿⣿⠀⠀⠀⣿⣿⠀⠀⣿⣿⡀⠀⣠⣿⣿⠀⢀⣴⣿⣋⣀⣀⣀⡀⠘⣿⣿⣄⣀⣠⣿⣿⠃⠀⢸⣿⡇⠀⠀⢸⣿⣿⠀⠀⠀⠀⠀⠀⠈⢿⣿⣦⣀⣀⣀⣴⣿⡿⠃
⠚⠛⠋⠀⠀⠀⠀⠘⠛⠛⠀⠘⠛⠛⠀⠀⠀⠛⠛⠀⠀⠀⠛⠛⠀⠀⠙⠻⠿⠟⠋⠛⠛⠀⠘⠛⠛⠛⠛⠛⠛⠃⠀⠈⠛⠿⠿⠿⠛⠁⠀⠀⠘⠛⠃⠀⠀⠘⠛⠛⠀⠀⠀⠀⠀⠀⠀⠀⠙⠛⠿⢿⣿⣿⣋⠀⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠛⠿⢿⡧
╭─────────────────────────────── Did you know? ────────────────────────────────╮
│ │
│ You can resume the last conversation from your current directory by │
│ launching with q chat --resume │
│ │
╰──────────────────────────────────────────────────────────────────────────────╯
/help all commands • ctrl + j new lines • ctrl + s fuzzy search
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
>
- Make sure that the tools are loaded in Amazon Q
> /tools
Tool Permission
▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔▔Built-in:
- execute_bash * trust read-only commands
- fs_read * trusted
- fs_write * not trusted
- report_issue * trusted
- use_aws * trust read-only commands
awslabsaws_diagram_mcp_server (MCP):
- awslabsaws_diagram_mcp_server___generate_diagram trusted
- awslabsaws_diagram_mcp_server___get_diagram_examples trusted
- awslabsaws_diagram_mcp_server___list_icons trusted
awslabsaws_documentation_mcp_server (MCP):
- awslabsaws_documentation_mcp_server___read_documentation * not trusted
- awslabsaws_documentation_mcp_server___recommend * not trusted
- awslabsaws_documentation_mcp_server___search_documentation * not trusted
awslabscore_mcp_server (MCP):
- awslabscore_mcp_server___prompt_understanding * not trusted
awslabseks_mcp_server (MCP):
- awslabseks_mcp_server___add_inline_policy * not trusted
- awslabseks_mcp_server___apply_yaml * not trusted
- awslabseks_mcp_server___generate_app_manifest * not trusted
- awslabseks_mcp_server___get_cloudwatch_logs * not trusted
- awslabseks_mcp_server___get_cloudwatch_metrics * not trusted
- awslabseks_mcp_server___get_eks_metrics_guidance * not trusted
- awslabseks_mcp_server___get_k8s_events * not trusted
- awslabseks_mcp_server___get_pod_logs * not trusted
- awslabseks_mcp_server___get_policies_for_role * not trusted
- awslabseks_mcp_server___list_api_versions * not trusted
- awslabseks_mcp_server___list_k8s_resources * not trusted
- awslabseks_mcp_server___manage_eks_stacks * not trusted
- awslabseks_mcp_server___manage_k8s_resource * not trusted
- awslabseks_mcp_server___search_eks_troubleshoot_guide * not trusted
Trusted tools will run without confirmation.
* Default settings
💡 Use /tools help to edit permissions.
>
- Ask Amazon Q to report anomalies in the past 3 days on a specific cluster
> Check my bmn-rack-outposts-cluster EKS cluster in us-east-1, inspect all pods in that cluster, report any anomalies detected over the past 3 days
=======================================================================================================================================
● Completed in 1.850s
Based on my inspection of your EKS cluster bmn-rack-outposts-cluster in the us-east-1 region, here's a comprehensive report of the current state and anomalies detected over the past 3 days:
## Cluster Overview
• Cluster Name: bmn-rack-outposts-cluster
• Kubernetes Version: 1.32
• Status: ACTIVE
• Created: May 7, 2025
• Node Count: 3 nodes (all in Ready state)
• Total Pods: 49 pods across multiple namespaces
## Detected Anomalies
1. Network Routing Issues on Node ip-10-0-58-137.ec2.internal:
• Warning events showing missing IP routes for pod IPs
• Most recent warnings (within the last 10 minutes):
• MissingIPRoutes: Pod IP 10.0.62.161 is missing secondary routes
• MissingIPRoutes: Pod IP 10.0.50.83 is missing secondary routes
• This could potentially cause network connectivity issues for pods on this node
2. Node Debugger Pods:
• Multiple node-debugger pods have been created and completed in the last 20 minutes:
• node-debugger-ip-10-0-58-137.ec2.internal-lhrgx
• node-debugger-ip-10-0-58-137.ec2.internal-pjpn2
• node-debugger-ip-10-0-58-137.ec2.internal-rsnn5
• node-debugger-ip-10-0-58-137.ec2.internal-twpht
• These pods suggest someone was actively debugging the node with the network routing issues
3. High Resource Usage Pods:
• Several pods with high CPU requests/limits:
• mlnx-dpdk-1001-1002-node1-rx and mlnx-dpdk-1001-1002-node1-tx: 33 CPU cores each (17% of node capacity)
• mlnx-dpdk-1001-1002-node1-tx2: 24 CPU cores (12% of node capacity)
• mlnx-dpdk-1001-1002-node2-rx and mlnx-dpdk-1001-1002-node2-tx: 33 CPU cores each (17% of node capacity)
• mlnx-dpdk-1001-1002-node3-rx and mlnx-dpdk-1001-1002-node3-tx: 33 CPU cores each (17% of node capacity)
• These DPDK pods also request 32Gi memory each
4. Recent Pod Additions:
• New DPDK pods added in the last few days:
• mlnx-dpdk-1001-1002-node1-rx2 and mlnx-dpdk-1001-1002-node1-tx2: 3 days 4 hours ago
• mlnx-dpdk-1001-1002-node2-rx and mlnx-dpdk-1001-1002-node2-tx: 5 days 16 hours ago
• mlnx-dpdk-1001-1002-node3-rx and mlnx-dpdk-1001-1002-node3-tx: 5 days 3 hours ago
## Recommendations
1. Address Network Routing Issues:
• Investigate the missing secondary routes on node ip-10-0-58-137.ec2.internal
• Check the AWS CNI plugin configuration and logs
• Consider restarting the aws-node pod on the affected node if issues persist
2. Monitor Resource Usage:
• Enable the Metrics API to better monitor resource usage (currently unavailable)
• Watch the high CPU/memory usage DPDK pods to ensure they're not causing resource contention
3. Review Node Debugger Activity:
• Check who initiated the node debugger pods and what issues they were investigating
• Ensure any debugging activities are properly documented
4. Consider Node Maintenance:
• The network routing issues might indicate a need for node maintenance
• Consider cordoning and draining the problematic node if issues persist
All pods are currently in Running or Completed state with no failed pods detected. The main concern is the network routing warnings on one of the nodes, which should be addressed to prevent potential connectivity issues.
!>
Conclusion
View the summarized view of the report, and dive deeper to conclude your next steps:
Remember to validate any implementation in a test environment before deploying to production.
Monitor your implementation and fine-tune the anomaly detection parameters based on your specific use case. Consider integrating with existing alerting systems and implementing automated remediation where appropriate.
Top comments (0)