Speaker: Shaoyi Li @ AWS Amarathon 2025
Summary by Amazon Nova
Kubernetes Operations Challenges
Large Volume of Operations Data, Time-Consuming Troubleshooting
Average MTTR exceeds 4 hours, with manual analysis accounting for 65%
Analysis data volume can reach TB levels
Multiple Resource Types, Complex Associations
- Large volume of cluster objects, events, and log data
Complex Switching Between Multiple Tools
- SREs switch between 8+ tools daily, with high context switching costs
Complex and Time-Consuming Troubleshooting in Response to Alerts
Limited automation capabilities
Only 30% of common failures can be automatically repaired, with complex scenarios relying on human decision-making
High Learning Cost and Threshold for K8s
Comparison of operational efficiency
Enterprises adopting AIOps have an average fault recovery time (MTTR) 90% shorter than traditional models, with operational costs reduced by 50%
Core Values
Self-Healing Failures: Achieve unattended repair of some failures through AI prediction and automation scripts.
Intelligent Monitoring: Precisely locate the root cause of problems from massive logs and metrics, saying goodbye to needle-in-a-haystack searches.
Free Up Human Resources: Liberate SRE teams from repetitive tasks, focusing on more valuable innovation tasks.
Kagent-Driven AIOps Solution
Kagent: Cloud-Native Agentic AI Framework
CNCF 2025 open-source sandbox project, a specialized Agent framework for K8s cloud-native scenarios.
Builds an intelligent agent system based on K8s by integrating with multiple model platforms (Amazon Bedrock, Anthropic, OpenAI, etc.).
Core advantages:
K8s Cloud-Native: Natively integrated with the K8s ecosystem, naturally blending into existing clusters
Rich Use Cases: Applicable to any AI Agent use case
Rich Tool Integration: Supports custom MCP tools, built-in diverse K8s tools, and pre-configured Agents
Visualization Interface: UI interface evolves multi-agent workflow orchestration, more intuitive and efficient
Comprehensive Observability: Built-in tracing, logging, and monitoring capabilities, supporting integration of common observability tools
Use Cases:
- Cloud-native operations automation, multi-cluster management, any multi-agent collaborative system, AIOps practices, etc.
Amazon Nova Sonic: Driving Voice-Based Conversational AIOps
Amazon Nova Sonic is a voice conversation model provided on Amazon Bedrock.
It unifies traditional separate speech understanding and speech generation models, capable of real-life human-like voice conversations, supporting multiple languages and tones, with low latency and high performance.
Use Cases:
AI Intelligent Customer Service: 24/7 response to customer inquiries
Enterprise Voice Assistant: Integrates knowledge base, intelligent agents, and external tools for customized services
Multilingual Learning Tools: Supports multiple languages
Multi-industry Applications: Fintech, healthcare, smart home, etc.
Core Value in Combining with Operations Scenarios:
- Simplifies traditional complex manual troubleshooting + repair into voice conversations, maximizing intelligent operations AIOps, reducing MTTR
K8sGPT: Open-Source K8s Failure Diagnosis Expert
CNCF open-source sandbox project, providing AI-driven observability and automated operations for Kubernetes maintenance
Supports CLI and Operator dual modes, enabling instant analysis and continuous monitoring
Scans cluster resources, events, logs, and metrics, integrating AI models on Amazon Bedrock to generate textual insights and explanations, and can be integrated with Kiro's MCP functions for natural language observation and maintenance of clusters
Addresses the passive response issue of traditional operations, adopting proactive AI intelligent operations
Supports diverse custom analyzers and observability tools, integratable with Prometheus, Alertmanager, Grafana, etc.
Demo Cluster:
EKS managed cluster deployed on Amazon Web Services, cluster name: eks-cluster
Cluster resource overview: The cluster deploys multiple K8s resources read from GitHub via ArgoCD's application. Includes 2 pods, one service, and one Deployment
Pod issue: Memory limit set to 200Mi, but running a 205Mi process, causing CrashLoopBackOff
Experimental repair scenario:
K8sGPT identifies the Pod issue and provides explanations and repair suggestions.
Finally, through ArgoCD, adjusts the memory limit parameter of the Helm Chart within the application, triggering ArgoCD to modify the pod configuration, allowing the pod to start successfully.
Summary
Learn how to build a K8s intelligent operation solution from scratch, based on Amazon Bedrock AgentCore, empowered by an AI multi-agent collaboration system.
With just one simple sentence, you can complete the entire process from problem identification, diagnosis to fully automatic repair, greatly simplifying the analysis of large volumes of operations data and manual repair operations, reducing manual error risks.
Compared to K8sGPT's original limited automatic repair capabilities, this solution adds more business-based automatic repair functions, making it more flexible and scalable.
For automated repair scenarios, we introduce HITL (Human-in-the-Loop) processes to ensure the reliability and controllability of automatic repairs.
Leveraging ArgoCD's native capabilities, all repair operations are auditable and rollbackable, reducing maintenance risks.
Operations engineers can maximize AIOps intelligent operations directly through voice, significantly reducing MTTI and MTTR.
Future plans: Integrate CloudWatch Anomaly Detection (AD) and DevOps Guru to predict potential K8s cluster failures based on historical data analysis.
Team:
Top comments (0)