Transform Conversational Agentic AIOps for K8s Using CNCF Kagent, K8sGPT & Nova Sonic

#aws #cloud #beginners #productivity

Speaker: Shaoyi Li @ AWS Amarathon 2025

Summary by Amazon Nova

Kubernetes Operations Challenges

Large Volume of Operations Data, Time-Consuming Troubleshooting

Multiple Resource Types, Complex Associations

Complex Switching Between Multiple Tools

Complex and Time-Consuming Troubleshooting in Response to Alerts

Limited automation capabilities
Only 30% of common failures can be automatically repaired, with complex scenarios relying on human decision-making

High Learning Cost and Threshold for K8s

Comparison of operational efficiency
Enterprises adopting AIOps have an average fault recovery time (MTTR) 90% shorter than traditional models, with operational costs reduced by 50%

Core Values

Self-Healing Failures: Achieve unattended repair of some failures through AI prediction and automation scripts.
Intelligent Monitoring: Precisely locate the root cause of problems from massive logs and metrics, saying goodbye to needle-in-a-haystack searches.
Free Up Human Resources: Liberate SRE teams from repetitive tasks, focusing on more valuable innovation tasks.

Kagent-Driven AIOps Solution

Kagent: Cloud-Native Agentic AI Framework

CNCF 2025 open-source sandbox project, a specialized Agent framework for K8s cloud-native scenarios.
Builds an intelligent agent system based on K8s by integrating with multiple model platforms (Amazon Bedrock, Anthropic, OpenAI, etc.).

Core advantages:

K8s Cloud-Native: Natively integrated with the K8s ecosystem, naturally blending into existing clusters
Rich Use Cases: Applicable to any AI Agent use case
Rich Tool Integration: Supports custom MCP tools, built-in diverse K8s tools, and pre-configured Agents
Visualization Interface: UI interface evolves multi-agent workflow orchestration, more intuitive and efficient
Comprehensive Observability: Built-in tracing, logging, and monitoring capabilities, supporting integration of common observability tools

Use Cases:

Cloud-native operations automation, multi-cluster management, any multi-agent collaborative system, AIOps practices, etc.

Amazon Nova Sonic: Driving Voice-Based Conversational AIOps

Amazon Nova Sonic is a voice conversation model provided on Amazon Bedrock.
It unifies traditional separate speech understanding and speech generation models, capable of real-life human-like voice conversations, supporting multiple languages and tones, with low latency and high performance.

Use Cases:

AI Intelligent Customer Service: 24/7 response to customer inquiries
Enterprise Voice Assistant: Integrates knowledge base, intelligent agents, and external tools for customized services
Multilingual Learning Tools: Supports multiple languages
Multi-industry Applications: Fintech, healthcare, smart home, etc.

Core Value in Combining with Operations Scenarios:

Simplifies traditional complex manual troubleshooting + repair into voice conversations, maximizing intelligent operations AIOps, reducing MTTR

K8sGPT: Open-Source K8s Failure Diagnosis Expert

CNCF open-source sandbox project, providing AI-driven observability and automated operations for Kubernetes maintenance
Supports CLI and Operator dual modes, enabling instant analysis and continuous monitoring
Scans cluster resources, events, logs, and metrics, integrating AI models on Amazon Bedrock to generate textual insights and explanations, and can be integrated with Kiro's MCP functions for natural language observation and maintenance of clusters
Addresses the passive response issue of traditional operations, adopting proactive AI intelligent operations
Supports diverse custom analyzers and observability tools, integratable with Prometheus, Alertmanager, Grafana, etc.

Demo Cluster:

EKS managed cluster deployed on Amazon Web Services, cluster name: eks-cluster
Cluster resource overview: The cluster deploys multiple K8s resources read from GitHub via ArgoCD's application. Includes 2 pods, one service, and one Deployment
Pod issue: Memory limit set to 200Mi, but running a 205Mi process, causing CrashLoopBackOff

Experimental repair scenario:

K8sGPT identifies the Pod issue and provides explanations and repair suggestions.
Finally, through ArgoCD, adjusts the memory limit parameter of the Helm Chart within the application, triggering ArgoCD to modify the pod configuration, allowing the pod to start successfully.

Summary

Learn how to build a K8s intelligent operation solution from scratch, based on Amazon Bedrock AgentCore, empowered by an AI multi-agent collaboration system.
With just one simple sentence, you can complete the entire process from problem identification, diagnosis to fully automatic repair, greatly simplifying the analysis of large volumes of operations data and manual repair operations, reducing manual error risks.
Compared to K8sGPT's original limited automatic repair capabilities, this solution adds more business-based automatic repair functions, making it more flexible and scalable.
For automated repair scenarios, we introduce HITL (Human-in-the-Loop) processes to ensure the reliability and controllability of automatic repairs.
Leveraging ArgoCD's native capabilities, all repair operations are auditable and rollbackable, reducing maintenance risks.
Operations engineers can maximize AIOps intelligent operations directly through voice, significantly reducing MTTI and MTTR.
Future plans: Integrate CloudWatch Anomaly Detection (AD) and DevOps Guru to predict potential K8s cluster failures based on historical data analysis.

Team: