DEV Community

Cover image for Transform Conversational Agentic AIOps for K8s Using CNCF Kagent, K8sGPT & Nova Sonic

Transform Conversational Agentic AIOps for K8s Using CNCF Kagent, K8sGPT & Nova Sonic

Speaker: Shaoyi Li @ AWS Amarathon 2025

Summary by Amazon Nova



Kubernetes Operations Challenges

Large Volume of Operations Data, Time-Consuming Troubleshooting

  • Average MTTR exceeds 4 hours, with manual analysis accounting for 65%

  • Analysis data volume can reach TB levels

Multiple Resource Types, Complex Associations

  • Large volume of cluster objects, events, and log data

Complex Switching Between Multiple Tools

  • SREs switch between 8+ tools daily, with high context switching costs

Complex and Time-Consuming Troubleshooting in Response to Alerts

  • Limited automation capabilities

  • Only 30% of common failures can be automatically repaired, with complex scenarios relying on human decision-making

High Learning Cost and Threshold for K8s

  • Comparison of operational efficiency

  • Enterprises adopting AIOps have an average fault recovery time (MTTR) 90% shorter than traditional models, with operational costs reduced by 50%

Core Values

  1. Self-Healing Failures: Achieve unattended repair of some failures through AI prediction and automation scripts.

  2. Intelligent Monitoring: Precisely locate the root cause of problems from massive logs and metrics, saying goodbye to needle-in-a-haystack searches.

  3. Free Up Human Resources: Liberate SRE teams from repetitive tasks, focusing on more valuable innovation tasks.



Kagent-Driven AIOps Solution

Kagent: Cloud-Native Agentic AI Framework

  • CNCF 2025 open-source sandbox project, a specialized Agent framework for K8s cloud-native scenarios.

  • Builds an intelligent agent system based on K8s by integrating with multiple model platforms (Amazon Bedrock, Anthropic, OpenAI, etc.).

Core advantages:

  • K8s Cloud-Native: Natively integrated with the K8s ecosystem, naturally blending into existing clusters

  • Rich Use Cases: Applicable to any AI Agent use case

  • Rich Tool Integration: Supports custom MCP tools, built-in diverse K8s tools, and pre-configured Agents

  • Visualization Interface: UI interface evolves multi-agent workflow orchestration, more intuitive and efficient

  • Comprehensive Observability: Built-in tracing, logging, and monitoring capabilities, supporting integration of common observability tools

Use Cases: 

  • Cloud-native operations automation, multi-cluster management, any multi-agent collaborative system, AIOps practices, etc.


Amazon Nova Sonic: Driving Voice-Based Conversational AIOps

  • Amazon Nova Sonic is a voice conversation model provided on Amazon Bedrock.

  • It unifies traditional separate speech understanding and speech generation models, capable of real-life human-like voice conversations, supporting multiple languages and tones, with low latency and high performance.

Use Cases:

  • AI Intelligent Customer Service: 24/7 response to customer inquiries

  • Enterprise Voice Assistant: Integrates knowledge base, intelligent agents, and external tools for customized services

  • Multilingual Learning Tools: Supports multiple languages

  • Multi-industry Applications: Fintech, healthcare, smart home, etc.

Core Value in Combining with Operations Scenarios:

  • Simplifies traditional complex manual troubleshooting + repair into voice conversations, maximizing intelligent operations AIOps, reducing MTTR


K8sGPT: Open-Source K8s Failure Diagnosis Expert

  • CNCF open-source sandbox project, providing AI-driven observability and automated operations for Kubernetes maintenance

  • Supports CLI and Operator dual modes, enabling instant analysis and continuous monitoring

  • Scans cluster resources, events, logs, and metrics, integrating AI models on Amazon Bedrock to generate textual insights and explanations, and can be integrated with Kiro's MCP functions for natural language observation and maintenance of clusters

  • Addresses the passive response issue of traditional operations, adopting proactive AI intelligent operations

  • Supports diverse custom analyzers and observability tools, integratable with Prometheus, Alertmanager, Grafana, etc.



Demo Cluster:

  • EKS managed cluster deployed on Amazon Web Services, cluster name: eks-cluster

  • Cluster resource overview: The cluster deploys multiple K8s resources read from GitHub via ArgoCD's application. Includes 2 pods, one service, and one Deployment

  • Pod issue: Memory limit set to 200Mi, but running a 205Mi process, causing CrashLoopBackOff

Experimental repair scenario:

  • K8sGPT identifies the Pod issue and provides explanations and repair suggestions.

  • Finally, through ArgoCD, adjusts the memory limit parameter of the Helm Chart within the application, triggering ArgoCD to modify the pod configuration, allowing the pod to start successfully.



Summary

  • Learn how to build a K8s intelligent operation solution from scratch, based on Amazon Bedrock AgentCore, empowered by an AI multi-agent collaboration system.

  • With just one simple sentence, you can complete the entire process from problem identification, diagnosis to fully automatic repair, greatly simplifying the analysis of large volumes of operations data and manual repair operations, reducing manual error risks.

  • Compared to K8sGPT's original limited automatic repair capabilities, this solution adds more business-based automatic repair functions, making it more flexible and scalable.

  • For automated repair scenarios, we introduce HITL (Human-in-the-Loop) processes to ensure the reliability and controllability of automatic repairs.

  • Leveraging ArgoCD's native capabilities, all repair operations are auditable and rollbackable, reducing maintenance risks.

  • Operations engineers can maximize AIOps intelligent operations directly through voice, significantly reducing MTTI and MTTR.

  • Future plans: Integrate CloudWatch Anomaly Detection (AD) and DevOps Guru to predict potential K8s cluster failures based on historical data analysis.



Team:

AWS FSI Customer Acceleration Hong Kong

AWS Amarathon Fan Club

AWS Community Builder Hong Kong

Top comments (0)