Gunnar Grosch for AWS

Posted on Dec 29, 2025

DEV Track Spotlight: Supercharge DevOps with AI-driven observability (DEV304)

#aws #ai #observability #devops

Modern observability has evolved far beyond traditional dashboards and reactive alerts. In DEV304, Elizabeth Fuentes Leone (AWS Developer Advocate, GenAI) and Rossana Suarez (AWS Container Hero and Engineer at Naranjax) demonstrated how Generative AI is transforming DevOps and SRE practices through intelligent, proactive observability systems.

Elizabeth opened with Werner Vogels' most famous quote: "Everything fails all the time." The question is not if something will fail, but when and how fast we can detect and respond. The key is anticipation, not reaction.

Watch the Full Session:

The Limits of Traditional Observability

Traditional observability systems face critical challenges that impact both business outcomes and engineering teams:

Reactive, Not Proactive - Dashboards alert you two minutes after users are already complaining on social media. By then, the damage is done.

Alert Fatigue - About 70% of DevOps engineers experience alert fatigue. When 90% of alerts in a five-minute window are noise, teams struggle to identify what actually matters.

Siloed Signals - Multiple dashboards across different tools with zero correlation between them. Teams drown in data but lack actionable insights.

Slow Decision Making - Incident rooms and Slack debates consume 40% of engineering time during incidents. Meanwhile, customers wait.

The real impact goes beyond the $50,000 to $500,000 per hour cost of downtime. Teams lose customer trust, engineers burn out from alert fatigue, and innovation stalls while everyone fights fires.

As Rossana put it: "We've all been there, right? Friday night, 11:00 PM. Someone said the magic word: 'It's quite a small change,' and someone just touched production."

AI-Powered Observability: From Reactive Chaos to Proactive Intelligence

The solution lies in AI-powered observability integrated directly into CI/CD pipelines. Instead of waiting for production failures, AI analyzes systems before, during, and after deployment.

The Results Are Dramatic:

Alert reduction: From 200 alerts per deploy down to just 5
MTTR improvement: From 2 hours to 15 minutes (8x faster)
Proactive prevention: AI stops incidents before they impact users

Three Critical Moments for AI Intervention

Pull Request Analysis - AI provides advice and shows risks before code merges. No blocking at this stage, just intelligent guidance to improve code quality and identify potential issues.

Pre-Deployment Health Check - This is the critical safety gate. AI can approve or block deployments based on system health. If the system looks unstable, AI stops the deployment automatically, protecting production.

Post-Deployment Validation - After deployment, AI checks everything again, generates reports, and alerts teams if something goes wrong.

Elizabeth explained the decision flow: "We have a prompt that has specialization like a DevOps engineer to understand everything what is happening there."

The Health Score System

The AI agent generates a health score from 0 to 100 based on comprehensive analysis:

90-100: Excellent - Deploy with confidence
75-89: Good - Approved with monitoring
70-74: Caution - Approved with warnings and increased monitoring
Below 60: Critical - Deployment blocked

The AI does not just look at CPU metrics. It analyzes CPU usage, memory, restarts, error logs, crash loop backoff patterns, and ImagePullBackOff failures. The agent catches classic deployment failures that every DevOps engineer knows, but catches them before they impact users.

Real Production Implementation

Rossana and Elizabeth demonstrated their open source solution with two live demos, showing AI-powered observability in action with both local environments and GitHub Actions.

Architecture Components:

AI Models: Amazon Bedrock (supporting multiple models including Claude), or OpenAI (easily switchable)
Agent Framework: Strands Agents SDK managing the agentic loop
Kubernetes: Running Prometheus and Grafana for metrics collection
Notifications: Telegram integration for real-time alerts
CI/CD: GitHub Actions triggering the entire flow

The system captures comprehensive metrics including CPU, memory, restarts, and pod status. Every push to GitHub triggers the AI analysis flow automatically.

The complete open source implementation is available at: https://github.com/roxsross/aws-ai-driven-devops-actions

Demo 1: Local Observability with Claude

The first demo showed local observability using the Claude model through Amazon Bedrock. The AI agent connected to an Amazon EKS cluster, pulled real-time data from Prometheus, and analyzed system health.

Healthy Scenario: The system showed a 100% health score with no anomalies. The AI automatically approved deployment and sent a Telegram notification with complete details including the model used, system status, analysis time, and confidence score.

Failure Scenario: When they intentionally triggered failures in the cluster, the AI detected problems immediately. The health score dropped, and the AI blocked the deployment automatically. The Grafana dashboard lit up in red while the AI provided detailed analysis of what was happening, why, and how to fix it.

Demo 2: GitHub Actions with Amazon Bedrock

The second demo integrated AI directly into GitHub Actions pipelines. The AI became part of every deployment step, checking everything at pull request time and during deployment.

Pull Request Validation: When a PR was created, the AI automatically triggered observability analysis. It connected to the cluster, analyzed metrics and logs, and provided a complete health review. With a 100% health score and no critical issues, the AI approved the PR automatically.

Blocked Deployment: When critical issues were detected, the AI blocked the deployment with a red message on the pull request. The workflow showed detailed reasons for the block, the health score (68 out of 100), and the main issues found. A Telegram notification provided the same report with safety recommendations.

The Docker-based GitHub Action is publicly available and can be added to any pipeline with just a few lines of configuration. Developers specify the AI model provider, Kubernetes namespace, app name, cluster name, and Telegram token. The action handles everything else automatically.

Key Takeaways and Best Practices

AI Prevents Failures Before They Happen - Not after production breaks, but before code even deploys. This shift from reactive to proactive changes everything.

Model Flexibility Builds Confidence - Choose between models available through Amazon Bedrock or OpenAI. The open source architecture makes it easy to switch providers or add new ones.

Clear Explanations Build Trust - Teams ship faster when they understand why the AI made specific decisions. The system provides detailed reasoning, not just pass/fail verdicts.

DevOps Principles Apply to AI - As Rossana emphasized: "AI is a tool. It makes you stronger, it makes you faster, it makes you better. Don't be afraid of AI. Use it and you will be successful."

Elizabeth closed with this insight: "AI won't replace engineers, but engineers who use AI maybe. AI is a tool that makes you strong, faster, and better."

The Future of DevOps

The choice is clear: continue fighting fires at 3:00 AM with traditional observability, or let AI protect deployments proactively. The technology exists today. The code is open source. The demos are ready to run.

Two companies. Company one uses traditional observability: deploy, wait, something breaks, fix. 3:00 AM calls and stressed teams. Company two uses AI-powered observability: analyze, predict, block bad deployments, approve good ones. No surprises and happy teams.

Which company do you want to be?

The repository at https://github.com/roxsross/aws-ai-driven-devops-actions includes everything needed to get started: Kubernetes and Prometheus logic in the analyze folder, AI provider management in the models folder, Telegram notifications, and observability scripts in the tools folder. Everything is documented, modular, and written in Python.

About This Series

This post is part of DEV Track Spotlight, a series highlighting the incredible sessions from the AWS re:Invent 2025 Developer Community (DEV) track.

The DEV track featured 60 unique sessions delivered by 93 speakers from the AWS Community - including AWS Heroes, AWS Community Builders, and AWS User Group Leaders - alongside speakers from AWS and Amazon. These sessions covered cutting-edge topics including:

🤖 GenAI & Agentic AI - Multi-agent systems, Strands Agents SDK, Amazon Bedrock
🛠️ Developer Tools - Kiro, Kiro CLI, Amazon Q Developer, AI-driven development
🔒 Security - AI agent security, container security, automated remediation
🏗️ Infrastructure - Serverless, containers, edge computing, observability
⚡ Modernization - Legacy app transformation, CI/CD, feature flags
📊 Data - Amazon Aurora DSQL, real-time processing, vector databases

Each post in this series dives deep into one session, sharing key insights, practical takeaways, and links to the full recordings. Whether you attended re:Invent or are catching up remotely, these sessions represent the best of our developer community sharing real code, real demos, and real learnings.

Follow along as we spotlight these amazing sessions and celebrate the speakers who made the DEV track what it was!

DEV Community