AWS DevOps Agent: Automated Incident Response and Root Cause Analysis on AWS

#devops #agentaichallenge #automation #ai

Stop Waking Up at 3 AM: How AWS DevOps Agent Automates Incident Response

Every on-call engineer knows the drill: a CloudWatch alarm fires at 3 AM, and you spend the next 30 minutes manually correlating logs, metrics, and service events across five browser tabs. This is not a scalability problem; it is an AWS automation gap that AWS DevOps Agent is designed to close.

AWS DevOps Agent, launched in preview in early 2026, is an Anthropic-powered AI embedded directly into the AWS console. It is built to behave like an experienced on-call engineer: it receives your alarm, investigates autonomously across your entire AWS environment, correlates signals, and delivers a diagnosis with recommended actions. No hints. No prompting. Just results.

This is not another AI chatbot where you paste log excerpts and ask questions. The agent has native read access to your AWS environment and performs its own investigation from start to finish.

Who Should Use AWS DevOps Agent

DevOps and Cloud Engineers managing on-call rotations, AWS DevOps Agent acts as an AI-powered second responder that continuously monitors your AWS environment and never misses a log correlation.

CTOs and Engineering Managers evaluating AI-driven cloud operations to reduce MTTR (mean time to resolution) and operational overhead without growing headcount.

Teams in e-commerce, SaaS, banking, and healthcare industries where every minute of downtime has a direct dollar cost and 3 AM incidents are non-negotiable.

How AWS DevOps Agent Integrates with CloudWatch, EventBridge, and Your Existing AWS Stack

The agent does not require sidecar infrastructure or a separate observability platform. It integrates with your existing AWS setup and acts as an autonomous reasoning layer on top of it.

When a CloudWatch alarm fires, an EventBridge rule routes the event to the agent. The agent then independently queries CloudWatch Logs, EC2 metrics, SSM Run Command, the AWS Health API, and other data sources, without being told where to look. It delivers a structured incident report with findings and recommended actions.

The flow is: CloudWatch Alarm → EventBridge Rule → AWS DevOps Agent → Investigation → Findings and Recommendations.

Step-by-Step: Implementing AWS DevOps Agent for Automated EC2 Incident Response

The scenario below is a real walkthrough. An EC2 instance running a production PHP application spikes to 98% CPU utilization. No human investigates. The agent is triggered and given only the alarm event. Everything that follows is autonomous.

Step 1: Enable the agent and connect your alarm

Enable AWS DevOps Agent from the AWS console under the Operations category. Then create an EventBridge rule that routes your CloudWatch CPU alarm to the agent’s event bus.

aws events put-rule \ --name "cpu-spike-to-devops-agent" \ --event-pattern '{ "source": ["aws.cloudwatch"], "detail-type": ["CloudWatch Alarm State Change"], "detail": { "alarmName": ["EC2-CPU-High"], "state": {"value": ["ALARM"]} } }' \ --state ENABLED aws events put-targets \ --rule "cpu-spike-to-devops-agent" \ --targets '[{ "Id": "devops-agent-target", "Arn": "arn:aws:devops-agent:ap-south-1:ACCOUNT_ID:agent/default" }]'

Step 2: Define the CloudWatch alarm

aws cloudwatch put-metric-alarm \ --alarm-name "healthcheck360-CPU-High" \ --alarm-description "CPU utilization above 85% for 5 minutes" \ --metric-name CPUUtilization \ --namespace AWS/EC2 \ --statistic Average \ --period 300 \ --threshold 85 \ --comparison-operator GreaterThanThreshold \ --evaluation-periods 1 \ --dimensions '[{"Name":"InstanceId","Value":"i-0abc1234def567890"}]' \ --alarm-actions "arn:aws:events:ap-south-1:ACCOUNT_ID:rule/cpu-spike-to-devops-agent"

Step 3: Simulate the CPU spike

To replicate this scenario in a test environment, stress the instance using SSM Run Command:

aws ssm send-command \ --instance-ids "i-0abc1234def567890" \ --document-name "AWS-RunShellScript" \ --parameters '{"commands":["stress --cpu 4 --timeout 600"]}' \ --comment "Simulate CPU spike for DevOps Agent demo"

What the Agent Did: Step by Step

The following is the agent’s investigation trace. It received one input: the alarm state change event. Everything below is what it derived on its own.

T+00:04: CloudWatch Alarm Alarm EC2-CPU-High transitioned to ALARM state. CPUUtilization = 98.4% over a 5-minute average.

T+00:09: Agent Received alarm event via EventBridge. Resolved instance metadata: t3.medium, ap-south-1a, running Amazon Linux 2, 2 vCPUs. Instance state: running. Initiating investigation.

T+00:22: Agent Queried CloudWatch Metrics for CPUUtilization over the past 2 hours. CPU was stable at 18–22% for approximately 115 minutes, then escalated sharply to 98.4% over a 7-minute window. Spike onset: 02:41 UTC.

T+00:38: Agent Cross-referenced the spike onset against CloudWatch Logs for /var/log/php-fpm/error.log. Found 847 repeated entries: “PHP Fatal error: Allowed memory size of 134217728 bytes exhausted.” First occurrence at 02:40:58 UTC. Correlation: strong.

T+00:51: Agent Issued SSM Run Command: ps aux –sort=-%cpu | head -15. Top process: php-fpm: pool www, consuming 4 processes at 24% CPU each. Total PHP-FPM CPU: 96.1%. No other anomalous processes were found.

T+01:02: Agent Examined ALB access logs via S3. Request volume at 02:40 UTC: 2,847 requests per minute, up from a 312 req/min baseline. Origin IPs concentrated across 3 CIDR ranges not previously seen. Pattern consistent with an external load spike or traffic anomaly.

T+01:18: Agent checked the AWS Health Dashboard API for ap-south-1 service events. No active incidents. CPU contention is instance-specific, not regional.

T+01:29: Agent Queried current PHP-FPM configuration via SSM: pm.max_children = 5, memory_limit = 128M. At 2,847 req/min with 5 workers on 2 vCPUs: the worker pool is severely underprovisioned for this traffic volume. Memory is not the bottleneck. CPU is.

T+01:44: Agent Root cause identified. Generating incident summary and remediation recommendations.

What the Agent Concluded

Critical: PHP-FPM worker pool exhaustion under traffic surge

A 9x increase in inbound request volume at 02:40 UTC overwhelmed the PHP-FPM worker pool (pm.max_children = 5) on a 2-vCPU instance. Workers queued and retried, saturating both CPU cores. The PHP memory exhaustion errors in the logs are a symptom of worker thrashing, not the root cause.

Warning: No Auto Scaling configured on this instance

Instance i-0abc1234def567890 is not part of an Auto Scaling Group. The traffic spike had no horizontal scale-out path. A single t3.medium was absorbing all inbound load with no relief mechanism.

Observation: Traffic origin suggests a non-organic spike

ALB logs show 68% of the spike traffic originated from 3 CIDR blocks not previously observed during the baseline window. This may indicate a load test, bot activity, or a marketing campaign without prior capacity planning coordination.

What the Agent Recommended

Increase pm.max_children in /etc/php-fpm.d/www.conf from 5 to 20–25 and restart PHP-FPM. This reduces CPU saturation by distributing load across more workers without requiring an instance resize.
Move the instance behind an Auto Scaling Group with a target tracking policy on CPUUtilization at 60%. Configure scale-out to add t3.medium instances when the threshold is breached for 2 consecutive minutes.
Investigate the three anomalous CIDR blocks in the ALB access logs. If confirmed as bot traffic, add a WAF rate-based rule capping requests to 100 per IP per 5-minute window from unknown CIDR ranges.
Consider upgrading from t3.medium (2 vCPU, 4GB RAM) to t3.large or c6i.large if PHP-FPM worker tuning alone proves insufficient at sustained peak load.
Add a CloudWatch alarm on php-fpm_active_processes via the CloudWatch Agent to detect worker pool exhaustion before it saturates CPU, giving you a leading indicator rather than a lagging one.

To apply the PHP-FPM fix immediately via SSM without SSH:

AWS DevOps Agent vs. Manual Incident Response: Speed, Accuracy, and Scale

AWS DevOps Agent completed a full root cause analysis in under two minutes, autonomously correlating CloudWatch metrics, PHP-FPM logs, ALB access logs, and the AWS Health API. A human engineer performing the same investigation typically needs 15–40 minutes, assuming full familiarity with the environment.

The implications go beyond speed. The agent has no knowledge gaps about your environment’s history. It does not skip the ALB logs because it is tired. It does not miss the PHP-FPM configuration because it assumes the problem was infrastructure. It checks everything systematically.

For lean DevOps teams or those operating across time zones, AWS DevOps Agent delivers an always-on, AI-powered first response. The on-call rotation doesn’t disappear, but the first 20 minutes of every incident now happen without a human.