Hung____ for AWS Community Builders

Posted on Dec 12, 2025 • Edited on Jan 15

AWS DevOps Agent Demo: Investigating ALB Health Check Failures

#aws #devops #agents

AWS DevOps Agent is a service that autonomously investigates incidents and identifies root causes.
In this demo, I'll try to simulate a scenario where EC2 instances behind an Application Load Balancer start failing health checks, and watch AWS DevOps Agent diagnose the problem.

Demo

In production environments, one of the most common incidents is ALB targets becoming unhealthy. This can happen for many reasons:

Application crashes
Database connection failures
Memory exhaustion
Dependency timeouts
Misconfigured health checks

In this demo, I'll deploy a Flask web application behind an ALB and simulate a database connection failure that causes the health endpoint to return 503 errors. The ALB will mark the targets as unhealthy, trigger CloudWatch alarms, and AWS DevOps Agent will investigate the root cause.

Here is the diagram, just a simple ELB, two instances and alarms. I recommend deploy this stack to us-east-1 cause at this time writing this, AWS DevOps Agent service is only available there.

I have also included a Lambda function that auto shut down instance after 2 hours, no worry about unexpected cost.

CloudFormation template: https://gist.github.com/Hung-00/e53f4c980baf13d9bb8902fd36a79a6b

Check the output of the stacks after successfully created.

Go ahead connect to the two instances. You can connect through Session Manager

Run this command to check heatlh status:

curl http://localhost/health

Server is healthy.

Now use this command to interrupt both servers:

curl -s http://localhost/simulate/unhealthy

or 

curl http://localhost/simulate/crash

The alarm had triggered.

Now let's head to AWS DevOps Agent.

You can learn how to create an Agent Space, the process is straightforward and simple.

Open Operator Access.

Have a look at your system in tab DevOps Center. You can see the stack's resources.

Change to tab Incident Response and let's investigate latest alarm.

You can watch the Investigation Progress. The agent shows its reasoning as it investigates:

It is interesting that AWS DevOps Agent can actually know the user did interrupt the server. As you can see below:

Each investigation has it own chat session, you can ask the agent about it.

Go to tab Prevention and run. Agent will analyze and give you some recommendations to improve based on investigation in history.

AWS DevOps Agent can not resolve the incidents by itself. You need to fix the root cause and implement the recommendations on your own.

Finally, run this command to restore healthy status:

curl -s http://localhost/simulate/healthy

Remember to delete the stack if you don't want to continue.

Conclusion

In this demo, we saw how AWS DevOps Agent cuts down resolution time, finds root causes quickly, and suggests ways to prevent similar issues.

The agent works best when it understands your full environment — AWS accounts, external tools, everything. Adding MCP servers for custom integrations could make it even more powerful.

It's free during preview, with some usage limits. Security is administrator-controlled through IAM permissions.

I think DevOps Agent is a tool, not a replacement. Engineers are still essential for implementing fixes, designing infrastructure improvements, and making critical decisions when rollbacks are needed

This is just a simple scenario to have a first look at AWS DevOps Agent, I will try to stimulate more scenarios in the future.

Thank you.