DEV Community

Cover image for AWS DevOps Agent Explained: Architecture, Setup, and Real Root-Cause Demo (CloudWatch + EKS)

AWS DevOps Agent Explained: Architecture, Setup, and Real Root-Cause Demo (CloudWatch + EKS)

Amazon Web Services(AWS) launched Frontier agents that are autonoumous systems during re:Invent 2025 which achieve various use cases goals, scale massively to tackle concurrent tasks, and run persistently for hours or days without intervention.

In this blog we will talk about one of the frontier agents i.e AWS DevOps Agent !!!

This blog will explain what is AWS DevOps Agent, its architecture & components, security & demo for investigating Ec2 CPU spike (CW alarm) and EKS Pod error.

People can also jump straight in to investigating Cloudwatch alarms or EKS errors plus there is tf repo as well for EKS

In the end also shared DevOps Engineer Perspective on the fear of being replaced and its a must read.


  1. Architecture

  2. What is AWS DevOps Agent

  3. How to maximize Agent's Effectiveness

  4. AWS DevOps Agent working architecture

  5. Resource Discovery by AWS DevOps Agent

  6. Demo 1: Investigate Cloudwatch Alarm

  7. Demo 2: Investigate EKS errors

  8. DevOps Engineer perspective


Architecture

architecture

What is AWS DevOps Agent

Think of AWS DevOps Agent like an 24/7 continuously learning Autonomous on call Engineer which has all the tools(more about it later) primarily to investigate incidents, finds root causes, provides mitigation plan and provide you Prevention as well.

At the moment it cannot fix the incidents on its own. Of course a human will be needed to fix the root cause.

How it investigates and find the root cause

To investigate root cause and give recommendations it needs understanding and relationships of the infrastructure & applications called as topology inside AWS account.

The understanding of infrastructure holistically becomes the CONTEXT for this DevOps Agent.

How to maximize Agent's Effectiveness

While the topology provides important context during investigations, AWS DevOps Agent is not limited to investigating only the resources shown in the topology.

The agent may use additional data sources, such as AWS service APIs or connected observability tools, to investigate resources that are not in the application topology.

And that is why AWS has given option to add capabilities to maximize Agent's effectiveness by :

  • Connect multiple AWS accounts
  • Connect CI/CD pipelines through repo like Github/GitLab
  • MCP servers
  • Telemetry sources like Datadog, New Relic
  • Ticketing and chat like serviceNow and slack
  • Even EKS (demo over here)

Note: We can also provide runbooks as pre-loaded guidance/hints to enhance investigation performance to provide investigation hints and guidance.

runbooks

IN THE END IT ALL ABOUT UNDERSTANDING RELATIONSHIPS ABOUT YOUR RESOURCES

AWS DevOps Agent working architecture [IMP]

  • Operates through a dual-console architecture.
  • Admins uses management Console to create and manage Agent Spaces,configure capabilities & set up access controls.
  • Operations teams uses AWS Agent web app to interact with agent and start investigation.

DevOps Agent Spaces

DevOps Agent spaces is logical container/boundary that defines what tools and infrastructure AWS DevOps Agent has access to.

When you create an Agent Space, you define which AWS accounts the agent can access, which external tools it can connect to, and which users in your organization can interact with the agent.

Admins configure the Agent Space through the AWS Management Console

Security Aspect of Agent Spaces:

  • Each Agent Space uses dedicated IAM roles that grant access only to specific AWS accounts and resources
  • You control which users or groups can access each Agent Space.
  • Information from one Agent Space is not visible or accessible from another Agent Space

DevOps Agent Web App

Operations team uses web apps for daily incident response activities.

Security Aspect of Web Apps:

  • IAM identity Center (user Access): centrally manage user access to the DevOps Agent Space web apps even federate with external identity providers. MFA support is included

  • IAM authentication link (Admin access): direct access to the web app from the AWS Management Console using your existing console session.

Resource Discovery by AWS DevOps Agent

Until this point we understand Agent's context starts with Resource discovery(topology) and it does by 2 ways

  • CloudFormation stacks: By default Agent will list all of the CloudFormation stacks and their resources. Resources created by CDK is also supported.

  • Resource Tags: Resources not deployed from CloudFormation (like Console or TF), will be discovered by AWS Tag Key and value pairs to include in topology.

Demo 1: Investigate Cloudwatch Alarm

Pre-requisite

  • For the brevity of the blog I won't cover how to create agent space. Pretty straightforward and you don't need aws organizations to do this.
  • Cloudformation basics
  • Access to us-east-1 region. DevOps agent is available only in this region.
  • I used single standalone account.

create AS-1

Icreate AS-2

screen after creating

1) Investigate Cloudwatch Alarm

cfn stack deployed

  • This CFN template is creating an security group, key-pair for ssh ec2 instance with startup script to do CPU stress test, CW alarm for CPU utilization, auto shutdown ec2 instances after 2 hours.
  • After ec2 instance is created wait for 5-10 minutes to trigger cloudwatch alarm
  • You can also SSH into instance and run the stress test manually as well ./cpu-stress-test.sh

Once stack is deployed DevOps Agent automatically identifies the new resources.

topology with 42 resources

topology-42 resources

Access Web App

After creating agent spaces click on View Details

Agent creation spaces

Click on Web app or directly Operator Access link

web app access

Investigation Root Cause

cloudwatch trigger

Once your Cloudwatch Alarm is triggered go to Web app and under incident Response click on latest alarm and hit on Start Investigation

_DevOps Agent is smart enough to fill the completed prompt. Agent will figure out the steps on its own and give you the final ROOT CAUSE
_

root cause

Well I also found an interesting finding when the time between 2 alarms was around 40 minutes agent was unable to find the root cause and I had to rerun the investigation.

cloudwatch-trigger-second

rootcause not found

Mitigation plan

Well since it was user initiated agent was smart enough to give no mitigation plan

no mitigation plan

Prevention

Since it was a very straightforward demo with user initiated errors so it wasn't enough to generate prevention recommendations

Prevention recommendations

Investigation gaps

One of my favorite feature which even docs does not cover is Investigation gaps.

No investigation can be perfect and that's what investigation gaps fill in to tell when it cannot cover extreme details due to absence of resource at infrastructure level for example in this absence of ssh agent, CloudWatch log groups then it tells those details.

Investigation gaps

Using Chat

You can ask more detailed questions using chat in natural language

chat

Demo 2: Investigate EKS errors

Adding Capability;Give DevOps Agent access to EKS clusters

  • Go to Capabilities and click Edit
  • As we learned earlier Agent space IAM role control the access of AWS resources of Agent, Click view role permissions

agent space role

  • Cope the IAM Role ARN. You will need to add an access entry in the EKS cluster with AmazonEKSAdminViewPolicy ( we have tf code for it)

agent space arn

  • Go to terraform code and replace the arn in terraform.vars

  • After the terraform code is finished you can see nginx has imagePullBack Error which is intentional

pod error

  • As we learn earlier we are deploying resources by terraform so we need to add tags in the Agent space so that DevOps agent can find the resources.

Agent space tags

I am using the same tags in my terraform code as well. Click Save. You will see Agent Space automatically finds the Newly created resources.

eks cluster topology

Go to Web App

Now we will ask the AWS DevOps Agent what is the cause of this error.

pods question

Agent successfully investigated the EKS cluster and find out the root cause

root cause of pod

Not only it found out the root cause it also gave the mitigation steps also rollback steps if mitigation cause issues.

mitigation-1

mitigation-2

mitgation-3

DevOps Engineer perspective

We explored how the AWS DevOps Agent reduces MTTR, prevents future incidents through recommendations, and pinpoints root causes.

The Agent’s effectiveness comes from its deep understanding of your infrastructure—both inside AWS accounts and across external systems. _It would be interesting to connect MCP server to enhance context even further.
_
Though Devops Agent is in preview and its free but has some limits

Is it secure? if configured Absolutely yes! because administrators control what DevOps Agent can access in AWS account, Agent IAM permissions control access to its feature and capabilities.

Will it replace you? Absolutely not! because still need an engineer to fix the issues, an engineer to build new features for the infrastructure & need that understanding of infrastructure how to rollback if things goes wrong.

Let me know what do you think?

I share such amazing AWS updates on DevOps, Kubernetes and GenAI daily over Linkedin, X. Follow me over there so that I can make your life more easy.

Top comments (0)