BrycePC for AWS Community Builders

Posted on Apr 11

AWS Devops Agent - AI-Based Incident Analysis Demo with "The Better Store"

#aws #devopsagent #sre

In a previous series of articles I have described the design and implementation of a sample ECommerce serverless solution on AWS at: https://dev.to/brycepc/building-the-better-store-an-agile-cloud-native-ecommerce-system-on-aws-part-1-introduction-to-27ii. The solution's source code is also available at: https://github.com/TheBetterStore.

In this article I explore the new AWS DevOps Agent service which has just become generally available in March 2026, to demonstrate its AI capabilities for automatic root cause analysis of alarm-notified incidents using all sources of information that it has available to it, including (but not limited to) Cloudtrail events, API Gateway, application and database logs and metrics, x-ray, its determined implementation topology, and integrated source code repositories - using errors introduced into The Better Store. The produced results as discussed in the conclusion certainly look impressive, for quickly analyzing errors in detail which could otherwise take much manual effort!

Initial Setup

DevOps Agent may be considered as a global service within AWS in that it monitors resources across all regions; though noting at the current date the service's control plane for configuration is available in 6 regions only (us-east-1, us-west-2, ap-southeast-2, ap-northeast-1, eu-central-1, eu-west-1). Its initial setup is relatively simple; it involves primarily:

Creating an AWS Devops Agent space to define an application boundary to monitor. AWS allows creation of multiple agent spaces within an account primarily to support segregating investigation scope to team ownership; where each on-call team has an Agent Space containing only the accounts (if cross-account access for an application is in-play) and tools relevant to what they're responsible for.
Configuring capabilities including:
1. Secondary sources; e.g. other AWS accounts where cross-account resources are used, or Azure accounts where Azure resources may also be integrated for DevOps Agent monitoring.
2. Telemetry - external sources including Dynatrace, Datadog, Grafana, New Relic and Splunk may be integrated.
3. Pipeline; where source code repositories for applications being monitored may be configured to support Devops Agent Analyses.
4. MCP Servers to support agentic retrieval against additional sources of information which may be pertinent to an investigation. For example, integration of Atlassian MCP servers are supported, which can provide AWS Devops Agent with additional information such as runbooks from an organization's Confluence pages.
5. Webhooks; to allow 2-way access between an Agent Space and 3rd-party applications and services.
Configuring access, including Operator access which support teams will use for interacting with DevOps Agent and its analyses. By default, operator access is available to users as a short-term link from the AWS web console which they require access to. Alternatively access can be integrated with IAM Identity Center or an external identity provider.
Configuring log delivery.
For our demonstration with TheBetterStore we will be configuring the following:
1. Pipeline - GitHub repositories used for TheBetterStore's services.
2. Webhooks - A separate lambda function as defined at tbs-devops-aiagent will be subscribed to error alarms configured for TheBetterStore resources, and on being triggered it will invoke a webhook defined by AWS Devops Agent, passing the alarm information to trigger an incident investigation.

Demo: Incident Simulation and DevOps Agent Analysis

For the demonstration, a bug was deliberately introduced into a new code branch in the tbs-app-order GitHub repository, to invalidate the key used for inserting data into a dynamodb table. This would cause an error to be thrown in the tbs-app-order-prod lambda services when a user tries to purchase an item from TheBetterStore, triggering an alarm. The sequence of events are illustrated as:

1. Bug introduced

2. Feature is deployed via GitHub Actions Pipeline

3. TheBetterStore error is generated via a purchase:

4. Cloudwatch alarm is triggered

5. Devops Agent Incident Is Created

Devops Incident Response Analysis

Incidents generated by Devops Agent are presented within the Operators web console in 3 separate tabs; an investigation timeline, root cause analysis (RCA), and a mitigation plan. Additionally, the Operators web console always provides a Chat window, whereby ad-hoc questions may be asked, with prompts for suggested questions also being provided (e.g. 'Summarise the investigation). Each of these sections are described below

Investigation timeline
Information within this tab describes how DevOps Agent queries underlying resources for its analysis. The first item that is displayed is a topology diagram for AWS-deployed resources and their relationships which may be in-scope for the analysis. This information is automatically-discovered by AWS Devops Agent, and it may assist with its analysis from the following:

Identifies in-scope components and blast radius
Traces impact paths through dependency chains
Helps pinpoint root cause
Shows what downstream services may be affected.

For our incident, the following is presented

Next, a timeseries of analysis events are listed which details how Devops Agent queries different sources of information to reach its findings. Only a subset of these are listed below, as an example.

10:30:26 Started

10:30:48. Fetching Data

Now I have a clear picture:

Alarm: tbs-app-order-prod-CreateOrder-Errors triggered at 2026-04-10T22:29:04Z
Metric: AWS/Lambda Errors for function tbs-app-order-prod-CreateOrderFunction-BZMzl0lYouog
Datapoint: 1 error at 2026-04-10T22:24:00Z (5-minute period)
Account: 048116471576, region: ap-southeast-2
The Order Service has:

API GW: v6jpsjputa
Lambda: CreateOrderFunction-BZMzl0lYouog
DynamoDB: tbs-app-order-prod-OrderTable-QYMTFX1MFY1E

Repository: TheBetterStore/tbs-app-order

I need to start investigating by checking the Lambda logs for what went wrong, looking at recent deployments or code changes, reviewing the Lambda metrics, checking CloudTrail for any infrastructure modifications, and examining the dependencies like DynamoDB and Stripe integration.

10:31:40 Fetching data

Now let me run multiple CloudTrail searches in parallel. I'll search by resource name for the key resources, and also search by event source for CloudFormation and CodeDeploy.

10:31:55 Fetching data

The actual log group name is /aws/lambda/tbs-app-order-prod-CreateOrderFunction (without the suffix). Let me query it now:

10:32:00 Fetching data

Let me now check GitHub workflow runs and CodeDeploy deployments:
1. Batch get deployments
2. Get github workflow run logs
3. Git log

10:32:28 Fetching data

The most recent successful deployment has a commit message "Introduce bug" — that's highly suspicious! Let me dig deeper:

1. Batch get deployments
2. Get github workflow run logs
3. Git log

10:32:42 Fetching data

Now let me look at the actual code changes in the "Introduce bug" commit:

Git log
Get github file

10:32:54 Observation

DynamoDB ValidationException: Missing the key OrderId in the item during CreateOrder invocations
Two CreateOrder Lambda invocations at 22:27:24Z and 22:28:23Z both failed with the same DynamoDB ValidationException: "One or more parameter values were invalid: Missing the key OrderId in the item". This indicates the Lambda code is attempting a DynamoDB PutItem (or similar) operation without including the required partition key "OrderId". Both invocations followed successful Stripe PaymentIntent creation (with orderId metadata set), suggesting the error occurs between payment intent creation and DynamoDB persistence. The Lambda platform reports both invocations as "success" (status code), meaning the function caught the error rather than crashing—but the application-level ERROR log triggered the CloudWatch alarm. Notably, both invocations used 512MB of the 512MB allocated memory, and the baseline window had zero invocations, so this is not a pre-existing pattern.
DynamoDB ValidationException thrown twice: 'Missing the key OrderId in the item' — the Lambda function is writing to DynamoDB without the required partition key
Source: cloudwatch

10:34:14 Root cause

Deployment of commit 4ba4473 renamed OrderId to OrderIdX in DynamoDB DTO mapping, breaking all CreateOrder writes
At 2026-04-10T07:16-07:19Z, GitHub Actions workflow run #24231398661 deployed commit 4ba4473 ('Introduce bug') from branch feature/aiagenttest to the CreateOrder Lambda function via CodeDeploy Blue/Green (deployment d-Q063WA0TH). The commit modified src/infrastructure/persistence/order-repository.ts, renaming the DynamoDB partition key field from 'OrderId' to 'OrderIdX' in the static toDto() method. Since 'OrderId' is the required partition key of the OrderTable (tbs-app-order-prod-OrderTable-QYMTFX1MFY1E), every DynamoDB PutItem call now fails with ValidationException because the item is missing the partition key. This causes 100% of CreateOrder invocations to fail. The deployment occurred ~15 hours before the first error was observed at 22:27Z, indicating no CreateOrder requests were made in the intervening period.

Investigation completed

Root Cause
Information within this tab summarises the investigation and timeline findings, to include the incident impact, root cause(s) and key findings

Mitigation Plan
Information here focuses on steps to resolve the incident. For our example the root cause was identified as being due to a bug being introduced into our GitHub source code repository. Steps are provided on how to revert the change/rollback the errant commit and to redeploy the solution to AWS.

Above steps also included sample AWS CLI commands that could be run to validate deployments and final service health (though it did specify the incorrect region for our services for our test (as of April 2026; it is likely to be resolved in a future update).

Summarise The Investigation
Finally, use of the Chat feature; specifically the prompt 'Summarise the Investigation' was tried here, which provided the following results:

This gave additional insights to the impact of the issue that could be helpful; including the amount of the failed orders and note of a reconciliation gap with Stripe, to be amended. The chat window also suggested other deeper queries that we could consider.

Conclusion

This demonstration highlighted how a Cloudwatch alarm-triggered incident was able to invoke its automated analysis by Devops Agent, which used multiple sources of information available to it to generate accurate mitigation steps and insights in less than 4 minutes from its commencement. Its analysis involved examination of multiple sources of information, including:

Cloudtrail events, as to whether infrastructure for any resources had been updated (they hadn't).
Examining the topology to understand dependencies and architecture (and potential dowstream effects) of resources. This identified that there could be a reconciliation gap between our DynamoDB and Stripe (card payments provider)!
Examining recent deployments of the affected resource in Cloudformation, which is used for its deployment.
Examining the impact lambda's application logs in Cloudwatch.
Examining GitHub deployments, and associated code changes.
Checking for errors prior to the change (to confirm the errors are new).
Recognising that the causing git commit was highly suspicious (though noting this code commit comment was entered for the demonstration).
Concluding that the issue was caused by the code change, for inserting records into a DynamoDB table with the wrong key; but also noting that the lambda was using all of its allocated memory in the process, as an additional issue to be addressed!
Generating root cause analysis, and mitigation steps.

Without this automation, it is envisaged that a competent support engineer who is familiar with the application would be able to identify the issue by first examining the lambda's application logs (while its errors are the source of the alarm) to see the error details. However at this point they may not be aware of the lambda's recent deployment; if not it is expected that they would need to liaise with the lambda's development team to identify and resolve the issue. The turnaround for this from an engineer first receiving an alarm would typically range from 30 minutes to 1-3 hours; during which in this example the system could be unavailable to users! It is unlikely that analysis at this time would also identify a potential memory issue, and coverage of an impact assessment for example to include gap analysis formed with Stripe would be dependent on maturity of support processes in place.
It is also anticipated that Devops Agent would be well-placed for identifying more obscure issues where it is able to quickly select and scan required sources of information; for example slow performance due to slow RDS SQL queries leading to lambda or API Gateway times; or a new Service Control Policy being applied to the AWS Account which unexpectedly result in permission errors. Such problems typically require time and AWS specialists to resolve.

The benefits of AWS Devops Agent for incident analysis hence look fantastic, and the ability to add MCP Servers can also provide further information enrichment capabilities to organisations. However there are a couple of factors that should be considered:

SRE maturity: The described solution is dependent on Cloudwatch alarms being appropriately configured, to trigger incident investigations when there is a real problem.
Cost: Charges are based on the time that the agent spends on operation tasks such as investigations and chat queries, billed by the second. For deployments to us-east-1 this is $0.0083USD per agent second. For an investigation taking 4 minutes to execute (similar to our demonstration), this would equate to $1.992USD. As can be imagined charges could accumulate rapidly if alarms are frequent! AWS Devops Agent however at this time is included in the free tier plan for new customers, and a free trial is also provided to customers for the first 2 months after first use of the service. Credits are also provided to customers on paid AWS Support plans. The AWS DevOps Pricing page should always be consulted for latest reliable information.
Security: It is good to know that AWS Devops Agent provides read-only actions against its AWS Accounts to support queries and investigations; it does not support update actions for example for remediation. There are however still important security considerations to be aware of, to ensure AWS-stored data is not leaked beyond intended audiences and an organisation's security policies are maintained. These include:
1. Separate agent spaces should be created with their own IAM roles to maintain isolation between application boundaries, and to prevent unintended access across different environments or teams. Authorisation should be configured for each Devops Agent space via integration with IAM Identity Center or other external provider to ensure only allowed users can access these.
2. While AWS Devops Agent provides prompt injection safeguards to ensure read-only capabilities beyond opening of tickets and support cases, integrated external MCP servers may offer the same protection, and these should be carefully reviewed before enabling. AWS provides further security recommendations for AWS Devops Agent, which can be reviewed here.