DEV Community

Nao San for AWS Community Builders

Posted on

[AWS] Achieving AIOps with Frontier Agents [Frontier Agent]

This article is a machine translation of the contents of the following URL, which I wrote in Japanese:

https://qiita.com/Nana_777/items/22b87cd8d28e3675e5c2

Introduction

In the previous article, we explained the setup of AWS Security Agent and DevOps Agent. This article will explain more practical ways to use them.
Security Agent can perform security reviews of GitHub pull requests (PRs) and design document reviews.
DevOps Agent can use AI to investigate, analyze, and predict incidents related to GitHub repositories and resources deployed on AWS.
By using these in conjunction, you can improve the efficiency of AI-powered development and operations (DevOps), essentially achieving AIOps.

Previous Article

[AWS] AWS Security Agent & DevOps Agent Setup Guide [FrontierAgents]

https://qiita.com/Nana_777/items/b5edfacdb00c3e9f6d17

Prerequisites

This article assumes that the Security Agent and DevOps Agent have already been set up (Agent Space creation, GitHub integration, and code review activation).

Challenges of DevOps

To accelerate system development and operation while ensuring quality, the following challenges exist:

  • Security reviews fail to keep pace with development speed, becoming a release bottleneck.
  • Penetration testing is conducted only a few times a year, operating on a different timeline than the development cycle.
  • Incident response is dependent on individual employees, and preventative measures are not fed back into the next development cycle.
  • Security knowledge gained in the development phase is not carried over to the operations phase.

To address these challenges, combining two AWS Frontier Agents, "AWS Security Agent" and "AWS DevOps Agent," allows for the creation of a system where AI performs security reviews and incident monitoring more autonomously.

This article uses a simple TODO application as an example to introduce the overall structure of combining these two agents and follows the construction procedure with a specific scenario.

Two Frontier Agents

AWS Security Agent

The AWS Security Agent is a Frontier Agent that protects applications throughout the entire development lifecycle. It primarily provides three functions.

Design Security Review

Before writing code, the design documentation is validated against the organization's security requirements. Based on the requirements defined by the security team in the AWS console (approved authorization libraries, logging standards, data access policies, etc.), design flaws are identified early.

Code Security Review

Pull requests (PRs) on GitHub are automatically detected and analyzed against the organization's security requirements and common vulnerabilities (insufficient input validation, SQL injection, etc.). Findings are provided directly as PR comments on GitHub, allowing developers to receive security feedback within their normal workflow.

Penetration Testing

Multi-stage attack scenarios are executed on demand to identify vulnerabilities that cannot be detected by automated scanning tools (DAST/SAST). If vulnerabilities are found, a PR including impact analysis, reproducible attack paths, and corrective code is automatically generated on GitHub.

AWS DevOps Agent

The AWS DevOps Agent is a frontier agent that autonomously resolves and prevents incidents.

24/7 Autonomous Incident Response

Investigation begins the moment an alert or support ticket is generated. It correlates telemetry data from observability tools (Amazon CloudWatch, Datadog, New Relic, etc.), code change history from GitHub repositories, and deployment history from CI/CD pipelines (GitHub Actions) to identify the root cause.

Preventive Recommendations

It analyzes past incident patterns and proposes improvements in the following areas:

  • Observability: Enhanced monitoring, alerting, and logging
  • Infrastructure Optimization: Auto-scaling and capacity tuning
  • Deployment Pipeline Enhancement: Addition of tests and validations (including improvements to GitHub Actions workflows)

Application Topology

It automatically maps resources and their relationships, visualizing them as a topology graph. This helps understand the overall impact on the architecture during incident investigations.

GitHub-centric features

The Security Agent monitors, comments on, and generates corrective PRs for GitHub pull requests, while the DevOps Agent utilizes code changes and deployment history from GitHub repositories for analysis.

Integration Diagram of Each Service

By combining the two agents, you can build a security check and incident analysis configuration like the one shown below.

image.png

The key point is that GitHub functions as the hub for the integration of each service.

The Security Agent interacts with developers through PRs on GitHub, and the DevOps Agent utilizes GitHub's code change and deployment history for analysis.

Subject: TODO App Configuration

This section introduces the configuration of the TODO app used in this scenario.

AWS Configuration (Serverless)

[Client]
↓ HTTPS
[API Gateway]
↓
[Lambda] ←→ [DynamoDB]
↑
[Cognito] (Authentication)
[CloudWatch] (Monitoring → DevOps Agent references)
[GitHub Actions] (CI/CD)
Enter fullscreen mode Exit fullscreen mode
  • API Gateway + Lambda: CRUD API for TODOs
  • DynamoDB: Storing TODO data
  • Cognito: User authentication (JWT token)
  • CloudWatch: Metrics, logs, alarms
  • GitHub Actions: Deployment to staging and production environments (SAM or CDK)

*See the DevOpsAgent topology screenshot below for the configuration diagram.

API Endpoints:

Method Path Description
POST /todos Create TODO
GET /todos Your Get TODO list
GET /todos/{id} Get TODO details
PUT /todos/{id} Update TODO
DELETE /todos/{id} Delete TODO

Although the configuration is simple, we will intentionally include several vulnerabilities during development to elicit detection from the Security Agent.

Service integration procedure followed with a specific scenario

Step 1: Design Review (Security Agent × AWS Console)

Create a design document for the TODO application and upload it to the Security Agent. The design document should include the following:

  • API endpoint design
  • Authentication and authorization flow (Cognito + JWT)
  • Data model (DynamoDB table design)
  • Error Handling Policy

↓ Select Start with Web App

image.png

image.png

image.png

image.png

image.png

Security The agent validates the design document against your organization's security requirements and returns findings.

After a short wait, the status will change to Completed.

image.png

The review results can be viewed in a summary to see how many issues were found for each level of urgency.
image.png Each review result is displayed in a list.
image.png

You can view the details of each review result.
image.png

Review results can be downloaded as a CSV file

Review results can also be downloaded as a CSV file.

The following review results were obtained this time:

Item Result Points of Concern
Authentication Best Practices COMPLIANT Cognito + JWT authentication is appropriate
Authorization Best Practices NON_COMPLIANT Owner checks that allow users to access only their own TODOs are not described in the design
Secret Protection INSUFFICIENT_DATA It is unclear whether Lambda uses IAM roles or how secrets are managed and rotated
Default Security Settings INSUFFICIENT_DATA Default settings such as DynamoDB encryption, API Gateway TLS enforcement, and Cognito password policy are not described
Log Protection INSUFFICIENT_DATA Masking of sensitive data, log retention period, and access control are not described
Information Protection INSUFFICIENT_DATA DynamoDB encryption and API Gateway TLS enforcement are not explicitly stated
Audit Log INSUFFICIENT_DATA Undefined log entry content and which events to log
Tenant Isolation INSUFFICIENT_DATA Multi-tenant configuration, but no owner verification code
Custom Requirement (Owner Verification) INSUFFICIENT_DATA Owner verification logic for GET/PUT/DELETE /todos/{id} is not described in the design

The developer will revise the design based on these points and proceed with implementation.

Step 2: Code Review (Security Agent × GitHub PR)

Implement the code based on the revised design and create a PR on GitHub. Here, the PR will intentionally include several vulnerabilities to observe how the Security Agent works.

todo-api-security-demo/
├── bin/
│ └── todo-api.ts # CDK entry point
├── lib/
│ └── todo-api-stack.ts # CDK stack definition
├── lambda/
│ ├── createTodo.ts # No input validation (intentional)
│ ├── getTodos.ts
│ ├── getTodo.ts # No authorization check (intentional)
│ ├── updateTodo.ts # No authorization check or input validation (intentional)
│ └── deleteTodo.ts # No authorization check (intentional)
├── package.json
├── cdk.json
└── tsconfig.json
Enter fullscreen mode Exit fullscreen mode

When you create a PR, the Security Agent automatically detects and analyzes the code. I will submit a pull request for code containing a vulnerability for testing purposes.
image.png

I will receive a message from SecurityAgent indicating that it is under review.

image.png

After SecurityAgent's review is complete, the review results will be returned.
image.png

Security Agent's findings clearly state three points: "What is the problem?", "Why is it important?", and "How should it be fixed?", allowing developers to take immediate action.

``
What is the problem? The catch block serializes both (error as Error).message and (error as Error).stack into an HTTP 500 response body. In a Lambda environment, the .stack internal file path (e.g., /var/task/lambda/createTodo.js:NN), DynamoDB table name, and AWS SDK call chain are exposed.

Why is this important? Because if stack traces and error messages are leaked, attackers can gain precise knowledge of internal implementation details, table names, code structure, etc., directly reducing the effort required for targeted attacks.

What are the recommendations? Replace the response body of the catch block with only a hardcoded static string: body: JSON.stringify({ message: 'Internal server error' }). Log the full error only to CloudWatch: console.error(error).

`plaintext

Below are some of the issues discovered:

  • Information leakage (stack trace)
  • Missing owner check (IDOR)
  • XSS (Cross-site scripting)

Developers should review the reports on GitHub and commit fixes. If the Security Agent re-checks and finds no issues, merge the PR.

Step 3: Incident Monitoring (DevOps Agent × Production Environment)

After correcting the code review findings, merging, and deploying, the DevOps Agent updates the application topology. For a TODO app, the dependency between API Gateway → Lambda → DynamoDB is visualized as the topology.

image.png
image.png

The DevOps Agent correlates the following data to monitor the impact of deployments:

  • CloudWatch metrics (Lambda error rate, API Gateway 5xx rate, DynamoDB throttling)

  • CloudWatch Logs (Lambda execution logs)

  • GitHub repository code change diffs

  • GitHub Actions Deployment History

Here, to observe the DevOps Agent's incident investigation behavior, we intentionally deploy code containing a bug to generate an error. We introduce a bug in the Lambda handler that causes a runtime error, resulting in a 500 error when a request is sent to the API. When a CloudWatch alarm (Lambda error rate threshold exceeded) is triggered, the DevOps Agent automatically begins an investigation.

Confirm that a CloudWatch alarm occurred.
image.png

Investigate recent alarms with DevOpsAgent.
image.png

image.png

An alarm investigation is initiated.

image.png

The investigation status can be checked on the dashboard.
image.png

After the investigation is complete, the "Investigation Timeline" will show what was investigated and the underlying cause.

The DevOps Agent correlated GitHub pull request history, code change diffs, and CloudWatch error logs to accurately identify the root cause.
image.png
![image.png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/192949/a7354418-3 (00c-4f7e-997e-2a6e3d45e550.png)

The "Root Cause" tab provides explanations of the root cause and its scope of impact.
image.png

Clicking the "Generate mitigation plan" button displayed in the Root Cause tab creates a mitigation plan.

The mitigation request outlines specific actions for resolution.
image.png

Summary of Mitigation Plan:

  • Recommends creating a new PR to revert PR #4, reapplying the security fixes from PR #3, and redeploying with CDK.
  • Proposes the following requirement changes for agent-enabled specifications:
  • Implement input validation for the todoId parameter in getTodo.ts
  • Implement secure error handling in all Lambda handlers (do not include stack traces in the response)
  • Implement user ownership checks in getTodo.ts

Detected incidents and response methods are registered as Issues on GitHub.

Because DevOpsAgent compiles information on incidents and response methods, registering this information as a GitHub Issue is less burdensome.
A cycle can be established where registered Issues are addressed using Kiro, reviewed by SecurityAgent, deployed, and then monitored again by DevOpsAgent.

Note: Extending DevOps Agent with MCP Servers
DevOps Agent supports MCP (Model Context Protocol), and its functionality can be extended by adding an MCP server to the Agent Space. For example, by hosting and connecting to a GitHub MCP server, it is possible to automatically create GitHub Issues based on the investigation results. However, currently, a hosted endpoint URL is not provided, so you will need to host your own MCP server.

Step 4: Preventive Recommendations (DevOps Agent)

DevOps Agent analyzes the application configuration and past incident patterns and presents preventive recommendations.
Select "Prevention" from the left menu of the web app.
image.png

Select "Run Now".

image.png

Please wait a moment for the results to appear.

image.png

↓Summary of Content

`
Category: Governance
Recommendation: "Implement CI/CD automation and comprehensive quality gates using GitHub Actions to prevent code regressions from reaching production."
Background: This incident involved code without input validation being deployed to production without quality gates, resulting in 100% failure of all GetTodoFunction requests.
Specific Proposal: Implement a system that runs automated tests, code coverage verification (85% or higher), TypeScript type checking, and ESLint static analysis during PR merges, and only allows CDK deployment after all checks have passed.

`

Based on the root cause identified in the incident investigation in Step 3 (security fix reverting to production without quality gates), specific actions to prevent recurrence are presented.

Step 5: Reflection in the Next Development Cycle

Based on the DevOps Agent's investigation results, mitigation plan, and preventive recommendations, these will be reflected in the next development cycle. Here, we add the requirements presented by the DevOps Agent in Step 3 to the Security Agent's custom security requirements so that they are automatically verified in subsequent code reviews.

This time, we will add "secure handling of error responses," which was pointed out in the mitigation plan in Step 3. Also, the "CI/CD quality gate using GitHub Actions" suggested in the preventative recommendation in Step 4 will be addressed separately as a task for the development team.

Adding a custom requirement for the Security Agent

image.png
image.png
↓The following settings were made, and "Create and enable security requirement" was performed.

Item Content
Security Requirement Name Secure Handling of Error Responses
Description Do not include internal implementation details (stack trace, error message, file path) in Lambda handler error responses
Applicability Applies to all Lambda handlers that process API endpoints, especially those that return error responses in catch blocks.
Compliance Requirements Compliant: When an error is caught in a catch block, the response should only return a generic error message (e.g., "Internal server error"), and detailed error information should only be logged to CloudWatch Logs using console.error. Violation: Implementations that include (error as Error).message or (error as Error).stack in the response body of the catch block.
Corrective Guidance Replace the catch block response with JSON.stringify({ message: "Internal server error" }) and log it using console.error("Error:", error).

By adding this requirement, if a developer submits a pull request in the next development cycle with code that includes a stack trace in the response, the Security Agent will automatically detect it as a violation of the organization's requirements.

This completes the integration.

  • Steps 1-2: Security Agent detects design and code vulnerabilities
  • Step 3: DevOps Agent investigates the incident and presents the root cause and mitigation plan
  • Step 4: DevOps Agent presents preventative recommendations (implementation of CI/CD quality gates)
  • Step 5: Reflect the mitigation plan and preventative recommendations in the Security Agent's organizational requirements and development tasks

In the next development cycle, the Security Agent will perform a review with the newly added organizational requirements reflected, so the same types of problems will be detected at the design and code stage.

Supplement: Utilizing Kiro IDE when automatic remediation is insufficient

While many cases can be handled by the automatically generated fix PRs from the Security Agent within the service integration, not all cases are covered.

For example, recommendations requiring architectural-level changes, such as "implementation of CI/CD quality gates" in Step 4, need to be implemented by the developer. In such cases, Kiro IDE complements the developer's efforts.

  • We'll proceed with the implementation while discussing "We want to add a quality gate to GitHub Actions" in Kiro's chat.
  • By defining the team's coding conventions and architectural patterns in the Steering file (.kiro/steering/), we can generate code that conforms to those conventions.

While the Security Agent and DevOps Agent are the main players in this collaboration, Kiro becomes a reliable partner for "fixes that require human thought and implementation."

Key Points for Implementation

Step-by-Step Implementation Approach

There's no need to immediately build a service integration configuration. You can implement it step by step using the following methods:

Step 1: Start with Security Agent Code Review

This is the easiest approach, as you can begin simply by connecting a GitHub repository. First, enable code review for an existing repository and see what kind of feedback you receive.

Step 2: Add Penetration Testing

Once you've seen the benefits of code review, add penetration testing. Run it on a staging environment and see if you can find vulnerabilities that automated scanning tools might miss.

Step 3: Deploy DevOps Agent to the Production Environment

Connect the observability tool (CloudWatch) and GitHub repository to the DevOps Agent and enable incident investigation and preventative recommendations.

Step 4: Inter-Service Integration Configuration

Establish an operational flow that reflects DevOps Agent's preventative recommendations in the Security Agent's organizational requirements. This completes the integration.

Considerations

  • Defining organizational security requirements is crucial for quality: Security Agent performs reviews based on defined requirements, so ambiguous requirements will reduce the accuracy of the review. It is recommended to start small and gradually enrich the requirements.
  • GitHub repository permission settings: Ensure that you properly configure which repositories Security Agent can access.
  • Integration with GitHub Actions pipelines: Consider where in the deployment pipeline to incorporate the timing of penetration test execution.
  • Division of roles between teams: Clearly defining roles, such as having the security team define organizational requirements, the development team handle code reviews, and the operations team evaluate preventative recommendations, will ensure smoother operation.

Summary

By combining AWS Security Agent and DevOps Agent, you can build a collaborative configuration of "Design → Code Review → Deployment → Incident Monitoring → Prevention → Next Development Cycle".

This time, we verified it with a simple TODO app. The Security Agent detected a missing authorization check as NON_COMPLIANT during the design review and pointed out IDOR, stack trace leaks, and XSS vulnerabilities at the line-of-code level during the code review. The DevOps Agent, upon incident occurrence, correlated GitHub PR history and code changes to accurately identify the root cause and presented mitigation plans and preventative recommendations (implementation of CI/CD quality gates).

This integration enables the following:

  • GitHub-centric workflow: Developers can receive security feedback without disrupting their usual GitHub workflow.
  • No human bottleneck: Security reviews and incident investigations are performed autonomously by AI agents.
  • Closed feedback integration configuration: DevOpsAgent's findings and preventative recommendations are reflected in the SecurityAgent's organizational requirements and automatically verified in the next development cycle.

Start with GitHub PR reviews using SecurityAgent. Simply connecting your GitHub repository is the first step towards accelerating DevOps.

Setup instructions are explained in the following article.

https://dev.to/aws-builders/aws-aws-security-agent-devops-agent-setup-guide-frontieragents-lbn

Top comments (0)