Original blogs:
https://zenn.dev/mob_engineer/articles/aad142f848a241
A new mode (Create skill with Chat) has appeared in the AWS DevOps Agent Skills that I've been experimenting with personally, so I'd like to try it out as a validation exercise.
Target Audience
This article is primarily written for those who use AWS DevOps Agent. In particular, I hope this article will be useful for those with the following challenges:
- Those who want to understand the modes of AWS DevOps Agent
- Those who want to know the pros and cons of Create skill with Chat
This article is written with the assumption that readers have some knowledge of Skills, so detailed explanations of Skills are omitted.
About AWS DevOps Agent Skills
Skills are documented in the following documentation:
There are several benefits to using Skills, but I believe the main perspectives are:
- Specialize the agent: Focus on investigating specific environments
- Reduce investigation time: Define investigation procedures in Skills for efficient investigation
- Integrate with multiple features: Achieve third-party integrations such as GitHub Actions and Dynatrace
Additionally, Skills can be automatically generated based on investigation results from AWS DevOps Agent.
Challenges with Previous Skills Registration Methods
There are methods to register Skills other than the newly introduced Create skill with Chat.
There are two options: Create Skills to register the Skills you want to implement, and Upload Skills to register pre-created Skills.
Create Skills
As the name suggests, this is a method to register Skills from scratch.
You can create highly accurate Skills by specifically describing when and how the AI agent should start its investigation, but this is a high barrier for beginners. For example, when trying to create Skills for an AWS environment running multiple services, you need to investigate the environment before creating Skills and organize what the architecture looks like.
Also, since it doesn't link with Skills registered in other agent spaces, it's considered extremely difficult for beginners to register Skills in an agent space where no Skills are registered at all.
Upload Skills
This is a method to register Skills written in markdown files.
Compared to Create Skills, it has the advantage of being able to register Skills used in other environments, but it becomes difficult when registering Skills in environments where GitHub is not available.
Additionally, since the effort of environment investigation for creating Skills is still required, it gives the impression of being difficult for beginners.
By using the newly introduced Create skill with Chat, it's believed that you can register Skills suited to your actual environment even without much knowledge about Skills.
Let's Try It
To test the Create skill with Chat feature, I'll build a new agent space. Building an agent space itself is completed with just 2-3 operations in the management console. The creation procedure is omitted.
Also, I'll pre-build resources in the AWS environment to validate the Skills feature. The resources are written in Terraform so they can be managed as code.
The content is a simple website composed of CloudFront/S3/EC2. CloudWatch Logs settings are also enabled to detect errors. The following 8 faults are injected:
Category 1: Network & Connectivity Issues
| # | Fault | Impact | Detection Method |
|---|---|---|---|
| 1 | EC2 health check response port (ephemeral port) not allowed in ALB security group | Health check failure → Target unhealthy | ALB TargetResponseTime, UnHealthyHostCount |
| 2 | EC2 security group outbound only allows HTTPS (443) (HTTP 80 not allowed) | External API calls from EC2 fail, package updates impossible | Application log connection timeout |
Category 2: Application & Server Issues
| # | Fault | Impact | Detection Method |
|---|---|---|---|
| 3 | EC2 disk space shortage (/var/log grows due to cron job) | Log write failure → Application crash | CloudWatch DiskSpaceUtilization, App logs |
| 4 | ALB health check path is /health but app responds at /api/health | All targets unhealthy → 503 error | ALB UnHealthyHostCount, 5xx error rate |
Category 3: CDN & Static Content Issues
| # | Fault | Impact | Detection Method |
|---|---|---|---|
| 5 | S3 bucket policy denies access from CloudFront OAC/OAI | Static content 403 error | CloudFront 4xxErrorRate |
| 6 | CloudFront cache TTL is 86400 seconds (24 hours), serving old content after deployment | Deployment reflection delay | User reports (difficult to detect with metrics) |
Category 4: Monitoring & Operations Issues
| # | Fault | Impact | Detection Method |
|---|---|---|---|
| 7 | CloudWatch Agent configuration error (wrong log file path) | Logs not collected → Investigation impossible | CloudWatch Logs IncomingLogEvents = 0 |
| 8 | Auto Scaling cooldown period is 600 seconds (10 minutes), scale-out delay | Response degradation during sudden load increase | ALB TargetResponseTime, RequestCount |
Case 1: Without Using Skills
I'd like to investigate how much is detected without using Skills.
The following prompt is used:
Our web service (CloudFront + S3 + ALB + EC2 architecture) seems to be experiencing multiple issues. Please check the CloudWatch alarm states, ALB target health status, and CloudFront error rates, and evaluate the overall health of the current environment.
Note: When I ran an AWS DevOps Agent incident investigation during Terraform execution, the result showed a healthy state.
The detection results were as follows:
Category 1: Network & Connectivity Issues
| # | Fault | Impact | Result |
|---|---|---|---|
| 1 | EC2 health check response port (ephemeral port) not allowed in ALB security group | Health check failure → Target unhealthy | OK |
| 2 | EC2 security group outbound only allows HTTPS (443) (HTTP 80 not allowed) | External API calls from EC2 fail, package updates impossible | NG |
Category 2: Application & Server Issues
| # | Fault | Impact | Result |
|---|---|---|---|
| 3 | EC2 disk space shortage (/var/log grows due to cron job) | Log write failure → Application crash | OK |
| 4 | ALB health check path is /health but app responds at /api/health | All targets unhealthy → 503 error | NG |
Category 3: CDN & Static Content Issues
| # | Fault | Impact | Result |
|---|---|---|---|
| 5 | S3 bucket policy denies access from CloudFront OAC/OAI | Static content 403 error | NG |
| 6 | CloudFront cache TTL is 86400 seconds (24 hours), serving old content after deployment | Deployment reflection delay | NG |
Category 4: Monitoring & Operations Issues
| # | Fault | Impact | Result |
|---|---|---|---|
| 7 | CloudWatch Agent configuration error (wrong log file path) | Logs not collected → Investigation impossible | OK |
| 8 | Auto Scaling cooldown period is 600 seconds (10 minutes), scale-out delay | Response degradation during sudden load increase | NG |
The investigation of CDN and static content-related issues did not produce very good accuracy.
The agent execution time was 2 minutes and 58 seconds.
Now, let's create Skills and conduct the investigation.
Creating Skills
Clicking the Create skill with Chat button automatically triggers Skills generation.
I'm curious whether the billing is the same when the agent runs during Skills creation as well. (As far as I could see, there was no mention of this in the official documentation.)
Skills are generated in about 5 to 10 minutes.
The contents of the generated Skills were as follows.
Since they can be downloaded, synchronization to other agent spaces is also possible.
If you want to modify the contents of Skills, clicking the regenerate button will regenerate them.
I'd like to investigate faults using the same prompt as before.
Although the prompt content hasn't changed, I could confirm from the logs that the created Skills were being checked.
Category 1: Network & Connectivity Issues
| # | Fault | Impact | Result |
|---|---|---|---|
| 1 | EC2 health check response port (ephemeral port) not allowed in ALB security group | Health check failure → Target unhealthy | OK |
| 2 | EC2 security group outbound only allows HTTPS (443) (HTTP 80 not allowed) | External API calls from EC2 fail, package updates impossible | OK |
Category 2: Application & Server Issues
| # | Fault | Impact | Result |
|---|---|---|---|
| 3 | EC2 disk space shortage (/var/log grows due to cron job) | Log write failure → Application crash | OK |
| 4 | ALB health check path is /health but app responds at /api/health | All targets unhealthy → 503 error | OK |
Category 3: CDN & Static Content Issues
| # | Fault | Impact | Result |
|---|---|---|---|
| 5 | S3 bucket policy denies access from CloudFront OAC/OAI | Static content 403 error | OK |
| 6 | CloudFront cache TTL is 86400 seconds (24 hours), serving old content after deployment | Deployment reflection delay | OK |
Category 4: Monitoring & Operations Issues
| # | Fault | Impact | Result |
|---|---|---|---|
| 7 | CloudWatch Agent configuration error (wrong log file path) | Logs not collected → Investigation impossible | NG |
| 8 | Auto Scaling cooldown period is 600 seconds (10 minutes), scale-out delay | Response degradation during sudden load increase | NG |
Compared to before, the fault detection rate has improved to approximately 75%, but monitoring and operations issues were not identified. However, since it has progressed to proposing recovery measures, it appears to be investigating more deeply than without Skills.
Summary
I have the impression that the newly introduced Create skill with Chat has made Skills management easier. That said, the time to create Skills is somewhat long (it took about 5 minutes even for this simple architecture).
AWS DevOps Agent itself gives the impression of constantly introducing new features, so I'd like to continue validation going forward.








Top comments (0)