Masaki Okuda for AWS Community Builders

Posted on May 5

Let's Create AWS DevOps Agent Skills via Chat

#devopsagent #aws #skills

Original blogs:
https://zenn.dev/mob_engineer/articles/aad142f848a241

A new mode (Create skill with Chat) has appeared in the AWS DevOps Agent Skills that I've been experimenting with personally, so I'd like to try it out as a validation exercise.

Target Audience

This article is primarily written for those who use AWS DevOps Agent. In particular, I hope this article will be useful for those with the following challenges:

Those who want to understand the modes of AWS DevOps Agent
Those who want to know the pros and cons of Create skill with Chat

This article is written with the assumption that readers have some knowledge of Skills, so detailed explanations of Skills are omitted.

About AWS DevOps Agent Skills

Skills are documented in the following documentation:

DevOps Agent Skills

There are several benefits to using Skills, but I believe the main perspectives are:

Specialize the agent: Focus on investigating specific environments
Reduce investigation time: Define investigation procedures in Skills for efficient investigation
Integrate with multiple features: Achieve third-party integrations such as GitHub Actions and Dynatrace

Additionally, Skills can be automatically generated based on investigation results from AWS DevOps Agent.

Learned Skills

Challenges with Previous Skills Registration Methods

There are methods to register Skills other than the newly introduced Create skill with Chat.
There are two options: Create Skills to register the Skills you want to implement, and Upload Skills to register pre-created Skills.

Create Skills

As the name suggests, this is a method to register Skills from scratch.
You can create highly accurate Skills by specifically describing when and how the AI agent should start its investigation, but this is a high barrier for beginners. For example, when trying to create Skills for an AWS environment running multiple services, you need to investigate the environment before creating Skills and organize what the architecture looks like.

Also, since it doesn't link with Skills registered in other agent spaces, it's considered extremely difficult for beginners to register Skills in an agent space where no Skills are registered at all.

Upload Skills

This is a method to register Skills written in markdown files.
Compared to Create Skills, it has the advantage of being able to register Skills used in other environments, but it becomes difficult when registering Skills in environments where GitHub is not available.

Additionally, since the effort of environment investigation for creating Skills is still required, it gives the impression of being difficult for beginners.

By using the newly introduced Create skill with Chat, it's believed that you can register Skills suited to your actual environment even without much knowledge about Skills.

Let's Try It

To test the Create skill with Chat feature, I'll build a new agent space. Building an agent space itself is completed with just 2-3 operations in the management console. The creation procedure is omitted.

Also, I'll pre-build resources in the AWS environment to validate the Skills feature. The resources are written in Terraform so they can be managed as code.

Source Code

The content is a simple website composed of CloudFront/S3/EC2. CloudWatch Logs settings are also enabled to detect errors. The following 8 faults are injected:

Category 1: Network & Connectivity Issues

#	Fault	Impact	Detection Method
1	EC2 health check response port (ephemeral port) not allowed in ALB security group	Health check failure → Target unhealthy	ALB TargetResponseTime, UnHealthyHostCount
2	EC2 security group outbound only allows HTTPS (443) (HTTP 80 not allowed)	External API calls from EC2 fail, package updates impossible	Application log connection timeout

Category 2: Application & Server Issues

#	Fault	Impact	Detection Method
3	EC2 disk space shortage (/var/log grows due to cron job)	Log write failure → Application crash	CloudWatch DiskSpaceUtilization, App logs
4	ALB health check path is /health but app responds at /api/health	All targets unhealthy → 503 error	ALB UnHealthyHostCount, 5xx error rate

Category 3: CDN & Static Content Issues

#	Fault	Impact	Detection Method
5	S3 bucket policy denies access from CloudFront OAC/OAI	Static content 403 error	CloudFront 4xxErrorRate
6	CloudFront cache TTL is 86400 seconds (24 hours), serving old content after deployment	Deployment reflection delay	User reports (difficult to detect with metrics)

Category 4: Monitoring & Operations Issues

#	Fault	Impact	Detection Method
7	CloudWatch Agent configuration error (wrong log file path)	Logs not collected → Investigation impossible	CloudWatch Logs IncomingLogEvents = 0
8	Auto Scaling cooldown period is 600 seconds (10 minutes), scale-out delay	Response degradation during sudden load increase	ALB TargetResponseTime, RequestCount

Case 1: Without Using Skills

I'd like to investigate how much is detected without using Skills.
The following prompt is used:

Our web service (CloudFront + S3 + ALB + EC2 architecture) seems to be experiencing multiple issues. Please check the CloudWatch alarm states, ALB target health status, and CloudFront error rates, and evaluate the overall health of the current environment.

Note: When I ran an AWS DevOps Agent incident investigation during Terraform execution, the result showed a healthy state.

The detection results were as follows:

Category 1: Network & Connectivity Issues

#	Fault	Impact	Result
1	EC2 health check response port (ephemeral port) not allowed in ALB security group	Health check failure → Target unhealthy	OK
2	EC2 security group outbound only allows HTTPS (443) (HTTP 80 not allowed)	External API calls from EC2 fail, package updates impossible	NG

Category 2: Application & Server Issues

#	Fault	Impact	Result
3	EC2 disk space shortage (/var/log grows due to cron job)	Log write failure → Application crash	OK
4	ALB health check path is /health but app responds at /api/health	All targets unhealthy → 503 error	NG

Category 3: CDN & Static Content Issues

#	Fault	Impact	Result
5	S3 bucket policy denies access from CloudFront OAC/OAI	Static content 403 error	NG
6	CloudFront cache TTL is 86400 seconds (24 hours), serving old content after deployment	Deployment reflection delay	NG

Category 4: Monitoring & Operations Issues

#	Fault	Impact	Result
7	CloudWatch Agent configuration error (wrong log file path)	Logs not collected → Investigation impossible	OK
8	Auto Scaling cooldown period is 600 seconds (10 minutes), scale-out delay	Response degradation during sudden load increase	NG

The investigation of CDN and static content-related issues did not produce very good accuracy.
The agent execution time was 2 minutes and 58 seconds.

Now, let's create Skills and conduct the investigation.

Creating Skills

Clicking the Create skill with Chat button automatically triggers Skills generation.
I'm curious whether the billing is the same when the agent runs during Skills creation as well. (As far as I could see, there was no mention of this in the official documentation.)

Skills are generated in about 5 to 10 minutes.

The contents of the generated Skills were as follows.
Since they can be downloaded, synchronization to other agent spaces is also possible.

If you want to modify the contents of Skills, clicking the regenerate button will regenerate them.

I'd like to investigate faults using the same prompt as before.
Although the prompt content hasn't changed, I could confirm from the logs that the created Skills were being checked.

Category 1: Network & Connectivity Issues

#	Fault	Impact	Result
1	EC2 health check response port (ephemeral port) not allowed in ALB security group	Health check failure → Target unhealthy	OK
2	EC2 security group outbound only allows HTTPS (443) (HTTP 80 not allowed)	External API calls from EC2 fail, package updates impossible	OK

Category 2: Application & Server Issues

#	Fault	Impact	Result
3	EC2 disk space shortage (/var/log grows due to cron job)	Log write failure → Application crash	OK
4	ALB health check path is /health but app responds at /api/health	All targets unhealthy → 503 error	OK

Category 3: CDN & Static Content Issues

#	Fault	Impact	Result
5	S3 bucket policy denies access from CloudFront OAC/OAI	Static content 403 error	OK
6	CloudFront cache TTL is 86400 seconds (24 hours), serving old content after deployment	Deployment reflection delay	OK

Category 4: Monitoring & Operations Issues

#	Fault	Impact	Result
7	CloudWatch Agent configuration error (wrong log file path)	Logs not collected → Investigation impossible	NG
8	Auto Scaling cooldown period is 600 seconds (10 minutes), scale-out delay	Response degradation during sudden load increase	NG

Compared to before, the fault detection rate has improved to approximately 75%, but monitoring and operations issues were not identified. However, since it has progressed to proposing recovery measures, it appears to be investigating more deeply than without Skills.

Summary

I have the impression that the newly introduced Create skill with Chat has made Skills management easier. That said, the time to create Skills is somewhat long (it took about 5 minutes even for this simple architecture).

AWS DevOps Agent itself gives the impression of constantly introducing new features, so I'd like to continue validation going forward.

DEV Community

Let's Create AWS DevOps Agent Skills via Chat

Target Audience

About AWS DevOps Agent Skills

Challenges with Previous Skills Registration Methods

Create Skills

Upload Skills

Let's Try It

Category 1: Network & Connectivity Issues

Category 2: Application & Server Issues

Category 3: CDN & Static Content Issues

Category 4: Monitoring & Operations Issues

Case 1: Without Using Skills

Category 1: Network & Connectivity Issues

Category 2: Application & Server Issues

Category 3: CDN & Static Content Issues

Category 4: Monitoring & Operations Issues

Creating Skills

Category 1: Network & Connectivity Issues

Category 2: Application & Server Issues

Category 3: CDN & Static Content Issues

Category 4: Monitoring & Operations Issues

Summary

Top comments (0)