Resolve incidents faster with Skills in AWS DevOps Agent

Yeremy Turcios — Fri, 19 Jun 2026 06:23:12 +0000

Skills in AWS DevOps Agent allow you to define and reuse your team’s investigation procedures so the agent can follow them automatically during incident analysis. Over time, operations teams develop precise investigation procedures for their infrastructure. They know the exact sequence of checks to run when a database starts throttling or a AWS Lambda function starts erroring. The challenge is making that expertise available consistently, across every investigation.

We built AWS DevOps Agent to automate incident investigation, but we kept hearing the same feedback from customers: "The agent is good at general investigation, but it doesn't know our specific procedures." Teams had developed battle-tested investigation workflows over years of operating their infrastructure, and they wanted the agent to follow those same steps.

That's why we built skills, a way to teach AWS DevOps Agent your team's investigation procedures, operational knowledge, and troubleshooting patterns. In this post, we'll walk through what skills are, how to create them, and how they change the way the agent investigates issues in your environment.

The problem: institutional knowledge doesn't scale

Here's a scenario we see often. A team runs a microservices application on AWS. Over time, they've learned that when their Amazon RDS instance starts showing high latency, the right investigation sequence is:

Check Amazon CloudWatch alarms for DatabaseConnections exceeding 80% of max_connections
Look at ReadLatency and WriteLatency over the past hour
Pull slow queries from Performance Insights
Check if FreeStorageSpace dropped below 20%
Correlate with recent deployments

This procedure works. The team trusts it. But it's often implicit, known by experienced engineers and applied inconsistently across responders. As teams grow and operate across multiple regions and time zones, these procedures become harder to scale, leading to inconsistent investigations and longer mean time to resolution (MTTR). Without skills, the agent relies on general-purpose reasoning. It might get to the right answer, but it won't follow the specific sequence your team has validated.

What skills look like

A skill is a directory with a SKILL.md file containing the instructions you want the agent to follow. That's the only required file. Beyond that, you can add any supporting files in whatever directory structure makes sense for your team: reference docs, architecture diagrams, metric threshold tables, PDFs, images, data files.

Note: Skills containing executable scripts are not currently supported and will be rejected during upload. This includes script files anywhere in the skill directory, not just in a scripts/ folder.

Skills follow a subset of the Agent Skills specification, an open standard for packaging agent instructions. Here's what a simple skill directory looks like:

rds-performance-investigation/
├── SKILL.md
└── references/
    └── rds-metrics-reference.md

The SKILL.md file starts with frontmatter (name and description), followed by the actual instructions:

---
name: rds-performance-investigation
description: "Investigation procedures for RDS performance issues including"
  connection exhaustion, slow queries, replication lag, and storage capacity.
  Use when investigating database latency, connection errors, or read/write  performance degradation.
---
# RDS Performance Investigation

Use this skill when investigating database latency, connection errors,
query timeouts, or read/write performance degradation.
## Step 1: Check alarm status

Query CloudWatch for active alarms on the affected RDS instance. Look for:- DatabaseConnections exceeding 80% of max_connections
- ReadLatency or WriteLatency above 20ms
- FreeStorageSpace below 20% of total storage
- ReplicaLag above 30 seconds (read replicas only)

## Step 2: Analyze connection metrics

Retrieve DatabaseConnections over the past hour. If connections are near
the max_connections limit, check for connection pool misconfiguration or
long-running idle connections.
## Step 3: Identify slow queries

Use Performance Insights (pi:GetResourceMetrics) to retrieve the top SQL
statements by average active sessions. Focus on queries with high db.load
contribution or frequent I/O waits.
## Step 4: Summarize findings

Refer to [references/rds-metrics-reference.md](references/rds-metrics-reference.md)
for normal ranges and investigation thresholds.

Provide a summary with:1. Current performance status (healthy / degraded / critical)2. Root cause hypothesis with supporting metrics3. Recommended remediation steps ranked by priority

And the reference file gives the agent concrete thresholds to work with:

# RDS CloudWatch Metrics Reference

| Metric | Normal Range | Investigation Threshold |
|---|---|---|
| DatabaseConnections | < 70% max_connections | > 80% max_connections |
| ReadLatency | < 5ms | > 20ms |
| WriteLatency | < 5ms | > 20ms |
| FreeStorageSpace | > 30% total storage | < 20% total storage |
| ReplicaLag | < 5 seconds | > 30 seconds |
| CPUUtilization | < 70% | > 85% |

How skills change an investigation

Figure 1. Skills lifecycle. Operators create skills once through the Operator Web App. During an incident, AWS DevOps Agent loads the skills that match the agent type and incident context, follows the skill's instructions to investigate using AWS APIs and tools, and records each step in the Investigation Timeline.

When an investigation starts, AWS DevOps Agent fetches the catalog of skills available in your Agent Space. The catalog is filtered to skills tagged for the current agent type, with Generic skills always included, so a triage agent doesn't see skills meant only for root cause analysis. At this point the agent has each skill's name and description, but not its full content.

The agent reads the descriptions and decides which skills are relevant to the current incident. This is why clear, specific descriptions matter, they're how the agent knows whether to use a skill. Multiple skills can be selected for a single investigation. For example, the agent might pull in an RDS performance skill alongside a deployment rollback skill when both apply.

When the agent loads a skill, its instructions become part of the agent's working context. The agent follows the steps, querying the AWS APIs the skill calls for, and reading any reference files the skill points to. A skill can also extend the agent's toolset, for example, a metrics skill might unlock provider-specific query tools that aren't loaded by default. Each step the agent takes, including reading a skill, is recorded in the Investigation Timeline so you can audit exactly which skills were used and what they produced.

To see this in practice, let's compare how the agent handles the same RDS latency incident with and without this skill.

Without a skill, the agent starts from general knowledge. It knows RDS is a database service and that CloudWatch has relevant metrics, so it begins querying broadly. It might check CPU utilization first, then look at storage, then eventually get to connection metrics. It reaches a reasonable conclusion, but the investigation path is generic. It doesn't know that your team has learned to check DatabaseConnections first because that's been the root cause 80% of the time in your environment. It doesn't know your specific thresholds, and it doesn't consult your team's metrics reference table.
With the skill above, the investigation changes. The agent recognizes that a skill exists for RDS performance issues and loads it. Now it follows your team's exact procedure: it checks DatabaseConnections against your 80% threshold first, then moves to ReadLatency and WriteLatency, pulls slow queries from Performance Insights, and checks FreeStorageSpace. It references your metrics table to distinguish normal ranges from investigation thresholds. The investigation follows the same path your senior engineers would take, every time.

The difference isn't just about reaching the right answer. It's about reaching it through the right process, the one your team has validated through experience. And because skills are reusable, this happens automatically for every investigation that matches, whether it's triggered at 2 PM or 2 AM. The result is more consistent investigations across your team, faster identification of root causes, and reduced mean time to resolution (MTTR) because the agent no longer needs to explore broadly before finding the right path.

Agent types

AWS DevOps Agent runs as different agent types depending on the task. When you create or upload a skill, you choose which of these agent types can use it:

All agents (the default): Applies to all agent types.
Chat tasks: Ad-hoc questions and requests during chat sessions.
Incident Triage: Does the initial assessment when an incident arrives.
Incident RCA: Drives root cause analysis on incidents that pass triage.
Incident Mitigation: Suggests or runs remediation actions.
Evaluation: Produces proactive recommendations on your environment.
Release Readiness Review: Production-readiness change review for code and infrastructure changes.

Targeting a skill to a specific agent type keeps it from loading when it's not relevant, which reduces context consumption and improves agent focus.

How to create a skill

From a zip file

If your team already maintains investigation procedures in a repository or local directory, you can package them as a zip file and upload them directly. Here's a walkthrough:

Create a directory with a SKILL.md file and any supporting files:

rds-performance-investigation/
├── SKILL.md
└── references/
    └── rds-metrics-reference.md

Compress the directory into a zip file (maximum 6 MB).
In the Operator Web App, navigate Knowledge page, click Skills and choose Add skill, then Upload skill.
Drag and drop your zip file or click to browse.
Select which agent types can use this skill.
Choose Upload.

The system validates the zip file, extracts the SKILL.md frontmatter, and makes the skill available to the selected agent types.

In the UI

For simpler skills that don't need reference files, you can write instructions directly in the Operator Web App. Navigate to Knowledge and Skills, then Add skill, then Create skill, and fill in the name, description, and instructions in Markdown.

With Chat

To create a skill with natural language, navigate to Knowledge and Skills, then Add skill, then Create skill with Chat. You can also create and manage skills directly from a chat session. Ask the agent in the chat to create, update, list, activate, or delete user skills without leaving the conversation.

From a GitHub Repository

To manage skills from a GitHub repository, navigate to Knowledge and Skills, then Add skill, then Import from Repository. Add the link to the repo URL and we will import all skills in the repository.

From the AWS SDK

If you want to manage skills from scripts or automation instead of the Operator Web App, you can create them programmatically with the Asset API. Every skill is an asset you can create, read, update, and delete through the devops-agent client in the AWS CLI and AWS SDKs, using a CreateAsset call with assetType set to skill. This is useful for bulk-loading a starter set of skills into a new Agent Space or keeping skills in version control. For the full walkthrough, see Managing assets in the User Guide.

Managed skills

In addition to custom skills you create, AWS DevOps Agent can generate two managed skills that capture knowledge about your environment and how the agent operates within it. Managed skills are produced by the agent itself, and can be updated by the agent or by you.

tool-use-best-practices: Learn from investigations so the agent picks the right tools faster. Eligible for generation after your Agent Space has accumulated enough completed investigations.
chat-tool-use-best-practices: Learn from your chat sessions so the agent picks the right tools faster in chat.
understanding-agent-space: Analyze all associations in your Agent Space, including cloud resources, code repositories, observability integrations, and custom MCP servers, to capture domain concepts, deployment environments, high-level architecture, critical code paths, and code-to-architecture mappings for increasing the effectiveness of incident investigations.
understanding-dependencies: A complete service-to-service and package dependency map. Use this skill to understand how repositories connect: which services call which, what events flow between them, which packages are shared, and where infrastructure boundaries lie. Useful for assessing the impact of changes, identifying upstream and downstream effects, and understanding deployment ordering.
understanding-pipeline-topology: Discover CI/CD pipeline configurations across all associated repositories, capturing pipeline stages, deployment flows, branch strategies, gates, and environment mappings for GitHub Actions, GitLab CI, Azure DevOps, Amazon Brazil pipelines, and more.

To generate a managed skill, navigate to the Skills page and go to Managed skills section. Choose Generate for the skill you want. You can regenerate either skill at any time as your environment evolves, and the agent uses the latest version automatically. For more info go to Learned Skills

Sample skills

The AWS DevOps Agent Skills Github page contains community-contributed skills you can use as-is or as a starting point for writing your own. Available samples include skills for AWS Health event investigation, AWS Support case analysis, EKS operational reviews, and RDS operational reviews.

To use a sample skill, import it from the GitHub repository. Alternatively, you can clone the repository, zip the skill directory, and upload it to your Agent Space. Each skill includes a README with prerequisites and usage instructions.

Tips for writing good skills

Write clear descriptions. The agent uses the skill's description to decide whether to load it during an investigation. Include the specific scenarios, services, and symptoms the skill covers.
Be specific in your instructions. Include concrete metric thresholds, specific API calls, and exact log group names. For example, "Query Amazon CloudWatch Logs Insights for error patterns in the last 2 hours" beats "check the logs."
Use descriptive names. Skill names should reflect the specific scenario they address, making it easier for your team to identify the right skill at a glance. For example, rds-throttling-investigation over database-skill.
Target agent types. Assign skills to only the agent types that need them to reduce context consumption and improve focus. For example, a triage skill doesn't need to load during root cause analysis.
Add reference files. Separate supporting content like metric thresholds and architecture docs into their own files. This keeps SKILL.md focused on the investigation workflow while giving the agent detailed reference material to consult.
Keep skills focused. Build single-purpose skills rather than one large skill that covers everything. The agent can compose multiple skills during complex incidents, so a skill for "RDS performance" and a separate skill for "deployment rollback" work better together than a single combined skill.

Get started

The fastest way to start is in chat. Open the chat in your Operator Web App and try one of these three skills first. The Skills page is where you'll go later to manage, edit, or deactivate them.

Convert an existing runbook into a skill. Paste a runbook your team already uses into the chat and ask the agent to turn it into a skill. Most teams already have written investigation procedures somewhere; skills meet you where you are. This is the lowest-effort first skill, and it usually surfaces the most issues you'd want to encode.
Build a skill for assessing incident impact. When an incident hits, the first question is usually "who's affected?" Capture the CloudWatch Logs Insights queries and metrics your team runs to answer that question into a skill. Impact-assessment skills are concrete, immediately reusable, and pay off on every incident.
Turn your steering into skills as you go. During investigations, you'll naturally steer the agent: "check the deployment timeline first," "look at the read replica before the writer." When you do, ask the chat to capture tyeshat guidance as a new skill or an update to an existing one. This is the habit that grows your skill library over time, without ever blocking on a writing session.

For the full documentation, see AWS DevOps Agent Skills, Learned Skills, and Managing Assets in the User Guide. We're excited to see how you use skills to make the agent work the way your team works. If you have feedback, leave a comment below.

Yeremy Turcios is a Software Development Engineer on the AWS DevOps Agent team, primarily focusing on agent development.

DEV Community: Yeremy Turcios