DEV Community

Ajit
Ajit

Posted on

I Used an AI "Skill" to Audit My CloudFormation Stack — Found a Broken Alarm Running for 3 Months

TL;DR

Kiro CLI has a feature called "skills" — domain-specific expertise packages you can load on demand. I loaded the aws-cloudformation skill to validate my production Spot Fleet template. It found 4 critical issues including a CloudWatch alarm that literally never worked.

Situation

I run a development environment on AWS Spot instances — Kiro CLI, VS Code Server, persistent EBS, the works. The CloudFormation stack has been deployed since February 2026. It's been updated once, runs daily, and I considered it production-ready.

I was about to upgrade stack and wanted to validate my base template first.

Task

Verify whether my "security-hardened" template actually follows CloudFormation best practices before using it as the foundation for upgrade.

Action

Step 1 — Discover the skill:
search_documentation("CloudFormation deployment", topics=["agent_skills"])

This returned 5 relevant skills. I picked aws-cloudformation.

Step 2 — Load the skill:
retrieve_skill("aws-cloudformation")

What I got back wasn't generic advice. It was:

  • A structured 10-step authoring checklist
  • 3-layer validation pipeline (syntax → security → pre-deploy)
  • Troubleshooting SOPs for failed stacks
  • Decision frameworks for template vs environment fixes

Step 3 — Load the reference SOP:
retrieve_skill("aws-cloudformation", file="references/author-cloudformation-best-practices.script.md")

This gave me the detailed checklist: resource naming, parameter design, security defaults, deletion policies, cross-stack references, conditions, outputs.

Step 4 — Apply it to my template:

I asked Kiro CLI to review CFN yaml against the skill's checklist.

Result

4 critical, 5 recommended, 1 strict finding.

Here's what hit hardest:

🔴 Critical #1: Broken CloudWatch Alarm (3 months undetected)

yaml
HighCPUAlarm:
Dimensions:
- Name: InstanceId
Value: !Ref SpotFleetRequest # ← This is a Fleet ID, not an Instance ID

My "cryptomining detection" alarm was monitoring a dimension that doesn't exist. It never evaluated. Never fired. I had zero protection for 3 months while thinking I was
covered.

Fix: Reference the actual instance ID (requires a Lambda or instance self-registration pattern since Spot Fleet instances are dynamic).

What Are "Skills" Exactly?

They're not prompts. They're not RAG documents. They're structured domain expertise packages containing:

Component What it does
Workflows Step-by-step procedures with decision points
Checklists Deterministic validation (like the 10-step authoring review)
SOPs Standard operating procedures for troubleshooting
Decision trees "If X, do Y; if Z, do W" frameworks
Reference files Architecture docs, schemas, examples

The CloudFormation skill has SOPs for:

  • Authoring with secure defaults
  • 3-layer pre-deployment validation (cfn-lint → cfn-guard → change set)
  • Troubleshooting failed stacks (using describe-events --filters FailedEvents=true)
  • Resource property lookup

Key insight: The skill told me to use describe-events (newer API with filter support) instead of describe-stack-events (legacy, no filters). I didn't know this API existed.

How to Use Skills in Kiro CLI

bash

1. Search for relevant skills

(you can't guess skill names — they must be discovered)

search_documentation("your task description", topics=["agent_skills"])

2. Load the skill

retrieve_skill("exact-skill-name-from-search")

3. Load reference files if the skill links to them

retrieve_skill("skill-name", file="references/some-sop.md")

4. Apply — ask the AI to use the loaded expertise

"Review my template against the authoring checklist"
"Troubleshoot my failed stack using the SOP"
"Validate before I deploy"

Available skills I found:

  • aws-cloudformation — authoring, validation, troubleshooting
  • launching-ec2-instance-with-best-practices — secure EC2 launches
  • creating-production-vpc-multi-az — VPC design
  • creating-ec2-image-builder-pipeline — AMI automation
  • aws-cdk — CDK patterns and deployment

Lessons Learned

  1. "It works" ≠ "It's correct." My stack ran fine for 3 months with a broken alarm. Functional doesn't mean secure.

  2. Validate infrastructure like you lint code. We wouldn't ship code without tests. Why do we ship CloudFormation without a structured review?

  3. AI skills > AI chat. A general "review my template" prompt gives generic advice. A loaded skill applies a deterministic, comprehensive checklist. The difference is like
    asking a random person vs. asking a CloudFormation specialist with a clipboard.

  4. The scariest bugs are silent ones. A broken alarm doesn't throw errors. It just... doesn't protect you.

The skill doesn't just find problems — it prevents them in new templates.

Top comments (0)