Ajit

Posted on May 6

I Used an AI "Skill" to Audit My CloudFormation Stack — Found a Broken Alarm Running for 3 Months

#aws #cloudformation #devops #security

TL;DR

Kiro CLI has a feature called "skills" — domain-specific expertise packages you can load on demand. I loaded the aws-cloudformation skill to validate my production Spot Fleet template. It found 4 critical issues including a CloudWatch alarm that literally never worked.

Situation

I run a development environment on AWS Spot instances — Kiro CLI, VS Code Server, persistent EBS, the works. The CloudFormation stack has been deployed since February 2026. It's been updated once, runs daily, and I considered it production-ready.

I was about to upgrade stack and wanted to validate my base template first.

Task

Verify whether my "security-hardened" template actually follows CloudFormation best practices before using it as the foundation for upgrade.

Action

Step 1 — Discover the skill:
search_documentation("CloudFormation deployment", topics=["agent_skills"])

This returned 5 relevant skills. I picked aws-cloudformation.

Step 2 — Load the skill:
retrieve_skill("aws-cloudformation")

What I got back wasn't generic advice. It was:

A structured 10-step authoring checklist
3-layer validation pipeline (syntax → security → pre-deploy)
Troubleshooting SOPs for failed stacks
Decision frameworks for template vs environment fixes

Step 3 — Load the reference SOP:
retrieve_skill("aws-cloudformation", file="references/author-cloudformation-best-practices.script.md")

This gave me the detailed checklist: resource naming, parameter design, security defaults, deletion policies, cross-stack references, conditions, outputs.

Step 4 — Apply it to my template:

I asked Kiro CLI to review CFN yaml against the skill's checklist.

Result

4 critical, 5 recommended, 1 strict finding.

Here's what hit hardest:

🔴 Critical #1: Broken CloudWatch Alarm (3 months undetected)

yaml
HighCPUAlarm:
Dimensions:
- Name: InstanceId
Value: !Ref SpotFleetRequest # ← This is a Fleet ID, not an Instance ID

My "cryptomining detection" alarm was monitoring a dimension that doesn't exist. It never evaluated. Never fired. I had zero protection for 3 months while thinking I was
covered.

Fix: Reference the actual instance ID (requires a Lambda or instance self-registration pattern since Spot Fleet instances are dynamic).

What Are "Skills" Exactly?

They're not prompts. They're not RAG documents. They're structured domain expertise packages containing:

Component	What it does
Workflows	Step-by-step procedures with decision points
Checklists	Deterministic validation (like the 10-step authoring review)
SOPs	Standard operating procedures for troubleshooting
Decision trees	"If X, do Y; if Z, do W" frameworks
Reference files	Architecture docs, schemas, examples

The CloudFormation skill has SOPs for:

Authoring with secure defaults
3-layer pre-deployment validation (cfn-lint → cfn-guard → change set)
Troubleshooting failed stacks (using describe-events --filters FailedEvents=true)
Resource property lookup

Key insight: The skill told me to use describe-events (newer API with filter support) instead of describe-stack-events (legacy, no filters). I didn't know this API existed.

How to Use Skills in Kiro CLI

bash

1. Search for relevant skills

(you can't guess skill names — they must be discovered)

search_documentation("your task description", topics=["agent_skills"])

2. Load the skill

retrieve_skill("exact-skill-name-from-search")

3. Load reference files if the skill links to them

retrieve_skill("skill-name", file="references/some-sop.md")

4. Apply — ask the AI to use the loaded expertise

"Review my template against the authoring checklist"
"Troubleshoot my failed stack using the SOP"
"Validate before I deploy"

Available skills I found:

aws-cloudformation — authoring, validation, troubleshooting
launching-ec2-instance-with-best-practices — secure EC2 launches
creating-production-vpc-multi-az — VPC design
creating-ec2-image-builder-pipeline — AMI automation
aws-cdk — CDK patterns and deployment

Lessons Learned

"It works" ≠ "It's correct." My stack ran fine for 3 months with a broken alarm. Functional doesn't mean secure.
Validate infrastructure like you lint code. We wouldn't ship code without tests. Why do we ship CloudFormation without a structured review?
AI skills > AI chat. A general "review my template" prompt gives generic advice. A loaded skill applies a deterministic, comprehensive checklist. The difference is like
asking a random person vs. asking a CloudFormation specialist with a clipboard.
The scariest bugs are silent ones. A broken alarm doesn't throw errors. It just... doesn't protect you.

The skill doesn't just find problems — it prevents them in new templates.

DEV Community