Yuto Takashi

Posted on Jan 9

I Accidentally Launched 50 EC2 Instances with CloudFormation SpotFleet (and How to Prevent It)

#aws #devops #infrastructureascode #jenkins

TL;DR

I misconfigured CloudFormation SpotFleet's TargetCapacity and launched 50 EC2 instances by accident. Cost: $1.41 (thanks to Spot pricing). But it could have been much worse.

Why you should care: If you're using CloudFormation with SpotFleet or any auto-scaling resources, one wrong default value can cause unexpected launches. This post shows you what went wrong and how to build multiple checkpoints to prevent it.

What Happened

I was updating a CloudFormation stack to add multiple instance size patterns (t3.nano, t3.small, t3.medium, t3.large, c8i.xlarge) for Jenkins Agent SpotFleets.

After running the stack update, I moved on to other work. CloudFormation takes time, right? So I didn't check immediately.

40 minutes later, I opened the AWS console.

50 EC2 instances were running. 10 instances × 5 SpotFleets.

"Oh no. What did I do?"

Why It Happened

Here's the problem in the CloudFormation template:

Parameters:
  MaxTargetCapacity:
    Type: Number
    Default: 10  # Oops

Resources:
  JenkinsAgentSpotFleet:
    Type: AWS::EC2::SpotFleet
    Properties:
      SpotFleetRequestConfigData:
        TargetCapacity: !Ref MaxTargetCapacity  # This launches 10 instances immediately

The intended design: Jenkins EC2 Fleet Plugin should dynamically manage TargetCapacity. When there are jobs in the queue, it increases capacity. When there are no jobs, it scales back to 0.

What I should have written:

Resources:
  JenkinsAgentSpotFleet:
    Type: AWS::EC2::SpotFleet
    Properties:
      SpotFleetRequestConfigData:
        TargetCapacity: 0  # Let Jenkins manage capacity

I knew this. But I missed it during code review.

Multiple Checkpoints Failed

This wasn't a single mistake. Multiple checkpoints failed:

Checkpoint 1: Design Phase

I didn't set the default value to 0. I should have designed with "safe by default" in mind.

If I had set TargetCapacity: 0, nothing would have launched.

Checkpoint 2: Code Review

The diff was large. I got lazy and skipped careful review. "It'll probably be fine," I thought.

If I had reviewed the TargetCapacity parameter, I would have caught it.

Checkpoint 3: Notification Setup

I didn't set up notifications for stack updates. EventBridge + SNS could have alerted me when the update completed.

If I had notifications, I wouldn't have forgotten to check.

Checkpoint 4: Immediate Verification

I didn't check right after running the update. CloudFormation takes time, so I moved on to other work.

If I had checked immediately, I would have noticed sooner.

Any one of these checkpoints could have prevented the problem. But all four failed.

Small mistakes compound into big problems.

How I Fixed It (Fast, Thanks to AI)

When I saw 50 instances running, I had a few options:

Update CloudFormation template and re-run stack update (slow)
Use AWS CLI to set TargetCapacity to 0 (fast)
Manually terminate instances from console (tedious)

I chose option 2.

I asked Claude Code for help: "50 instances launched by accident. Set SpotFleet TargetCapacity to 0 and terminate running instances."

Claude Code quickly generated the exact commands:

# Set TargetCapacity to 0 for all SpotFleets
aws ec2 modify-spot-fleet-request \
  --spot-fleet-request-id sfr-xxx \
  --target-capacity 0

# Terminate all running instances
aws ec2 terminate-instances \
  --instance-ids i-xxx i-yyy i-zzz ...

It organized the execution plan, verified conditions, and ran the commands accurately. Much faster than doing it manually.

Recovery time: about 10 minutes.

AI-driven development: It caused the problem (by generating code I didn't review carefully), and it also solved the problem (by executing recovery quickly and accurately).

Cost Impact

Total cost: $1.41 USD

Runtime: 50 minutes (40 minutes until I noticed + 10 minutes to fix)
50 instances (Spot pricing, us-west-2)

It could have been much worse if:

I noticed hours later
These were On-Demand instances instead of Spot
This happened in production

$1.41 is cheap tuition. But the lesson is valuable.

Lessons Learned

1. Default to Safe Values

Always set capacity parameters to 0 or minimum by default:

TargetCapacity: 0
MinSize: 0
DesiredCapacity: 0

"Do nothing" should be the safe state.

2. Review Critical Parameters

Even when the diff is large, always review:

TargetCapacity
MinSize / MaxSize / DesiredCapacity
Any parameter affecting instance count

3. Set Up Notifications

Use EventBridge + SNS to notify when stack updates complete:

Resources:
  StackUpdateNotificationRule:
    Type: AWS::Events::Rule
    Properties:
      EventPattern:
        source: [aws.cloudformation]
        detail-type: [CloudFormation Stack Status Change]
        detail:
          status-details:
            status: [UPDATE_COMPLETE, UPDATE_ROLLBACK_COMPLETE]
      Targets:
        - Arn: !Ref NotificationTopic

4. Build Multiple Checkpoints

One checkpoint might fail. But if you have four, at least one will catch the problem.

This is called the "Swiss Cheese Model" in safety engineering. Each layer has holes (human errors), but multiple layers prevent incidents.

Final Thoughts

I'm relieved it only cost $1.41. But will I remember this lesson?

Honestly, I might make the same mistake again when I forget about this incident.

Humans forget. Humans get lazy. "It'll probably be fine" is always tempting.

Whether I can apply these lessons depends on my future self.

Can I trust my future self?

I'm not sure yet.

That's why I need systems, not just willpower. Notifications. Safe defaults. Code review checklists.

Build checkpoints. Don't rely on memory.

What about you? Have you ever launched unexpected resources by accident? How did you handle it? Let me know in the comments.

If you're interested in how engineers think through problems and decisions, I write more about that on my blog:

https://tielec.blog/

DEV Community