Kachi

Posted on Apr 27

CloudFormation in Production: What Breaks and How to Fix It

#devops #aws #cloud #sre

Moving past YAML templates to failure handling, security, and real tradeoffs

Before we start

This is a follow-up to Infrastructure as Code with AWS CloudFormation: From Fundamentals to Production Patterns.

That article covered templates, stacks, nested stacks, CI/CD, and production best practices.

This article covers what happens when those best practices aren't enough. When things break in ways the documentation doesn't warn you about. When you're reading CloudFormation error messages at midnight and need answers.

Part 1: Stack deployment failures

Failure 1: "Resource handler returned message: 'Role does not exist'"

Symptoms:

IAM role creates successfully (status: CREATE_COMPLETE)
Lambda or EC2 resource fails immediately after
Error: "The role named 'xxx' does not exist or is not authorized"

Root cause:
IAM has eventual consistency. CloudFormation marks the role as complete as soon as the API call returns, but the role may take 5-10 seconds to propagate across AWS partitions.

Fix:

LambdaFunction:
  Type: AWS::Lambda::Function
  DependsOn: LambdaExecutionRole
  Properties:
    Role: !GetAtt LambdaExecutionRole.Arn

DependsOn forces CloudFormation to wait for the role resource to be fully created "including its propagation" before creating the Lambda function.

Prevention:
Always add DependsOn when a resource consumes an IAM role created in the same stack.

Failure 2: Stack timeout without clear cause

Symptoms:

Stack creation or update times out after the configured limit
No obvious error in event log
Some resources show CREATE_IN_PROGRESS for hours

Root cause:
Resources with CreationPolicy or WaitCondition are waiting for signals that never arrive. Common causes:

EC2 instance user data script fails silently
Custom resource Lambda times out
Application code never calls cfn-signal

Diagnosis:

# Check if any resources have CreationPolicy
aws cloudformation describe-stack-resources --stack-name prod-stack \
  --query "StackResources[?ResourceStatus=='CREATE_IN_PROGRESS']"

# For EC2, check user data logs on the instance
cat /var/log/cloud-init-output.log

Fix:

For EC2 with user data:

#!/bin/bash
# Do your setup here

# Signal success or failure
/opt/aws/bin/cfn-signal --exit-code $? --stack ${AWS::StackName} \
  --resource WebServerInstance --region ${AWS::Region}

For custom resources, implement timeout handling:

def handler(event, context):
    try:
        # Do work
        send_response(event, context, "SUCCESS")
    except Exception as e:
        # CRITICAL: Always send a response
        send_response(event, context, "FAILED", reason=str(e))

Prevention:
Always test CreationPolicy paths with --disable-rollback first so you can inspect failed resources without automatic cleanup.

Failure 3: Nested stack update fails, root cause invisible

Symptoms:

Parent stack update fails
Error message: "Nested stack failed to update"
No details about why the nested stack failed

Root cause:
CloudFormation does not bubble up nested stack failure details to the parent. You have to check each nested stack individually.

Diagnosis:

# List nested stacks from the parent
aws cloudformation list-stack-resources --stack-name parent-stack \
  --query "StackResources[?ResourceType=='AWS::CloudFormation::Stack'].[PhysicalResourceId]"

# Check each nested stack's events
aws cloudformation describe-stack-events --stack-name nested-stack-1

Fix:

Add explicit validation before updating parents:

# Validate nested template before updating parent
aws cloudformation validate-template --template-body file://nested.yaml

# Check nested stack for drift
aws cloudformation detect-stack-drift --stack-name nested-stack-1

Prevention:
Minimize nested stack depth (2 levels maximum). For complex dependencies, use StackSets or split into separate parent stacks.

Part 2: Drift and configuration mismatch

Failure 4: Production resource changed outside CloudFormation

Symptoms:

Security group rule allows unexpected traffic
S3 bucket becomes public
RDS backup retention period changes
No corresponding change in Git history

Root cause:
Someone modified a resource directly in the AWS console or via CLI, bypassing CloudFormation.

Diagnosis:

# Detect drift on a stack
aws cloudformation detect-stack-drift --stack-name prod-web

# Get detailed drift results
aws cloudformation describe-stack-drift-detection-status \
  --stack-drift-detection-id <id>

# List drifted resources
aws cloudformation list-stack-resources --stack-name prod-web \
  --query "StackResources[?DriftInformation.StackResourceDriftStatus!='NOT_CHECKED']"

Fix — manual:

# Import drifted resource back to CloudFormation
aws cloudformation import-stack-to-drift --stack-name prod-web \
  --template-body file://template.yaml \
  --resources-to-import '[{"ResourceType":"AWS::S3::Bucket","LogicalResourceId":"DataBucket"}]'

Fix — automated:

# CloudWatch Event to detect drift weekly
DriftDetectionRule:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: "cron(0 12 * * 1)"  # Every Monday at noon
    Targets:
      - Arn: !GetAtt DriftLambda.Arn
        Input: '{"stackName": "prod-web"}'

Prevention:

Enforce IAM policies that prevent resource modification outside CloudFormation
Enable drift detection on all production stacks
Review drift reports weekly

Failure 5: Stack drift causes deletion protection to block cleanup

Symptoms:

Trying to delete a stack
Error: "Cannot delete stack because resource X has deletion protection"
That resource was not supposed to have deletion protection

Root cause:
Someone enabled deletion protection directly on an RDS database or S3 bucket. CloudFormation doesn't know about it.

Diagnosis:

# Find which resource is blocking deletion
aws cloudformation describe-stack-resources --stack-name prod-stack \
  --query "StackResources[?ResourceStatus=='DELETE_FAILED']"

Fix:

# Remove deletion protection from the resource directly
aws rds modify-db-instance --db-instance-identifier mydb \
  --no-deletion-protection

# Or for S3
aws s3api put-bucket-versioning --bucket mybucket \
  --versioning-configuration Status=Suspended

# Retry stack deletion
aws cloudformation delete-stack --stack-name prod-stack

Prevention:
Include DeletionPolicy: Retain in your template for stateful resources, not deletion protection. DeletionPolicy is understood by CloudFormation. Deletion protection is not.

Part 3: Rollback failures

Failure 6: Rollback fails because resource won't delete

Symptoms:

Stack update fails
Rollback starts
Rollback fails
Stack stuck in ROLLBACK_FAILED

Root cause:
A resource created during the failed update cannot be deleted. Common reasons:

S3 bucket has versioning enabled and contains objects
RDS has deletion protection enabled
Network interface is still attached
Custom resource performed external actions

Diagnosis:

# Find which resource caused rollback failure
aws cloudformation describe-stack-events --stack-name prod-stack \
  --query "StackEvents[?ResourceStatus=='DELETE_FAILED']"

Fix - for S3:

# Empty the bucket first
aws s3 rm s3://bucket-name --recursive

# Disable versioning
aws s3api put-bucket-versioning --bucket bucket-name \
  --versioning-configuration Status=Suspended

# Retry stack deletion
aws cloudformation delete-stack --stack-name prod-stack

Fix - for RDS:

# Disable deletion protection
aws rds modify-db-instance --db-instance-identifier mydb \
  --no-deletion-protection

# Skip final snapshot if you want fast cleanup
aws rds delete-db-instance --db-instance-identifier mydb \
  --skip-final-snapshot

Prevention:
Design stateful resources with DeletionPolicy: Retain in production. Accept that you will clean them manually. Do not let stateful resources block automated rollbacks.

Failure 7: Rollback takes too long, extending downtime

Symptoms:

Stack update fails at minute 15
Rollback takes another 20 minutes
Total downtime: 35+ minutes

Root cause:
Resources with DeletionPolicy: Snapshot take time to create snapshots during rollback. RDS snapshots can take 10-20 minutes. EBS snapshots add minutes per volume.

Diagnosis:

# Check which resource is taking time during rollback
aws cloudformation describe-stack-events --stack-name prod-stack \
  --query "StackEvents[?contains(ResourceStatus, 'DELETE')]"

Fix during incident:
You have limited options once rollback starts. The fastest path is often to let it finish, even if slow.

Prevention:
Separate stateful resources (databases, buckets) into their own stack. This stack changes rarely. Application stacks change frequently but contain no stateful resources.

# Stack 1: Data (deploys monthly, rollback takes time but happens rarely)
DatabaseStack:
  Type: AWS::RDS::DBInstance
  DeletionPolicy: Snapshot

# Stack 2: Application (deploys daily, rollback is fast)
AppStack:
  Type: AWS::AutoScaling::AutoScalingGroup
  DeletionPolicy: Delete  # No snapshot, instant deletion

When AppStack fails, rollback takes seconds, not minutes. Database is untouched.

Part 4: IAM and permission failures

Failure 8: "User is not authorized to perform cloudformation:CreateStack"

Symptoms:

CI/CD pipeline fails
Error message about missing permission
Same permissions worked yesterday

Root cause:
IAM policies changed. A condition was added. A permission was removed. The role used by CI/CD no longer has required access.

Diagnosis:

# Simulate policy to find missing permission
aws cloudformation create-stack --stack-name test-stack \
  --template-body file://test.yaml \
  --dry-run

# Check effective permissions for the role
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789012:role/ci-cd-role \
  --action-names cloudformation:CreateStack \
  --resource-arns arn:aws:cloudformation:us-east-1:123456789012:stack/*

Fix:
Add the missing permission to the CI/CD role:

{
  "Effect": "Allow",
  "Action": "cloudformation:CreateStack",
  "Resource": "arn:aws:cloudformation:region:account:stack/*"
}

Prevention:
Use IAM boundaries and permission guardrails. Test CI/CD role permissions in a staging account before deploying to production.

Failure 9: Cross-account stack operations fail

Symptoms:

Stack in Account A tries to create a resource in Account B
Error: "Access denied" or "Role does not exist"

Root cause:
CloudFormation does not natively support cross-account resource creation. You need IAM roles in both accounts with trust relationships.

Fix — setup cross-account role in target account:

# In Account B (target)
CrossAccountRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Statement:
        - Effect: Allow
          Principal:
            AWS: arn:aws:iam::AccountA:root
          Action: sts:AssumeRole
    ManagedPolicyArns:
      - arn:aws:iam::aws:policy/AdministratorAccess  # Scope down in production

Fix — assume role from source account:

# In Account A (source)
CustomResource:
  Type: Custom::CrossAccount
  Properties:
    ServiceToken: !GetAtt CrossAccountLambda.Arn
    TargetRoleArn: arn:aws:iam::AccountB:role/CrossAccountRole

Prevention:
Design stacks to be account-specific. Use AWS Organizations and StackSets for multi-account deployments instead of cross-account resource references.

Part 5: Template validation failures that only appear at deploy time

Failure 10: Template validates but deployment fails

Symptoms:

aws cloudformation validate-template --template-body file://template.yaml
# Returns: Template is valid

But deployment fails with: "Encountered unsupported property" or "Resource handler returned invalid request"

Root cause:
validate-template checks syntax and basic schema. It does not check:

Resource property combinations that are invalid (e.g., certain combinations of SourceSecurityGroupId and CidrIp)
Region-specific limitations (some resources not available in all regions)
Service limits (e.g., requesting 2000 IOPS when limit is 1000)

Diagnosis:

Deploy with --disable-rollback to keep failed resources for inspection:

aws cloudformation create-stack --stack-name test-stack \
  --template-body file://template.yaml \
  --disable-rollback

Then examine the failed resource's status reason:

aws cloudformation describe-stack-resources --stack-name test-stack \
  --query "StackResources[?ResourceStatus=='CREATE_FAILED']"

Fix:
Correct the specific property combination. Check region availability. Request service limit increases before deployment.

Prevention:
Test in a staging region first. Use cfn-lint in CI/CD — it catches property combination errors that validate-template misses.

# Install cfn-lint
pip install cfn-lint

# Run locally before commit
cfn-lint template.yaml

Part 6: Change set failures

Failure 11: Change set shows replacement when you expected modification

Symptoms:

Change set indicates "Replacement" for a production resource
You expected an in-place modification
Replacement means downtime

Root cause:
Certain property changes force replacement. For RDS: changing EngineVersion or DBInstanceClass sometimes requires replacement depending on the version difference.

Diagnosis:

Check which property triggered replacement:

aws cloudformation describe-change-set --change-set-name my-change-set \
  --query "Changes[?ResourceChange.Replacement=='True']"

Common properties that force replacement:

Resource	Property that forces replacement
AWS::RDS::DBInstance	`Engine`, `EngineVersion` (major version), `DBSubnetGroupName`
AWS::EC2::Instance	`ImageId`, `InstanceType` (sometimes), `SubnetId`
AWS::S3::Bucket	`BucketName` (can't change), `AccessControl` (sometimes)
AWS::Lambda::Function	`Code` (S3 bucket/key change)

Fix:

Accept the replacement and plan for downtime
Use blue/green deployment for zero-downtime replacement
Modify the resource directly in AWS console (not recommended for IaC)

Prevention:
Always review change sets in staging before production. Know which properties cause replacement for your critical resources.

Failure 12: Change set execution fails because of update conflicts

Symptoms:

Change set creates successfully
execute-change-set fails
Error: "Cannot update stack because another update is in progress"

Root cause:
Another process (CI/CD pipeline, another engineer, scheduled automation) started a stack update while your change set was waiting for execution.

Diagnosis:

# Check current stack status
aws cloudformation describe-stacks --stack-name prod-stack \
  --query "Stacks[0].StackStatus"

# Status like UPDATE_IN_PROGRESS or ROLLBACK_IN_PROGRESS means locked

Fix:
Wait for the other update to complete. Then create a new change set based on the latest stack state. Do not execute the old change set — it's now out of date.

# Delete old change set
aws cloudformation delete-change-set --change-set-name old-change-set

# Create new change set against current stack
aws cloudformation create-change-set --stack-name prod-stack \
  --change-set-name new-change-set --template-body file://template.yaml

# Execute fresh change set
aws cloudformation execute-change-set --change-set-name new-change-set

Prevention:

Implement stack-level locking via S3 condition keys or custom resources
Coordinate CI/CD pipelines to never deploy simultaneously to the same stack
Use separate stacks for separate environments

Part 7: Performance and quota failures

Failure 13: Stack deployment times out due to API rate limiting

Symptoms:

Stack deployment slows dramatically after hundreds of resources
Error: "Rate exceeded" for various AWS APIs
Some resources take 5-10 retries before succeeding

Root cause:
CloudFormation makes many API calls to create resources. AWS APIs have rate limits. Large stacks hit these limits.

Diagnosis:

# Check CloudTrail for throttle errors
aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=ThrottlingException

Fix — immediate:
Split the stack. CloudFormation has a recommended limit of 200 resources per stack for optimal performance.

# List resources by type to see distribution
aws cloudformation list-stack-resources --stack-name large-stack \
  --query "StackResources[*].[ResourceType]" --output text | sort | uniq -c

Fix — long term:
Design modular stacks:

network-stack.yaml     (VPC, subnets, route tables)
data-stack.yaml        (RDS, ElastiCache, S3)
compute-stack.yaml     (ASG, launch templates)
app-stack.yaml         (Lambda, API Gateway)

Prevention:
Monitor stack creation time. If it exceeds 15 minutes for non-stateful resources, split the stack.

Failure 14: Service quota exceeded during deployment

Symptoms:

Deployment fails
Error: "You have reached your limit of X resources"

Root cause:
AWS account has default service limits. You're trying to create more resources than allowed.

Common quotas:

VPCs per region: 5
Security groups per VPC: 500
RDS instances per region: 40
Lambda concurrent executions: 1000

Diagnosis:

# Check current usage against quota
aws service-quotas get-service-quota \
  --service-code ec2 --quota-code L-12345678

# List all quotas for a service
aws service-quotas list-service-quotas --service-code rds

Fix — immediate:
Request quota increase from AWS Support or via Service Quotas API:

aws service-quotas request-service-quota-increase \
  --service-code ec2 --quota-code L-12345678 \
  --desired-value 100

Fix — tactical:
Reduce resource count in the current deployment. Use smaller instance sizes. Share resources across stacks.

Prevention:
Include quota checks in your CI/CD pipeline before deployment:

# Script to check quotas before deploying
python scripts/check_quotas.py --template template.yaml

Part 8: Troubleshooting workflow - where to start

When a CloudFormation deployment fails, follow this workflow:

Step 1: Get the raw error

aws cloudformation describe-stack-events --stack-name prod-stack \
  --max-items 20 --query "StackEvents[?ResourceStatus=='CREATE_FAILED' || ResourceStatus=='UPDATE_FAILED']"

Look for the ResourceStatusReason field. This is your primary clue.

Step 2: Identify the failed resource

The error message tells you which logical resource failed. Find its type and properties in your template.

Step 3: Check if it's a known failure pattern

Error message pattern	Likely cause	Fix section
"Role does not exist"	IAM eventual consistency	Part 1, Failure 1
"Rate exceeded"	API throttling	Part 7, Failure 13
"Limit exceeded"	Service quota	Part 7, Failure 14
"Deletion protection"	Rollback blocked	Part 3, Failure 6
"Another update in progress"	Concurrent update	Part 6, Failure 12

Step 4: Deploy with `--disable-rollback` for debugging

aws cloudformation create-stack --stack-name debug-stack \
  --template-body file://template.yaml \
  --disable-rollback

Failed resources remain so you can inspect them directly.

Step 5: Inspect the failed resource directly

For EC2:

aws ec2 describe-instances --instance-ids i-12345
ssh ec2-user@instance-ip # Check logs

For Lambda:

aws logs describe-log-groups --log-group-name-prefix /aws/lambda/my-function
aws logs get-log-events --log-group-name /aws/lambda/my-function --log-stream-name $(aws logs describe-log-streams --log-group-name /aws/lambda/my-function --query "logStreams[0].logStreamName" --output text)

For RDS:

aws rds describe-db-instances --db-instance-identifier mydb
aws rds describe-events --source-identifier mydb --source-type db-instance

Step 6: Fix, then continue

If stack is in ROLLBACK_FAILED, you have two options:

Option A: Delete the failed stack and recreate

aws cloudformation delete-stack --stack-name prod-stack
# Wait for deletion
aws cloudformation create-stack --stack-name prod-stack --template-body file://template.yaml

Option B: Continue rolling back after fixing the blocker

# Fix the blocking resource (empty S3 bucket, disable deletion protection)
# Then retry rollback (CloudFormation may need manual intervention via support)

Production CloudFormation checklist

Before deploying to production, verify:

Drift detection

[ ] Enabled on all production stacks
[ ] Weekly automated drift check configured
[ ] Alerts configured for drift findings

Rollback strategy

[ ] Stateful resources have DeletionPolicy: Retain or Snapshot
[ ] Stateless resources have DeletionPolicy: Delete
[ ] Stateful and stateless resources in separate stacks

IAM and security

[ ] No "Action": "*" in policies
[ ] Secrets use {{resolve:secretsmanager:...}} not parameters
[ ] CI/CD role has minimal required permissions
[ ] cfn-guard or cfn-lint running in CI

Failure handling

[ ] CreationPolicy includes timeout and signal handling
[ ] Custom resources always send SUCCESS or FAILURE responses
[ ] Nested stack depth ≤ 2

Performance

[ ] No stack exceeds 200 resources
[ ] No stack consistently deploys longer than 15 minutes
[ ] Service quotas checked before deployment

Troubleshooting readiness

[ ] describe-stack-events command documented in runbook
[ ] Access to failed resource logs (EC2, Lambda, RDS) available
[ ] --disable-rollback used in staging deployments

Written by Onyedikachi Obidiegwu | Cloud Security Engineer

DEV Community

CloudFormation in Production: What Breaks and How to Fix It

Before we start

Part 1: Stack deployment failures

Failure 1: "Resource handler returned message: 'Role does not exist'"

Failure 2: Stack timeout without clear cause

Failure 3: Nested stack update fails, root cause invisible

Part 2: Drift and configuration mismatch

Failure 4: Production resource changed outside CloudFormation

Failure 5: Stack drift causes deletion protection to block cleanup

Part 3: Rollback failures

Failure 6: Rollback fails because resource won't delete

Failure 7: Rollback takes too long, extending downtime

Part 4: IAM and permission failures

Failure 8: "User is not authorized to perform cloudformation:CreateStack"

Failure 9: Cross-account stack operations fail

Part 5: Template validation failures that only appear at deploy time

Failure 10: Template validates but deployment fails

Part 6: Change set failures

Failure 11: Change set shows replacement when you expected modification

Failure 12: Change set execution fails because of update conflicts

Part 7: Performance and quota failures

Failure 13: Stack deployment times out due to API rate limiting

Failure 14: Service quota exceeded during deployment

Part 8: Troubleshooting workflow - where to start

Step 1: Get the raw error

Step 2: Identify the failed resource

Step 3: Check if it's a known failure pattern

Step 4: Deploy with `--disable-rollback` for debugging

Step 5: Inspect the failed resource directly

Step 6: Fix, then continue

Production CloudFormation checklist

Top comments (0)

Before we start

Part 1: Stack deployment failures

Failure 1: "Resource handler returned message: 'Role does not exist'"

Failure 2: Stack timeout without clear cause

Failure 3: Nested stack update fails, root cause invisible

Part 2: Drift and configuration mismatch

Failure 4: Production resource changed outside CloudFormation

Failure 5: Stack drift causes deletion protection to block cleanup

Part 3: Rollback failures

Failure 6: Rollback fails because resource won't delete

Failure 7: Rollback takes too long, extending downtime

Part 4: IAM and permission failures

Failure 8: "User is not authorized to perform cloudformation:CreateStack"

Failure 9: Cross-account stack operations fail

Part 5: Template validation failures that only appear at deploy time

Failure 10: Template validates but deployment fails

Part 6: Change set failures

Failure 11: Change set shows replacement when you expected modification

Failure 12: Change set execution fails because of update conflicts

Part 7: Performance and quota failures

Failure 13: Stack deployment times out due to API rate limiting

Failure 14: Service quota exceeded during deployment

Part 8: Troubleshooting workflow - where to start

Step 1: Get the raw error

Step 2: Identify the failed resource

Step 3: Check if it's a known failure pattern

Step 4: Deploy with --disable-rollback for debugging

Step 5: Inspect the failed resource directly

Step 6: Fix, then continue

Production CloudFormation checklist

Step 4: Deploy with `--disable-rollback` for debugging