Moving past YAML templates to failure handling, security, and real tradeoffs
Before we start
This is a follow-up to Infrastructure as Code with AWS CloudFormation: From Fundamentals to Production Patterns.
That article covered templates, stacks, nested stacks, CI/CD, and production best practices.
This article covers what happens when those best practices aren't enough. When things break in ways the documentation doesn't warn you about. When you're reading CloudFormation error messages at midnight and need answers.
Part 1: Stack deployment failures
Failure 1: "Resource handler returned message: 'Role does not exist'"
Symptoms:
- IAM role creates successfully (status: CREATE_COMPLETE)
- Lambda or EC2 resource fails immediately after
- Error: "The role named 'xxx' does not exist or is not authorized"
Root cause:
IAM has eventual consistency. CloudFormation marks the role as complete as soon as the API call returns, but the role may take 5-10 seconds to propagate across AWS partitions.
Fix:
LambdaFunction:
Type: AWS::Lambda::Function
DependsOn: LambdaExecutionRole
Properties:
Role: !GetAtt LambdaExecutionRole.Arn
DependsOn forces CloudFormation to wait for the role resource to be fully created "including its propagation" before creating the Lambda function.
Prevention:
Always add DependsOn when a resource consumes an IAM role created in the same stack.
Failure 2: Stack timeout without clear cause
Symptoms:
- Stack creation or update times out after the configured limit
- No obvious error in event log
- Some resources show CREATE_IN_PROGRESS for hours
Root cause:
Resources with CreationPolicy or WaitCondition are waiting for signals that never arrive. Common causes:
- EC2 instance user data script fails silently
- Custom resource Lambda times out
- Application code never calls
cfn-signal
Diagnosis:
# Check if any resources have CreationPolicy
aws cloudformation describe-stack-resources --stack-name prod-stack \
--query "StackResources[?ResourceStatus=='CREATE_IN_PROGRESS']"
# For EC2, check user data logs on the instance
cat /var/log/cloud-init-output.log
Fix:
For EC2 with user data:
#!/bin/bash
# Do your setup here
# Signal success or failure
/opt/aws/bin/cfn-signal --exit-code $? --stack ${AWS::StackName} \
--resource WebServerInstance --region ${AWS::Region}
For custom resources, implement timeout handling:
def handler(event, context):
try:
# Do work
send_response(event, context, "SUCCESS")
except Exception as e:
# CRITICAL: Always send a response
send_response(event, context, "FAILED", reason=str(e))
Prevention:
Always test CreationPolicy paths with --disable-rollback first so you can inspect failed resources without automatic cleanup.
Failure 3: Nested stack update fails, root cause invisible
Symptoms:
- Parent stack update fails
- Error message: "Nested stack failed to update"
- No details about why the nested stack failed
Root cause:
CloudFormation does not bubble up nested stack failure details to the parent. You have to check each nested stack individually.
Diagnosis:
# List nested stacks from the parent
aws cloudformation list-stack-resources --stack-name parent-stack \
--query "StackResources[?ResourceType=='AWS::CloudFormation::Stack'].[PhysicalResourceId]"
# Check each nested stack's events
aws cloudformation describe-stack-events --stack-name nested-stack-1
Fix:
Add explicit validation before updating parents:
# Validate nested template before updating parent
aws cloudformation validate-template --template-body file://nested.yaml
# Check nested stack for drift
aws cloudformation detect-stack-drift --stack-name nested-stack-1
Prevention:
Minimize nested stack depth (2 levels maximum). For complex dependencies, use StackSets or split into separate parent stacks.
Part 2: Drift and configuration mismatch
Failure 4: Production resource changed outside CloudFormation
Symptoms:
- Security group rule allows unexpected traffic
- S3 bucket becomes public
- RDS backup retention period changes
- No corresponding change in Git history
Root cause:
Someone modified a resource directly in the AWS console or via CLI, bypassing CloudFormation.
Diagnosis:
# Detect drift on a stack
aws cloudformation detect-stack-drift --stack-name prod-web
# Get detailed drift results
aws cloudformation describe-stack-drift-detection-status \
--stack-drift-detection-id <id>
# List drifted resources
aws cloudformation list-stack-resources --stack-name prod-web \
--query "StackResources[?DriftInformation.StackResourceDriftStatus!='NOT_CHECKED']"
Fix — manual:
# Import drifted resource back to CloudFormation
aws cloudformation import-stack-to-drift --stack-name prod-web \
--template-body file://template.yaml \
--resources-to-import '[{"ResourceType":"AWS::S3::Bucket","LogicalResourceId":"DataBucket"}]'
Fix — automated:
# CloudWatch Event to detect drift weekly
DriftDetectionRule:
Type: AWS::Events::Rule
Properties:
ScheduleExpression: "cron(0 12 * * 1)" # Every Monday at noon
Targets:
- Arn: !GetAtt DriftLambda.Arn
Input: '{"stackName": "prod-web"}'
Prevention:
- Enforce IAM policies that prevent resource modification outside CloudFormation
- Enable drift detection on all production stacks
- Review drift reports weekly
Failure 5: Stack drift causes deletion protection to block cleanup
Symptoms:
- Trying to delete a stack
- Error: "Cannot delete stack because resource X has deletion protection"
- That resource was not supposed to have deletion protection
Root cause:
Someone enabled deletion protection directly on an RDS database or S3 bucket. CloudFormation doesn't know about it.
Diagnosis:
# Find which resource is blocking deletion
aws cloudformation describe-stack-resources --stack-name prod-stack \
--query "StackResources[?ResourceStatus=='DELETE_FAILED']"
Fix:
# Remove deletion protection from the resource directly
aws rds modify-db-instance --db-instance-identifier mydb \
--no-deletion-protection
# Or for S3
aws s3api put-bucket-versioning --bucket mybucket \
--versioning-configuration Status=Suspended
# Retry stack deletion
aws cloudformation delete-stack --stack-name prod-stack
Prevention:
Include DeletionPolicy: Retain in your template for stateful resources, not deletion protection. DeletionPolicy is understood by CloudFormation. Deletion protection is not.
Part 3: Rollback failures
Failure 6: Rollback fails because resource won't delete
Symptoms:
- Stack update fails
- Rollback starts
- Rollback fails
- Stack stuck in ROLLBACK_FAILED
Root cause:
A resource created during the failed update cannot be deleted. Common reasons:
- S3 bucket has versioning enabled and contains objects
- RDS has deletion protection enabled
- Network interface is still attached
- Custom resource performed external actions
Diagnosis:
# Find which resource caused rollback failure
aws cloudformation describe-stack-events --stack-name prod-stack \
--query "StackEvents[?ResourceStatus=='DELETE_FAILED']"
Fix - for S3:
# Empty the bucket first
aws s3 rm s3://bucket-name --recursive
# Disable versioning
aws s3api put-bucket-versioning --bucket bucket-name \
--versioning-configuration Status=Suspended
# Retry stack deletion
aws cloudformation delete-stack --stack-name prod-stack
Fix - for RDS:
# Disable deletion protection
aws rds modify-db-instance --db-instance-identifier mydb \
--no-deletion-protection
# Skip final snapshot if you want fast cleanup
aws rds delete-db-instance --db-instance-identifier mydb \
--skip-final-snapshot
Prevention:
Design stateful resources with DeletionPolicy: Retain in production. Accept that you will clean them manually. Do not let stateful resources block automated rollbacks.
Failure 7: Rollback takes too long, extending downtime
Symptoms:
- Stack update fails at minute 15
- Rollback takes another 20 minutes
- Total downtime: 35+ minutes
Root cause:
Resources with DeletionPolicy: Snapshot take time to create snapshots during rollback. RDS snapshots can take 10-20 minutes. EBS snapshots add minutes per volume.
Diagnosis:
# Check which resource is taking time during rollback
aws cloudformation describe-stack-events --stack-name prod-stack \
--query "StackEvents[?contains(ResourceStatus, 'DELETE')]"
Fix during incident:
You have limited options once rollback starts. The fastest path is often to let it finish, even if slow.
Prevention:
Separate stateful resources (databases, buckets) into their own stack. This stack changes rarely. Application stacks change frequently but contain no stateful resources.
# Stack 1: Data (deploys monthly, rollback takes time but happens rarely)
DatabaseStack:
Type: AWS::RDS::DBInstance
DeletionPolicy: Snapshot
# Stack 2: Application (deploys daily, rollback is fast)
AppStack:
Type: AWS::AutoScaling::AutoScalingGroup
DeletionPolicy: Delete # No snapshot, instant deletion
When AppStack fails, rollback takes seconds, not minutes. Database is untouched.
Part 4: IAM and permission failures
Failure 8: "User is not authorized to perform cloudformation:CreateStack"
Symptoms:
- CI/CD pipeline fails
- Error message about missing permission
- Same permissions worked yesterday
Root cause:
IAM policies changed. A condition was added. A permission was removed. The role used by CI/CD no longer has required access.
Diagnosis:
# Simulate policy to find missing permission
aws cloudformation create-stack --stack-name test-stack \
--template-body file://test.yaml \
--dry-run
# Check effective permissions for the role
aws iam simulate-principal-policy \
--policy-source-arn arn:aws:iam::123456789012:role/ci-cd-role \
--action-names cloudformation:CreateStack \
--resource-arns arn:aws:cloudformation:us-east-1:123456789012:stack/*
Fix:
Add the missing permission to the CI/CD role:
{
"Effect": "Allow",
"Action": "cloudformation:CreateStack",
"Resource": "arn:aws:cloudformation:region:account:stack/*"
}
Prevention:
Use IAM boundaries and permission guardrails. Test CI/CD role permissions in a staging account before deploying to production.
Failure 9: Cross-account stack operations fail
Symptoms:
- Stack in Account A tries to create a resource in Account B
- Error: "Access denied" or "Role does not exist"
Root cause:
CloudFormation does not natively support cross-account resource creation. You need IAM roles in both accounts with trust relationships.
Fix — setup cross-account role in target account:
# In Account B (target)
CrossAccountRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Statement:
- Effect: Allow
Principal:
AWS: arn:aws:iam::AccountA:root
Action: sts:AssumeRole
ManagedPolicyArns:
- arn:aws:iam::aws:policy/AdministratorAccess # Scope down in production
Fix — assume role from source account:
# In Account A (source)
CustomResource:
Type: Custom::CrossAccount
Properties:
ServiceToken: !GetAtt CrossAccountLambda.Arn
TargetRoleArn: arn:aws:iam::AccountB:role/CrossAccountRole
Prevention:
Design stacks to be account-specific. Use AWS Organizations and StackSets for multi-account deployments instead of cross-account resource references.
Part 5: Template validation failures that only appear at deploy time
Failure 10: Template validates but deployment fails
Symptoms:
aws cloudformation validate-template --template-body file://template.yaml
# Returns: Template is valid
But deployment fails with: "Encountered unsupported property" or "Resource handler returned invalid request"
Root cause:
validate-template checks syntax and basic schema. It does not check:
- Resource property combinations that are invalid (e.g., certain combinations of
SourceSecurityGroupIdandCidrIp) - Region-specific limitations (some resources not available in all regions)
- Service limits (e.g., requesting 2000 IOPS when limit is 1000)
Diagnosis:
Deploy with --disable-rollback to keep failed resources for inspection:
aws cloudformation create-stack --stack-name test-stack \
--template-body file://template.yaml \
--disable-rollback
Then examine the failed resource's status reason:
aws cloudformation describe-stack-resources --stack-name test-stack \
--query "StackResources[?ResourceStatus=='CREATE_FAILED']"
Fix:
Correct the specific property combination. Check region availability. Request service limit increases before deployment.
Prevention:
Test in a staging region first. Use cfn-lint in CI/CD — it catches property combination errors that validate-template misses.
# Install cfn-lint
pip install cfn-lint
# Run locally before commit
cfn-lint template.yaml
Part 6: Change set failures
Failure 11: Change set shows replacement when you expected modification
Symptoms:
- Change set indicates "Replacement" for a production resource
- You expected an in-place modification
- Replacement means downtime
Root cause:
Certain property changes force replacement. For RDS: changing EngineVersion or DBInstanceClass sometimes requires replacement depending on the version difference.
Diagnosis:
Check which property triggered replacement:
aws cloudformation describe-change-set --change-set-name my-change-set \
--query "Changes[?ResourceChange.Replacement=='True']"
Common properties that force replacement:
| Resource | Property that forces replacement |
|---|---|
| AWS::RDS::DBInstance |
Engine, EngineVersion (major version), DBSubnetGroupName
|
| AWS::EC2::Instance |
ImageId, InstanceType (sometimes), SubnetId
|
| AWS::S3::Bucket |
BucketName (can't change), AccessControl (sometimes) |
| AWS::Lambda::Function |
Code (S3 bucket/key change) |
Fix:
- Accept the replacement and plan for downtime
- Use blue/green deployment for zero-downtime replacement
- Modify the resource directly in AWS console (not recommended for IaC)
Prevention:
Always review change sets in staging before production. Know which properties cause replacement for your critical resources.
Failure 12: Change set execution fails because of update conflicts
Symptoms:
- Change set creates successfully
-
execute-change-setfails - Error: "Cannot update stack because another update is in progress"
Root cause:
Another process (CI/CD pipeline, another engineer, scheduled automation) started a stack update while your change set was waiting for execution.
Diagnosis:
# Check current stack status
aws cloudformation describe-stacks --stack-name prod-stack \
--query "Stacks[0].StackStatus"
# Status like UPDATE_IN_PROGRESS or ROLLBACK_IN_PROGRESS means locked
Fix:
Wait for the other update to complete. Then create a new change set based on the latest stack state. Do not execute the old change set — it's now out of date.
# Delete old change set
aws cloudformation delete-change-set --change-set-name old-change-set
# Create new change set against current stack
aws cloudformation create-change-set --stack-name prod-stack \
--change-set-name new-change-set --template-body file://template.yaml
# Execute fresh change set
aws cloudformation execute-change-set --change-set-name new-change-set
Prevention:
- Implement stack-level locking via S3 condition keys or custom resources
- Coordinate CI/CD pipelines to never deploy simultaneously to the same stack
- Use separate stacks for separate environments
Part 7: Performance and quota failures
Failure 13: Stack deployment times out due to API rate limiting
Symptoms:
- Stack deployment slows dramatically after hundreds of resources
- Error: "Rate exceeded" for various AWS APIs
- Some resources take 5-10 retries before succeeding
Root cause:
CloudFormation makes many API calls to create resources. AWS APIs have rate limits. Large stacks hit these limits.
Diagnosis:
# Check CloudTrail for throttle errors
aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=ThrottlingException
Fix — immediate:
Split the stack. CloudFormation has a recommended limit of 200 resources per stack for optimal performance.
# List resources by type to see distribution
aws cloudformation list-stack-resources --stack-name large-stack \
--query "StackResources[*].[ResourceType]" --output text | sort | uniq -c
Fix — long term:
Design modular stacks:
network-stack.yaml (VPC, subnets, route tables)
data-stack.yaml (RDS, ElastiCache, S3)
compute-stack.yaml (ASG, launch templates)
app-stack.yaml (Lambda, API Gateway)
Prevention:
Monitor stack creation time. If it exceeds 15 minutes for non-stateful resources, split the stack.
Failure 14: Service quota exceeded during deployment
Symptoms:
- Deployment fails
- Error: "You have reached your limit of X resources"
Root cause:
AWS account has default service limits. You're trying to create more resources than allowed.
Common quotas:
- VPCs per region: 5
- Security groups per VPC: 500
- RDS instances per region: 40
- Lambda concurrent executions: 1000
Diagnosis:
# Check current usage against quota
aws service-quotas get-service-quota \
--service-code ec2 --quota-code L-12345678
# List all quotas for a service
aws service-quotas list-service-quotas --service-code rds
Fix — immediate:
Request quota increase from AWS Support or via Service Quotas API:
aws service-quotas request-service-quota-increase \
--service-code ec2 --quota-code L-12345678 \
--desired-value 100
Fix — tactical:
Reduce resource count in the current deployment. Use smaller instance sizes. Share resources across stacks.
Prevention:
Include quota checks in your CI/CD pipeline before deployment:
# Script to check quotas before deploying
python scripts/check_quotas.py --template template.yaml
Part 8: Troubleshooting workflow - where to start
When a CloudFormation deployment fails, follow this workflow:
Step 1: Get the raw error
aws cloudformation describe-stack-events --stack-name prod-stack \
--max-items 20 --query "StackEvents[?ResourceStatus=='CREATE_FAILED' || ResourceStatus=='UPDATE_FAILED']"
Look for the ResourceStatusReason field. This is your primary clue.
Step 2: Identify the failed resource
The error message tells you which logical resource failed. Find its type and properties in your template.
Step 3: Check if it's a known failure pattern
| Error message pattern | Likely cause | Fix section |
|---|---|---|
| "Role does not exist" | IAM eventual consistency | Part 1, Failure 1 |
| "Rate exceeded" | API throttling | Part 7, Failure 13 |
| "Limit exceeded" | Service quota | Part 7, Failure 14 |
| "Deletion protection" | Rollback blocked | Part 3, Failure 6 |
| "Another update in progress" | Concurrent update | Part 6, Failure 12 |
Step 4: Deploy with --disable-rollback for debugging
aws cloudformation create-stack --stack-name debug-stack \
--template-body file://template.yaml \
--disable-rollback
Failed resources remain so you can inspect them directly.
Step 5: Inspect the failed resource directly
For EC2:
aws ec2 describe-instances --instance-ids i-12345
ssh ec2-user@instance-ip # Check logs
For Lambda:
aws logs describe-log-groups --log-group-name-prefix /aws/lambda/my-function
aws logs get-log-events --log-group-name /aws/lambda/my-function --log-stream-name $(aws logs describe-log-streams --log-group-name /aws/lambda/my-function --query "logStreams[0].logStreamName" --output text)
For RDS:
aws rds describe-db-instances --db-instance-identifier mydb
aws rds describe-events --source-identifier mydb --source-type db-instance
Step 6: Fix, then continue
If stack is in ROLLBACK_FAILED, you have two options:
Option A: Delete the failed stack and recreate
aws cloudformation delete-stack --stack-name prod-stack
# Wait for deletion
aws cloudformation create-stack --stack-name prod-stack --template-body file://template.yaml
Option B: Continue rolling back after fixing the blocker
# Fix the blocking resource (empty S3 bucket, disable deletion protection)
# Then retry rollback (CloudFormation may need manual intervention via support)
Production CloudFormation checklist
Before deploying to production, verify:
Drift detection
- [ ] Enabled on all production stacks
- [ ] Weekly automated drift check configured
- [ ] Alerts configured for drift findings
Rollback strategy
- [ ] Stateful resources have
DeletionPolicy: RetainorSnapshot - [ ] Stateless resources have
DeletionPolicy: Delete - [ ] Stateful and stateless resources in separate stacks
IAM and security
- [ ] No
"Action": "*"in policies - [ ] Secrets use
{{resolve:secretsmanager:...}}not parameters - [ ] CI/CD role has minimal required permissions
- [ ]
cfn-guardorcfn-lintrunning in CI
Failure handling
- [ ]
CreationPolicyincludes timeout and signal handling - [ ] Custom resources always send SUCCESS or FAILURE responses
- [ ] Nested stack depth ≤ 2
Performance
- [ ] No stack exceeds 200 resources
- [ ] No stack consistently deploys longer than 15 minutes
- [ ] Service quotas checked before deployment
Troubleshooting readiness
- [ ]
describe-stack-eventscommand documented in runbook - [ ] Access to failed resource logs (EC2, Lambda, RDS) available
- [ ]
--disable-rollbackused in staging deployments
Written by Onyedikachi Obidiegwu | Cloud Security Engineer
Top comments (0)