DEV Community

Cover image for CloudFormation in Production: What Breaks and How to Fix It
Kachi
Kachi

Posted on

CloudFormation in Production: What Breaks and How to Fix It

Moving past YAML templates to failure handling, security, and real tradeoffs


Before we start

This is a follow-up to Infrastructure as Code with AWS CloudFormation: From Fundamentals to Production Patterns.

That article covered templates, stacks, nested stacks, CI/CD, and production best practices.

This article covers what happens when those best practices aren't enough. When things break in ways the documentation doesn't warn you about. When you're reading CloudFormation error messages at midnight and need answers.


Part 1: Stack deployment failures

Failure 1: "Resource handler returned message: 'Role does not exist'"

Symptoms:

  • IAM role creates successfully (status: CREATE_COMPLETE)
  • Lambda or EC2 resource fails immediately after
  • Error: "The role named 'xxx' does not exist or is not authorized"

Root cause:
IAM has eventual consistency. CloudFormation marks the role as complete as soon as the API call returns, but the role may take 5-10 seconds to propagate across AWS partitions.

Fix:

LambdaFunction:
  Type: AWS::Lambda::Function
  DependsOn: LambdaExecutionRole
  Properties:
    Role: !GetAtt LambdaExecutionRole.Arn
Enter fullscreen mode Exit fullscreen mode

DependsOn forces CloudFormation to wait for the role resource to be fully created "including its propagation" before creating the Lambda function.

Prevention:
Always add DependsOn when a resource consumes an IAM role created in the same stack.


Failure 2: Stack timeout without clear cause

Symptoms:

  • Stack creation or update times out after the configured limit
  • No obvious error in event log
  • Some resources show CREATE_IN_PROGRESS for hours

Root cause:
Resources with CreationPolicy or WaitCondition are waiting for signals that never arrive. Common causes:

  • EC2 instance user data script fails silently
  • Custom resource Lambda times out
  • Application code never calls cfn-signal

Diagnosis:

# Check if any resources have CreationPolicy
aws cloudformation describe-stack-resources --stack-name prod-stack \
  --query "StackResources[?ResourceStatus=='CREATE_IN_PROGRESS']"

# For EC2, check user data logs on the instance
cat /var/log/cloud-init-output.log
Enter fullscreen mode Exit fullscreen mode

Fix:

For EC2 with user data:

#!/bin/bash
# Do your setup here

# Signal success or failure
/opt/aws/bin/cfn-signal --exit-code $? --stack ${AWS::StackName} \
  --resource WebServerInstance --region ${AWS::Region}
Enter fullscreen mode Exit fullscreen mode

For custom resources, implement timeout handling:

def handler(event, context):
    try:
        # Do work
        send_response(event, context, "SUCCESS")
    except Exception as e:
        # CRITICAL: Always send a response
        send_response(event, context, "FAILED", reason=str(e))
Enter fullscreen mode Exit fullscreen mode

Prevention:
Always test CreationPolicy paths with --disable-rollback first so you can inspect failed resources without automatic cleanup.


Failure 3: Nested stack update fails, root cause invisible

Symptoms:

  • Parent stack update fails
  • Error message: "Nested stack failed to update"
  • No details about why the nested stack failed

Root cause:
CloudFormation does not bubble up nested stack failure details to the parent. You have to check each nested stack individually.

Diagnosis:

# List nested stacks from the parent
aws cloudformation list-stack-resources --stack-name parent-stack \
  --query "StackResources[?ResourceType=='AWS::CloudFormation::Stack'].[PhysicalResourceId]"

# Check each nested stack's events
aws cloudformation describe-stack-events --stack-name nested-stack-1
Enter fullscreen mode Exit fullscreen mode

Fix:

Add explicit validation before updating parents:

# Validate nested template before updating parent
aws cloudformation validate-template --template-body file://nested.yaml

# Check nested stack for drift
aws cloudformation detect-stack-drift --stack-name nested-stack-1
Enter fullscreen mode Exit fullscreen mode

Prevention:
Minimize nested stack depth (2 levels maximum). For complex dependencies, use StackSets or split into separate parent stacks.


Part 2: Drift and configuration mismatch

Failure 4: Production resource changed outside CloudFormation

Symptoms:

  • Security group rule allows unexpected traffic
  • S3 bucket becomes public
  • RDS backup retention period changes
  • No corresponding change in Git history

Root cause:
Someone modified a resource directly in the AWS console or via CLI, bypassing CloudFormation.

Diagnosis:

# Detect drift on a stack
aws cloudformation detect-stack-drift --stack-name prod-web

# Get detailed drift results
aws cloudformation describe-stack-drift-detection-status \
  --stack-drift-detection-id <id>

# List drifted resources
aws cloudformation list-stack-resources --stack-name prod-web \
  --query "StackResources[?DriftInformation.StackResourceDriftStatus!='NOT_CHECKED']"
Enter fullscreen mode Exit fullscreen mode

Fix — manual:

# Import drifted resource back to CloudFormation
aws cloudformation import-stack-to-drift --stack-name prod-web \
  --template-body file://template.yaml \
  --resources-to-import '[{"ResourceType":"AWS::S3::Bucket","LogicalResourceId":"DataBucket"}]'
Enter fullscreen mode Exit fullscreen mode

Fix — automated:

# CloudWatch Event to detect drift weekly
DriftDetectionRule:
  Type: AWS::Events::Rule
  Properties:
    ScheduleExpression: "cron(0 12 * * 1)"  # Every Monday at noon
    Targets:
      - Arn: !GetAtt DriftLambda.Arn
        Input: '{"stackName": "prod-web"}'
Enter fullscreen mode Exit fullscreen mode

Prevention:

  • Enforce IAM policies that prevent resource modification outside CloudFormation
  • Enable drift detection on all production stacks
  • Review drift reports weekly

Failure 5: Stack drift causes deletion protection to block cleanup

Symptoms:

  • Trying to delete a stack
  • Error: "Cannot delete stack because resource X has deletion protection"
  • That resource was not supposed to have deletion protection

Root cause:
Someone enabled deletion protection directly on an RDS database or S3 bucket. CloudFormation doesn't know about it.

Diagnosis:

# Find which resource is blocking deletion
aws cloudformation describe-stack-resources --stack-name prod-stack \
  --query "StackResources[?ResourceStatus=='DELETE_FAILED']"
Enter fullscreen mode Exit fullscreen mode

Fix:

# Remove deletion protection from the resource directly
aws rds modify-db-instance --db-instance-identifier mydb \
  --no-deletion-protection

# Or for S3
aws s3api put-bucket-versioning --bucket mybucket \
  --versioning-configuration Status=Suspended

# Retry stack deletion
aws cloudformation delete-stack --stack-name prod-stack
Enter fullscreen mode Exit fullscreen mode

Prevention:
Include DeletionPolicy: Retain in your template for stateful resources, not deletion protection. DeletionPolicy is understood by CloudFormation. Deletion protection is not.


Part 3: Rollback failures

Failure 6: Rollback fails because resource won't delete

Symptoms:

  • Stack update fails
  • Rollback starts
  • Rollback fails
  • Stack stuck in ROLLBACK_FAILED

Root cause:
A resource created during the failed update cannot be deleted. Common reasons:

  • S3 bucket has versioning enabled and contains objects
  • RDS has deletion protection enabled
  • Network interface is still attached
  • Custom resource performed external actions

Diagnosis:

# Find which resource caused rollback failure
aws cloudformation describe-stack-events --stack-name prod-stack \
  --query "StackEvents[?ResourceStatus=='DELETE_FAILED']"
Enter fullscreen mode Exit fullscreen mode

Fix - for S3:

# Empty the bucket first
aws s3 rm s3://bucket-name --recursive

# Disable versioning
aws s3api put-bucket-versioning --bucket bucket-name \
  --versioning-configuration Status=Suspended

# Retry stack deletion
aws cloudformation delete-stack --stack-name prod-stack
Enter fullscreen mode Exit fullscreen mode

Fix - for RDS:

# Disable deletion protection
aws rds modify-db-instance --db-instance-identifier mydb \
  --no-deletion-protection

# Skip final snapshot if you want fast cleanup
aws rds delete-db-instance --db-instance-identifier mydb \
  --skip-final-snapshot
Enter fullscreen mode Exit fullscreen mode

Prevention:
Design stateful resources with DeletionPolicy: Retain in production. Accept that you will clean them manually. Do not let stateful resources block automated rollbacks.


Failure 7: Rollback takes too long, extending downtime

Symptoms:

  • Stack update fails at minute 15
  • Rollback takes another 20 minutes
  • Total downtime: 35+ minutes

Root cause:
Resources with DeletionPolicy: Snapshot take time to create snapshots during rollback. RDS snapshots can take 10-20 minutes. EBS snapshots add minutes per volume.

Diagnosis:

# Check which resource is taking time during rollback
aws cloudformation describe-stack-events --stack-name prod-stack \
  --query "StackEvents[?contains(ResourceStatus, 'DELETE')]"
Enter fullscreen mode Exit fullscreen mode

Fix during incident:
You have limited options once rollback starts. The fastest path is often to let it finish, even if slow.

Prevention:
Separate stateful resources (databases, buckets) into their own stack. This stack changes rarely. Application stacks change frequently but contain no stateful resources.

# Stack 1: Data (deploys monthly, rollback takes time but happens rarely)
DatabaseStack:
  Type: AWS::RDS::DBInstance
  DeletionPolicy: Snapshot

# Stack 2: Application (deploys daily, rollback is fast)
AppStack:
  Type: AWS::AutoScaling::AutoScalingGroup
  DeletionPolicy: Delete  # No snapshot, instant deletion
Enter fullscreen mode Exit fullscreen mode

When AppStack fails, rollback takes seconds, not minutes. Database is untouched.


Part 4: IAM and permission failures

Failure 8: "User is not authorized to perform cloudformation:CreateStack"

Symptoms:

  • CI/CD pipeline fails
  • Error message about missing permission
  • Same permissions worked yesterday

Root cause:
IAM policies changed. A condition was added. A permission was removed. The role used by CI/CD no longer has required access.

Diagnosis:

# Simulate policy to find missing permission
aws cloudformation create-stack --stack-name test-stack \
  --template-body file://test.yaml \
  --dry-run

# Check effective permissions for the role
aws iam simulate-principal-policy \
  --policy-source-arn arn:aws:iam::123456789012:role/ci-cd-role \
  --action-names cloudformation:CreateStack \
  --resource-arns arn:aws:cloudformation:us-east-1:123456789012:stack/*
Enter fullscreen mode Exit fullscreen mode

Fix:
Add the missing permission to the CI/CD role:

{
  "Effect": "Allow",
  "Action": "cloudformation:CreateStack",
  "Resource": "arn:aws:cloudformation:region:account:stack/*"
}
Enter fullscreen mode Exit fullscreen mode

Prevention:
Use IAM boundaries and permission guardrails. Test CI/CD role permissions in a staging account before deploying to production.


Failure 9: Cross-account stack operations fail

Symptoms:

  • Stack in Account A tries to create a resource in Account B
  • Error: "Access denied" or "Role does not exist"

Root cause:
CloudFormation does not natively support cross-account resource creation. You need IAM roles in both accounts with trust relationships.

Fix — setup cross-account role in target account:

# In Account B (target)
CrossAccountRole:
  Type: AWS::IAM::Role
  Properties:
    AssumeRolePolicyDocument:
      Statement:
        - Effect: Allow
          Principal:
            AWS: arn:aws:iam::AccountA:root
          Action: sts:AssumeRole
    ManagedPolicyArns:
      - arn:aws:iam::aws:policy/AdministratorAccess  # Scope down in production
Enter fullscreen mode Exit fullscreen mode

Fix — assume role from source account:

# In Account A (source)
CustomResource:
  Type: Custom::CrossAccount
  Properties:
    ServiceToken: !GetAtt CrossAccountLambda.Arn
    TargetRoleArn: arn:aws:iam::AccountB:role/CrossAccountRole
Enter fullscreen mode Exit fullscreen mode

Prevention:
Design stacks to be account-specific. Use AWS Organizations and StackSets for multi-account deployments instead of cross-account resource references.


Part 5: Template validation failures that only appear at deploy time

Failure 10: Template validates but deployment fails

Symptoms:

aws cloudformation validate-template --template-body file://template.yaml
# Returns: Template is valid
Enter fullscreen mode Exit fullscreen mode

But deployment fails with: "Encountered unsupported property" or "Resource handler returned invalid request"

Root cause:
validate-template checks syntax and basic schema. It does not check:

  • Resource property combinations that are invalid (e.g., certain combinations of SourceSecurityGroupId and CidrIp)
  • Region-specific limitations (some resources not available in all regions)
  • Service limits (e.g., requesting 2000 IOPS when limit is 1000)

Diagnosis:

Deploy with --disable-rollback to keep failed resources for inspection:

aws cloudformation create-stack --stack-name test-stack \
  --template-body file://template.yaml \
  --disable-rollback
Enter fullscreen mode Exit fullscreen mode

Then examine the failed resource's status reason:

aws cloudformation describe-stack-resources --stack-name test-stack \
  --query "StackResources[?ResourceStatus=='CREATE_FAILED']"
Enter fullscreen mode Exit fullscreen mode

Fix:
Correct the specific property combination. Check region availability. Request service limit increases before deployment.

Prevention:
Test in a staging region first. Use cfn-lint in CI/CD — it catches property combination errors that validate-template misses.

# Install cfn-lint
pip install cfn-lint

# Run locally before commit
cfn-lint template.yaml
Enter fullscreen mode Exit fullscreen mode

Part 6: Change set failures

Failure 11: Change set shows replacement when you expected modification

Symptoms:

  • Change set indicates "Replacement" for a production resource
  • You expected an in-place modification
  • Replacement means downtime

Root cause:
Certain property changes force replacement. For RDS: changing EngineVersion or DBInstanceClass sometimes requires replacement depending on the version difference.

Diagnosis:

Check which property triggered replacement:

aws cloudformation describe-change-set --change-set-name my-change-set \
  --query "Changes[?ResourceChange.Replacement=='True']"
Enter fullscreen mode Exit fullscreen mode

Common properties that force replacement:

Resource Property that forces replacement
AWS::RDS::DBInstance Engine, EngineVersion (major version), DBSubnetGroupName
AWS::EC2::Instance ImageId, InstanceType (sometimes), SubnetId
AWS::S3::Bucket BucketName (can't change), AccessControl (sometimes)
AWS::Lambda::Function Code (S3 bucket/key change)

Fix:

  • Accept the replacement and plan for downtime
  • Use blue/green deployment for zero-downtime replacement
  • Modify the resource directly in AWS console (not recommended for IaC)

Prevention:
Always review change sets in staging before production. Know which properties cause replacement for your critical resources.


Failure 12: Change set execution fails because of update conflicts

Symptoms:

  • Change set creates successfully
  • execute-change-set fails
  • Error: "Cannot update stack because another update is in progress"

Root cause:
Another process (CI/CD pipeline, another engineer, scheduled automation) started a stack update while your change set was waiting for execution.

Diagnosis:

# Check current stack status
aws cloudformation describe-stacks --stack-name prod-stack \
  --query "Stacks[0].StackStatus"

# Status like UPDATE_IN_PROGRESS or ROLLBACK_IN_PROGRESS means locked
Enter fullscreen mode Exit fullscreen mode

Fix:
Wait for the other update to complete. Then create a new change set based on the latest stack state. Do not execute the old change set — it's now out of date.

# Delete old change set
aws cloudformation delete-change-set --change-set-name old-change-set

# Create new change set against current stack
aws cloudformation create-change-set --stack-name prod-stack \
  --change-set-name new-change-set --template-body file://template.yaml

# Execute fresh change set
aws cloudformation execute-change-set --change-set-name new-change-set
Enter fullscreen mode Exit fullscreen mode

Prevention:

  • Implement stack-level locking via S3 condition keys or custom resources
  • Coordinate CI/CD pipelines to never deploy simultaneously to the same stack
  • Use separate stacks for separate environments

Part 7: Performance and quota failures

Failure 13: Stack deployment times out due to API rate limiting

Symptoms:

  • Stack deployment slows dramatically after hundreds of resources
  • Error: "Rate exceeded" for various AWS APIs
  • Some resources take 5-10 retries before succeeding

Root cause:
CloudFormation makes many API calls to create resources. AWS APIs have rate limits. Large stacks hit these limits.

Diagnosis:

# Check CloudTrail for throttle errors
aws cloudtrail lookup-events --lookup-attributes AttributeKey=EventName,AttributeValue=ThrottlingException
Enter fullscreen mode Exit fullscreen mode

Fix — immediate:
Split the stack. CloudFormation has a recommended limit of 200 resources per stack for optimal performance.

# List resources by type to see distribution
aws cloudformation list-stack-resources --stack-name large-stack \
  --query "StackResources[*].[ResourceType]" --output text | sort | uniq -c
Enter fullscreen mode Exit fullscreen mode

Fix — long term:
Design modular stacks:

network-stack.yaml     (VPC, subnets, route tables)
data-stack.yaml        (RDS, ElastiCache, S3)
compute-stack.yaml     (ASG, launch templates)
app-stack.yaml         (Lambda, API Gateway)
Enter fullscreen mode Exit fullscreen mode

Prevention:
Monitor stack creation time. If it exceeds 15 minutes for non-stateful resources, split the stack.


Failure 14: Service quota exceeded during deployment

Symptoms:

  • Deployment fails
  • Error: "You have reached your limit of X resources"

Root cause:
AWS account has default service limits. You're trying to create more resources than allowed.

Common quotas:

  • VPCs per region: 5
  • Security groups per VPC: 500
  • RDS instances per region: 40
  • Lambda concurrent executions: 1000

Diagnosis:

# Check current usage against quota
aws service-quotas get-service-quota \
  --service-code ec2 --quota-code L-12345678

# List all quotas for a service
aws service-quotas list-service-quotas --service-code rds
Enter fullscreen mode Exit fullscreen mode

Fix — immediate:
Request quota increase from AWS Support or via Service Quotas API:

aws service-quotas request-service-quota-increase \
  --service-code ec2 --quota-code L-12345678 \
  --desired-value 100
Enter fullscreen mode Exit fullscreen mode

Fix — tactical:
Reduce resource count in the current deployment. Use smaller instance sizes. Share resources across stacks.

Prevention:
Include quota checks in your CI/CD pipeline before deployment:

# Script to check quotas before deploying
python scripts/check_quotas.py --template template.yaml
Enter fullscreen mode Exit fullscreen mode

Part 8: Troubleshooting workflow - where to start

When a CloudFormation deployment fails, follow this workflow:

Step 1: Get the raw error

aws cloudformation describe-stack-events --stack-name prod-stack \
  --max-items 20 --query "StackEvents[?ResourceStatus=='CREATE_FAILED' || ResourceStatus=='UPDATE_FAILED']"
Enter fullscreen mode Exit fullscreen mode

Look for the ResourceStatusReason field. This is your primary clue.

Step 2: Identify the failed resource

The error message tells you which logical resource failed. Find its type and properties in your template.

Step 3: Check if it's a known failure pattern

Error message pattern Likely cause Fix section
"Role does not exist" IAM eventual consistency Part 1, Failure 1
"Rate exceeded" API throttling Part 7, Failure 13
"Limit exceeded" Service quota Part 7, Failure 14
"Deletion protection" Rollback blocked Part 3, Failure 6
"Another update in progress" Concurrent update Part 6, Failure 12

Step 4: Deploy with --disable-rollback for debugging

aws cloudformation create-stack --stack-name debug-stack \
  --template-body file://template.yaml \
  --disable-rollback
Enter fullscreen mode Exit fullscreen mode

Failed resources remain so you can inspect them directly.

Step 5: Inspect the failed resource directly

For EC2:

aws ec2 describe-instances --instance-ids i-12345
ssh ec2-user@instance-ip # Check logs
Enter fullscreen mode Exit fullscreen mode

For Lambda:

aws logs describe-log-groups --log-group-name-prefix /aws/lambda/my-function
aws logs get-log-events --log-group-name /aws/lambda/my-function --log-stream-name $(aws logs describe-log-streams --log-group-name /aws/lambda/my-function --query "logStreams[0].logStreamName" --output text)
Enter fullscreen mode Exit fullscreen mode

For RDS:

aws rds describe-db-instances --db-instance-identifier mydb
aws rds describe-events --source-identifier mydb --source-type db-instance
Enter fullscreen mode Exit fullscreen mode

Step 6: Fix, then continue

If stack is in ROLLBACK_FAILED, you have two options:

Option A: Delete the failed stack and recreate

aws cloudformation delete-stack --stack-name prod-stack
# Wait for deletion
aws cloudformation create-stack --stack-name prod-stack --template-body file://template.yaml
Enter fullscreen mode Exit fullscreen mode

Option B: Continue rolling back after fixing the blocker

# Fix the blocking resource (empty S3 bucket, disable deletion protection)
# Then retry rollback (CloudFormation may need manual intervention via support)
Enter fullscreen mode Exit fullscreen mode

Production CloudFormation checklist

Before deploying to production, verify:

Drift detection

  • [ ] Enabled on all production stacks
  • [ ] Weekly automated drift check configured
  • [ ] Alerts configured for drift findings

Rollback strategy

  • [ ] Stateful resources have DeletionPolicy: Retain or Snapshot
  • [ ] Stateless resources have DeletionPolicy: Delete
  • [ ] Stateful and stateless resources in separate stacks

IAM and security

  • [ ] No "Action": "*" in policies
  • [ ] Secrets use {{resolve:secretsmanager:...}} not parameters
  • [ ] CI/CD role has minimal required permissions
  • [ ] cfn-guard or cfn-lint running in CI

Failure handling

  • [ ] CreationPolicy includes timeout and signal handling
  • [ ] Custom resources always send SUCCESS or FAILURE responses
  • [ ] Nested stack depth ≤ 2

Performance

  • [ ] No stack exceeds 200 resources
  • [ ] No stack consistently deploys longer than 15 minutes
  • [ ] Service quotas checked before deployment

Troubleshooting readiness

  • [ ] describe-stack-events command documented in runbook
  • [ ] Access to failed resource logs (EC2, Lambda, RDS) available
  • [ ] --disable-rollback used in staging deployments

Written by Onyedikachi Obidiegwu | Cloud Security Engineer

Top comments (0)