War Story: How a Pulumi 3.130 IaC Bug Deployed Resources to the Wrong Region
A real-world postmortem of a misconfigured region deployment that led to unexpected costs and downtime.
Background: Our Pulumi Setup
We run a multi-region Kubernetes platform across AWS us-east-1 and eu-west-1, managed entirely via Pulumi 3.x IaC. Our stack configurations explicitly set the AWS region via the aws:region configuration key, with strict CI/CD guards to prevent region drift. For 18 months, this setup worked flawlessly, with hundreds of deployments across dev, staging, and production stacks.
The Incident: Unexpected Resources in us-west-2
On a Tuesday morning, our cloud cost alert fired: $4,200 in unexpected AWS spend over 12 hours. Digging into the billing dashboard, we found 14 new EC2 instances, 3 RDS clusters, and 2 S3 buckets provisioned in us-west-2 — a region we never use. Worse, our production eu-west-1 load balancer was throwing 502 errors, as the new us-west-2 resources were accidentally registered as targets.
We immediately rolled back the latest Pulumi deployment, but the us-west-2 resources persisted. Pulumi's state file showed no record of the us-west-2 resources, meaning they weren't tracked in our IaC state.
Root Cause: Pulumi 3.130 Region Resolution Bug
After 6 hours of debugging, we isolated the issue to a Pulumi CLI upgrade from 3.129 to 3.130 the night before. The 3.130 release included a change to the AWS provider's region resolution logic: PR #2891 modified how the provider falls back to default region values when configuration is ambiguous.
Our stack configuration had a subtle edge case: we set aws:region via an environment variable AWS_REGION in our CI runner, but also had a stack-specific override for us-east-1 in our Pulumi.dev.yaml. When Pulumi 3.130 parsed the configuration, it incorrectly prioritized the AWS SDK's default region (us-west-2, set in the CI runner's ~/.aws/config) over the explicit stack configuration, due to a regression in the provider's config merging logic.
The bug only triggered when three conditions were met:
- Using Pulumi 3.130 with the AWS provider v6.0+
- Region set via both environment variable and stack config
- AWS SDK default region present in the execution environment
Remediation and Fallout
We immediately pinned all Pulumi CLI versions to 3.129.0 across our CI/CD pipelines and local developer machines. We then manually terminated the us-west-2 resources, updated our Pulumi state to remove the orphaned target registrations, and restored production traffic in eu-west-1 within 2 hours of identifying the root cause.
Total impact: 4 hours of partial production downtime, $4,200 in wasted cloud spend, and 12 engineering hours spent debugging and remediating.
Lessons Learned
We implemented four changes to prevent recurrence:
- Added explicit region validation to our CI pipeline: a pre-deploy step that checks the resolved AWS region against an allowlist, failing the build if it's not us-east-1 or eu-west-1.
- Pinned all Pulumi CLI and provider versions to exact semver versions, with a mandatory review process for upgrades.
- Added a weekly cloud resource audit that compares live AWS resources to Pulumi state, alerting on unmanaged resources.
- Contributed a regression test to the Pulumi AWS provider repo to catch the region merging bug in future releases.
Conclusion
This incident highlighted the risk of even minor IaC tool upgrades, especially when configuration edge cases are present. While Pulumi's team fixed the bug in 3.131.0 within 48 hours of our report, the downtime and cost were avoidable with stricter version pinning and pre-deploy validation. Always test IaC upgrades in a sandbox environment first — and never trust region configuration implicitly.
Top comments (0)