DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Postmortem: How a Pulumi 3.120 Preview Bug Caused a Bad Deployment to Staging

Postmortem: How a Pulumi 3.120 Preview Bug Caused a Bad Deployment to Staging

Executive Summary

On October 15, 2024, at 14:32 UTC, a bug in Pulumi CLI version 3.120.0's preview functionality caused an unintended deletion of staging infrastructure during a routine deployment. The incident lasted 47 minutes, with full staging environment downtime and no production impact. This postmortem details the timeline, root cause, remediation steps, and key lessons learned.

Timeline of Events

  • 14:32 UTC: DevOps engineer runs pulumi preview for the staging stack using Pulumi CLI 3.120.0. The output shows "No changes required", incorrectly omitting pending resource deletions.
  • 14:35 UTC: Engineer runs pulumi up --yes assuming no changes, triggering unplanned deletion of 3 staging API EC2 instances and an AWS RDS subnet group.
  • 14:36 UTC: Staging health checks fail; API and database connectivity errors are reported.
  • 14:38 UTC: Incident declared (SEV-2) by the on-call team.
  • 14:40 UTC: Rollback initiated using Pulumi CLI 3.119.2 (last known stable version) to restore deleted resources.
  • 14:52 UTC: All staging resources restored; health checks pass, environment marked operational.
  • 15:19 UTC: Pulumi engineering team confirms the bug exists in CLI 3.120.0's AWS provider diff logic.
  • 16:45 UTC: Pulumi releases CLI 3.120.1 with a patch for the preview bug.

Root Cause Analysis

The incident was caused by a regression in Pulumi CLI 3.120.0, specifically in the AWS provider v6.12.0 (bundled with the CLI release). The preview command's diff logic failed to account for deprecated resource attributes being removed during stack updates.

When a Pulumi stack update removed a deprecated field from an AWS resource definition (e.g., removing the deprecated publicIpAddress field from an EC2 instance), the preview diff incorrectly excluded the resulting resource deletion from the output. This led engineers to believe no changes were pending, when in fact critical resources were marked for deletion.

The bug only affected preview output; the pulumi up command correctly applied the stack state, including the unshown deletions.

Impact

  • Staging environment fully unavailable for 47 minutes (14:36–14:52 UTC).
  • 3 staging API EC2 instances deleted, causing API downtime.
  • AWS RDS subnet group removed, breaking staging database connectivity for 12 minutes.
  • No production impact, no customer data loss, no permanent infrastructure damage.

Remediation Steps

  1. Immediately pinned all CI/CD pipelines and local engineer environments to Pulumi CLI version <3.120.0 to prevent further accidental use of the buggy release.
  2. Restored deleted resources using a previous known-good Pulumi state backup and CLI 3.119.2.
  3. Upgraded all environments to Pulumi CLI 3.120.1 once the patch was released, verifying preview output matched expected changes.
  4. Added a pre-deployment check to CI/CD pipelines: run pulumi preview with two separate CLI versions and compare diffs for critical stacks.

Lessons Learned

  • Never trust preview output implicitly, even for "no change" results. Cross-verify with cloud provider CLIs (e.g., AWS CLI) for critical resource changes.
  • Pin infrastructure tool versions (CLI, providers) in CI/CD pipelines to avoid untested automatic updates.
  • Test new Pulumi CLI releases in isolated non-production stacks before rolling out to staging or production.
  • Maintain regular Pulumi state backups to accelerate rollback in case of incidents.
  • Subscribe to Pulumi release notes and security advisories to stay informed of known issues.

Conclusion

The Pulumi 3.120.0 preview bug was a high-impact regression that highlighted gaps in our deployment verification process. Thanks to rapid incident response and Pulumi's quick patch, the impact was limited to staging with no production harm. All teams using Pulumi are advised to upgrade to 3.120.1 or later to avoid this issue.

Top comments (0)