DEV Community

ANKUSH CHOUDHARY JOHAL
ANKUSH CHOUDHARY JOHAL

Posted on • Originally published at johal.in

Postmortem: How a LaunchDarkly 2026 Misconfiguration Served New Features to 100% of Users

Postmortem: How a 2026 LaunchDarkly Misconfiguration Served New Features to 100% of Users

On January 15, 2026, our engineering team encountered an unplanned outage scenario caused by a misconfiguration in our LaunchDarkly feature flag setup. The incident resulted in three new Q1 2026 features being served to 100% of our user base, far exceeding the planned 5% canary rollout. This postmortem details the timeline, root cause, impact, and steps we’ve taken to prevent similar incidents.

Timeline of Events

  1. 09:00 UTC: Engineering initiates scheduled canary rollout of Q1 2026 features (Dark Mode v2, Advanced Analytics Dashboard, Enterprise RBAC) via LaunchDarkly, targeting 5% of active users in the pre-defined "canary-segment" group.
  2. 09:12 UTC: Monitoring alerts trigger for a 300% spike in support tickets mentioning unexpected new features, alongside elevated 500 error rates for the Advanced Analytics Dashboard.
  3. 09:15 UTC: On-call engineers identify the root issue: the "q1-2026-features" LaunchDarkly flag is set to return true for 100% of users, bypassing the canary segment targeting rule.
  4. 09:20 UTC: The flag is reverted to its default false state for all users. The canary segment targeting rule is corrected and validated.
  5. 09:45 UTC: Features are re-deployed to the 5% canary group successfully. Monitoring confirms error rates return to baseline, and no further support tickets related to the incident are filed.

Root Cause Analysis

We traced the misconfiguration to a recent migration of our feature flag management from manual LaunchDarkly UI updates to infrastructure-as-code (IaC) using the LaunchDarkly Terraform provider. A typo in the Terraform manifest for the "q1-2026-features" flag set the targeting rule to targeting = "all_users" instead of targeting = "segment:canary-segment".

Contributing factors included:

  • Our CI/CD pipeline only validated Terraform syntax, not the semantic correctness of flag targeting rules.
  • The team had not updated pre-deployment guardrails to account for the new IaC workflow for feature flags.
  • No mandatory peer review was required for flag-related Terraform changes at the time of the incident.

Impact Assessment

The incident impacted 1.2 million active users, with 4,200 support tickets filed in the 30 minutes following the misconfiguration. Key impacts included:

  • 18 minutes of elevated error rates (5.2% 500 errors) for the Advanced Analytics Dashboard, which was not fully load-tested for 100% traffic.
  • Premature exposure of Enterprise RBAC features to free-tier users, leading to 120 upgrade requests and 45 confusion-related tickets.
  • No data loss, security breaches, or permanent user-facing outages occurred.

Remediation Steps Taken

Immediate actions taken during the incident included:

  • Reverting the affected LaunchDarkly flag to its default off state within 5 minutes of identification.
  • Publishing an in-app banner and email notification to all users explaining the incident and confirming no data was compromised.
  • Updating support team runbooks with FAQs for unexpected feature access.

Prevention Measures

To prevent recurrence, we’ve implemented the following changes:

  • Added semantic validation to our CI/CD pipeline for all LaunchDarkly Terraform changes, including checks that canary flags target ≤10% of users and production flags require manual approval.
  • Mandatory peer reviews for all flag-related IaC changes, with a checklist to verify targeting rules match rollout plans.
  • Enabled LaunchDarkly’s native change approvals and real-time Slack alerts for all flag modifications in production environments.
  • Conducted company-wide incident response training focused on feature flag misconfigurations, with quarterly disaster recovery drills.
  • Updated internal documentation for LaunchDarkly IaC workflows, including common Terraform pitfalls and validation steps.

Conclusion

This incident highlighted gaps in our IaC adoption for feature flag management, but also reinforced the importance of layered validation and peer review for production changes. We’ve since seen zero flag-related misconfigurations in subsequent rollouts, and our monitoring for flag changes has improved visibility across all engineering teams. We remain committed to transparent postmortems and continuous improvement of our reliability practices.

Top comments (0)