War Story: We Merged a Bad PR to Main and Broke Production for 1000 Users – Here’s How We Improved Our PR Process

#story #merged #main #broke

War Story: We Merged a Bad PR to Main and Broke Production for 1000 Users – Here’s How We Improved Our PR Process

It was 4:30 PM on a Friday. Our 8-person engineering team at SaaS Co. (mid-sized project management tool) was wrapping up for the week. A PR from a junior engineer, labeled “Quick fix: Update payment gateway null handling,” had two approvals, passed all CI checks, and was merged to main. 12 minutes later, our production monitoring dashboard lit up red: 1000 active users couldn’t access their dashboards, and payment attempts were failing at a 100% rate.

The Incident: What Went Wrong

The PR in question modified a null check in our payment integration code. Previously, if a user’s billing address was null, the system would throw a handled error and prompt the user to update their info. The PR changed this to default to an empty string, with the logic: if (!billingAddress) billingAddress = "";. The problem? The payment gateway’s API rejected empty strings for required address fields, causing all payment flows and post-payment dashboard redirects to crash.

Our CI pipeline only ran unit tests, which mocked the payment gateway response and didn’t catch the empty string rejection. The single PR reviewer, a peer junior engineer, didn’t spot the logic flaw. Main branch auto-deployed to production within 5 minutes of merge, and we had no staging environment to validate PRs before merge. We didn’t notice the outage for 15 minutes (our user error alerting only triggered at 5% error rate, and this hit 100% immediately but we had no real-time dashboard for critical flows). Rollback took 45 minutes, as we had to revert the merge, re-run CI, and redeploy. Total downtime: 2 hours.

The Fallout

We lost 12 enterprise customers that week, had 47 support tickets filed, and our uptime SLA dropped to 99.1% for the month. A blameless post-mortem revealed the root cause wasn’t the engineer’s mistake, but our broken PR process that let a bad change reach production untested.

Root Causes of the Bad Merge

Single reviewer requirement: Only one peer review was required, often from engineers unfamiliar with the payment codebase.
Weak CI checks: No integration or end-to-end (e2e) tests for critical user flows, only unit tests with mocked dependencies.
No staging validation: PRs were never deployed to a production-like environment before merge.
Unrestricted auto-deploys: Main branch automatically deployed to production with no manual approval gate.
Rushed PRs: No mandatory test plan or risk assessment in PR descriptions, leading to missed edge cases.

How We Overhauled Our PR Process

We spent the next 2 weeks implementing mandatory process changes, all enforced via GitHub branch protection rules and CI pipeline updates:

Mandatory two reviewers: All PRs require two approvals: one senior engineer, and one engineer who has contributed to the affected codebase area in the last 6 months. No self-reviews, no exceptions.
Expanded CI pipeline: Added integration tests for all payment, login, and dashboard flows, plus e2e tests for critical paths using Cypress. All CI checks (unit, integration, e2e, lint, security scans) must pass before merge is allowed.
Staging validation requirement: Authors must deploy their PR to our staging environment (production-like, connected to sandbox payment gateways) and validate the change manually before requesting merge. Staging deploy status is required in the PR description.
Manual prod deploy approval: Main branch no longer auto-deploys to production. All prod deploys require manual approval from a team lead, and only from main after staging validation and all CI checks pass.
Standardized PR template: All PRs must use a template with mandatory sections: detailed description, test plan (unit, integration, manual), risk assessment (low/medium/high), and screenshots/GIFs for UI changes. Empty sections block merge.
Friday merge freeze: No PR merges to main after 2 PM ET on Fridays, unless tagged as an emergency fix. This prevents rushed end-of-week merges with less review coverage.
Blameless post-mortems: Every production incident triggers a post-mortem within 48 hours, with action items assigned to owners and tracked in our project management tool until closed.

The Results

Six months later, we’ve had zero production outages caused by bad PR merges. Our average PR merge time increased from 45 minutes to 1 hour 15 minutes, but the tradeoff is worth it: our uptime SLA is back to 99.99%, customer support tickets related to bugs are down 82%, and team confidence in our deployment process is at an all-time high.

The Friday outage was a painful lesson, but it forced us to fix a process that was bound to cause bigger issues down the line. If you’re running a team with production auto-deploys, take it from us: don’t skimp on PR reviews, testing, and staging validation. Your users will thank you.