The production disasters we've watched happen, and the habit that would have prevented all of them

#testing #automation #security #webdev

The Tuesday the database got smaller

A client of ours ran a loyalty program with about 120,000 members. On a Tuesday afternoon their agency pushed a "cleanup migration" to production. The intent was to merge duplicate accounts where the same email had signed up twice with different casing. The script ran, the dashboard was snappier than usual, and someone in the client's marketing team noticed by Wednesday morning that roughly 40,000 members had vanished from the list.

The migration had matched on normalized email, yes, but it had also silently deleted the "loser" row instead of merging the points balances first. There was no soft delete. There was no dry run log. The backup was 19 hours old, which meant a full day of new signups and point redemptions was gone by the time anyone restored it.

The agency's post-mortem used the word "oversight" four times. The real word is "untested". Nobody had run the script against a production-shaped dataset before Tuesday. The staging DB had 300 rows in it. I know because I asked.

The other one was a checkout

Different client, different year. An e-commerce build. Clean code, good developers, the sort of team you would hire again. They shipped a payment integration update to handle a new 3D Secure flow. It worked on desktop Chrome. It worked on Android. It worked in their test Stripe account.

What it did on mobile Safari was charge the card, fail to register the webhook response, show the user a generic error, and then charge again when the user tapped the retry button. For about six hours on a Saturday, roughly one in four mobile checkouts double-billed. Support found out from Twitter. The refund process took two weeks because the agency had to reconcile it manually against Stripe exports.

I want to be fair to the developers here. The bug was subtle. Safari's handling of redirect-based payment flows has edge cases that will age you a year per hour you spend debugging them. But the reason it reached production was not the subtlety. It was that the team's QA process was one developer clicking through on one phone before the PR merged. There was no mobile test matrix. There was no webhook replay in staging. There was no synthetic buyer running the full flow every 10 minutes after deploy.

The part where I admit we missed one ourselves

BetterQA runs its own internal tools and we eat our own cooking, mostly. But in 2024 we shipped a change to our timesheet platform that quietly broke PTO accrual for contractors who had been hired mid-month. The math was off by a fraction of a day. It was the kind of thing you would only notice if you looked at your own balance closely and thought "that seems low".

One contractor did notice, a month later, and flagged it politely in Slack. When we dug in, we found the test coverage for the accrual logic had a gap exactly where new-hire proration lived. We had tested the happy path, the leaver path, and the anniversary path. We had not tested the "joined halfway through April" path. A QA team. Missing a date edge case. On our own product.

I bring this up because the usual posture of a QA company writing a blog post is to sound like the wise adult in the room. We are not. We are the people who got burned enough times to build checklists, and we still occasionally miss things. The checklists are how we miss them less often than we used to.

What these stories have in common

None of the three disasters I just described were caused by bad engineers. They were caused by the same structural problem, which is that the team doing the work was also the team signing off on the work. Our founder has a line he uses in sales conversations, and it sounds glib until you have lived through one of these: the chef should not certify his own dish.

When a developer tests the feature they just wrote, they test the paths they imagined while writing it. They do not test the paths they did not imagine, because by definition they did not imagine them. An independent tester, even one who is less skilled technically, will wander into those unimagined paths because they are not carrying the mental model the developer built. This is why the migration script passed review, and why the mobile Safari checkout looked fine in the PR, and why the accrual bug sat in our own codebase for weeks.

Independent QA is not about hiring a second tier of people to do clicks. It is about introducing a viewpoint that is not contaminated by the assumptions of the build.

What I would actually tell a client who asked me how to avoid this

Stop asking agencies "do you do QA". Every agency says yes. Start asking: who on your team will physically run my feature on a device they did not use to build it, before it gets to my users. If the answer is "the same developer who wrote it", you already know what kind of outage you are buying.

Ask to see the last bug report they wrote for a previous client. Not a metric. An actual bug report, with steps and severity and what they did about it. If they cannot show you one because "we don't really write those", they are not doing QA, they are doing vibes.

Ask what happens after launch. Not "do you offer support", every agency offers support. Ask what the response time is for a production incident at 9pm on a Friday, and ask who specifically takes that call. The answer should be a name, not a policy.

Ask them about something they got wrong on a past project and how they caught it. If they cannot think of one, they either have a very bad memory or they are not looking closely enough at their own work. Both are disqualifying.

A short note on feature creep, because it keeps showing up in disasters

I did not want to turn this into a listicle but I have to mention one more thing, because it is the quiet cause of maybe half the production incidents I have seen in the last few years. Scope.

Every feature you ship is a surface area you have to test, monitor, and eventually rewrite. When an agency pitches you a fancy real-time dashboard and you do not actually need real-time, you are not getting a free gift. You are agreeing to debug a WebSocket connection at 2am six months from now when it starts dropping on AWS's us-east-1 for reasons nobody can reproduce. The cheapest feature is the one you did not build.

Most of the "digital transformation" budgets I see waste about 30% of their money on things that looked exciting in the pitch deck and got tested by nobody before launch. The honest agencies will talk you out of half of it. The dishonest ones will let you pay for all of it and then blame the complexity when it breaks.

The uncomfortable summary

If I had to give you one habit that would have prevented the loyalty migration disaster, the Safari double-charge, and our own accrual bug, it is this: before any change reaches users, a person who did not write the change has to try to break it on production-realistic data, on the real devices users use, with the real integrations in place. Not a dev clicking through staging. Not an automated test of the happy path. A human being, with fresh eyes, who gets paid to think nastily.

You can hire this role in-house. You can bring in a partner. We happen to think the partner model works better for most teams, which is why BetterQA exists, but I would rather you have bad independent QA than no independent QA. The point is not the logo on the invoice. The point is that no chef certifies their own dish.

If you are about to ship a migration, a payment change, or anything that touches money, authentication, or customer data, ask yourself one question before you merge the PR. Who is going to break this on purpose before my customers break it by accident? If the answer is nobody, you are the nobody, and the Tuesday is coming.