It started like any other normal deploy.
A version bump. Nothing fancy. The kind of change that slips into prod without fanfare.
We had tests. Dev looked good. It was reviewed. Merged. ✅
And for a while… everything was fine.
Then the PagerDuty alert went off.
Then Slack lit up with:
“Checkout’s not loading.”
“The screen is blank.”
“Apple Pay’s broken.”
And that’s when my heart sank.
Because this wasn’t just a bug.
This was the entire payment flow going down.
What made it worse?
It was my first real production outage.
I’d read about outages before. I’d seen postmortems.
But nothing prepares you for what it feels like when your code takes down checkout.
When you're on an incident bridge, adrenaline kicking in, trying to act calm while your mind is racing.
If you’ve ever worked on a checkout team, you know how high the stakes are.
If you haven’t-trust me, this is the kind of breakage that makes your stomach drop.
This isn’t a deep technical write-up. It’s a story.
About what happened, what I learned, and what I wish I knew going in.
So when your moment comes (and it will), maybe you’ll feel a little more ready than I did.
🧨 Part 1: What Changed (And What Went Wrong)
To the readers who don’t know me personally or know what I do to pay my bills — I work on the web side of payments, maintaining internal libraries that power checkout flows across different products.
It’s a space that lives quietly in the background when everything’s working… and turns into absolute chaos when something breaks.
Recently, we had to update the Apple Pay SDK.
The version we were using had been deprecated by the Apple team, and the new one introduced a few changes in the API contract—mostly around how inputs were passed during initialisation.
We made the necessary updates, tested everything on dev, and it all looked solid. ✅
So we merged it and shipped it.
A while later, we got alerted about a spike in Apple Pay errors.
At first, it felt like a small fire.
But as we started digging in, it quickly escalated.
We discovered that the updated SDK relied on a function that wasn’t supported in certain older browsers. That resulted in a runtime error early in the flow-so early that it completely broke the entire checkout experience.
No Apple Pay.
No other payment methods.
Just a blank screen.
⚠️ First Things First - Don’t Start with Debugging
Before we go into what we found, here’s one thing I learned that I’ll never forget:
When you're facing a production outage, your first instinct shouldn’t be to start debugging.
That’s the mistake I almost made.
Because when your code is involved, your brain goes:
“Okay, where’s the bug? What broke? Let’s fix it.”
But here’s the thing: debugging takes time.
And while you’re digging through logs and testing theories, real users are getting blocked.
Revenue is dropping. The company is losing money. And most importantly, people are having a bad experience.
Unless you know the fix and you're confident it's safe to ship fast, the best move is this:
👉 Revert the commit. Redeploy. Stabilize.
It might not feel heroic, but trust me - it's the most responsible thing you can do.
The goal isn’t to debug under pressure. The goal is to stop the bleeding.
And then, when the incident is over, you can figure out what actually went wrong.
And that’s exactly what we did.
We rolled back the commit.
The blank screen went away. Checkout came back. Users could pay again.
The incident bridge went quiet.
PagerDuty stopped yelling.
But the work wasn’t over.
Now it was on us to figure out exactly what broke and why.
🛠️ Part 2: The Debug Spiral - What We Found
Once things were stable again, it was time to figure out what had actually happened.
The incident bridge was done. The pressure had dropped.
But the ownership was still on us to make sure we knew exactly why checkout had broken, and how to prevent it from ever happening again.
So we started going through logs, testing browsers, and trying to reproduce the issue in a controlled way.
At first, it was confusing.
There was no visible crash.
No red banners. No clean stack trace on screen. Just… silence.
The kind that’s somehow more unsettling than an obvious error.
Eventually, we spotted it: A runtime error coming from deep inside the updated Apple Pay SDK.
It was trying to use a method that wasn’t supported in some older browser environments.
That’s when it clicked:
Of course it was happening on Safari.
Apple Pay is only supported on Safari so if something in the SDK was going to fail, that’s exactly where it would show up.
And that’s what was happening here.
In certain versions of Safari, this one method wasn’t available.
And since the SDK tried to use it during initialization, it caused a hard crash.
Not just for Apple Pay.
But for everything.
Because the failure happened at the top level of our checkout bootup, no other payment methods loaded either.
No UI. No error message. Just a blank screen.
The kind of bug that slips past on dev.
And punches you right in prod.
🧠 Part 3: It's Not Just About the Fix - It's About Ownership
The bug was found. The rollback was done. Users could pay again.
But I knew my job wasn’t over.
Because being a good engineer doesn’t stop at unblocking users.
It starts when you step back and ask:
What did this break mean for the user?
How much revenue did we potentially lose?
What did this outage cost the business?
Could this have been caught earlier?
Will it happen again?
We’re not just here to respond to alerts and patch code like we’re on an assembly line.
We’re not coders you can prompt like ChatGPT and expect the problem to disappear.
We’re here to think deeply, own outcomes, and prevent future damage.
So I started writing.
I documented everything:
- What we changed
- Why we changed it
- What failed (and where)
- How it slipped past QA
- What we missed in our monitoring
- And what browser/version combos were affected
Not because someone told me to, but because future-me (or future teammates) might run into something similar and I want them to have answers faster than I did.
💡 What This Taught Me - and What I Want to Push For
This outage didn’t just teach me what went wrong.
It made me pause and ask: what would I do differently if this happened again?
So I started jotting down ideas.
Not all of them are done yet. Some are in progress.
But they’re now on my radar and they weren’t before this incident.
Here’s what I’m exploring post-outage:
🧪 Introducing browser-specific test cases, especially for critical SDK integrations
🛡️ Adding guard clauses around third-party SDKs to prevent one failure from taking everything down
📉 Revisiting alert thresholds and adding more granular monitors (especially for visual regressions and blank states)
📊 Surfacing “what changed in this release” summaries to make incident triage faster
📚 Starting a habit of writing lightweight incident summaries for internal sharing and future reference
These aren’t silver bullets.
But they’re the kinds of things I now know matter and they’re on my roadmap moving forward.
🧘♀️ Part 4: Wrapping It All Up
If you’ve made it this far thank you.
This wasn’t the easiest thing to write, but I wanted to put it out there because… no one tells you what your first production fire feels like.
It’s messy.
It’s stressful.
And sometimes, yes it’s your code.
But you learn.
You learn what it really means to take ownership.
You learn to zoom out and think beyond “what broke” and start asking “who did this impact?”
You learn that being a good engineer isn’t about writing perfect code - It’s about how you show up when things aren’t perfect.
I’m still learning. But one thing’s for sure: this incident taught me more than weeks of “normal” engineering ever could.
✅ A Checklist for When (Not If) It Happens to You
Here’s a simple list I wish I had in front of me when it all went down:
- 🔄 Don’t start debugging immediately - if the issue is real and users are impacted, revert first, then investigate
- 📣 Communicate early - even just “Looking into it” goes a long way in incident channels
- 🧪 Try reproducing the issue in controlled environments (same browser, device, etc.)
- 🔍 Check logs, network calls, error boundaries - don’t assume it’ll crash loudly
- 📉 Review your monitoring thresholds - are they tuned to catch subtle failures early?
- 📚 Document what happened, in plain language - not just for your team, but for future you
- 🧠 Think about business impact - how many users were affected, and for how long?
- ✍️ If you find patterns or takeaways, write them down - they’re gold in the next incident
In the end I can just say this, breaking prod might not be a badge of honour - but surviving it? Definitely is.
If this story helped you - or reminded you of your own - you can always reach out.
Would love to connect with others who’ve been through the fire. 🔥
📍Twitter / X: smileguptaaa
Top comments (4)
Nice write up.
This is the hardest thing frontend devs, or well, any devs, need to learn along the way.
I do disagree with step 1. I go in and debug immediately.
Sometimes rolling back just means you're undoing the problem and finding it to fix it is much more work later on.
After 20 years of doing this, I'm pretty fast at it, so nothing stays broken long.
[Unless i run into something i cannot fix with code, eg broken package, broken api connection, missing certificates, etc which make things take longer to fix]
I do agree if within 30mins you can't apply a patch to at least make things work, it might make sense to roll back, but most of the time the issues are quick to fix if you understand the changes you implemented well.
With JS frameworks, step 1 is always "look at the console, read the error, paste it into an AI [or google if you must] and track the cause [usually, like in your case some package update broke things, which teaches us to rely less and less on packages when possible], most often it's a typo [a comma, parentheses, or bracket too many or too few] and those are quick fixes you can patch on prod, then got back and fix in dev and redeploy.
Communication is key. Be honest, be straight to the point. If you need to translate for non-technical folk just give them a "i'm working on it, will write a breakdown for you when done"
Documentation is nice, but honestly... just build experience.
No one remembers where their notes/documentation is during that time anyway.
And half the time no one remembers "this has happened before" and most of the issues you run into will never happen again on the same project, so 2 years from now you won't know what to look for in your docs from 20 other projects anyway :)
And you'll find the majority of prod breakdowns are the same issues, just basic stuff people mess up all the time. Learn to recognize them.
Good luck on your dev journey!
reminds me of my first rca n incident
It's nice to have an alternative to rolling back, as well - for example if your checkout process can be put offline separately to the whole site. When building these things in the past, I've put a checkbox on the dashboard that lets you replace the checkout with a custom maintenance message.
It's quicker than a roll-back and sometimes safer.
For example, in some systems where there were schema changes in the broken release you might not be able to reverse the migration (if there even was one), and you'd be left with the choice of either running the code on an outdated database or reverting to a backup of the data, which could mean losing any changes clients have added in the meantime.
There are "better" ways to build a website, but most sites use off-the-shelf frameworks which don't care about data integrity, unfortunately.
This is a fantastic write up and I could feel the tension in my chest while reading your description of what you were dealing with.
I really like the suggestion to rollback, and the reality that it’s not what we want to do. For me the desire to debug and fix comes from the idea that if I fix the problem quick I can tell myself that it was a “small” outage and no one will really care. But the reality is that a real outage, no matter how small, is always going to be an impact.
I've lived through quite a few production outages over the last 22 years and when it was my code that caused it I felt horrible, when it was someone else's code I felt horrible for them. So I’ll just mention that in addition to all these fantastic suggestions, if you find yourself on the event bridge and it’s not your fault, just offer support to the person who is owning it to help them know they aren’t dealing with it alone.