rokcso

Posted on Apr 15 • Originally published at shipstry.com

How I Ran a Live Production Upgrade in 24 Minutes Without Taking the Site Down

#database #devops #webdev #ai

I do not like touching production unless I have to.

Feature work is fun. Production upgrades are not. Feature work gives you screenshots. Production upgrades give you backups, validation queries, freeze switches, and a very direct answer to the question: do you actually trust your system?

I recently had to do one of those upgrades for Shipstry, a product launch platform I run on Cloudflare Workers.

The maintenance window took about 24 minutes.

During that window, I:

froze writes with a centralized maintenance guard
backed up production D1
migrated comment data into a cleaner split model
moved payments onto a new internal purchase domain
backfilled historical orders conservatively
verified that old entitlements and historical paid amounts still matched before reopening writes

Public pages stayed up the whole time. That was the main requirement from the start.

TL;DR

The release worked because I treated it like an operational change, not a schema change.

The migration script was only one part of it. The real release was:

one switch to freeze writes
a production backup taken before any destructive step
baseline counts recorded in advance
validation gates after every risky move
no pressure to reopen writes early

One SQL migration did fail on the first run because Cloudflare D1 remote execution rejected an explicit transaction wrapper in the file. That did not turn into an incident because the safety rails were already in place.

Why I changed the model

This upgrade was triggered by two areas that had started to outgrow the old schema.

The first was comments.

Product comments and blog comments had drifted enough that pretending they were the same thing was no longer buying simplicity. It was just deferring cleanup.

The second was payments.

The old order model was good enough when it handled a narrower set of paid actions. It became less convincing once the platform started supporting multiple payment-backed behaviors:

backlinks access
paid submissions
product upgrades
upgrade pricing that respects what a maker already paid before

At that point I needed the database to answer business questions directly, instead of forcing new features to keep carrying legacy assumptions around.

The core separation I wanted was:

the payment itself
what the user bought
what that purchase unlocks

That sounds abstract until you hit questions like:

Does an old backlinks purchase still grant access?
If someone already paid for an earlier tier, does that amount still count toward the next upgrade?
Can I evolve the payment system without special-casing old data everywhere?

Once those questions become common, "we will clean it up later" stops being practical.

The rule that mattered most

I did not want a full outage.

I also did not want live writes continuing while I changed production data underneath them.

So the operating rule was simple:

reads stay up, writes freeze first

That only works if you have a real maintenance switch. Not a plan to be careful. Not a checklist item. A switch that the app actually respects.

Before release day, I added a centralized MAINTENANCE_WRITE_FREEZE guard and wired it through the places that can mutate D1:

checkout creation
Stripe webhook mutations
comments and likes
votes
profile updates
admin write actions
draft save, submit, and delete flows
notification endpoints that mutate state

I was not optimizing for elegance. I was optimizing for control.

If I have to do this again, I want one switch that really means one switch.

The boring work that makes production survivable

Before touching a migration, I exported production D1.

That should feel boring. If backups feel exciting, you are already too late.

I also recorded baselines before the destructive parts:

legacy comment count
legacy comment-like count
top-level vs reply comment counts
soft-deleted comments
legacy order count
paid backlinks count

Those numbers became release gates later.

Not "it seems fine."

Not "the page loaded on my laptop."

Actual before and after checks.

Production had a surprise waiting

Before the main cutover, I checked for old pricing-tier values that should already have been normalized.

Production still had legacy expedition values in both:

product.pricing_tier
legacy order.tier

That was annoying, but useful.

It meant I could not trust migration history in the abstract. I had to trust the database I was actually about to operate on.

So I reran the tier-normalization migration before the comment and payment cutover. That cleaned up the remaining visible tier drift before final deployment.

It is a small example, but it captures something important: production work gets easier the moment you stop arguing with reality.

The comment migration

The comment migration looked straightforward on paper and high-risk in practice.

Old comment data had to be split into four tables:

product_comment
product_comment_like
blog_comment
blog_comment_like

The schema work itself was not the scary part. The real risk was silently dropping relationships or detaching data during the move.

So I checked what actually mattered:

comment counts matched
comment-like counts matched
orphan replies were zero
orphan likes were zero

The production dataset was still small. I treated it like a serious migration anyway. Row count does not change the standard.

The payment migration was the real reason for the window

The bigger job was payments.

Shipstry now reads runtime payment state through a cleaner domain:

payment_order
purchase
product_submission_purchase
product_upgrade_purchase

I wanted this because the old model was trying to do too much with too little structure.

Once the platform started supporting more than one kind of paid action, I needed the schema to preserve meaning, not just store rows.

The most important example was upgrade pricing.

If a maker already paid for an earlier tier, Shipstry should not pretend that money never existed. Their next upgrade should take prior payment into account. From the user's perspective that is obvious. From the schema's perspective it is only obvious if the model supports it.

The backfill was where I had to stay conservative

Schema migration alone was not enough.

The new read paths needed old production data to make sense inside the new payment model. That meant historical orders had to be backfilled into the new tables in a way the current code could actually use.

There were two things I absolutely did not want to break:

historical backlinks access
historical paid-amount continuity for product upgrades

I tested the backfill locally against production snapshots before touching live production.

That helped answer the key question:

Could the old data safely reconstruct exact historical upgrade transitions?

Not reliably.

Legacy order rows could safely tell me:

this user bought backlinks access
this product had a paid submission
this product had this historical paid total

What they could not always tell me, with enough confidence, was the exact boundary between an original submission and a later upgrade in every case.

So I kept the backfill conservative.

I backfilled what I could defend. I refused to invent precision that the old data did not actually contain.

A clean new schema built on guessed history is still guessed history.

The one thing that broke

There was exactly one real hiccup during the window.

The first run of 0040_backfill_order_to_payment_domain.sql failed.

The data was fine. The logic was fine. The problem was operational: D1 remote execution rejected the explicit BEGIN TRANSACTION / COMMIT wrapper in the SQL file.

That was survivable because the release was already set up to absorb surprises:

write freeze was still on
the backup already existed
nothing was forcing me to reopen writes early

So I removed the explicit transaction wrapper, reran the migration, and it completed cleanly.

No rollback. No restore. Just a contained operational fix.

That is the kind of surprise I am willing to accept in production. Not "nothing went wrong", but "something went wrong and stayed small".

The database gate was the real release gate

I did not consider the release done because the migrations stopped erroring.

I considered it done only after the database proved that the migration had preserved what mattered.

The gate looked roughly like this:

comment counts still matched
no orphaned comment relationships existed
no leftover expedition or admiral rows remained
payment_order count matched legacy order count
migrated purchase count matched expectations
active backlinks purchases matched the old paid backlinks count
a known historical backlinks buyer still resolved correctly
a known paid product still reported the exact same historical paid total under the new model

I also kept write freeze enabled while checking public routes:

homepage
product detail
explore
blog
RSS
sitemap

The site stayed readable the whole time, which was the goal from the start.

Only after the database gates passed did I flip write freeze back off and deploy the final version.

What users actually care about

From the outside, this could be described as infrastructure work.

That is true, but incomplete.

Users do not care that I created payment_order and purchase.

They care that:

paid access still works
previous payments still count
comments still load
upgrades make sense
the platform feels stable

The invisible part is what makes the visible part credible.

What I took away from it

The biggest lesson is the obvious one: if a release needs a maintenance switch, build the maintenance switch before release day.

The second lesson is that production always gets the last word. If the live database says an old tier is still there, then it is still there, no matter how tidy your migration history looks in Git.

The third lesson is that conservative backfills are underrated. When historical data cannot safely reconstruct perfect semantics, preserve runtime correctness first. Do not make up a cleaner story for the database than the source data can support.

And maybe the most important lesson is this:

The release is not the migration script.

The release is the gates.

Backups, baselines, validation queries, smoke tests, and the discipline not to reopen writes early just because you are tired.

That was the part that made a 24-minute maintenance window possible without turning it into a full outage.

Top comments (4)

Archit Mittal • Apr 18

24 minutes with zero downtime is a solid result. The part I'd dig into: how did you handle in-flight long-lived connections (websockets, SSE streams)? Those are usually what make "blue-green" not actually zero-downtime in practice — the old pods have to drain, and drain windows can stretch.

The pattern I've used is to emit a server-migration-pending event over the websocket ~30s before cutover, so clients reconnect gracefully to the new version. Curious if you did something similar or just let connections re-establish.

mote • Apr 16

"Treating it like an operational change, not a schema change" — this is exactly the mental model shift that separates painful migrations from clean ones.

The write freeze guard as a first-class application concept (not just "stop the server") is something I wish more teams did explicitly. Most outage post-mortems I've read could have been "we froze writes, migrated, validated, unfroze" but instead became "we took down the load balancer, ran a migration, something broke, we rolled back half of it."

The detail about Cloudflare D1 rejecting explicit transaction statements in remote execution is worth flagging more prominently — that's a sharp edge a lot of people will hit. Did you end up using implicit transactions or a different migration path for the failing step?

mote • Apr 20

The freeze-writes-then-verify approach is clean — it's essentially a write barrier pattern where you drain the pipeline before schema migration. Cloudflare Workers D1 makes this practical since you can route traffic based on request properties.

For the historical backfill, did you run into issues with write ordering? I'm thinking about cases where a new order is placed during backfill and the customer's entitlement calculation depends on data that hasn't been migrated yet.

Also — what was your rollback trigger? Did you have a time-based cutoff, or a metric threshold (error rate, latency p99) that would have aborted the migration?

Mykola Kondratiuk • Apr 17

the maintenance window always reveals whether you actually tested your rollback. I've had 24-min windows stretch to 2 hours because the freeze wasn't atomic. cloudflare workers helps but the discipline is the hard part