DEV Community

Cover image for How I Ran a Live Production Upgrade in 24 Minutes Without Taking the Site Down
rokcso
rokcso

Posted on • Originally published at shipstry.com

How I Ran a Live Production Upgrade in 24 Minutes Without Taking the Site Down

I do not like touching production unless I have to.

Feature work is fun. Production upgrades are not. Feature work gives you screenshots. Production upgrades give you backups, validation queries, freeze switches, and a very direct answer to the question: do you actually trust your system?

I recently had to do one of those upgrades for Shipstry, a product launch platform I run on Cloudflare Workers.

The maintenance window took about 24 minutes.

During that window, I:

  • froze writes with a centralized maintenance guard
  • backed up production D1
  • migrated comment data into a cleaner split model
  • moved payments onto a new internal purchase domain
  • backfilled historical orders conservatively
  • verified that old entitlements and historical paid amounts still matched before reopening writes

Public pages stayed up the whole time. That was the main requirement from the start.

TL;DR

The release worked because I treated it like an operational change, not a schema change.

The migration script was only one part of it. The real release was:

  • one switch to freeze writes
  • a production backup taken before any destructive step
  • baseline counts recorded in advance
  • validation gates after every risky move
  • no pressure to reopen writes early

One SQL migration did fail on the first run because Cloudflare D1 remote execution rejected an explicit transaction wrapper in the file. That did not turn into an incident because the safety rails were already in place.

Why I changed the model

This upgrade was triggered by two areas that had started to outgrow the old schema.

The first was comments.

Product comments and blog comments had drifted enough that pretending they were the same thing was no longer buying simplicity. It was just deferring cleanup.

The second was payments.

The old order model was good enough when it handled a narrower set of paid actions. It became less convincing once the platform started supporting multiple payment-backed behaviors:

  • backlinks access
  • paid submissions
  • product upgrades
  • upgrade pricing that respects what a maker already paid before

At that point I needed the database to answer business questions directly, instead of forcing new features to keep carrying legacy assumptions around.

The core separation I wanted was:

  • the payment itself
  • what the user bought
  • what that purchase unlocks

That sounds abstract until you hit questions like:

  • Does an old backlinks purchase still grant access?
  • If someone already paid for an earlier tier, does that amount still count toward the next upgrade?
  • Can I evolve the payment system without special-casing old data everywhere?

Once those questions become common, "we will clean it up later" stops being practical.

The rule that mattered most

I did not want a full outage.

I also did not want live writes continuing while I changed production data underneath them.

So the operating rule was simple:

reads stay up, writes freeze first

That only works if you have a real maintenance switch. Not a plan to be careful. Not a checklist item. A switch that the app actually respects.

Before release day, I added a centralized MAINTENANCE_WRITE_FREEZE guard and wired it through the places that can mutate D1:

  • checkout creation
  • Stripe webhook mutations
  • comments and likes
  • votes
  • profile updates
  • admin write actions
  • draft save, submit, and delete flows
  • notification endpoints that mutate state

I was not optimizing for elegance. I was optimizing for control.

If I have to do this again, I want one switch that really means one switch.

The boring work that makes production survivable

Before touching a migration, I exported production D1.

That should feel boring. If backups feel exciting, you are already too late.

I also recorded baselines before the destructive parts:

  • legacy comment count
  • legacy comment-like count
  • top-level vs reply comment counts
  • soft-deleted comments
  • legacy order count
  • paid backlinks count

Those numbers became release gates later.

Not "it seems fine."

Not "the page loaded on my laptop."

Actual before and after checks.

Production had a surprise waiting

Before the main cutover, I checked for old pricing-tier values that should already have been normalized.

Production still had legacy expedition values in both:

  • product.pricing_tier
  • legacy order.tier

That was annoying, but useful.

It meant I could not trust migration history in the abstract. I had to trust the database I was actually about to operate on.

So I reran the tier-normalization migration before the comment and payment cutover. That cleaned up the remaining visible tier drift before final deployment.

It is a small example, but it captures something important: production work gets easier the moment you stop arguing with reality.

The comment migration

The comment migration looked straightforward on paper and high-risk in practice.

Old comment data had to be split into four tables:

  • product_comment
  • product_comment_like
  • blog_comment
  • blog_comment_like

The schema work itself was not the scary part. The real risk was silently dropping relationships or detaching data during the move.

So I checked what actually mattered:

  • comment counts matched
  • comment-like counts matched
  • orphan replies were zero
  • orphan likes were zero

The production dataset was still small. I treated it like a serious migration anyway. Row count does not change the standard.

The payment migration was the real reason for the window

The bigger job was payments.

Shipstry now reads runtime payment state through a cleaner domain:

  • payment_order
  • purchase
  • product_submission_purchase
  • product_upgrade_purchase

I wanted this because the old model was trying to do too much with too little structure.

Once the platform started supporting more than one kind of paid action, I needed the schema to preserve meaning, not just store rows.

The most important example was upgrade pricing.

If a maker already paid for an earlier tier, Shipstry should not pretend that money never existed. Their next upgrade should take prior payment into account. From the user's perspective that is obvious. From the schema's perspective it is only obvious if the model supports it.

The backfill was where I had to stay conservative

Schema migration alone was not enough.

The new read paths needed old production data to make sense inside the new payment model. That meant historical orders had to be backfilled into the new tables in a way the current code could actually use.

There were two things I absolutely did not want to break:

  1. historical backlinks access
  2. historical paid-amount continuity for product upgrades

I tested the backfill locally against production snapshots before touching live production.

That helped answer the key question:

Could the old data safely reconstruct exact historical upgrade transitions?

Not reliably.

Legacy order rows could safely tell me:

  • this user bought backlinks access
  • this product had a paid submission
  • this product had this historical paid total

What they could not always tell me, with enough confidence, was the exact boundary between an original submission and a later upgrade in every case.

So I kept the backfill conservative.

I backfilled what I could defend. I refused to invent precision that the old data did not actually contain.

A clean new schema built on guessed history is still guessed history.

The one thing that broke

There was exactly one real hiccup during the window.

The first run of 0040_backfill_order_to_payment_domain.sql failed.

The data was fine. The logic was fine. The problem was operational: D1 remote execution rejected the explicit BEGIN TRANSACTION / COMMIT wrapper in the SQL file.

That was survivable because the release was already set up to absorb surprises:

  • write freeze was still on
  • the backup already existed
  • nothing was forcing me to reopen writes early

So I removed the explicit transaction wrapper, reran the migration, and it completed cleanly.

No rollback. No restore. Just a contained operational fix.

That is the kind of surprise I am willing to accept in production. Not "nothing went wrong", but "something went wrong and stayed small".

The database gate was the real release gate

I did not consider the release done because the migrations stopped erroring.

I considered it done only after the database proved that the migration had preserved what mattered.

The gate looked roughly like this:

  • comment counts still matched
  • no orphaned comment relationships existed
  • no leftover expedition or admiral rows remained
  • payment_order count matched legacy order count
  • migrated purchase count matched expectations
  • active backlinks purchases matched the old paid backlinks count
  • a known historical backlinks buyer still resolved correctly
  • a known paid product still reported the exact same historical paid total under the new model

I also kept write freeze enabled while checking public routes:

  • homepage
  • product detail
  • explore
  • blog
  • RSS
  • sitemap

The site stayed readable the whole time, which was the goal from the start.

Only after the database gates passed did I flip write freeze back off and deploy the final version.

What users actually care about

From the outside, this could be described as infrastructure work.

That is true, but incomplete.

Users do not care that I created payment_order and purchase.

They care that:

  • paid access still works
  • previous payments still count
  • comments still load
  • upgrades make sense
  • the platform feels stable

The invisible part is what makes the visible part credible.

What I took away from it

The biggest lesson is the obvious one: if a release needs a maintenance switch, build the maintenance switch before release day.

The second lesson is that production always gets the last word. If the live database says an old tier is still there, then it is still there, no matter how tidy your migration history looks in Git.

The third lesson is that conservative backfills are underrated. When historical data cannot safely reconstruct perfect semantics, preserve runtime correctness first. Do not make up a cleaner story for the database than the source data can support.

And maybe the most important lesson is this:

The release is not the migration script.

The release is the gates.

Backups, baselines, validation queries, smoke tests, and the discipline not to reopen writes early just because you are tired.

That was the part that made a 24-minute maintenance window possible without turning it into a full outage.

Top comments (0)