Rutika Khaire

Posted on Jun 29

From a 13.5-Hour Mock to a 3-Hour Production Cutover

#softwareupgrade #riskmanagement #lessonslearned #infrastructure

We Planned for a Weekend Upgrade. A Mock Run Showed Us How Close We Were to a Weekend Disaster.

Every IT team has been here. A critical business system needs upgrading. The vendor says it's straightforward — run the installer, verify, go live. Leadership wants it done in one maintenance window. Your gut says it won't be that simple, but you can't explain why without sounding like you're stalling.

This is the story of how a single mock upgrade turned what could have been a stressful weekend into a calm, 3-hour production cutover — and why "just run a test first" might be the most undervalued practice in enterprise IT.

The Situation Most IT Teams Will Recognize

Our client is an association that runs their entire operation — membership, events, orders, finances, and reporting — on a single platform. The platform was two versions behind. Upgrading meant touching the system nearly every employee relied on daily.

The plan looked simple: upgrade from version A to B, then B to C, over a weekend. Users log off Friday evening, we upgrade Saturday and Sunday, everyone logs in Monday morning on the new version. The vendor's documentation suggested roughly two hours per version.

We insisted on running a mock upgrade first.

That decision saved the project.

What the Mock Revealed: The Installer Was the Easy Part

The installer finished in 40 minutes.

Everything looked perfect.

No errors.

No warnings.

If we'd been doing this in production, we probably would have celebrated and moved confidently to the next step.

Then we tried to launch the application.

It wouldn't start.

That was the moment we realized the upgrade wasn't going to be about installing software—it was going to be about understanding an entire application ecosystem.

The upgraded application expected newer runtime libraries, while several background services, supporting components, and custom modules still referenced older versions. Nothing failed during installation—the failures only appeared as each component started independently.

This wasn't one issue.

It was five separate dependency and configuration issues spread across different parts of the application stack. Each component failed differently, and the error messages often pointed us toward the wrong solution before we uncovered the real cause.

Over the next eight hours we documented every issue, refined every fix, updated our runbook, and noted exactly which configuration files needed changes across each application component.

What would have happened without the mock

We would have discovered these issues at around 2 PM on Saturday with users expecting the system back on Monday morning. The weekend would have turned into continuous troubleshooting, increasing the likelihood of extending downtime or making a difficult rollback decision under pressure.

What actually happened

When the real upgrade weekend arrived, every issue had already been solved.

The entire production cutover finished in under three hours.

The Five Things We Learned (That Apply to Any System Upgrade)

1. The Vendor's Upgrade Guide Is a Starting Point, Not a Playbook

Vendor documentation describes the happy path. It assumes clean environments, standard configurations, and minimal customization.

Real production systems rarely look like that.

Our client had of custom code, integrations, scheduled jobs, and environment-specific configurations. The vendor's documentation wasn't wrong—it simply couldn't anticipate every customer's implementation.

What to do instead

Treat the vendor guide as your baseline. Then build your own runbook that reflects your environment, your integrations, your customizations, and your validation process.

2. Mock Upgrades Aren't Rehearsals — They're Discovery

We didn't perform the mock to practice.

We performed it to discover everything we didn't know.

That distinction changes the entire mindset.

A rehearsal assumes you already understand the process.

A discovery exercise assumes there are unknown problems waiting to be uncovered.

That means budgeting more time, assigning experienced troubleshooters, documenting every surprise, and expecting to update your runbook throughout the exercise.

Our mock took approximately 13.5 hours.

Our production cutover took less than three.

The difference wasn't faster execution.

It was the absence of surprises.

3. "Successful Installation" Is a Dangerous Milestone

The installer completing successfully creates a false sense of confidence.

In our project, the installer succeeded every single time.

The failures appeared later:

when the application started
when background services initialized
when scheduled jobs executed
when integrations connected
when users exercised real business workflows

The milestones that actually matter

Can users log in?
Can they perform their daily work?
Can they create transactions?
Can reports run successfully?
Do scheduled tasks execute?
Do integrations still communicate correctly?

If your validation ends with "the application launches," you've only tested a small part of what could fail.

4. Every Component Has Its Own Configuration Universe

Modern enterprise applications are rarely a single executable.

They're collections of web applications, Windows services, scheduled jobs, APIs, reporting engines, and supporting utilities that often maintain their own configuration files.

During our upgrade, fixing the main application wasn't enough.

Next the order processing module failed.

After that, a background task failed.

Then another service exposed a slightly different variation of the same dependency issue.

The lesson was simple:

Map your application's complete topology before the upgrade.

Know every executable, every service, every scheduled process, every configuration file, and every integration point.

Then verify your fixes across all of them—not just the first application users see.

5. Your Rollback Plan Should Require Almost No Action

Many rollback plans begin with restoring backups.

Ours began with not changing production until we were confident.

The existing production server remained untouched throughout the project.

The upgrade happened on a separate server.

DNS, connection strings, and user access continued pointing to the existing environment until every validation step had passed and the client approved the cutover.

If something unexpected had happened—even after hours of work—we simply wouldn't have switched users to the new environment.

No emergency restores.

No rushed database recovery.

Just another business day on the existing system.

That design dramatically reduced both operational risk and decision pressure during the maintenance window.

The Uncomfortable Math

	Without Mock	With Mock
Planning effort	2 days	2 weeks
Production downtime	Unknown	Under 3 hours
Confidence during cutover	Low	High
Unknown issues	Many	Mostly eliminated
Monday morning	Fingers crossed	Boring (which is exactly what you want)

Yes, the mock added time to the project.

It also removed uncertainty.

Those 13.5 hours weren't extra work.

They were production troubleshooting completed safely before anyone depended on the system.

Our Final Production Runbook Looked Something Like This

Verify backups and rollback readiness.
Upgrade to Version B.
Apply documented dependency and configuration updates.
Validate application services.
Upgrade to Version C.
Reapply required configuration changes.
Execute technical smoke tests.
Execute client business test cases.
Approve cutover.
Switch users to the upgraded environment.
Monitor critical services after go-live.

By production weekend, none of these steps required improvisation.

What We'd Do Differently Next Time

No project is perfect.

Looking back, we'd improve three things.

We'd ask the client for business test cases much earlier. Our technical smoke tests were good, but only the client truly understood every business workflow that needed validation.
We'd automate the configuration updates. Applying XML and configuration changes manually worked, but scripting them would reduce both time and the possibility of human error.

The Takeaway

The least exciting part of a successful upgrade should be the production cutover.

If your maintenance window is full of surprises, those surprises usually didn't begin that weekend—they began weeks earlier when the decision was made to skip discovery.

Run the mock.

Find the ugly.

Document everything.

Refine the runbook.

Make the production upgrade boring.

Because the best infrastructure projects aren't remembered for dramatic recoveries.

They're forgotten because nothing unusual happened at all.

That's exactly how successful upgrades should feel.

DEV Community