Peter Kim Frank

Posted on Apr 23, 2018

Explain Required Downtime Like I'm Five

#explainlikeimfive

Most services seem to update continuously without the need for any formal downtime.

However, sometimes I'll get notice a service is going offline for as much as several hours. What sorts of upgrade / maintenance events typically require downtime, as opposed to a seamless change?

Latest comments (8)

Nancy Deschenes • Jul 11 '18

I'll do you one better. I'll explain required downtime like you're 3.

You know how you have to take a nap every day? Even when you're not tired? Mommy always insists on the nap. And most days, after you get up from your nap, the house is cleaner and dinner is under way? That's because those are things Mommy can't do when you're up and in the way, either because they would be too dangerous for you (handling bleach), or because she wouldn't be able to pay attention you while doing them (ex: mowing the lawn).

So, when you take a nap, mommy gets to do things she can't do when you're in the way.

Required downtime is like forcing a toddler to nap. You get to do things you wouldn't be able to do otherwise, or that would be too risky. It can be painful to arrange (getting all the stakeholders to agree), but usually, there's a very good reason. The system can focus its attention on only one thing (your upgrade), avoid risky conflicts (access to shared resources), take shortcuts and lock whole resources (ex: databases or database tables, instead of row-level locking), etc. By reserving some resources for exclusive use, you can avoid complexe (slow) locking mechanisms.

You can also have required downtime when you know the performance of the system will be affected to such a high level that it wont be usable.

So, when you encountered required downtime, I just take the hint and go for a nap :-)

Nathan Grass • Nov 13 '19

Came here looking for the explanation of a nap. Very good!

Peter Kim Frank • Jul 11 '18

Incredible, thank you for this.

Scott Tadman • Apr 25 '18

The military has very expensive mid-flight refueling systems, planes full of jet fuel that act as gas stations for other planes. This means planes can keep flying continuously without having to periodically land to refuel.

This is similar to how large companies can seemingly do almost anything without downtime: huge engineering teams, specialized tools, plus lots of practice, training and discipline.

Smaller companies can’t afford teams like that. In a sense they do what ordinary people do: pull into the gas station to refuel, clean the windows, and check oil and tire pressures. Doing all of that while driving safely would be a lot harder and outrageously expensive for the average person.

Even a little downtime makes it a lot easier and cheaper to get things done.

Ant The Developer • Apr 25 '18

As opposed to listing those events, I'll explain how so many services require no downtime. I'd say, if you can do it, afford the down time but if you can't and you run a service that people use 24/7, there are ways around having to experience down time.

For example, if you're deploying new code, it's very common to start a temporary instance of your application which your load-balancer can switch to, and after the code deploy, switch back.

Database migrations are a little trickier. I don't know the exact approaches companies use; however, I can imagine having a read-only database that serves read data while another system queues up the write requests.

Even server migrations can be done without downtime by employing the previous method for seamless code deployment. Just switching from one server to the other while both are running.

Dave Jacoby • Apr 24 '18 • Edited

So, you want to make a sandwich before the cartoons start.

You need to get the peanut butter and jelly and the bread, and find a knife. You thought you had it all ready, but the silverware drawer is empty and you have to wait until the dishwasher is done.

Or a slice of bread slips out of your hand and lands jelly-side down on the kitchen floor and you have to throw it out and start over.

Or your brother asks for one too, but wants the other bread, the creamy peanut butter and the crusts cut off.

Or, you expect all of those things could happen and you want to be sure you can get a clean knife and clean up after yourself and make a second or third well before Itchy and Scratchy start fighting.

The hours-long maintenance windows I see in my life go to 1) interactive network-backed mobile games, where you want the servers unavailable while the games are uploaded and accepted by two separate apps stores, or 2) big changes to large compute clusters where jobs can take weeks on fast, GPU-laden high-memory nodes, because science. With the latter, it's Puppet or CFengine changing several thousands of machines, and because downtime is so rare, you often get "while we're down, we'll change out the switches/cooling/power".

In either case, I'm sure it involves minimally parallel tasks. You can't spread the jelly and wash the knife at the same time, for example.

Alyss 💜 • Apr 23 '18 • Edited

One of the common instances I've seen is a deployment model that doesn't support an active code version. You'd likely have to be working with software 15+ years old to encounter that model. During downtime, the new code commits are incorporated into production and brought back online.

If the upgrade requires a lot of human intervention, that could be another reason for required downtime.

Downtime for a co-located team is at night or over lunch depending on how quick or integrated the process is.

A great example (imo) is Blizzard (World of Warcraft) vs ArenaNet (Guild Wars). My assumption is that Guild Wars incorporated a rollover model before World of Warcraft did. This allowed them to use an existing game dialog to ask players to move to a new realm instance that was more populated or force them to move. Source

David Ojeda • Apr 23 '18

From the top of my head I would say a database migration. It can be done without downtime, but usually means having a replica and incurring in more costs for a short period of time. So, if limited resources are available, you can opt for a maintenance window with downtime.