Explain Required Downtime Like I'm Five

Most services seem to update continuously without the need for any formal downtime.

However, sometimes I'll get notice a service is going offline for as much as several hours. What sorts of upgrade / maintenance events typically require downtime, as opposed to a seamless change?

Did you find this post useful? Show some love!

The military has very expensive mid-flight refueling systems, planes full of jet fuel that act as gas stations for other planes. This means planes can keep flying continuously without having to periodically land to refuel.

This is similar to how large companies can seemingly do almost anything without downtime: huge engineering teams, specialized tools, plus lots of practice, training and discipline.

Smaller companies canโ€™t afford teams like that. In a sense they do what ordinary people do: pull into the gas station to refuel, clean the windows, and check oil and tire pressures. Doing all of that while driving safely would be a lot harder and outrageously expensive for the average person.

Even a little downtime makes it a lot easier and cheaper to get things done.

One of the common instances I've seen is a deployment model that doesn't support an active code version. You'd likely have to be working with software 15+ years old to encounter that model. During downtime, the new code commits are incorporated into production and brought back online.

If the upgrade requires a lot of human intervention, that could be another reason for required downtime.

Downtime for a co-located team is at night or over lunch depending on how quick or integrated the process is.

A great example (imo) is Blizzard (World of Warcraft) vs ArenaNet (Guild Wars). My assumption is that Guild Wars incorporated a rollover model before World of Warcraft did. This allowed them to use an existing game dialog to ask players to move to a new realm instance that was more populated or force them to move. Source

From the top of my head I would say a database migration. It can be done without downtime, but usually means having a replica and incurring in more costs for a short period of time. So, if limited resources are available, you can opt for a maintenance window with downtime.

Ben Halpern DEV.TO FOUNDER

Hey there, we see you aren't signed in. (Yes you, the reader. This is a fake comment.)

Please consider creating an account on dev.to. It literally takes a few seconds and we'd appreciate the support so much. โค๏ธ

Plus, no fake comments when you're signed in. ๐Ÿ™ƒ

So, you want to make a sandwich before the cartoons start.

You need to get the peanut butter and jelly and the bread, and find a knife. You thought you had it all ready, but the silverware drawer is empty and you have to wait until the dishwasher is done.

Or a slice of bread slips out of your hand and lands jelly-side down on the kitchen floor and you have to throw it out and start over.

Or your brother asks for one too, but wants the other bread, the creamy peanut butter and the crusts cut off.

Or, you expect all of those things could happen and you want to be sure you can get a clean knife and clean up after yourself and make a second or third well before Itchy and Scratchy start fighting.

The hours-long maintenance windows I see in my life go to 1) interactive network-backed mobile games, where you want the servers unavailable while the games are uploaded and accepted by two separate apps stores, or 2) big changes to large compute clusters where jobs can take weeks on fast, GPU-laden high-memory nodes, because science. With the latter, it's Puppet or CFengine changing several thousands of machines, and because downtime is so rare, you often get "while we're down, we'll change out the switches/cooling/power".

In either case, I'm sure it involves minimally parallel tasks. You can't spread the jelly and wash the knife at the same time, for example.

As opposed to listing those events, I'll explain how so many services require no downtime. I'd say, if you can do it, afford the down time but if you can't and you run a service that people use 24/7, there are ways around having to experience down time.

For example, if you're deploying new code, it's very common to start a temporary instance of your application which your load-balancer can switch to, and after the code deploy, switch back.

Database migrations are a little trickier. I don't know the exact approaches companies use; however, I can imagine having a read-only database that serves read data while another system queues up the write requests.

Even server migrations can be done without downtime by employing the previous method for seamless code deployment. Just switching from one server to the other while both are running.

Classic DEV Post from Mar 11

If you could start over from scratch, how would CSS work?

CSS has a lot of issues. Now that we have a few decades of knowledge, how would...

READ POST
Follow @ben to see more of their posts in your feed.
Peter Kim Frank
Working on dev.to. Previously: on-demand tutoring and textbooks. Before that, ran/sold an online community and worked for the company that built Tinder. ๐Ÿ˜Ž
More from @peter
Explain DNS TTL Like I'm five
#explainlikeimfive
Is Bitcoin vulnerable to a "bank run"?
#discuss #explainlikeimfive #bitcoin
Trending on dev.to
What was your first computer?
#discuss
Explain Higher Order Component(HOC) in React.js like I'm five
#explainlikeimfive #beginners
How do you keep track of all the great resources you find?
#discuss #learning #resources
Sweating the very small design details: external links
#css #webdev #design #discuss
Dev and Designer Communication
#discuss
I asked my first StackOverflow question
#discuss #beginners #javascript
Developer Interview Bloopers
#career #development #interviews #php
Explain Higher Order Component(HOC) in React.js like I'm five
#explainlikeimfive #beginners