Discussion on: How do you update backend web services without downtime?

View post

Going to describe it as simply as I can.

Generally for availability users of your web service will hit it via a load balancer (like nginx) which then routes requests to a pool of instances of your application.

This means if one instance falls over, you have availability because the load balancer will just send the traffic to the instances that are working.

For that reason, when you deploy you can take down one instance, upgrade, then another, etc until they are all updated. During this you will always have at least one running.

In addition this lets your application "horizontally scale" very easily as it's trivial to add more instances of your application behind the load balancer.

There are more involved ways of doing this but they all mainly work on the premise of some kind of load balancer / router managing traffic so that there is always a running app available

Meghan (she/her) • May 31 '18

Thanks for the quick helpful response! :D

Albrecht Scheidig • Jun 1 '18

What would you recommend, if those instances share a database, and the database schema depends on the version of the application?
Meaning: when updating the first instance, it updates the database schema, making it incompatible with the running, outdated instances (so they tend to run into errors).

Is your recommendation only applicable to architectures without a central database?

Adam Bullmer • Jun 2 '18

If you're asking these questions, it may mean that you are going down a path that you might not need to. I understand the desire for perfectly maintained and groomed DB schemas. But at the end of the day, you've got to take into account business requirements while solving the problems in software.

Does the business have an SLA on downtime? No? Great throw up a maintenance page, run a migration, and move on with your life.

Yes? Maybe think about how you can migrate data and deploy new features in 2 or more deploys/migrations.

Do you have slow traffic hours? Maybe it is acceptable to have downtime at 3am, when your usage is low to none.

The downtime plans have an advantage of not engineering a flawless plan, which takes time, and frees you to continue developing. But these are only viable if your application would suffer during any sort of outrage.

Lastly, the option I don't hear people advocating for, is taking on technical debt. Mind you, debt is bad and I would recommend seeking the above alternatives first. In a pinch this is a solution.

Tech debt the process of making an informed concession on architecture/design/schema I'm the interest of time, and all parties of the business agree that you will be given time later to do it the right way. It's called debt because almost always is more total work to do later then now, but is less work to short cut now than the right way.

So for you, you might just deal with this in your software: a column that isn't mapped to the right table, or make a new table, or whatever outrage free solution you come up with. Then later run a migration and code deploy. And things could have changed at this time to your benefit, like more coders, different SLAs, different patterns, different technology, or a rewrite even. There is no hard and fast rule on how you deploy, only that some solutions fit your particular case better.

Albrecht Scheidig • Jun 2 '18

Adam, thanks for taking the time.
We sell a product. We have hundreds of installations worldwide and support updates from 15 year old versions as well as 15 h old versions. So, tech. debt approach is not feasible because it does not solve the problem fundamentally.
"Updates without downtimes" would definitely add value we could sell or use as USPs. Or fulfill stricter SLAs and what not. But there seams no silver bullet to add this into our existing legacy code base with a centralized database.

Adam Bullmer • Jun 3 '18

Yikes, 15 year old software? That sounds rough. If the service can't be interrupted, I've seen the staged release work well:

forwards compatible DB alter
Ship code to read/write from both places (gross, but it's going away)
Any data migration necessary (no alters allowed here)
Update code to stop looking in the old place
Confirm everything is working as planned and blank out the old data, or alters to clean up your schema.

Step 2 is the hardest because there's a lot of options of how you can handle this. I've personally had my software write to 2 different places, ran a migration to move the old data to the new place, and then updated the service again to only read/write from the new place.

Hopefully you're sourcing see good ideas from the communuty suggestions, these have all been good ideas if they solve your problem effectively.

I also can't stress enough the importance of DB backups, testing that your theory of deploys works at every step in a non production environment, and having your plan written down and reviewed by your peers. DB alters in prod are hard, and carry a lot of risk if something unexpected happens.

Timur Zurbaev • Jun 1 '18

If you need to release some critical DB schema updates, try to roll updates in several parts. For example, if you need to move a column from one table to another (drop column in first table, create column in second table), consider this scenario:

Add column to the second table & update your code to read/write from new column;
Release first part - now all of your production instances are not touching old column at all;
Remove column from the first table and deploy changes - no matter how many instances you're running, they won't produce errors.

Pert Soomann • Jun 1 '18

Actually very good point.

Code updates are usually very trivial, with PHP it could be just pulling changes from GIT repo, with node you probably have to re-build on each instance?

But with DBs, once you get decent amount of data in tables, changing the table config could take very long time to re-build.

There are few ways you can work around, like Timur explained, you could try to implement backwards compatible approach (ie new column defaults to NULL, so old code can still insert new entries to DB without necessarily falling over).

Another option is to have graceful maintenance mode, something we're using at my current place. When updating the real users will see maintenance screen instead of half updated code, nor do we have to worry about concurrent legacy v new code running, depending on instance they end up on.

I know it's technically "downtime", but when built into project from ground up, much easier than trying to achieve the same thing with networking and re-pointing servers etc, and it's not bad user experience, IMHO.

Albrecht Scheidig • Jun 1 '18

We do "maintenance page"-like updates here, too, but I dream of having smart updates without downtime / maintenance page. And as things turn out, this is not possible in my scenario: shared DB, lots of schema changes in every new release.
Timurs approach is interesting, but seems to add a lot of complexity and testing efforts.

Pert Soomann • Jun 1 '18 • Edited

I think it's OK to find a reasonable solution that doesn't annoy your userbase or break your dev-team, even if it's not dream no-noticeable-downtime :)

Abel Wang • Jun 1 '18

Another thing you can do to have no down time w/your db is version your schemes somehow and store this value in your db. Then in your code wrap new features using feature flags. Part of the flag logic is the whether the switch is on or off and part of the switch logic is what version is the db at. And based on those values your code would route to new or old code hitting new or old sql calls. It does add complexity and debt as you are now using feature flags and you will need to religiously clean up your flags or things can spiral out of control. However, that’s a small price to pay in terms of the benefits you get like the speed at which you can deploy new changes, not worrying about what order to deploy your micro services and dbs and ease of rolling features in and out. And, zero down time. Another added benefit of using flags is now you can do trunk based development which simplified things tremendously over complex branching schemes but that’s a whole nother topic.