Breaking up the monolith: Zero downtime migrations

#architecture #microservices #migration #database

It’s pretty common in monolith architectures to have to handle migrations. But this isn’t the only place. Microservices also frequently will have the need for database migrations. Unlike in monoliths where zero-downtime is usually impossible, microservices enable the capability to perform migrations with no downtime whatsoever.

Opponents of microservices, you know who you are, might say “oh now you need to learn to do this”. You don’t, you can continue with your app, which goes offline on the weekends and display a prominent banner that “our development team has been outsourced so we can’t do this simple activity without impacting our users.”

The fundamental need

Even in the most simple cases, you might find yourself in a situation where a database migration is important. You’ve labeled a critical column poorly, you are trying to reduce your DB size, or most probably, your primary index is wrong. It is unfortunate, but it does happen. The recommended approach is do nothing. That’s right, first evaluate how much of a problem this is. Sure it has been working okay until now, and it will continue to get worse, but how much worse? Are we talking about 0.001% of calls are problematic, or even 100% are but only suffer a performance degradation of an extra 1ms?

Before performing any work, always evaluated the value of doing so. While migrations are easy, it doesn’t mean they should be done.

The setup of a migration

So you’ve decided that the migration should be done. No problem, time to make it happen. Migrations always work the same way, and they are easy, once you know what to do.

(Caveat: This works reliably for database sizes up to a few Terabytes. Larger than that it will start to take some real time and put a real load on your migration infrastructure. In those cases you might think about a complimentary approach to port the data from one location to another in addition to the pattern.)

Here is the pattern:

Deploy a new replica database — You should have two versions of your DB running at all times, no exceptions. You cannot do this in one step and it must be done over time. It is always a trade-off for migrations, either take downtime and do it quickly or take no downtime and do it slowly. Zero-downtime is a slow and careful approach. One important part is to have a field called lastUpdated to indicated the most recent write. If you don’t already have something like this in your object, it’s really valuable in itself, but now is the required time to add it.
Duplicate the write logic — Duplicate all logic to write to both the current location and the new location. There are a bunch of ways to do this, the easiest is duplicate the code. The hardest is setup a post processing of your DB to read from the old location and write to the new location. (If you were using AWS DynamoDB this would be a Simple Stream.) — Make sure that the lastUpdated property is being set. This probably should also be used to ensure that data that is written has a lastUpdated that is later than the current value.
Replicate existing data — At this point all New data is going to be in both places. Any data row/object/item that gets written will be exactly the same, the only problem is any old data, this will live in the legacy system but not the new one. The next step is to copy all the existing data to your new database. When these updates happen they must check that their value is greater than the previous lastUpdated. This check ensures that a new write isn’t overwritten by old data. If you don’t do this, it will be impossible to ensure a complete and correct migration.
Validate the migration — Now is the time to check that the data migration worked correctly, run two read operations on both tables and validate that every row is the same, you can ignore an row that was updated after the validation started running (because we know that they’ll be the same). The check is something like — Read from table 1, check when it was last updated (if possible), then if needs to be validated, read from table 2, is the lastUpdated after “migration start time” if yes, ignore, if the same ignore, if they are different, investigate.
Start using new table — Change all the read operations to using the new table. Since all the data in the tables are now consistent, it doesn’t matter which one we read from. So switch to reading from table 2.
Cleanup — Change the “write to table 1 logic” to “write to table 2”, and delete the “duplicate data from table 1 to table 2”. We don’t need the duplication anymore, and since we don’t need table 1 anymore, we can remove the logic to write to it.
Delete Legacy database — Now we can delete table 1, it isn’t necessary anymore, and that’s the end of the migration.

There is however a race condition here that you need to be aware of. When we are in Step 5 (start using the new table):

Writing to table 1
Duplicating to table 2
Reading from table 1 — About to switch this to table 2

If there is a delay between the writing to table 1 and the duplicating to table 2, you could be reading stale data. In 99% of cases this doesn’t matter, in some areas this could be a problem, so you need to make sure that the logic that does the duplication of this data is completed before doing the switch:

await write1();
await duplicateTo2();
await read2();

That’s all you need to ensure, this works even in distributed systems and when using transactions.

DEV Community

Breaking up the monolith: Zero downtime migrations

The fundamental need

The setup of a migration

Oldest comments (0)