MD Sayem Ahmed for Berlin Tech Blog

Posted on Jun 12

Handling User Migration with Debezium, Apache Kafka, and a Synchronization Algorithm with Cycle Detection

#apachekafka #dataengineering #streaming #changedatacapture

Introduction

Migrating millions of users' data without downtime or loss is a monumental challenge. At Kleinanzeigen, we tackled this problem recently when we migrated our users' data from a legacy platform to a new one with the help of Change Data Capture (Debezium), Apache Kafka, and a custom synchronization algorithm with built-in cycle detection, and thus, ended up creating a data synchronization system that mimics many of the key properties of a Distributed Database. This blog post will describe the business case that started the migration, our thought process for defining the architecture and technology choices, the trade-offs we made when agreeing on a solution, and the final architecture.

Background

Kleinanzeigen (KA for short) is a leading classifieds platform in Germany. It is also the number one platform in re-commerce, with 33.6 million unique visitors per month and 714 million visits in total. Suffice to say, it is a powerhouse in Germany.

KA recently migrated the whole platform to a new system. For this purpose, we created the new system, ran it parallel to our legacy system, migrated all users’ data to the new system, and then incrementally switched users to the new one. We kept both platforms operational simultaneously to make incremental switching possible, which helped us avoid a high-risk big-bang migration. Also, if something goes wrong with the new platform, we could always revert the users to the old system.

In addition to migrating the data, we also implemented significant user data transformations. The transformations were necessary because KA has been an extremely successful company, so naturally, we had accrued technical debts over the years. As part of this migration, we wanted to eliminate at least some of it.

Figuring Out How to Orchestrate the Migration

We first had to figure out how to orchestrate the whole migration process. A user at KA has profile data, ads, transactions, messages, etc. The migration process is complicated by the strict dependencies among all the data. Suppose we tried migrating a user’s ads before their profile data had been migrated? The ad migration would fail, because it cannot be attached to a user. It was important to carefully coordinate the migration sequence to ensure all data dependencies were handled.

After numerous planning sessions, workshops, and collaborative discussions involving multiple teams, we decided on the following migration strategy:

We would first migrate the core data of a user: their unique user IDs, their authentication information, and their profile data.
Once we had migrated core user data, the new system would start to recognize the user IDs, which were needed before any other data (e.g., ads, transactions) could be migrated.
After core user data migration, we would inform all the dependent systems to start subsequent data migrations.

Our team — Team Red at KA — was responsible for migrating the core user data and bootstrapping the entire migration process, which we will focus on in this post.

First High-Level Architecture

We initially developed the following architecture to migrate core user data -

We used a reverse proxy to intercept all incoming user requests, with just enough business logic to determine where to forward a user’s traffic. Existing users would continue to interact with our legacy system. However, once a user’s data had been migrated and the user had been switched to the new platform, the proxy would forward the user’s traffic to the new system.

We also decided to use an asynchronous streaming architecture rather than attempting to dual-write both systems synchronously from a single place/service. The reasons for this decision are below:

Even though the dual-write might seem simpler at first because we won’t need any additional system between the legacy and the new system, and it would also appear to enforce strong consistency, our experience showed that trying to synchronize multiple systems with synchronous calls is operationally challenging. What happens when the write to the legacy system succeeds but the write to the new system fails? Or vice versa? As we all know, the network is the most unreliable part of any distributed architecture, and failure due to a network issue is very likely because synchronous calls introduce temporal coupling between the client and the server. In such cases, we would need to introduce additional infrastructure (i.e., implement Saga) to handle such failures, and ultimately, our system would become eventually-consistent.
Choosing between the CAP theorem’s C (consistency) and A (availability) is always a business decision, and our business was fine with a slight syncing delay of up to a minute between the two systems.
As discussed before, we also needed a place to implement the data transformation logic to address technical debts. The dual-write approach would have required us to either modify the legacy system, which was a complex undertaking, or write the transformation in the new system, which was being developed by a whole separate team. As a result, we did not have much control over it (Conway’s Law).

Backsyncing Users’ Data from New to Legacy System

The high-level diagram simplified many elements. For example, the legacy system was not a simple “box”, as the picture showed. The box is an abstraction of numerous services with complex interconnections between them. Both the legacy system and the new system were complex. Since complex systems inherently possess multifaceted failure modes, we wanted to ensure that in case of any failures in the new system, we could switch all users back to the legacy system. This approach would require us to propagate all migrated users’ updates from the new system to the legacy system.

Also, as mentioned before, we wanted to incrementally switch our users to the new platform to avoid a big-bang migration (we called these users transitioned users). After both systems were prepped and ready, we planned to transition a few test users. Following that, we would transition a tiny percentage of users to beta-test the platform before steadily ramping up. While these users were using the new system — creating and updating their data (i.e., creating ads) — we also needed to send these updates to the legacy system because that’s where most users would remain. If we did not send these changes there, the transitioned users’ ads would not receive much visibility.

With those decisions in mind, we focused on our next challenge: How could we capture users’ updates in the legacy system and stream them to the new system?

Capturing User Updates with Transactional Outbox

Transactional Outbox is a commonly used pattern for capturing updates to an entity and sending them to a remote system. We considered using it to capture changes to the user entity -

As the above diagram shows, we considered creating a new table — UserChanged — to store the change events required for an outbox implementation. An event publisher would poll this event table and send the events to a Kafka topic, which would then be read by the mapping service and sent to the new system. However, we realized that this approach won’t scale. KA has millions of users, and their core data is updated millions of times per day. Thus, a polling job would not be able to keep up with the update rate, especially if we wanted to ensure speedy delivery of updates to the new system.

We then considered leveraging Spring’s Transactional Event Listeners to publish the events in real-time to Kafka. However, we quickly realized that there are some edge cases where we might send a user’s updates in the wrong order, leading to inconsistencies between the two systems.

Another problem with the outbox implementation was that every team needed to do the same repetitive work — create change events for entities like ads and transactions, store them in relevant tables, and then publish them to Kafka. It would be great to eliminate some of the duplicate work here.

Finally, and most importantly, we would have to scan through our legacy system to implement the outbox pattern and identify where we modified the entities. Referring to our earlier comment that KA has been a successful company which grew at a fantastic speed and thus accumulated technical debts, we could not rule out the possibility that we might forget a few places where database updates were taking place and, therefore, miss capturing those changes. If we did not send those changes to the new system, we would create inconsistencies that would be hard to detect and fix. At this point, we asked ourselves if there was a better way. Can we intercept the changes from the database directly?

Enter Change Data Capture with Debezium

Debezium is an open-source distributed platform capable of streaming changes directly from a database. It hooks into a database like MySQL’s transactional log, captures every change performed on the database, and then publishes them to a Kafka topic. For the use case that we had for this migration, it sounded like an excellent fit. However, our team was also aware that a new technology introduces a lot of unknown failure modes. Even though Debezium was already a few years old by then, KA had never used it before, and we did not have anyone with prior experience setting it up and using it.

After thinking about it for a while and determining that we had a few innovation tokens for our team that we could afford to spend researching a new technology that might also help other teams, we decided to try Debezium.

Challenges with Debezium

Our first job was to figure out how to set it up within the existing KA infrastructure. At that time, KA used a MySQL cluster with a single leader and multiple follower instances. The cluster was set up across two data centres (DC for short), with stand-by leaders on each DC. Only one of these leaders would be acting as the current leader at any given time, while the others would act as followers. Unfortunately, we discovered several issues with our cluster setup, as described below.

The first issue that we encountered was due to the transaction log format. MySQL’s transaction logs (also known as Binlog) store all changes applied to the database in one of the three formats -

Statement-based log
Row-based log
Mixed-mode log

In a statement-based log, MySQL stores the actual SQL statements executed on the server in the log. Then, during replication, the leader sends these statements to the followers so they can execute the same statements and thus get the same set of changes. However, this type of log format is no longer recommended as, for certain types of queries, it can lead to non-deterministic outcomes and, as a result, create inconsistencies between a leader and a follower.

In a row-based format, MySQL stores the actual change applied to the database when an SQL statement is executed. This format is recommended nowadays as it does not have non-determinism issues like the statement-based format.

The third form, mixed mode, is a format where the Binlog supports statement- and row-based formats.

For Debezium to retrieve all changes from a MySQL database, the Binlog must be in row-based format. However, all instances in our MySQL cluster at KA used statement-based formats. We learned that it would take considerable time and effort to change the format from statement- to row-based for all instances, which our user migration process could not afford then. In addition, it would also require a substantial amount of effort from our site reliability engineers, which was also difficult to arrange.

Another challenge we ran into was with the cluster itself. Debezium documentation recommended enabling Global Transaction Identifier (GTID for short) for a single-leader cluster setup like ours. Enabling GTID ensures that in the event of a leader failure, when a follower gets promoted to be the new leader, Debezium can continue reading the Binlog. In such a case, Debezium uses a GTID sequence check to position itself correctly in the new leader’s Binlog. Unfortunately, our cluster was not GTID-enabled.

To resolve these challenges, we came up with some pragmatic solutions:

We provisioned two new MySQL instances, one in each DC, with their Binlog format set to row. These instances would act like any other followers, and like any other follower, they would get their changes from the leader in a statement-based format, except that their own Binlog format would be row-based. Debezium would follow one of these instances to read the database changes.
We added a reverse proxy that would route traffic to one of these two row-based MySQL instances at a time while the other one was on standby. We then created a Debezium connector config using the proxy as a database host, effectively abstracting the two-instance small-cluster setup.
We decided that in case of a failure in the database the proxy was pointing to, we would switch traffic to another instance. At this point, the Debezium connector would fail to start, and a simple restart of the connector would not be enough for it to resume working. To resume the connector, we would drop the internal config topic where Debezium stores the connector config information, recreate the topic again, and restart the Debezium connector. At that point, the connector would start working again. The alternative would have been to allow Debezium to perform a full snapshot of the database, which we did not want to do because that would mean sending millions of updates to downstream systems in an uncontrolled manner, which could impact system stability.
Debezium would miss the updates in our database during the failure and the switching. To capture those changes, we relied on the snapshotting capability — we would issue snapshot commands for each table, which would replay all changes that had happened since a specific time in the past.
These decisions allowed us to run Debezium without making it cluster-aware, thus avoiding the necessity of changing all instances’ log format and having GTID.
We also increased MySQL’s Binlog retention period to at least seven days so that Debezium would not miss any changes while the connector was not running.

We tested every decision we discussed above locally by simulating our production infrastructure. We created the MySQL cluster, set up Kafka topics, and then configured Debezium to read from the instances — all using docker-compose and a local producer/consumer application. We also tested the failover scenarios and the process of switching Debezium to follow a different instance and resume it after re-creating config topics. We tested other failure scenarios as well. All of these were done to help us discover as many unknowns as possible: we wanted to eliminate as many failures as we could from the final architecture (or at least monitor them with correct metrics and alerts and update our operational runbooks with appropriate actions to take).

After testing everything and gaining more confidence, we installed Debezium in our infrastructure with support from our site reliability engineers.

Tracking Migration Phases

After we figured out how to capture user changes, our next focus was tracking migration status for each user. Based on the requirements, we needed:

To track the number of users ready to be migrated, either because their data has been transformed or they do not need any transformation at all.
To track how many users’ data have been migrated at any point in time.
To figure out a way to selectively transition our first small batch of test users to the new platform so that they can test it.
A way for other sub-systems to get notified whenever a user’s data migration has been completed, so that they can also trigger the migration of their relevant data.
A way to expose this phase information to our reverse proxy, as it decides whether the new system or the legacy system receives a user’s traffic.

When we examined these requirements closely, we realized we needed to implement a state machine to track different phases of the migration process. But we still needed to figure out the exact states and where we would store them.

We also realized that user migration is similar to how a person moves from one country to another. For example, when one of the Team Red members wanted to move to Germany, they went through the following phases:

They decided that they would like to work in Germany.
They applied for a job and got a job offer.
Once getting the job offer, they applied for a work visa at the German Embassy.
Once getting the visa approval, they moved to Germany.
After living in Germany for a few years, they decided to stay for a long time and got a Permanent Settlement Permit, which allowed them to live in Germany indefinitely.

Applying this analogy to our user migration process, we came up with the following migration states -

IDENTIFIED: A user that needs to be migrated. They may not be migrated immediately because their data might require some transformation, or they may not be one of the users selected for migration. But we know they will eventually migrate to the new system.
ELIGIBLE: The user has fulfilled all the requirements for migration — their data has been transformed, and/or they have been chosen for the migration.
MIGRATION_REQUESTED: A migration request has been issued to the new system, which is now setting up the account. The user account may take a while to be created in the new system. From this point on, any changes to the user’s core data in the legacy system will be propagated to the new system.
MIGRATED: The new system has confirmed that the user account has been created.
TRANSITIONED: The user has transitioned to the new system, and all user updates are now happening on the new system.

The following picture displays the relationship between the user migration states and the migration phases of a person -

We then started thinking — where do we put this state information? Storing them with the user data would not make sense, as they were not related and were only needed during the migration. Also, once the migration is over, we will remove them once and for all. We applied a similar migration analogy to find a solution and developed a new type - Emigrant. An emigrant is a person who has left their home country and moved to another. Since KA users are also leaving their old platform for a new one, they are all emigrants from the legacy platform’s perspective. We then created an emigrant entity in our legacy system and stored this state there.

We then mapped our entire migration process as transitions in this emigrant state machine. The following diagram shows all possible transitions -

We also decided to use Debezium to capture all changes happening to an emigrant. This decision helped us organize the migration process as a series of small and consistent updates, as we will see shortly.

This is what the architecture looked like after plugging in the emigrant concept -

Migrating Emigrants

The following diagram shows the logic that we used to create emigrants -

As the diagram shows, Debezium sends all user updates to the syncing service via Kafka. The Kafka listener then checked if an emigrant existed in the legacy database and, if not, created one. As discussed earlier, the emigrant’s initial state would be IDENTIFIED.

We had a data transformation logic check within the same service to verify if an emigrant’s data needed to be transformed before starting their migration. Once verified, that check changed the emigrant’s state from IDENTIFIED to ELIGIBLE, and the change was committed to the database. Debezium would then pick up the change and publish it to the respective topic (emigrant). The listener, which was implemented within the same syncing service, would pick up this state change. Since Debezium CDC payloads contain two fields that include the state of the database record before the change and the state of the record after applying the change, the listener would then be able to determine that a state transition had taken place where the old state was IDENTIFIED and the new state was ELIGIBLE (in fact, it checked if the previous state was anything other than ELIGIBLE, which allowed us to reuse this flow for another case, as we will see in a bit). It would then treat it as a signal to start the migration and send an HTTP request to the new system. If the request was successful, it would change the emigrant state to MIGRATION_REQUESTED, and in case it failed, the migration state would be changed to MIGRATION_FAILED. It would then commit the change.

To retry failed migrations, we had a job to replay them by simply changing the state back to ELIGIBLE, and the above flow would start executing again. Since network failures are common when making HTTP requests, this design allowed us to handle transient network failures easily. We also ensured the account creation process was idempotent to avoid duplicate accounts. Also, when an emigrant moved to the MIGRATION_REQUESTED state, the syncing service started syncing every change to the respective core user data. The data changes would also continue syncing when emigrants moved to the MIGRATED state.

Once the new system successfully created the account, an HTTP endpoint in the mapping service was called to confirm the migration. The controller in the mapping service would then change the emigrant state to MIGRATED and commit it. This state change would then be picked up by the emigrant listener, which, like before, would detect that a state change had taken place and send a confirmation message to a topic called user-migrated, thus notifying all downstream systems that an emigrant had just migrated.

Deleting Emigrants

The following diagram shows the deletion logic used to delete emigrants -

The deletion process would start whenever we received a user deletion event via Debezium. Even if an emigrant did not exist in the database, we sent a delete request to the new system. This would ensure we never leave a deleted user’s data in the new system, even if some unforeseen inconsistency caused missing emigrants in the database. We took this extra, unnecessary (but not too costly) step to fully comply with GDPR. Since the delete operation on the new system was idempotent, it did not cause any issues.

However, if an emigrant existed for the user, then after sending the delete request to the new system, the process would delete the emigrant. Then, rather than publishing a tombstone to the user-migrated topic immediately, we relied on Debezium to capture the emigrant’s deletion and then published the deletion confirmation to the user-migrated topic. This allowed us to be more resilient in the face of failures. Instead of ensuring three different syncing operations — deleting emigrants, sending delete to the new system, and publishing tombstone to the user-migrated topic — succeeded one after another when a user was deleted, we needed to focus on only two (deleting emigrants and sending delete to the new system). But most importantly, it is symmetrical to how we publish migration events to this topic (after emigrant is migrated) and conceptually more clearer — we emit tombstones only when the emigrant is deleted.

Full Forward-Sync Architecture

Connecting all the processes described above, this is what the complete forward-sync architecture looks like -

A few points about the architecture:

Debezium’s ability to capture data before and after a committed transaction was tremendously helpful. It allowed the emigrant listener to accurately determine what changed in a particular transaction. This, in turn, allowed us to (re) start a migration whenever the new state was ELIGIBLE, and the old state was anything but. As a result, we could reuse the same logic when replaying a failed migration.
Because of Debezium, we could also break down the migration process into smaller atomic chunks that revolved around committed state changes of emigrants. For example, once receiving the account confirmation request, the controller in the mapping service changed the emigrant state to MIGRATED. It did not need to publish a confirmation message to the user-migrated topic at the same time. As a result, we did not need to deal with corner cases such as the state change being successful but the message publication failing. If the state change failed, the controller returned a 5xx to the client, who would retry. If the publication of migration confirmation to the user-migrated topic failed, the emigrant listener would keep retrying until it was successful. We collected metrics to monitor every topic’s lag and triggered alerts if it exceeded a certain threshold. However, for some use cases, we also implemented a dead letter queue using a database table where we would push the failing records after trying to process them a certain number of times.
In addition to publishing confirmation messages to the user-migrated topic, we also exposed an HTTP endpoint for services that relied on polling/querying to determine an emigrant’s migration state.
We did not create separate microservices for different operations. Instead, we kept them all within the same service. We used modular monolith practices and created cohesive modules that properly separated concerns within the same service. This decision allowed us to leverage the ACID properties of MySQL transactions to build our synchronization algorithm, as we will see in a bit. Also, we avoided a lot of overhead that typically occurs with a microservice-oriented architecture.

Transitioning Emigrants

Next, it was time to figure out how to transition emigrants to the new platform. A different team was working on a separate service that would decide when to transition an emigrant. Once decided, the transition process would start by sending a command to the mapping service. The following diagram shows the modified architecture after adding these components -

After receiving the transition request, the mapping service would change the emigrant’s state to TRANSITIONED. From then on, the reverse proxy would forward this user’s traffic to the new platform.

Backsyncing Transitioned Emigrants’ Data

We discussed the need to backsync a transitioned emigrant’s updates with the legacy platform. To do that, we added another HTTP endpoint in our mapping service. When a transitioned emigrant’s data got updated in the new system, it would send the update to the mapping service via that endpoint. The mapping service would then update the emigrant’s data in the legacy database. The following diagram shows the additional components needed for this job -

Allowing Bi-Directional Updates

Until now, the architecture implementation has made one implicit assumption — a user’s data would only be updated on one platform. Before a user transitioned, all their updates occurred on the legacy platform. Once they transitioned, all their updates happened on the new platform. This assumption helped us avoid scenarios like an infinite update loop between the two systems. Let us explain how.

Suppose our system allows user data updates on both platforms simultaneously, and a transitioned user’s data has just been updated on the legacy platform. This update will arrive in the mapping service via Debezium. Debezium will send this update to the new system. Seeing that this is a transitioned user, the new system will update its copy and re-send the update to the mapping service via HTTP call. Seeing that this is a transitioned user, the mapping service will update it on the database again, which will then be captured via Debezium and sent to the mapping service. Thus, an update loop would have been established.

One might assume there would be no changes to the user’s data during later updates in the loop described just now; since the mapping service used JPA/Hibernate, the update might not even trigger any actual SQL updates. However, that assumption did not last long, as the mapping service relied on the new service to get a transitioned user’s data modification time. The new system always sets the modification time for transitioned users to the current timestamp. As a result, the modification time would always keep changing, resulting in actual SQL updates.

We initially considered restricting a user’s update to a single platform at a time to avoid this loop. However, we realized that this restriction was not realistic. A transitioned user’s traffic would only be forwarded to the new system when the reverse proxy in front of our infrastructure identifies the user as a transitioned user. To do that, it would at least need the user ID, and that user ID would only be available if the user had logged in. But what if the user had not logged in, but their data would still have been updated? This could happen when a user resets their password. In such scenario, the password would be updated in the legacy system, which would then need to be synchronized with the new system. Another use case would be when a user account would be deemed fraudulent. For such cases, even though we had an alternative moderation tool available in the new system, we still wanted to keep our old tool available in case of an emergency and/or our customer support agents forgot to check if the user had transitioned. Since providing a safe platform for our users is one of KA’s highest priorities, we wanted to ensure that operations like blocking a fraudulent user account could still be done on the legacy system and then synced with the new system. But to do that, we needed to figure out a way to break the update loop.

One proposed solution to break the loop was introducing a new field to track which platform an update originated from. The field could be named as “update_source”. When an update takes place on the legacy system, it would have “LEGACY” as the value; for new system updates, it would contain “NEW”. But to use that field, we needed to find all the places where user updates were taking place on both the legacy and the new platform, a challenge that we already saw as too daunting and error-prone.

Another idea was to use the modification time to determine if an update was stale. Ignoring that the new system always used the latest time as the modification time, and as a result, it was constantly changing, using the physical time to determine if an update was fresh in a distributed system has other issues, too. For example, protocols like Network Time Protocol that keep machine times in sync always have room for errors, which could lead to clock skews between two machines in the range of a few milliseconds to seconds. Our migration system handled migrations for millions of users and synced millions of updates daily, so even the slightest deviation could result in thousands of infinite updates going round and round, which would be hard to detect and stop.

Logical clocks are a popular alternative to physical clocks in distributed systems. An example of such a clock would be the partition offset assigned to each record in a Kafka topic. Consensus algorithms like Raft and Paxos also rely on logical clocks to determine which updates are recent. But where can we find a logical clock in our system?

While trying to figure out a solution, we noticed an “interesting” field in our user entity. As mentioned, we used JPA/Hibernate as our ORM to handle data persistence. With any JPA entity, it has become a standard practice to include a @Version field. These fields help JPA implement optimistic locking when updating the entity in the database.

When numeric types like Long are used as @version, due to the way Hibernate implements optimistic locking, they become -

Monotonically increasing: they are either incremented and stored in the database, or the update fails and gets rolled back. They never decrease.
Atomic: they are either incremented and persisted in the database, or the update is rolled back.
Unique for each successful update of an individual user: when persisted/committed into the database, each update is guaranteed a new version value for each individual user.

We realized that these properties make the Version field an ideal logical clock for our migration system. We then used it and developed our synchronization algorithm to determine which updates must be propagated between the two systems.

Custom Synchronization Algorithm to Break Update Loop

We decided to store the version value of the user entity with the emigrant entity in a field called user_version. The mapping service kept updating this value whenever it received a user update via Debezium that had a version greater than user_version. However, whenever the service received an update whose version was less than or equal to user_version, it identified the update as one it had already seen.

For transitioned emigrants, whenever their data changed in the new system, the mapping service would start updating it in the legacy system by starting a transaction in MySQL. Once the user data had been updated, it would update the user_version to the latest version, in the same transaction. It would then commit.

This change would then arrive at the mapping service via Debezium. Since this update’s version value would be the same as user_version, the service would deduce that this was a user update it had already seen before and would ignore the change.

Let’s see what happens when an update, such as a password reset, originates in the legacy system.

Suppose that a user had just reset their password. Then -

Once the password had been reset, the update would arrive via Debezium at the mapping service. Suppose that the version value in this update is 8.
Since the user update had just occurred and the mapping service had not seen it before, the user_version in emigrant must have a value of less than 8 (considering the property of this field/logical clock we discussed before). Let’s assume the value in the user_version field was 7.
Since this is a user update the mapping service had not seen before, it would update the user_version in emigrant to 8 and sync this update with the new system.
The new system would update its copy, and seeing that this was a transitioned user, it would send the update back to the mapping service via HTTP call.
The mapping service would then start updating the user data again. It would begin another transaction in MySQL, update the user data, which would cause the version value in the user entity to increment to 9, store this updated value in the user_version field in the emigrant entity, and commit the transaction.
This user update would then again arrive at the mapping service via Debezium. But this time, the mapping service would see that the user_version in emigrant was already 9. Hence, it would identify it as an already-seen/processed update and ignore it, breaking the loop.

We tested the flow with a few transitioned users and found that the algorithm worked as expected. We also noticed one interesting fact — sometimes our mapping service would receive user updates where the version field had a value of, let’s say, 5, while we had 7 stored in user_version in emigrant. It would then receive the update with version 6 and finally with version 7. We attributed it to possible network slowness that caused Debezium and/or Kafka brokers to deliver updates slowly to one topic while other updates had already occurred. These edge cases were fascinating because they once again demonstrated that in a distributed system there is no guaranteed order of execution unless it is explicitly enforced.

We only allowed password reset and account-blocking operations to be synced from the legacy system. We did not provide a way to sync any other updates (i.e., name change), which helped us avoid many edge cases that were almost guaranteed to happen.

A Data-Sync System Exhibiting Distributed Database Properties

Based on the migration architecture we have seen, as long as a user was logged in and our reverse proxy could determine their migration status, it sent all their traffic to the new system. For all intents and purposes, the legacy system would appear as “unreachable” to the reverse proxy for this particular user. However, our migration process kept the legacy system up to date, similar to how a primary database instance would keep a standby-primary/replica instance up to date so that it could take over if something went wrong.

However, when users were no longer logged in and tried to reset their passwords, the reverse proxy would send all their updates to the legacy system (which was up to date due to backsync). From the reverse proxy’s point of view, it was as if a network partition had made the new system unreachable. After updating its own password copy, the legacy system would sync the update with the new system, ensuring it was up to date, but this would be invisible to the proxy.

Once the user resets their password and logs in again, the new system becomes the primary. It was able to fulfil its role as the primary because the legacy system kept it up to date in the background. From the reverse proxy’s point of view, it was as if the network partition had just been recovered, and the old primary was back being the primary, containing all the updates that occurred during the partition.

This example proves that the system we have built to store our users’ data (shown in the above picture with a yellow dotted line) mimics several key behaviors of a Distributed Database:

High availability under “partition”: even when the proxy can’t determine a user’s migration status, both reads and writes always succeed.
Automatic fail-over and recovery: even if a transitioned user’s password reset ends up in the legacy system, it still gets synced to the new system.
AP system: from the CAP Theorem perspective, it behaves like an AP system — always available and eventually consistent.

Final Thoughts

The entire migration process, from start to finish, had many technical and organizational obstacles. Ultimately, we overcame all obstacles and delivered a robust working solution to the business that did the job perfectly. Our research with Debezium also paid dividends — it helped other teams stream changes for their data migration and proved helpful in different use cases.

Our architecture has significantly improved since then. We now have a GTID-enabled MySQL cluster with row-based replication, simplifying the overall setup.

If you are interested in solving complex challenges like what we described here, we are hiring!

Acknowledgements

Many thanks to the following KA people who reviewed a draft of this post and provided helpful feedback: Andre Charton, Christiane Lemke, Donato Emma, Joeran Riess, Konrad Campowsky, Louisa Bieler, Max Tritschler, André Charton, Pierre Du Bois, and Valeriia Platonova.

Special thanks to Sophie Asmus for assisting with the publication.

Next, kudos to our SREs/DevEx engineers Claudiu-Florin Vasadi, Peter Mucha, Soohyun Kim, Stephen Day, and Wolfgang Braun for supporting us with all things infrastructure.

Then, many thanks to Matt Goodwin and Robert Brodersen for helping us managing the initiative.

Finally, a huge thank you to the rest of the Team Red — Christiane Lemke, Franziska Schumann, Maria Sanchez Sierra, Michael Schwalbe, Niklas Lönn, and Valeriia Platonova — whose positive attitudes and hard work turned this challenging migration into a success.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.