API killed the transaction star

#api #database

Building applications from various services is common today. This can mean using a microservices architecture, which allows teams to develop and deploy features quickly. It may also involve integrating legacy systems with new ones or connecting external services to in-house systems.

However, distributed systems come with challenges. One major issue is that we can't depend on transactions like we did with centralized applications. Transactions provide consistency. They ensure that changes across multiple database tables either succeed together or fail together, keeping data consistent.

In distributed systems, achieving this level of consistency is difficult. Data may be stored across multiple databases linked by remote APIs. We can, however, aim for "eventual consistency." This means we accept temporary inconsistencies but expect them to resolve automatically.

If you need to combine a transactional resource (like a database) with one non-transactional resource (like a remote API), you might try these steps:

Start a transaction.
Apply changes to the database.
Call the remote API.
Commit the transaction if the API call is successful; otherwise, roll back.

This approach has drawbacks. Step 4 may fail, or the service might crash (e.g., due to Kubernetes) before reaching it. Additionally, the time spent calling the remote API adds to the transaction time. Long transactions lock objects for a long time, leading to performance issues or deadlocks.

If you have multiple non-transactional resources to call, this simple method won't work.

A more complex solution is using compensating transactions. Here, you commit transactions for the transactional resources right away (or don't use transactions at all). If a non-transactional operation fails, you must reverse the changes made so far.

Picture booking a summer holiday deal but needing manager approval. If your manager declines, you'll need to cancel your booking. This cancellation is your compensating transaction, but the holiday has to be cancellable.

If you're updating a large, complex resource, it can be tough to identify what parts you're changing and how to reverse those changes. If another update occurs simultaneously, conflicts can arise. For example, if you are raising a price from 10 to 15, but meanwhile another update sets it to 12, should you revert to 10 or 12?

The most resilient distributed systems tend to be event-based. Each state change in a service is an irreversible fact, communicated through an event message. Consumers of these events must respond accordingly to maintain consistency.

To ensure the data in the service database aligns with the events sent, we often use the Outbox pattern. First, you make changes to the service database within a transaction and then insert a record in an "outbox" table for the event. Transactions do remain useful in distributed architectures.

The outbox table is checked regularly, and messages are sent via messaging infrastructure (like Kafka). If a service restarts, the same event might be sent twice.

Thus, event consumers must be idempotent. If they receive a duplicate event, the outcome should be the same as if they received it just once.

What has been your experience with managing consistency in distributed systems?