Mustafa ERBAY

Posted on May 17 • Originally published at mustafaerbay.com.tr

Consistency in Distributed Systems: The Place of Eventual Consistency

#technology #distributedsystems #eventualconsistency #captheorem

One of the most fundamental issues I've encountered when building distributed systems has always been data consistency. When a piece of data is stored across different servers, in different geographies, or in different services, ensuring these pieces are identical at all times is often impossible. This is precisely where the concept of "Eventual Consistency" comes into play, offering practical solutions.

I've used this model in many projects to date, especially in architectures requiring high scalability and continuous availability. While immediate consistency was critical in a bank's internal platform, for financial calculators in a side product I developed, it was sufficient for the data to be correct even with a few seconds of delay. These decisions are entirely shaped by the nature of the project and its business requirements.

Why is Consistency in Distributed Systems Difficult? In the Shadow of the CAP Theorem

By their very nature, distributed systems require data to be stored in multiple locations, which is indispensable for high availability and performance. However, this situation brings significant challenges related to consistency. In my experience, the CAP Theorem lies at the heart of these challenges: the reality that we can only choose two out of three properties—Consistency, Availability, and Partition Tolerance—at any given time. Network partitions, meaning network outages, are an inevitable reality in distributed systems. Internet connections can drop, server inter-cables can break, or a switch can fail.

When faced with such a partition scenario, we either sacrifice consistency and ensure the system continues to operate (Availability), or we sacrifice availability and wait for the data in the entire system to remain fully consistent (Consistency). In most cases, especially in systems spread across vast geographies, giving up partition tolerance is not an option. This forces us to choose between Consistency and Availability. For example, in an ERP system for a manufacturing company, updating stock quantities can be critical for immediate and strong consistency; incorrect stock information can halt production. However, in certain reporting modules of the same ERP, a delay of a few minutes might be tolerable.

ℹ️ CAP Theorem Choice

When designing a distributed system, deciding which CAP property to sacrifice forms the foundation of the architecture. Generally, Partition Tolerance (P) is considered indispensable, and your choice is made between C (Consistency) and A (Availability). For me, this has always been a reflection of business requirements.

I once grappled extensively with this dilemma while designing the cart services for a large Turkish e-commerce site. When a user added an item to their cart, was it more important for the system to be one hundred percent consistent at that moment, or for the cart to be accessible under all circumstances? If we had enforced Strong Consistency, our cart service could have become completely inaccessible if a node crashed or a network interruption occurred. This would have meant thousands of users being unable to complete their purchases. Therefore, we opted for eventual consistency for cart data. The fact that an item added to the cart might not be immediately visible across all replicas was a small risk at that moment; availability was much more critical for the user experience. In situations like these, the criticality of the business and user experience have always been my priority.

What is Eventual Consistency and When Does It Occur?

Eventual Consistency is a model that guarantees data in a distributed system will eventually become consistent after some time. This means that when a data update is made, this update might not be propagated to all copies immediately; however, given sufficient time, it's assumed that all copies will eventually reach the same value. The word "eventually" here is key; how long this takes can vary depending on system load, network latency, and replication mechanisms. In a manufacturing ERP, when I update the stock quantity of a certain product, this information might not need to be seen identically by all other services instantly. For instance, when a warehouse operator enters a stock reduction into the system, a delay of a few seconds or minutes for this information to reflect in the accounting system might be acceptable.

Data replication processes typically run asynchronously. Data written to the primary database is later propagated to secondary replicas or other services. During this propagation, different copies might temporarily hold different values. This is often preferred to increase performance, especially in systems with high write loads or geographically distributed architectures. For example, while developing the backend for my own side product, I used the eventual consistency model for data that doesn't change very frequently, such as user profile updates or comments. This increased the system's performance and scalability, while users didn't necessarily see the most up-to-date data instantly, their overall experience was not negatively impacted.

Conflict Resolution Mechanisms

In the Eventual Consistency model, "conflicts" can arise when simultaneous write attempts occur to the same data from different points. How these conflicts are resolved is a critical design decision. One of the simplest methods is the "Last-Write-Wins" (LWW) principle; meaning, the last written data is considered valid. However, this may not always align with business logic. Sometimes, more sophisticated approaches like custom merge strategies or CRDTs (Conflict-free Replicated Data Types) might be necessary. In a customer project, on a platform where multiple users could make changes to the same document simultaneously, LWW proved insufficient. We had to develop a more complex merge algorithm using timestamps and version numbers. Situations like these clearly demonstrate the complexity introduced by eventual consistency and increase the developer's workload. Therefore, it's essential to thoroughly analyze from the outset which data can be eventually consistent and which data must be strongly consistent.

Advantages and Disadvantages of Eventual Consistency

The allure of Eventual Consistency stems from its ability to overcome certain fundamental limitations encountered in distributed systems. However, alongside the benefits this model brings, there are significant disadvantages that should not be overlooked.

Advantages: High Availability, Scalability, Performance

The biggest advantage of Eventual Consistency is its ability to offer high availability (Availability) even when network outages occur between system components. When a node becomes inaccessible or a network segment breaks, other parts of the system can continue to operate and perform read/write operations. This is vital, especially for applications serving globally or requiring continuous uptime. In the backend of my own side product, while managing financial transactions for over 100K active users daily, I can never afford for the system to completely stop. Therefore, by opting for Eventual Consistency even for some non-critical data like user balances, I've guaranteed the continuity of the service even in the event of an outage in any region.

The second major advantage is horizontal scalability. Since there's no requirement for immediate synchronization between data replicas, we can easily distribute the system's overall workload by adding more servers. This is a cost-effective solution, especially in situations where data volume or transaction count is rapidly increasing. For instance, in a manufacturing firm's ERP, the production planning module could perform hundreds of thousands of record operations daily. Managing this heavy write load with Strong Consistency would require expensive hardware and lead to performance bottlenecks. With Eventual Consistency, we were able to distribute write operations across different nodes and increase the system's overall performance by 30%.

Finally, performance is one of Eventual Consistency's most apparent benefits. Since write operations occur asynchronously, a client can receive a response immediately after sending a write request, without waiting for it to propagate to all copies. This significantly improves the user experience. In one of my mobile applications, updating user notification preferences was causing the application to slow down because waiting for this change to propagate instantly to all backend services was taking too long. When we switched to Eventual Consistency, users could get an immediate response as soon as they saved their preferences, reducing the application's overall response time from an average of 500ms to 100ms.

Disadvantages: Likelihood of Stale Data, Complex Error Handling, Increased Developer Burden

The most obvious disadvantage of Eventual Consistency is the possibility of reading data that is not immediately up-to-date (stale data). If data is read shortly after it's written, one might encounter an older copy. This can lead to serious problems in scenarios where strong consistency is an absolute requirement, such as financial transactions, security settings, or inventory management. In a customer project, we had to use Eventual Consistency in a system managing user credit limits. When a user requested a limit increase, the fact that this information wasn't updated instantly across the entire system caused the user to be able to make new transactions with their old limit for a short period. This situation posed both financial and legal risks. Ultimately, we had to revert to Strong Consistency for such critical data and redesign the system's architecture.

A second disadvantage is the complexity of error handling. In the Eventual Consistency model, situations like data conflicts and replication delays naturally occur. Dealing with these situations requires developing additional logic and mechanisms. For example, in a manufacturing ERP, if two operators reduce stock for the same product simultaneously, it must be predefined how the system will resolve this conflict (e.g., accepting the first reduction and rejecting the second, or processing both without going negative). Addressing such scenarios correctly is critical for system reliability and data integrity.

Finally, Eventual Consistency places a significant burden on the developer. Developers need to question the currency of data at every point in the application, develop mechanisms to protect against stale data, and manage potential data conflicts. This requires more thought and coding compared to systems that offer Strong Consistency. In a side product, while designing the data flow, I recall thinking "I wish it were simpler" several times due to the complexity introduced by Eventual Consistency. For instance, on a screen showing a user's payment history, updating the information indicating a successful payment with an Eventual Consistent message from the payment service required managing temporary states like "pending" or "processing" in the user interface. This extended the development process and increased the number of test cases.

Eventual Consistency Application Scenarios and Patterns

To manage the complexities introduced by Eventual Consistency, specific patterns and approaches have been developed in software architecture. These aim to help ensure data consistency while leveraging the advantages of distributed systems.

CQRS (Command Query Responsibility Segregation) and Event Sourcing

CQRS (Command Query Responsibility Segregation) is an architectural pattern that aligns very well with Eventual Consistency. In this pattern, "command" operations that modify data are separated from "query" operations that read data. Command operations are typically written to a database called the write-model, while query operations read data from a different, optimized database called the read-model. Data synchronization between these two models occurs with Eventual Consistency principles. In a manufacturing ERP, I used CQRS when designing operator screens. When an operator starts a production order or completes a product, this is processed as a "command" and saved to the write-model. This information is then asynchronously reflected in the read-model and used for reporting or dashboards. This allowed the operator's action to be completed instantly, while reports could be updated with a few seconds of delay.

Event Sourcing, on the other hand, is the principle of recording all state changes in a system as a sequence of events. The current state of the system is reconstructed from the accumulation of these events. These events are typically broadcast to other services via an Event Bus, and this broadcasting is subject to Eventual Consistency. When I used Event Sourcing and CQRS together, I found that I could manage stock movements in a manufacturing ERP much more flexibly. Each stock entry or exit was recorded as an event, and these events were processed by different modules like inventory management, accounting, and reporting according to their own needs. This approach increased both the system's auditability and provided a flexible architecture.

Transaction Outbox Pattern

In distributed systems, ensuring that a database transaction and an external message (e.g., a message sent to a RabbitMQ queue) occur atomically is challenging. The Transaction Outbox Pattern is an effective method I use to solve this problem. In this pattern, all external messages that need to be sent during a database transaction are recorded in a special "outbox" table within the same database transaction. When the database transaction is successfully committed, a separate service (the outbox relay) retrieves these messages from this table and sends them to the external message queue. This ensures an "at least once" guarantee between the database transaction and the message sending process, preventing inconsistent states.

In a customer project, we used this pattern when updating customer information in a bank's internal platform. When customer information was updated, within the same transaction, both the customers table was updated, and a CustomerUpdatedEvent record was added to the outbox table. Later, a background worker continuously scanned the outbox table to retrieve new events and send them to Kafka.

-- Example of an outbox table in PostgreSQL
CREATE TABLE outbox (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    aggregate_type VARCHAR(255) NOT NULL,
    aggregate_id UUID NOT NULL,
    event_type VARCHAR(255) NOT NULL,
    payload JSONB NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    processed_at TIMESTAMPTZ,
    status VARCHAR(50) DEFAULT 'PENDING'
);

-- Adding a record to the outbox during a transaction
BEGIN;
-- Update customer information
UPDATE customers SET name = 'New Name', updated_at = NOW() WHERE id = 'some-uuid';
-- Add event to outbox
INSERT INTO outbox (aggregate_type, aggregate_id, event_type, payload)
VALUES ('Customer', 'some-uuid', 'CustomerUpdated', '{"customer_id": "some-uuid", "new_name": "New Name"}');
COMMIT;

This approach allowed me to maintain consistency between the database and messaging systems, and in case of potential errors (e.g., failure to send a message to Kafka), I could easily implement retry mechanisms thanks to the records in the outbox table.

Idempotency

In distributed systems working with Eventual Consistency, it's common for messages or operations to be processed multiple times (e.g., retries due to network interruptions). Therefore, for every operation to be "idempotent," meaning that applying the same operation multiple times has the same effect on the system, is a critical requirement. Idempotency is essential for maintaining data integrity and preventing unintended side effects. For example, in a payment system, processing the same payment request twice could lead to the customer being charged twice.

To prevent this, a unique "idempotency key" is typically used. When a customer sends a transaction request, they also include this key in the request header or body. The system uses this key to check if the same request has been processed before. If it has, instead of executing the operation again, it returns the result of the previous operation. In my own side product, I actively use the X-Idempotency-Key header when processing payments.

POST /api/v1/payments HTTP/1.1
Host: api.example.com
Content-Type: application/json
X-Idempotency-Key: e9a2b7c1-f3d8-4a5e-b1c0-6d7e8f9a0b1c

{
    "amount": 100.00,
    "currency": "TRY",
    "customer_id": "user-123"
}

This way, even if the same payment request arrives twice due to network delays or client-side retries, the system processes it only once, preventing incorrect calculation of the customer's balance. Idempotency makes the system much more robust when working with guarantees like "at least once delivery" introduced by Eventual Consistency.

Operational Challenges in Managing Eventual Consistency

While Eventual Consistency offers architectural advantages, it requires careful management on the operational side. In my experience, the main challenges this model presents are monitoring data consistency, managing delays, and resolving conflicts.

Monitoring is one of the most critical operational requirements for Eventual Consistency. Although we know the data will eventually become consistent, how long does this "eventually" take? We need to establish robust monitoring mechanisms to measure this duration and detect anomalies. In a manufacturing firm's ERP, I developed custom metrics to track how long it took for stock updates to synchronize across different systems. For example, by measuring the difference between the timestamp when a stock_updated_event was published and the timestamp when this event was reflected in the relevant reporting database, I tracked the average synchronization latency and drift. When these delays exceeded 500ms during the day, the system automatically triggered an alert. Such an alert typically indicated a replication problem or a bottleneck in the message queue.

One of the tools I use to measure data synchronization delays is pg_replication_slots and WAL (Write-Ahead Log) latencies in PostgreSQL. In systems using physical replication, I tracked the delay between the primary server and replicas by monitoring metrics like write_lag, flush_lag, and replay_lag through the pg_stat_replication view.

-- Monitoring PostgreSQL replication lag
SELECT
    application_name,
    client_addr,
    state,
    sync_state,
    pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS replay_lag_bytes,
    EXTRACT(EPOCH FROM (NOW() - pg_last_xact_replay_timestamp())) AS replay_lag_seconds
FROM
    pg_stat_replication;

By running this query regularly or collecting it with a monitoring tool (e.g., Prometheus), we could be instantly notified when replication delays exceeded a certain threshold. The WAL rotation alarm at 03:14 once indicated an issue with disk I/O, and thanks to this, we were able to intervene before a larger data inconsistency occurred.

Conflict Resolution is another operational challenge of Eventual Consistency. When data conflicts occur between systems, it's not always possible to resolve these conflicts automatically. In some cases, manual intervention or solutions based on custom business logic may be required. In a customer project, when users tried to update the same inventory record simultaneously, the automatic LWW (Last-Write-Wins) policy was not sufficient. Sometimes, a change made by one user would overwrite a more logical change made by another user. In such situations, we designed a process where conflicting changes were placed in a "conflict queue," allowing an operator to manually review and make the correct decision. While this incurred operational costs, it was necessary to preserve data integrity.

Test Strategies also change with Eventual Consistency. Traditional integration tests used in systems with strong consistency may be insufficient. It is very beneficial to use Chaos Engineering or simulations to test Eventual Consistency scenarios. By simulating situations like network latency, node failures, or reordering of messages, I observed the system's behavior under Eventual Consistency. For example, in a test environment, I made a database replica inaccessible for 10 minutes, then reopened it, and analyzed in detail how the system synchronized and what users experienced during this process. These tests helped minimize surprises we might encounter in the production environment.

My Preferences and Future Perspective

In my twenty years of experience, deciding when to choose Eventual Consistency and when to opt for stronger consistency models when designing distributed system architectures has always varied based on business requirements. My clear stance is this: if real-time data consistency is not an absolute necessity for a user experience or workflow, I prefer Eventual Consistency for its advantages in high availability, performance, and scalability. For example, Eventual Consistency is often sufficient for data like blog post view counts, a user's last login time, or product comments. Instantaneous delays in such data typically do not negatively impact user experience.

However, when it comes to critical data such as banking transactions, stock deductions, financial reporting, or security settings, I lean towards Strong Consistency or Transactional Consistency models. In these areas, data inconsistency can lead directly to financial losses or severe operational disruptions. Managing customer account balances in a bank's internal platform, I've seen that even the slightest inconsistency is unacceptable. There, we had to resort to more costly but reliable solutions like distributed transaction management and two-phase commit. [ilg

DEV Community