Building a GlobalStream: Replication Challenges and Optimizations

#tech #payments #adyen #kafka

By Gaurav Singh & Luciano Sabença, Streaming Platform Team

Global data streaming requires careful tradeoffs between performance, compliance, and reliability. At Adyen, replication is essential: data must remain close to producers to reduce latency and meet regulations, while also being centralized for analytics.

In this blog, we share our experience optimizing MirrorMaker 2 for large-scale, cross-continent Kafka replication, the parameter tuning we applied, and the hidden behaviors in Kafka Connect that ultimately unlocked the required throughput.

Why Replicate?

There are many things you need to take into account when designing a data streaming platform for a company such as Adyen. One of the most important aspects of it is data location and replication. Due to performance, reliability, and compliance, keeping data close to producers is usually a good idea. After all, you don't want to wait in line at the cashier to confirm the payment for that fat burger you just bought on that nice beach in Australia while your payment was being sent to Europe. However, it's also a common practice to have all data in a centralized cluster(s) in centralized locations for analytical purposes. These two need a bridge, which is where mirroring comes in.

Mirror Maker

When it comes to copying data between Kafka clusters, Mirror Maker is the default open source tool. MirrorMaker 2 is the current version, and it's built on top of Kafka Connect, a platform designed to make it easier to integrate Kafka with other tools such as databases and distributed file systems, and - why not? - another Kafka cluster. Setting up Mirror Maker is fairly straightforward, but tuning it to the performance Adyen requires isn't. Let's go over the journey into the depths of Kafka Connect and parameter tuning to find a solution!

Playing the Volume Game

Kafka is a highly flexible tool and serves as the backbone for many different architectural patterns. It can be used for batch processing, real time processing, or any combination of these two extremes. With default parameters, Kafka producers and consumers are optimized for low-latency use cases: batch sizes are not large, and records are sent immediately to the server (i.e., linger.ms = 0). Given that our volume is on the order of several hundred thousand messages per second, any optimization has a huge impact on replication latency. In our case, the default settings were not good enough to keep the latency bounded within a few seconds; in fact, our replication latency was on the order of minutes. Since we are transferring this data across continents, we don't expect very low latency for replication, but such high latency indicates that we were at the maximum possible throughput for this configuration, so we needed to revisit those parameters.

We made the following adjustments on the producer side of MirrorMaker:

batch.size: Increased from the default 16384 to 409600. This leads to larger batches per request, and less requests to replicate the same number of messages
linger.ms: Increased from 0 to 200. This parameter sets how long the Kafka driver can wait before dispatching the messages. 0 means send immediately. Giving more time to the driver allows it to batch more messages in the same batch.
compression.type: Changed from 'none' to 'zstd'. Activating compression means more messages per batch.
These changes didn't yield the desired results. There was another bottleneck. We then turned our attention to the consumer parameters. After all, MirrorMaker is both a consumer from the source cluster and a producer to the destination.

We adjusted:

fetch.min.bytes: Increased from 1 to 5000. So we wait a bit more to ensure that more data is available for consumption, thus decreasing the number of consumer requests and optimizing throughput.
max.fetch.records: Increased from 500 to 2000 with the objective of fetching more records per requests and sending more requests to the producer.

The goal here was clear: have the consumers fetch as much data as possible at once and send it in very large batches to the destination cluster. The idea is to optimize network traffic and amortize large distance latencies when measured per unit of replicated byte.

Despite these efforts, the throughput remained stubbornly the same. Each time we restarted with new parameters, the record rate would spike briefly before settling back to its previous baseline, indicating that our changes were not having a lasting impact. This led us to believe there was a deeper, underlying issue we were missing.

The Missing Piece

There was, however, a missing piece of the puzzle to make it work as we expect: how Kafka Connect changes are persisted and saved to the nodes. As we mentioned before, MM2 is built on top of Kafka Connect. Kafka Connect uses some internal Kafka topics to store and persist its state and configuration. This gives us three possible layers of configuration:

Kafka driver's parameters
Mirrormaker configuration
Internal Kafka Connect state

We needed to be sure that whenever any configuration was changed in any of these three places it would be reflected in the others. It soon became clear that this wasn't really the case: whilst our MM2 consumer was able to consume a quite decent amount of data, the producer wasn't really able to keep the throughput high and wasn't creating big enough batches for our use-case.

After digging a bit into the internal configuration topics and on MM docs, we understood the root cause: configuration changes on the connectors are not updated while any instance of the connector is up and running. Instead, they are read from the internal mm configuration topics and applied to the connectors. This was our case: we are running MM2 in distributed mode, so we didn't stop all nodes and start them again. Although the behaviour makes sense, it prioritizes consistency when the application is running; it's not a very intuitive behavior and it almost led us to explore weird paths in our search for better throughput.

The Results: A Tale of Two Graphs

The results were dramatic once we understood the necessity of a complete restart. As seen in the graph below, the number of records per request skyrocketed. We were also able to validate via logs and via the data in the topic that the parameters being used by the producers and consumers were the same as we originally intended.

As a consequence of those parameters, the total number of requests sent by the producer dropped significantly and each request now was able to replicate more messages.

This experience was a valuable lesson in the intricacies of managing a distributed system like Kafka Connect. Understanding the operational details of how configurations are managed was the key to unlocking the needed performance. This journey highlights that sometimes the most significant gains come not from simply tweaking parameters, but from a deeper understanding of the tools themselves.