Kafka: Let's talk (again) about replication

#kafka #replication #datastreaming #multidc

Like in many others distributed systems, Kafka leverages data replication to implement reliability and everyone that has tapped into Kafka knows about the replication factor configuration property that is required to create any topic. So why writing a paper in 2023 to talk about that? Well in fact things have evolved other time and data replication is now a broader topic than just the partition replication factor, so let me give you an overview of the current options.

Replication factor

Just a brief recap of this capability that is at the roots of Kafka itself. Data in a production Kafka cluster is replicated 3 times: once in the partition that is the leader in the replica set and two additional times in the followers. In fact, 3 is not enforced but when it comes to replication having 3 copies, it’s the commonly accepted minimum, it permits the data to still be available despite one failure aor a maintenance. Another instrumental configuration parameter is complementing this: min.insync.replicas. The in-sync replica set is, as its name says, the set of replicas made up of the leader and all other replicas in sync with the leader. So, considering a producer set with acks=all, a record write is considered successful if at least min.insync.replicas were able to acknowledge the write. So with a Replication Factor of 3, the usual min ISR value is set to 2, that way if one broker is unavailable for any reason, whether because of a failure or any planned operation like an upgrade, then a partition can still accept new records, giving the guarantee that they will be replicated even in this degraded configuration. This is why a minimal cluster requires 3 brokers, but 4 is recommended to keep the topic creation availability in case of a broker loss.

On the producer side, in the majority of the use cases, acks=all is the standard setting for all the reasons explained above. Note that even if min.insync.replicas=2, during nominal operations, most of the time the ISR set counts all 3 replicas. Hence, this makes this replication process synchronous.

External replication with MirrorMaker or Confluent Replicator

External replication is when records are replicated to another cluster. There are various reasons for doing that :

Sharing data between two locations that are too far to support synchronous replication: imagine applications hosted in the US east coast producing records and other consumers located in the west coast, implying too much network latency. Another reason is if you need to share data in real time with partners, and you want to copy the records from a set of topics to a foreign cluster managed by the partner.
DR scenarios: if your organization has 2 and only 2 DC, then you can't stretch the cluster, and you need to run two distinct clusters and replicate records from the primary to the DR one. But this also applies if you're hosting your cluster in the public cloud in three availability zones, and your business is so critical that you want to cover the risk of a complete region loss.
When you have on-prem applications like legacy core banking systems or mainframe applications and you need to stream data to new generation applications hosted in the public cloud, one good way to implement that kind of hybrid scenario is to replicate the on-prem cluster to a fully managed one in the cloud. It also drastically helps to streamline the network round trips as there's only one kind of flow to govern, and you can read multiple times the same data and pay the network cost between your DC and the cloud only once.
Data migration between two Kafka clusters

So tools like MirrorMaker and Confluent Replicator allow that kind of cross-cluster replication. You can see them like external applications consuming records from one side and producing on the other side, obviously the reality is a bit more complex as they're covering a wide range of edge cases. Both of them are implemented as Kafka Connect connectors, note that MirrorMaker version 1 is not a connector. So as they're replicating beyond the ISR set, this makes this kind of replication asynchronous by design, and you should also pay attention to the fact that you can't guarantee that all records are replicated at the moment of a complete disaster, so the guaranteed RPO can't be 0.

Asynchronous replication with Cluster Linking

We saw that asynchronous replication makes sense in some scenarios, especially to avoid any latency impact on the producer side. External replication is one option for doing so, but it also comes with a couple of challenges to care about:

those tools are external to the broker, which implies additional resource to manage, in a fault-tolerant manner and with the proper monitoring.
as consumer offsets are stored in a distinct topic, the consumer offsets can't be preserved, so offsets need to be translated from one cluster to another, this topic is extensively covered in the documentation.

This is where Cluster Linking comes into play. It's a feature offered by Confluent Server, which can be seen as a broker on steroids with a wide set of added capabilities, and Cluster Linking is one of them. Here the game changer is that as the replication is a feature internal to the broker, so it makes a byte-to-byte replication allowing to preserve the consumer offsets from the source cluster to the destination one. The other benefit is the reduction of the footprint on the infrastructure as there's no need to manage external components for that matter.
Cluster Linking is also available on Confluent Cloud.

Asynchronous intra-cluster replication

At that stage you should wonder how would it be possible to have asynchronous replication as the followers are expected to be part of the ISR set? This is the trick: Confluent introduced an additional kind of replica: the observer. It's another additional feature from the Confluent Server and it's different in the sense it's not part of the ISR set, which allows replicating asynchronously the leader.

Ok, so now let's talk about the use cases where this feature can fit. As formerly mentioned, if you need to share data with some applications that are hosted far away from the producer, implying a latency beyond the acceptable from a producer's perspective, then building a Multi Region Cluster spanned across those two places makes sense. It relies on the follower fetching feature that was introduced in Kafka 2.4.

Another very interesting scenario is when you combine observers and Automatic Observer Promotion because it unlocks the option to stretch the cluster across only 2 DC for the data plane. It's quite common in many organizations to have only 2 DC but remember that the control plane is implemented with Zookeeper, which is quorum based, so it needs an odd number of locations in order to maintain the quorum in case of a DC loss. So, if using the public cloud to host a Zookeeper tie-breaker is an option, which is usually accepted by Info Sec teams as no business data is managed by Zookeeper, then it's possible to overcome the 2 DC limitation mentioned previously. This is what we call 2.5 DC Architecture, to learn more, see this blog post: Automatic Observer Promotion Brings Fast and Safe Multi-Datacenter Failover with Confluent Platform 6.1. The main benefits of using a stretched cluster rather than replicated clusters are that you don't need to restart and reconfigure the client applications on the DR; as it's a single cluster, then you need fewer components and more importantly, you can guarantee that the RPO will be 0, meaning no data loss in the event of a unavailable DC.

I hope this gives clarification on all available options in terms of replication, however if you still need help to figure out what can be the appropriate setup for your use case, let's connect on LinkedIn and discuss about it!