A winning Kafka partition strategy

#kafka #partitions

When starting to develop applications using Kafka, you might wonder how many partitions you need to set up for each topic.

As it is so often, the answer is: it depends.

This article will discuss everything you need to know about Kafka partitions and which partition strategy to use.

Kafka topics and partitions —a quick primer

Kafka organizes incoming events in topics. A topic is the primary unit of storage. It is a log of events belonging to a specific domain, such as website_user_activity or checkout_notification. More technically, you could think of a topic as an append-only application log file.

With log files, you might have your logging framework configured to create one log file per day—primarily for space reasons, but also to make it easier to find specific log entries later. Like splitting log files, Kafka divides a topic into partitions, primarily for performance reasons. Partitions are Kafka's way of parallelizing data processing. When you connect a consumer to Kafka, in addition to the topic, you need to specify a partition to read from.

Because a partition can only be consumed by a single consumer, if you need to handle a high volume of data, you‘ll need to increase the number of partitions to scale up consumption.

A partition example

Let's look at an example to make things more concrete. The first example shows a single broker and a topic consisting of multiple partitions. Each partition contains a subset of the overall data set.

In a more advanced scenario, several partitions are distributed across several brokers. While the former setup might be more suitable for a development environment, the approach described here is what you will likely find in production. Unlike with a single broker, each partition comes with an additional configuration: the replication factor. The replication factor determines how many copies of a partition Kafka should create and replicate. Having one leader and several followers helps in disaster-recovery scenarios, preventing data loss and increasing read performance. Since data is available on multiple brokers, not every consumer needs to read from a single host.

Partition strategies

With an understanding of the mechanics of partitions, how should you distribute data? To make this decision, the main question is: is event ordering important?

Round robin partitioning

Round robin partitioning is the default strategy. It does not provide ordering guarantees. Kafka will use this strategy if you don't offer a message key. Events get distributed evenly across all available partitions.

Message key partitioning

You should employ this strategy if event order is required or when events require grouping (i.e., in a multi-tenant environment where user data needs to be coherent). To use this approach, you need to provide a message key when producing events, such as:



    let mut producer = self.create_producer();

    let record = Record {
      topic: &self.to_kafka_topic_name(topic),
      key: username,
      partition: -1,
      value: serialized_event.as_bytes(),
    };

    producer.send(&record).unwrap();

Whatever value you provide as the key will be processed by a hashing function. The resulting hash value will be used to determine the correct partition.

Summary

Kafka divides topics into partitions. A partition holds a subset of incoming event data. Depending on your use case, you can either let Kafka decide how to distribute data (using round robin partitioning) or be in charge by specifying message keys to predetermine which partition to use.

DEV Community

A winning Kafka partition strategy

Kafka topics and partitions —a quick primer

A partition example

Partition strategies

Round robin partitioning

Message key partitioning

Summary

Top comments (0)

Read next

🎄 A Christmas Gift for Developers: FileToMarkdown!

How to run llama 405b bf16 with gh200s

How Adam Smith Differentiates Money, Capital, Wealth & Goods

Neuer: The End of Framework Slavery