When starting to develop applications using Kafka, you might wonder how many partitions you need to set up for each topic.
As it is so often, the answer is: it depends.
This article will discuss everything you need to know about Kafka partitions and which partition strategy to use.
Kafka topics and partitions —a quick primer
Kafka organizes incoming events in topics. A topic is the primary unit of storage. It is a log of events belonging to a specific domain, such as website_user_activity or checkout_notification. More technically, you could think of a topic as an append-only application log file.
With log files, you might have your logging framework configured to create one log file per day—primarily for space reasons, but also to make it easier to find specific log entries later. Like splitting log files, Kafka divides a topic into partitions, primarily for performance reasons. Partitions are Kafka's way of parallelizing data processing. When you connect a consumer to Kafka, in addition to the topic, you need to specify a partition to read from.
Because a partition can only be consumed by a single consumer, if you need to handle a high volume of data, you‘ll need to increase the number of partitions to scale up consumption.
A partition example
Let's look at an example to make things more concrete. The first example shows a single broker and a topic consisting of multiple partitions. Each partition contains a subset of the overall data set.
In a more advanced scenario, several partitions are distributed across several brokers. While the former setup might be more suitable for a development environment, the approach described here is what you will likely find in production. Unlike with a single broker, each partition comes with an additional configuration: the replication factor. The replication factor determines how many copies of a partition Kafka should create and replicate. Having one leader and several followers helps in disaster-recovery scenarios, preventing data loss and increasing read performance. Since data is available on multiple brokers, not every consumer needs to read from a single host.
Partition strategies
With an understanding of the mechanics of partitions, how should you distribute data? To make this decision, the main question is: is event ordering important?
Round robin partitioning
Round robin partitioning is the default strategy. It does not provide ordering guarantees. Kafka will use this strategy if you don't offer a message key. Events get distributed evenly across all available partitions.
Message key partitioning
You should employ this strategy if event order is required or when events require grouping (i.e., in a multi-tenant environment where user data needs to be coherent). To use this approach, you need to provide a message key when producing events, such as:
    let mut producer = self.create_producer();
    let record = Record {
      topic: &self.to_kafka_topic_name(topic),
      key: username,
      partition: -1,
      value: serialized_event.as_bytes(),
    };
    producer.send(&record).unwrap();
Whatever value you provide as the key will be processed by a hashing function. The resulting hash value will be used to determine the correct partition.
Summary
Kafka divides topics into partitions. A partition holds a subset of incoming event data. Depending on your use case, you can either let Kafka decide how to distribute data (using round robin partitioning) or be in charge by specifying message keys to predetermine which partition to use.
 

 
                      

 
    
Top comments (0)