Farhan Khan

Posted on Oct 2

Core Concepts of Kafka

#architecture #beginners #dataengineering

1. Introduction

Apache Kafka is a distributed event streaming platform. It’s designed to handle large volumes of real-time data, enabling applications to publish, subscribe to, store, and process streams of events (messages). Think of it as a central nervous system for data in modern systems.

2. Core Concepts

Producer

A Producer is a client application that writes data into Kafka. It creates messages, prepares them for sending, and delivers them to Kafka brokers. Most of its responsibilities can be configured based on reliability, latency, and throughput requirements.

Send messages to Kafka brokers.
Create messages as ProducerRecord objects (with topic, key, value, etc.).
Serialize keys and values into bytes.
Select partition (either user-specified or chosen by the partitioner).
Batch messages together for efficiency.
Handle retries if a send fails.
Receive acknowledgments from the broker (success or error).
Return metadata (topic, partition, offset) when a message is successfully written.

Consumer

A consumer reads messages from Kafka topics. If one consumer alone can’t keep up with the incoming data, Kafka lets you add more consumers in the form of a consumer group.

Within a consumer group, the topic’s partitions are split among the consumers:

If the group has one consumer, it reads from all partitions.
If there are as many consumers as partitions, each consumer gets one partition.
If there are more consumers than partitions, the extras stay idle.

This is how Kafka scales consumption — by letting multiple consumers share the load across partitions. It’s also why topics are often created with multiple partitions, so more consumers can be added later if traffic grows.

When multiple applications need the same data, each application should have its own consumer group. That way, each group gets a full copy of all the messages in the topic, independent of other groups.

Partition

A partition in Kafka is a way of splitting a topic into smaller parts so messages can be stored and processed in parallel. When producing messages, you create a ProducerRecord with a topic, a value, and optionally a key. The key is important because, by default, Kafka will hash the key so that all messages with the same key always go to the same partition. This ensures ordering — for example, all transactions of a single bank account can be kept in sequence. If you don’t provide a key, Kafka uses a round-robin (or “sticky round-robin” in newer versions) to spread messages evenly across partitions. In some cases, the default isn’t enough, so you can implement custom partitioning — for example, sending all records for a very large customer to a dedicated partition to avoid overloading one partition. The main advantage of partitions is that they let Kafka scale: data is distributed across brokers for higher throughput, while still keeping strict ordering for messages that share the same key.

For example, imagine a topic called CustomerOrders that has 4 partitions. If a producer sends a ProducerRecord("CustomerOrders", "Alice", "Order123"), Kafka will hash the key "Alice" and always send her orders to the same partition. Another customer, "Bob", might be mapped to a different partition, and if a record is sent without a key, Kafka will assign it round-robin across the available partitions. This way, the topic can handle thousands of customers at once by spreading the load across partitions, while still keeping each customer’s orders in order.

3. Configuration

Configuring Producer

Following are some important properties to configure Producer

bootstrap.servers
This is a list of broker addresses (host:port) that the producer uses to connect to Kafka for the first time. You don’t need to list all brokers — the producer will discover the rest. It’s best to list at least two, so if one broker is down, the producer can still connect.

key.serializer
This setting tells the producer how to turn the record key into bytes, because Kafka only understands byte arrays. For example, StringSerializer converts a String into bytes, and IntegerSerializer converts an Integer into bytes. Kafka already includes serializers for many common types, so in most cases you don’t need to write your own. The setting is required even if you don’t plan to use keys, but in that case you can use the VoidSerializer.

value.serializer
The same way you set key.serializer to a name of a class that will serialize the message key object to a byte array, you set value.serializer to a class that will serialize the message value object.

acks
The acks setting decides how many brokers must confirm a message before the producer considers it successfully sent. This affects both reliability and speed.

acks=0
The producer doesn’t wait for any reply. Messages can be lost, but sending is very fast.

acks=1
The producer gets confirmation once the leader broker has the message. Safer than 0, but messages can still be lost if the leader crashes before replication.

acks=all
The producer waits until all in-sync replicas (ISR) have the message. Safest option, since the data survives broker crashes, but it’s also the slowest. But how many replicas must be “in-sync” is controlled by min.insync.replicas (a broker or topic-level setting). If min.insync.replicas=2, then at least 2 replicas (leader + 1 follower) must acknowledge the message before it is considered successful.

retries
The retries setting controls how many times the producer will try again if sending a message fails due to a temporary error (like no leader for a partition). The wait time between retries is set by retry.backoff.ms (default 100 ms).
Retries can cause out-of-order messages, because later messages may succeed while earlier ones are still retrying. To avoid this, set max.in.flight.requests.per.connection=1, but this lowers throughput.
With idempotence enabled (enable.idempotence=true) makes sure retries don’t create duplicate messages. Without it, a lost acknowledgment can cause the producer to resend the same record, and the broker will store it twice. With idempotence on, Kafka detects duplicates and only keeps one copy.

max.in.flight.requests.per.connection
max.in.flight.requests.per.connection controls how many batches a producer can send before getting replies from the broker. A higher value improves throughput but uses more memory.
By default it’s 5. But if retries are enabled and this is more than 1, messages can get delivered out of order — because a later batch might succeed while an earlier one is still retrying.
To keep both performance and ordering, it’s best to use enable.idempotence=true, which prevents duplicates and preserves order even with multiple in-flight requests.

enable.idempotence
When retries happen, the same message can sometimes be written more than once, creating duplicates. Setting enable.idempotence=true prevents this by adding a sequence number to each message so the broker can ignore duplicates. It also ensures that messages stay in the correct order, even when retries occur.
It requires acks=all, retries > 0, and max.in.flight.requests.per.connection ≤ 5.

Configuring Consumer

max.poll.records
This setting controls how many records a consumer can get in a single call to poll(). It lets you limit the number of messages your application processes at once, but it doesn’t control their total size.

max.poll.interval.ms
This sets the maximum time a consumer can go without calling poll(). If this time is exceeded, Kafka assumes the consumer is stuck and removes it from the group, triggering a rebalance. The default is 5 minutes. It acts as a safeguard to detect consumers that are not actually processing messages, even if they are still sending heartbeats in the background.

enable.auto.commit
This setting decides if Kafka should automatically save the consumer’s progress (offsets). By default, it’s true, so offsets are committed for you at regular intervals (controlled by auto.commit.interval.ms). The risk is that offsets might be committed before processing finishes, which can cause data loss if the app crashes.

If you set it to false, you commit offsets manually, usually after processing is complete. This avoids missing data, though it may cause duplicates if a message is reprocessed after a crash. To handle duplicates safely, many applications use idempotent writes (e.g., database operations that ignore or update existing records instead of inserting them twice).

partition.assignment.strategy
This setting decides how Kafka will split topic partitions among consumers in the same group. Kafka uses a PartitionAssignor class to do this, and you can choose between several built-in strategies (or write your own).

Range – gives each consumer a block of consecutive partitions per topic. It’s simple, but if partitions don’t divide evenly, some consumers may get more partitions than others.

RoundRobin – spreads partitions one by one across consumers. This usually results in a more balanced distribution.

Sticky – aims to balance partitions like RoundRobin, but also tries to keep existing assignments during rebalances so consumers don’t constantly get shuffled around.

Cooperative Sticky – similar to Sticky, but supports cooperative rebalancing, where consumers can continue working on partitions that aren’t reassigned, making rebalances smoother with less interruption.

By default, Kafka uses the Range strategy. You can switch to RoundRobin, Sticky, or CooperativeSticky, or even plug in your own custom assignor class if you need a special partitioning rule.

3. Exactly Once Semantics

Idempotent Producer

The first step toward achieving exactly-once semantics in Kafka is enabling the idempotent producer. Normally, retries can create duplicates if a broker crashes or acknowledges late, causing the producer to resend the same record. By setting enable.idempotence=true, the producer avoids this problem using three pieces of metadata: a producer ID (PID), a sequence number, and an epoch.

The producer ID uniquely identifies the producer instance.
Each record sent to a partition gets a sequence number that increases in order.
The epoch increments whenever the producer restarts, ensuring that old (zombie) producers can’t write again.

Brokers track the latest sequence numbers per producer and partition. If they see a duplicate (same PID + sequence number), they reject it; if they see an old epoch, they fence it off. This guarantees that messages are delivered once and in order per partition, even during retries or failures.

Transactions

While the idempotent producer removes duplicates within a single partition, many real-world applications process an input message and then produce results to multiple partitions or topics. To guarantee correctness in this consume–process–produce workflow, Kafka provides transactions. Transactions allow producers to group multiple writes and offset commits into a single atomic unit: either all succeed together or none are applied.

A transactional producer is enabled by setting a transactional.id, which persists across restarts and lets Kafka track producer identity over time. Under the hood, Kafka uses a transaction coordinator that records transaction state in an internal topic (transaction_state). When committing, the coordinator writes commit markers to all involved partitions, ensuring consistency across them. If the producer crashes, a new coordinator can recover the intent from the log and finish the commit or abort, preventing partial results.

From the consumer’s perspective, Kafka provides two isolation levels:

read_uncommitted (default): consumers see all records, including those from aborted or still-open transactions.

read_committed: consumers only see records that are part of committed transactions, ensuring that partial or aborted writes are never exposed. This mode provides the exactly-once guarantee, but it may introduce slight lag since consumers must wait until a transaction is finalized.

Together, transactional producers and read_committed consumers provide true exactly-once semantics across partitions and topics.

DEV Community