Apache Kafka tutorial – What is Apache Kafka?

#webdev #devops #kafka #eventdriven

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies to publish, store, and process streams of records in real time. It was originally developed by LinkedIn before moving into the open-source world in early 2011. It was built because LinkedIn needed a better way to handle massive amounts of data to track site events, like page views and user clicks, and to gather all their log data in one place.

Kafka is designed for distributed high throughput systems. Kafka tends to work very well as a replacement for a more traditional message broker. In comparison to other messaging systems, Kafka has better throughput, built-in partitioning, replication and inherent fault-tolerance, which makes it a good fit for large-scale message processing applications.
In this tutorial, we are going to dive into what Kafka actually is and walk through the most common definitions you'll run into when working with it.

Who is using Kafka?

Twitter uses Kafka to power its mobile application performance management and analytics platform. In 2015, this system was already handling around five billion sessions per day.
Netflix relies on Kafka as the messaging backbone for its Keystone pipeline, a unified platform for publishing, collecting, and routing events across both batch and stream processing. As of 2016, Keystone included more than 4,000 brokers running entirely in the cloud, handling over 700 billion events every single day.
Tumblr uses Kafka as a core part of its event processing pipeline. In 2012, it was already capturing up to 500 million page views per day through this system.
Square uses Kafka as its central data bus. It supports stream processing, website activity tracking, metrics collection, monitoring, log aggregation, real-time analytics, and complex event processing.
Pinterest runs Kafka as part of its real-time advertising platform. The system includes over 100 clusters and more than 2,000 brokers deployed on AWS. It processes more than 800 billion events per day, with peaks reaching up to 15 million events per second.
Uber is one of the most well-known Kafka adopters. The company processes over a trillion events per day using Kafka, mainly for data ingestion, event stream processing, database changelogs, log aggregation, and general-purpose publish-subscribe messaging.

Kafka architecture

Kafka is a publish/subscribe system built around event data. It involves four main actors: producers, consumers, brokers, and ZooKeeper nodes.

Broker nodes: These handle most of the I/O work and are responsible for durable data storage within the cluster. A Kafka broker receives messages from producers and writes them to disk, organizing them by partition and indexing them with a unique offset. Consumers can then fetch messages by topic, partition, and offset. Brokers also work together to form a Kafka cluster, sharing information either directly or indirectly through ZooKeeper.
ZooKeeper nodes: Kafka relies on ZooKeeper to manage the overall state of the cluster. It keeps track of broker health, maintains metadata about topics, and helps coordinate cluster operations.
Producers: Producers are responsible for sending messages to Kafka brokers, typically organized by topic.
Consumers: Consumers are client applications that read messages from Kafka topics.

In the diagram above, a topic is divided into three partitions:

Partition 1 contains two offsets: 0 and 1
Partition 2 contains four offsets: 0, 1, 2, and 3
Partition 3 contains a single offset: 0

A replica is simply a copy of a partition, including the same data, offsets, and partition ID.

Now, consider a scenario where the replication factor is set to 3. Kafka will create three identical copies of each partition and distribute them across the cluster to ensure availability and fault tolerance. In the example shown, the replication factor is 1, meaning each partition has only one copy.

To balance load across the cluster, partitions are distributed among brokers, and each broker can store one or more partitions. At the same time, multiple producers and consumers can publish and read messages concurrently, making Kafka highly scalable and efficient.

Kafka component concepts

Broker nodes

A broker acts as the middle layer between producers and consumers, helping move data across the system. It stores messages in partitions and ensures that data is safely persisted for a configurable amount of time. This design also helps Kafka handle failure scenarios without losing data.

In Kafka, a broker is a key unit of scalability. By increasing the number of brokers in a cluster, you can improve I/O throughput, availability, and overall durability. Brokers also coordinate with each other and communicate with ZooKeeper to maintain cluster state and consistency.

In most setups, each server runs a single broker. While it is technically possible to run multiple brokers on the same server, this is generally not recommended in production environments.

Topic

A Kafka Topic is where messages sent by producers are stored. Topics are split into partitions, and each topic has at least one partition. Each partition contains messages in an immutable and ordered sequence. Once a message is written, it cannot be changed. A partition is internally implemented as a set of segment files of equal sizes. So we can imagine a topic as a logical aggregation of partitions. By splitting a topic into multiple partitions, Kafka can process data in parallel, which improves performance and scalability.

When a producer publish a record, the producer automatically select a partition based on the record’s key. A producer will digest the byte content of the key using a hash function (Kafka uses murmur2 for this purpose).

The process works like this:

The key is hashed using murmur2.
The highest-order bit is masked to ensure the result is a positive integer.
The final partition number value is calculated using modulo with the number of partitions. This process determines the target partition for the record.

As a result, records with the same key will always be routed to the same partition, ensuring ordering for that key. However, if the number of partitions changes, the hash result will also change. This means Kafka may assign the same key to a different partition after the change.

The topic structure, contents, and how producers interact with partitions are depicted below.

This is the nature of hashing: Records with different hashes may still end up in the same partition due to hash collisions.

Consumer groups and load balancing

Kafka’s producer-topic-consumer topology adheres to a flexible and highly scalable model. A multipoint-to-multipoint model, which means multiple producers and multiple consumers can interact with the same topic at the same time.

A consumer is a process (or thread) that attaches to a Kafka cluster using a client library. When multiple consumers subscribe to a topic and belong to the same consumer group, Kafka distributes the partitions among the consumer in the group. Each consumer reads from a different subset of partitions in the topic.

Let’s walk through an example.

Suppose we have a topic called T1 with four partitions. If we create a single consumer C1 in group G1 and subscribe it to topic T1, then C1 will receive messages from all four partitions.

But, if we add a second consumer C2 to the same group G1, Kafka will rebalance the partitions. Each consumer will now read from two partitions. For example:

C1 reads from partitions 0 and 2
C2 reads from partitions 1 and 3

If we keep adding more consumers to the same group G1, and the number of consumers exceeds the number of partitions, some consumers will remain idle. This happens because Kafka guarantees that each partition is assigned to only one consumer within a group.

Now, if a new consumer group G2 is introduced, with a single consumer. This consumer will receive all messages from topic T1, completely independent of what group G1 is doing.

If G2 has multiple consumers, Kafka will again split the partitions among them, just like it does for G1. However, the key point is that each consumer group receives the full stream of messages, regardless of other groups.

This works because of how Kafka manages offsets:

Offsets are managed by consumers, but they are stored in a special Kafka topic called __consumer_offsets.
Offsets are tracked per (consumer group, topic, partition) combination.
This combination is used as the key when storing offsets, which allows Kafka to:
- Keep offsets organized in the same partition of the __consumer_offsets topic.
- Use log compaction to remove outdated offset records.
- Efficiently manage progress tracking for each consumer group.

By default, the __consumer_offsets topic is configured with 50 partitions.

Th-th-that's all, folks!

Apache Kafka is much more than just a messaging system. It is a distributed event streaming platform designed to handle large-scale data with high throughput, reliability, and scalability.

The key takeaway is simple: Kafka is built for movement, durability, and scalability of data. Once you understand these core concepts, everything else, from stream processing to real-time analytics, becomes much easier to reason about.

If you're starting with Kafka, these fundamentals are the base for everything you will build on top of it.