Lawrence Murithi

Posted on May 14

A Beginner’s Guide to Apache Kafka: The Engine of Real-Time Data

#kafka #dataengineering #luxdev

Introduction

Imagine you are running a massive online store. Every second, hundreds of users are clicking items, adding them to carts and making purchases. Your inventory system needs to know about the purchases, your recommendation engine needs to know about the clicks and your security system needs to monitor for fraud.
If you connect every single system directly to each other, you get a tangled, unmanageable mess.
This is the exact problem Apache Kafka was built to solve. Instead of systems talking directly to each other, they all send their data to a central hub (Kafka) and any system that needs that data simply reads it from the hub. This creates a completely decoupled architecture; the system sending the data doesn't need to know anything about the systems receiving it.
This article delves into everything you need to know to understand Apache Kafka, from its history to running your first commands.

What is Apache Kafka?

Apache Kafka is an open-source distributed event streaming platform.
Let's break that down.
• Event - This is a record of something that happened (e.g. User A clicked button B at 12:00 PM). These events, also known as messages or records, are the fundamental immutable data structures consisting of a key, value, timestamp and headers that are continuously transmitted.
• Streaming - The data flows continuously in real-time, rather than waiting to be processed in daily batches. Kafka allows you to publish (write) and subscribe to (read) streams of events, store them indefinitely, and process them as they occur.
• Distributed - It doesn't just run on one computer. It runs across many computers working together, making it incredibly fast and virtually impossible to crash.
Think of Kafka as a massive, high-speed, highly organized post office. Senders drop off packages (data) and the post office holds onto them until the receivers come to pick them up. This complete journey of data, from generation and publishing to storage, consumption, and eventual deletion, represents the Kafka lifecycle.

Kafka was originally created at LinkedIn in 2011 by software engineers Jay Kreps, Neha Narkhede and Jun Rao.
LinkedIn was generating billions of data points daily (profile views, messages, connections), and their existing databases and message queues couldn't keep up. They needed a system that could handle these massive amounts of data in real-time without slowing down.
It was named Kafka, after the author Franz Kafka, because he was a writer and the software was an optimized system for writing data. Eventually, LinkedIn gave it out to the Apache Software Foundation, making it free and open-source.

Core Characteristics of Kafka

Thousands of companies such as Netflix, Uber and Airbnb use Kafka for various reasons.
1. High Throughput - Kafka can handle millions of messages per second.
2. Scalable - Kafka expands seamlessly without any downtime. If you need more power, you just add another computer (node) to the Kafka system.
3. Permanent (Durable) - Unlike traditional message queues that delete a message once it is read, Kafka writes data to a hard drive and keeps it for a set amount of time (days, weeks, or forever).
4. Fault-Tolerant - Kafka keeps copies (replicas) of your data on different computers. If one computer crashes or catches fire, the data is still safe on another, and the system automatically switches to the backup without missing a beat.

When to Use Kafka

• Real-time tracking - Tracking website activity (page views, clicks) as it happens.
• Log aggregation - Collecting logs from hundreds of different servers into one central place for monitoring, debugging, or auditing.
• Location tracking - Apps like Uber use Kafka to process the real-time GPS locations of drivers and riders.
• Stream processing - Transforming data on the fly such as using the Kafka Streams API to convert currencies in real-time as transactions happen.
• Event sourcing - Storing state-changing events. Instead of overwriting existing data to save the current state, you permanently record every individual action that led to that state as an append-only log(a database doesn't update a shopping cart's final inventory to show 1 Hat, it records the exact history, Added Shirt, Added Hat and Removed Shirt).
• Data Integration - Using Kafka Connect to continuously pull data from an old database and push it into a new cloud warehouse. In this context, kafka connect is utilized as a specialized tool and framework for scalably and reliably streaming data between Kafka and these external systems without custom code.

When NOT to use Kafka

• You just need a standard database to search for specific records (use SQL or NoSQL). Kafka is designed for sequential reading, not searching for a specific item.
• You only have a small amount of data (Kafka is complex; setting it up for low traffic is overkill).
• You need simple task routing.

Key Kafka Concepts and Rules

To understand Kafka, you need to know its vocabulary and the strict rules that govern how data is managed.
• Event (Message/record) - This is the actual piece of data( immutable record of something that happened). An event consists of a Key, a Value, a Timestamp, and optional metadata headers. The Message Key acts as an optional identifier used for routing the event to a specific partition, while the Message Value is the core payload containing the business data. The Value is the actual data. The Key is optional but important for organizing data. Before an event is sent over the network, it is translated into a binary format (bytes) in a process called serialization.
Serialization is the crucial step of converting these readable data objects into binary bytes for efficient network transmission and storage. Conversely, Deserialization is the reverse process used by consumers to convert those binary bytes back into readable data.
• Producer - The application that sends data into Kafka (e.g. the website frontend sending click data). Producers send/write (publish) messages to topics and decide which partition the data should be sent to.
• Consumer - The application that reads data from Kafka (e.g. the analytics dashboard). Consumers receive/read (subscribe) messages from topics.
• Topic - Its a named stream of events or logical category or channel where related events are continuously published and stored.
If you send user clicks to Kafka, you would send them to a topic named user_clicks. Consumers read from specific topics. Unlike traditional queues, topics have a retention policy. You configure a topic to keep data for 24 hours, 7 days, or until the disk reaches a certain size. Once the limit is hit, the oldest data is automatically deleted.
• Partition - This is the secret to Kafka's speed. A single topic is split into multiple parts called Partitions. They split data across multiple brokers/servers, enabling multiple consumers to read data in parallel. This is the mechanism that allows Kafka to scale and process massive amounts of data concurrently.
Imagine a grocery store with only one checkout lane (one topic), a line forms. If you open 10 checkout lanes (partitions), 10 times as many people can check out at once.
- Rule 1 - Order is only guaranteed within a single partition. If you send messages to Partition 0 and Partition 1, you cannot guarantee which one gets read first. But messages inside each individual partition are read in the exact order they arrive.
- Rule 2 - Keys determine the partition. If a Producer sends an event without a key, Kafka assigns it to a random partition (Round-robin routing). If an event has a key (like customer_id_123), Kafka uses a math formula (hashing) to ensure every event with that same key always goes to the exact same partition. This guarantees all purchases by customer_123 are processed in the correct order.
• Offset - Inside a partition, every single message is assigned a unique, sequential ID number called an Offset (e.g., 0, 1, 2, 3...). These offsets act as unique, ever-increasing integers used to accurately maintain reading positions. Offsets only go up and are never reused. Consumers use offsets to keep a bookmark of their specific reading position so they can resume reading safely after a crash or restart.
• Consumer Groups - A team of consumers working together to read a topic.
- The Golden Rule - A single partition can only be read by one consumer within the same group. If a topic has 4 partitions, and your group has 4 consumers, each gets exactly one partition. If you have 5 consumers in the group, the 5th one sits idle. This is how Kafka scales reading perfectly without processing the same message twice.
• Broker/ Server - A single Kafka node. This individual node is responsible for receiving messages from producers, storing them on disk, and serving them to consumers upon request.
• Cluster - A group of Brokers(servers) working together. Brokers are linked together to operate seamlessly as a single distributed network, providing fault tolerance, high availability, and massive scale. Within a cluster, data is duplicated across brokers using a Replication Factor. Replication factor is a configuration defining the exact number of copies of a partition that must be maintained across different brokers to ensure fault tolerance. If your Replication Factor is 3, three different brokers have a copy of the data.
For each partition, one broker is assigned the Leader (primary broker) and exclusively handles all read and write requests for that specific partition to ensure strict data consistency.
The other brokers become Followers (backup brokers) and they passively replicate the data from the Leader (acting as In-Sync Replicas or ISR) so they can seamlessly and instantly take over if the Leader fails.
• KRaft (Kafka Raft) - The internal manager of Kafka. Kraft is the modern built-in consensus protocol that functions as the internal cluster manager, meaning it is the overarching system responsible for managing broker states, leader elections, and metadata within Kafka. It keeps track of which brokers are online, which broker holds which partition, and handles the recovery if a broker crashes.
Historically, Zookeeper served as the legacy external coordination service that acted as the cluster manager before being phased out. Kafka recently removed ZooKeeper and replaced it with KRaft, which is built directly into Kafka to make it faster and easier to manage the state of the cluster.

How Kafka Works

Here is a detailed flow of how data moves through Kafka, showing Partitions, Offsets, and Consumer Groups:

NB: Two different Consumer Groups can read the exact same messages without interfering with each other. Because Kafka stores the data on disk, Consumer Group 2 (Receipt System) can read the message hours after Consumer Group 1 (Inventory System) read it, simply by starting at an older Offset.

Kafka architecture

To understand Apache Kafka’s architecture, we need to examine its internal design and core components.
Kafka architecture explains how Kafka does what it does. Kafka’s architecture is designed to do three things flawlessly; never lose data, handle millions of messages a second, and scale up without turning the system off.
To understand how it works, we can break Kafka’s architecture into three main areas.
1. The Network Architecture (The Physical Components)
Kafka is a distributed system, meaning it is not just one big computer. It is a network of smaller computers working together as a single unit and comprises of The Kafka Cluster, Brokers, The KRaft Controller (The Manager), Producers and Consumers.

2. The Data Architecture/Storage (Logical Components)
Kafka does not store data in tables like a standard database. It stores data using a concept called an Append-Only Commit Log.
Imagine a physical logbook. When a new message arrives, Kafka writes it at the very bottom of the page. You cannot erase, edit, or insert a message in the middle. You can only append (add) to the end.
This architectural choice is the secret to Kafka's speed. Because it never wastes time searching for a record to update or delete, writing to Kafka is incredibly fast.

How Topics and Partitions fit into the Log.

A Topic is just a logical name for a group of these logbooks.
A Partition is the actual physical logbook file sitting on a Broker's hard drive.
An Offset is the line number in that logbook.

3. High Availability Architecture (Fault Tolerance)
Because hardware fails, Kafka’s architecture assumes that brokers will eventually crash. It protects your data using Replication.
When you create a topic, you set a Replication Factor. If you set a replication factor of 3, Kafka guarantees that three different Brokers will have an exact copy of the data.
For every single partition, Kafka elects one Broker to be the Leader. which Producers and Consumers talk. The other Brokers are become Followers. They do not talk to producers or consumers but copy everything the Leader does in real-time.
If the Leader broker crashes, the KRaft Controller notices immediately. It instantly promotes one of the Followers to become the new Leader. The Producers and Consumers automatically connect to the new Leader, and the system continues without dropping a single message.

4. The Philosophy(Core Design Rules)
Kafka’s architecture relies on a few specific design choices that make it different from almost every other messaging system.

A. Smart Consumers, Dumb Brokers/Servers
In traditional message queues, the server (the queue) is smart. It remembers which consumer read which message and deletes the message after it is read. This puts a heavy workload on the server.
Kafka flipped this architecture. The Kafka Broker is dumb while the Consumer is smart. The server just stores the data and deletes it after a certain time while the consumer tracks its own Offset (its place in the logbook). This takes a massive load off the brokers, allowing them to handle millions of messages per second.

B. The Pull Model
Many systems push data to the consumers. If a sudden spike in traffic happens, the system pushes so much data that it crashes the consumer application.
Kafka uses a Pull architecture. Producers push data into Kafka, but Kafka never pushes data to Consumers. The Consumers pull data from Kafka only when they are ready for it. If the consumer gets overloaded, it just slows down its pulling. The data waits safely on Kafka’s hard drive until the consumer catches up.

C. Using Disk instead of RAM (The OS Page Cache)
While most messaging systems try to keep data in RAM (memory) because it is faster, Kafka writes straight to the hard drive.
because it relies on the Operating System's Page Cache. The OS automatically uses free RAM to temporarily hold the most recently written data. When a consumer asks for the latest data, Kafka actually serves it straight from the OS memory without ever spinning the hard disk, giving you memory-like speeds with hard-drive-like storage capacity.

Visualizing the Architecture

Here is how all these pieces connect in a real-world scenario.

How to Install Apache Kafka (Locally)

To run Kafka on your computer, you need to have Java installed.
Step 1: Download Kafka
Go to the official Apache Kafka Downloads page here and download the latest .tgz file (binaries).

Step 2: Extract the file
Open your terminal and extract the folder

tar -xzf kafka_2.13-4.2.0.tgz

# you can rename the folder 
mv kafka_2.13-4.2.0/ kafka

# Navigate into the folder
cd kafka

Step 3: Generate a Cluster UUID
Since we are using modern Kafka (KRaft mode), first generate a unique ID for the cluster

KAFKA_CLUSTER_ID="$(bin/kafka-storage.sh random-uuid)"

Step 4: Format the Storage

bin/kafka-storage.sh format -t $KAFKA_CLUSTER_ID -c config/kraft/server.properties

Step 5: Start the Kafka Server

bin/kafka-server-start.sh config/kraft/server.properties

Leave this terminal window open. Kafka is now running!

Key Kafka Commands for Beginners

Open a new terminal window (keep the server running in the first one) to run these commands.

1. Create a Topic
Before you can send data, you need to create a topic. Let's create one called first_topic.

bin/kafka-topics.sh --create --topic first_topic --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1

NB: localhost:9092 is the default address where your local Kafka broker is running. We set --partitions 3 to split the topic into three lanes for speed, and --replication-factor 1 because we only have one local broker running right now.

2. Start a Producer (Send Data)
This command opens a prompt where you can type messages.

bin/kafka-console-producer.sh --topic first_topic --bootstrap-server localhost:9092

Once it starts, type a few lines and hit Enter after each.

Hello Kafka!
This is my first message.

3. Start a Consumer (Read Data)
Open a third terminal window and run the below command to read the data.

bin/kafka-console-consumer.sh --topic first_topic --from-beginning --bootstrap-server localhost:9092

The --from-beginning flag tells the consumer to start reading from Offset 0. You will instantly see the messages you typed in the Producer terminal appear here. If you go back to the Producer terminal and type a new message, it will pop up in the Consumer terminal in real-time.

Conclusion

Apache Kafka is the nervous system of modern data engineering. By sitting in the middle of your architecture, it decouples the applications that create data from the applications that need to use data.
While it can be complex to manage at a massive scale, the core concept remains remarkably simple; Producers write events to Topics, those Topics are split into Partitions for speed, and Consumers use Offsets to read those events whenever they are ready. By leveraging Consumer Groups, Kafka ensures that data is processed efficiently, securely, and at an incredibly massive scale.

DEV Community