KiplangatJaphet

Posted on Sep 23

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

#dataengineering #architecture #backend #kafka

Introduction
Apache Kafka has emerged as a cornerstone technology for building scalable, real-time data pipelines and event-driven architectures. Originally developed at LinkedIn and open-sourced in 2011, Kafka is a distributed streaming platform designed to handle massive volumes of data with low latency and high throughput. This article explores Kafka’s core concepts, its applications in data engineering, and best practices for running Kafka in production, with insights into how companies like Netflix, LinkedIn, and Uber leverage it.

What is Apache kafka?
Apache Kafka is an open-source distributed event streaming platform that serves as a robust system for building real-time data pipelines and streaming applications. It enables applications to publish and subscribe to streams of events, making it ideal for applications that need to process large volumes of data in real-time, such as data ingestion, real-time analytics, and event-driven architectures. Kafka's key features include high throughput, scalability, fault tolerance through data replication, and durable, ordered message storage within topics.

Apache Kafka as an event streaming platform
Kafka combines three key capabilities so you can implement your use cases for event streaming end-to-end with a single battle-tested solution:

To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
To store streams of events durably and reliably for as long as you want.
To process streams of events as they occur or retrospectively.

And all this functionality is provided in a distributed, highly scalable, elastic, fault-tolerant, and secure manner. Kafka can be deployed on bare-metal hardware, virtual machines, and containers, and on-premises as well as in the cloud. You can choose between self-managing your Kafka environments and using fully managed services offered by a variety of vendors.

How does Kafka work?
Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol. It can be deployed on bare-metal hardware, virtual machines, and containers in on-premise as well as cloud environments.

Servers
Kafka is run as a cluster of one or more servers that can span multiple datacenters or cloud regions. Some of these servers form the storage layer, called the brokers. Other servers run Kafka Connect to continuously import and export data as event streams to integrate Kafka with your existing systems such as relational databases as well as other Kafka clusters. To let you implement mission-critical use cases, a Kafka cluster is highly scalable and fault-tolerant: if any of its servers fails, the other servers will take over their work to ensure continuous operations without any data loss.

Clients
They allow you to write distributed applications and microservices that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures. Kafka ships with some such clients included, which are augmented by dozens of clients provided by the Kafka community: clients are available for Java and Scala including the higher-level Kafka Streams library, for Go, Python, C/C++, and many other programming languages as well as REST APIs.

Components of kafka

Producer - Producers are applications or services that publish (write) messages into Kafka topics. They decide which topic and partition the message goes to, either randomly, in round-robin fashion or based on key.
Zookeper - It is an open-source distributed coordinated service that helps manage and synchronize large clusters of distributed systems by providing a reliable place where services can keep configuration, naming, synchronization, and group information.
Consumer - Consumers are applications that subscribe (read) messages from Kafka topics Consumers exists in consumer groups to share the load of message consumption.
Topic - A topic is like a logical channel or category where messages are stored. Producers write messages into topics, and consumers read from them.
Partition - Topics are split into partitions to allow parallelism and scalability. Each partition is an ordered, immutable log of records. Messages inside partitions are identified by a unique offset.
Broker - A broker is a Kafka server that stores and serves messages. Acting as a central hub, the broker accepts messages from producers, assigns them unique offsets, and stores them securely on disk.
Cluster - A Kafka cluster is a group of brokers working together. It ensures data replication, fault tolerance, and high availability.
Offset - An offset is a unique ID assigned to each message in a partition. It helps consumers keep track of which messages have been read.

The architecture of kafka.

Each Kafka broker can host multiple topics, and each topic is divided into multiple partitions for scalability and fault tolerance.

Kafka topic partitions layout.

Consumers use offsets to read messages sequentially from oldest to newest, and upon recovery from failure, resume from the last committed offset.

Set up Kafka on the Terminal
Let’s dive into the installation and running of Kafka directly from the terminal.

1. Install Java
Kafka requires Java (JDK 11 or 17). Let’s install Java 11:

sudo apt update
sudo apt install openjdk-11-jdk -y

Confirm Java Installation

Verify that Java is installed correctly:

java -version

You should see output similar to:

openjdk version "11.0.xx" ...

2. Download and Extract Kafka

Let’s download and set up Kafka 3.9.1 with Scala 2.13.

Download Kafka

wget https://downloads.apache.org/kafka/3.9.1/kafka_2.13-3.9.1.tgz

Extract the downloaded file

tar -xvzf kafka_2.13-3.9.1.tgz

Delete the archive to free up space

rm kafka_2.13-3.9.1.tgz

Rename the extracted folder to something simpler

mv kafka_2.13-3.9.1 kafka

Change into the Kafka directory

cd kafka

3. Start ZooKeeper and Kafka
Start ZooKeeper (in one terminal):

bin/zookeeper-server-start.sh config/zookeeper.properties

You should see output like this:

Start Kafka (in another terminal)

Now open a second terminal, navigate to the Kafka folder, and run:

bin/kafka-server-start.sh config/server.properties

You should see an output like this:

4. Create a Kafka Topic
Let’s create a topic called exams.

bin/kafka-topics.sh --create \
  --topic exams \
  --bootstrap-server localhost:9092 \
  --partitions 1 \
  --replication-factor 1

5. Start a Kafka Producer
This send some messages to the exams topic using Kafka’s built-in console producer.
In a new terminal window, run:

bin/kafka-console-producer.sh --topic exams --bootstrap-server localhost:9092

Once it starts, you can type messages directly into the terminal and hit Enter to send each message to the Kafka topic.

Each line you type gets published to the exams topic.

6. Start a Kafka Consumer
To read the messages you just sent, start a Kafka consumer that listens to the exams topic.

In another terminal window, run:

bin/kafka-console-consumer.sh --topic exams --from-beginning --bootstrap-server localhost:9092

This will display all messages from the beginning of the topic — including the ones you just produced.

Your output should be like this:

Use Cases
Netflix
Netflix needs no introduction. One of the world’s most innovative and robust OTT platform uses Apache Kafka in its keystone pipeline project to push and receive notifications.

There are two types of Kafka used by Netflix which are Fronting Kafka, used for data collection and buffering by producers and Consumer Kafka, used for content routing to the consumers.

All of us know that the amount of data processed at Netflix is pretty huge, and Netflix uses 36 Kafka clusters (out of which 24 are Fronting Kafka and 12 are Consumer Kafka) to work on almost 700 billion instances in a day.

Netflix has achieved a data loss rate of 0.01% through this keystone pipeline project and Apache Kafka is a key driver to reduce this data loss to such a significant amount.

Netflix plans to use a 0.9.0.1 version to improve resource utilization and availability.

Uber
There are a lot of parameters where a giant in the travel industry like Uber needs to have a system that is uncompromising to errors, and fault-tolerant.

Uber uses Apache Kafka to run their driver injury protection program in more than 200 cities.

Drivers registered on Uber pay a premium on every ride and this program has been working successfully due to scalability and robustness of Apache Kafka.

It has achieved this success largely through the unblocked batch processing method, which allows the Uber engineering team to get a steady throughput.

The multiple retries have allowed the Uber team to work on the segmentation of messages to achieve real-time process updates and flexibility.

Uber is planning on introducing a framework, where they can improve the uptime, grow, scale and facilitate the program without having to decrease the developer time with Apache Kafka.

LinkedIn
LinkedIn, one of the world’s most prominent B2B social media platforms handles well over a trillion messages per day.

And we thought the number of messages handled by Netflix was huge. This figure is mind-blowing and LinkedIn has seen a rise of over 1200x over the last few years.

LinkedIn uses different clusters for different applications to avoid clashing of failure of one application which would lead to harm to the other applications in the cluster.

Broker Kafka clusters at LinkedIn help them to differentiate and white list certain users to allow them a higher bandwidth and ensure the seamless user experience.

LinkedIn plans to achieve a lesser number of data loss rate through the Mirror Maker.

As the Mirror Maker is used as the intermediary between Kafka clusters and Kafka topics.

At present, there is a limit on the message size of 1 MB.

But, through the Kafka ecosystem, LinkedIn plans to enable the publishers and the users to send well over that limit in the coming future.
Messaging
Kafka works well as a replacement for a more traditional message broker. Message brokers are used for a variety of reasons (to decouple processing from data producers, to buffer unprocessed messages, etc). In comparison to most messaging systems Kafka has better throughput, built-in partitioning, replication, and fault-tolerance which makes it a good solution for large scale message processing applications.
In our experience messaging uses are often comparatively low-throughput, but may require low end-to-end latency and often depend on the strong durability guarantees Kafka provides.
(https://www.knowledgenile.com/blogs/apache-kafka-use-cases)
Resources
Apache Kafka - Fundamentals -(https://www.tutorialspoint.com/apache_kafka/apache_kafka_fundamentals.htm)
Apache Kafka Documentation - (https://kafka.apache.org/documentation/?utm_source=chatgpt.com#uses)
Introduction to Apache Kafka - https://kafka.apache.org/intro
Best Apache Kafka Use Cases - (https://www.knowledgenile.com/blogs/apache-kafka-use-cases)

DEV Community

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

Top comments (0)