DEV Community

Brian Ouchoh
Brian Ouchoh

Posted on

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

In real time, or near real time data processing, Apache Kafka is a critical tool to the data engineer. Apache Kafka a distributed software platform( a server side application) that provides real time messaging and data streaming capabilities between systems. A key feature of Apache Kafka is that it can handle millions of events per second with millisecond-level latency.
Alternatives to Apache Kafka are RabbitMQ, Pulsar, AWS Kinesis, Google Pub/Sub, and Azure Event Hubs are other messaging/streaming tools.

1. Kafka Architecture
The Kafka architecture consists of the following components:
Brokers (servers that store/serve data)
Topics & partitions (how data is organized)
Producers & consumers (who sends and reads data)
ZooKeeper (legacy) or KRaft mode (cluster coordination)
Replication, logs, offsets, serialization, connectors, streams, etc.

Brokers
Brokers are Kafka servers responsible for storing and serving data. Each broker can handle thousands of clients and manage multiple partitions across topics.
One broker can handle thousands of partitions, but production clusters usually have 3–5+ brokers for fault tolerance.

In production, several brokers work together. When more than one broker are working together, they are called a Kafka cluster.

ZooKeeper vs KRaft Mode
Since a cluster is a group of brokers, and each broker is holding relatable data, how does Kafka manage coordination with all this brokers?
Traditionally, Kafka relied on Apache ZooKeeper for cluster coordination, managing broker metadata, and leader election.
Recently, Kafka is transitioning to KRaft mode (Kafka Raft), a ZooKeeper-less architecture that simplifies cluster management and improves scalability.

Scaling an Apache Kafka instance achieved by adding brokers and redistributing partitions across the cluster.

Bonus:
In Apache Kafka, managing cluster metadata and leader election refers to how Kafka coordinates which broker is responsible for which data and ensures smooth operation when something changes or fails.

Cluster Metadata

Metadata = information about the Kafka cluster’s state.

Includes:

  • What topics and partitions exist.
  • Which brokers are online.
  • Which broker is the leader for each partition. Configuration settings (replication factor, retention, etc.). This metadata is essential because producers and consumers need to know where to send or read messages from

Leader Election
Each partition in Kafka has one leader (handles reads/writes) and zero or more followers (replicas).
Leader election is the process of choosing which broker is the leader for a partition.

Happens when:
A new partition is created.
A leader broker fails or goes offline.
A new broker joins, triggering rebalancing.

Example:
Partition 0 has 3 replicas on brokers 1, 2, 3.
Broker 1 is the leader.
If broker 1 fails, Kafka elects broker 2 as the new leader from the ISR (In-Sync Replicas)

2. Topics, Partitions, and Offsets
Topics, Partitions, and Offsets are critical concepts that determine how fast data is processed and enable scalability of Kafka cluster, i will illustrate this concepts using an example of an e-commerce platform.

Imagine you are running an e-commerce platform that records every order placed by customers.

Topic – Grouping Data
You create a topic called orders.
Every new order (order ID, product, quantity, price, customer ID) is sent to this topic.
Producers: Your checkout system sends these order events.
Consumers: Your accounting system, inventory service, and shipping system all read from this topic.
Think of a topic as a named folder or mailbox where all related messages go.

Partition – Breaking Down the Topic
As your business grows, thousands of orders come in every second.
A single topic may become a bottleneck.
You split the orders topic into 3 partitions: P0, P1, P2.
Each partition holds part of the topic’s data:
Orders with customer IDs starting A–G go to P0.
H–N go to P1.
O–Z go to P2.
This partitioning allows parallel processing. One consumer can read P0, another reads P1, and another reads P2 — speeding up order processing.

Offset – Tracking Each Message
Inside each partition, every order is assigned a unique offset (like a line number).

Example for partition P1:

Offset 0: Order #1001
Offset 1: Order #1002
Offset 2: Order #1003

Consumers use these offsets to know where they left off.
If a consumer stops at offset 1, it will resume from offset 2 next time.
Offsets ensure no orders are skipped or processed twice (unless you design it that way).

Producers
Producers are responsible for writing data into topics.To do this efficiently, producers have a set of instructions to determine the partition to send messages and confirm successful sending of messages. The two sets of instructions are known as Key-based Partitioning and Acknowledgment Modes (acks). let us continue with the e-commerce example to illustrate this:
Your checkout service acts as a Kafka producer. Each time a customer places an order, the producer sends that order event to the orders topic, Using Key-based Partitioning Kafka allows you to assign a key to each message (e.g., customer_id or order_id). If the key is provided, Kafka always routes messages with the same key to the same partition eg Orders for customer_id = CUST123 will always go to Partition P1.This ensures ordering is preserved per customer, which is useful for billing or delivery tracking. (If no key is given, Kafka uses a round-robin strategy to balance data evenly across partitions.)

Acknowledgments control data durability and reliability when producers send messages to brokers:

  1. acks=0 (Fire-and-forget): The producer does not wait for any acknowledgment. Fastest but risk of data loss if a broker fails immediately after receiving a message. Example: Logging non-critical website clicks
  2. acks=1 (Leader acknowledgment):The producer waits for the leader broker of the partition to confirm receipt. Safer than acks=0 but still risks data loss if the leader fails before replicating to followers. Example: E-commerce orders when speed is important but occasional loss is acceptable (not common in production).
  3. acks=all (All in-sync replicas acknowledgment):The producer waits for all in-sync replicas (ISR) to acknowledge. Ensures maximum durability — no message is lost even if a broker crashes. Example: Financial transactions or high-value orders.

Consumers
Consumers read data from topics. They can work individually or in groups.

-Consumer groups: Distribute partitions among multiple consumers for scalability.
-Offset management: Can be automatic (Kafka commits offsets) or manual (application commits).

Continuing with the e-commerce example:
Your order fulfillment service (e.g., warehouse system) acts as a Kafka consumer. It reads messages from the orders topic to start packing and shipping items.
A consumer group is a set of consumers that work together to read data from the same topic. Kafka assigns each partition in the topic to only one consumer within a group.
Example:
Topic: orders has 3 partitions (P0, P1, P2).
Consumer Group: order_fulfillment_group with 3 consumers (C1, C2, C3).
Partition assignment:
C1 → P0
C2 → P1
C3 → P2
If you add a 4th consumer, it will remain idle because there are only 3 partitions.

Offsets track where each consumer left off in the partition. Example: If the last committed offset is 20 and a consumer restarts, it will resume reading from offset 21

Message Delivery Semantics
Message delivery semantics define how reliably messages are delivered and processed between producers and consumers.

Kafka supports three delivery semantics:

  • At-most-once: Messages may be lost but are never redelivered. e.g If an order message is sent and the consumer crashes before reading it, that order is lost — no retry occurs.
  • At-least-once: Messages are redelivered if acknowledgments are missing. e.g An order message might be processed twice — one copy might trigger a duplicate warehouse request, but no order is lost
  • Exactly-once: Guarantees a message is processed once (requires idempotent producers and transactions).e.g Each order is guaranteed to trigger only one invoice and one shipment, even if failures happen mid-process.

Retention Policies
Kafka retains data for a configurable period or size:

  • Time-based: Retain data for X days (e.g., 7 days).
  • Size-based: Retain up to a certain log size.
  • Log compaction: Keep only the latest value per key, useful for stateful topics.

Back Pressure & Flow Control
Kafka ensures system stability even under heavy load:

  1. Slow consumers: Can create consumer lag. Example (E-commerce): Your order_fulfillment_group has three consumers, but during a holiday sale, thousands of orders are placed per minute.
  2. Producers publish to the orders topic at 10,000 messages/minute.
  3. Consumers can only process 7,000 messages/minute. This creates a lag of 3,000 messages/minute, meaning orders start queuing up in Kafka partitions.

2.Monitoring: Tools like Prometheus and Kafka Manager help track lag and throughput.
Example:
If Consumer C2 is lagging on Partition P1, you can:

  • Scale out by adding another consumer to the group.
  • Optimize processing speed (batch processing, better hardware, or asynchronous processing).

Serialization & Deserialization
Kafka transmits all data as raw bytes. To make this data meaningful for producers and consumers, serialization (writing data) and deserialization (reading data) are used.
Serialization formats include:

  • JSON: Human-readable but larger in size.
  • Avro: Compact, with schema evolution support.
  • Protobuf: Efficient and language-agnostic. Schema evolution is often managed using the Confluent Schema Registry.

Replication & Fault Tolerance
Kafka ensures high availability through replication:

  • Each partition has one leader and multiple followers.
  • ISR (In-Sync Replicas): Follower replicas in sync with the leader.
  • If a leader fails, a new leader is elected from the ISR.

Kafka Connect
Kafka Connect simplifies integrating Kafka with external systems:

1.Source connectors: Import data from systems like MySQL, PostgreSQL, or cloud storage.
2.Sink connectors: Export data to systems like Elasticsearch, Snowflake, or Hadoop.

Example:
{
"name": "mysql-source-connector",
"config": {
"connector.class": "io.debezium.connector.mysql.MySqlConnector",
"database.hostname": "mysql",
"database.user": "user",
"database.password": "password",
"database.server.id": "1",
"database.server.name": "dbserver1",
"table.include.list": "inventory.customers"
}
}

Kafka Streams
Kafka Streams is a client library for building real-time stream processing applications.

  • Stateless operations: Filter, map, transform.
  • Stateful operations: Joins, aggregations, windowing.

Example:
KStream source = builder.stream("orders");
KStream filtered = source.filter((key, value) -> value.contains("paid"));
filtered.to("processed-orders");

ksqlDB
ksqlDB provides a SQL-like interface for stream processing, allowing developers to write queries without Java/Scala code.
Using our e-commerce example, this is a scenario where ksqlDB comes in:
You operate an online store with a Kafka topic named orders. It contains real-time events like:

{
"order_id": "ORD12345",
"customer_id": "CUST789",
"status": "CONFIRMED",
"amount": 249.99,
"payment_status": "PENDING"
}

Your team wants to:

  1. Continuously track paid orders for warehouse dispatch.
  2. Avoid writing a new Java or Scala microservice.

ksqlDB enables you to create Kafka topics using SQL-like queries in two steps:
Step 1: Define the Input Stream (This creates a streaming table reading from the orders topic.)

CREATE STREAM orders_stream (
order_id VARCHAR,
customer_id VARCHAR,
status VARCHAR,
amount DOUBLE,
payment_status VARCHAR
) WITH (
KAFKA_TOPIC='orders',
VALUE_FORMAT='JSON'
);

Step 2: Filter Paid Orders (ksqlDB creates a new topic paid_orders containing only orders that are fully paid and ready for fulfillment.)

CREATE STREAM paid_orders AS
SELECT order_id, customer_id, amount
FROM orders_stream
WHERE payment_status = 'PAID';

Transactions & Idempotence

  • Idempotence: Prevents duplicate messages when retries occur.
  • Transactions: Enable exactly-once semantics (EOS) across multiple topics.

** Security in Kafka**
Kafka handles sensitive business data (e.g., payment info, customer orders), so security is critical. It involves three main layers: authentication, authorization, and encryption.

  1. Authentication verifies the identity of clients (producers, consumers, admin tools) connecting to the Kafka cluster. Common methods used for authentication include: SASL, Kerberos, or OAuth
  2. Authorization controls what authenticated clients can access. Kafka uses Access Control Lists (ACLs).
  3. Encryption ensures data cannot be intercepted or tampered with during transmission between clients and brokers. Kafka uses TLS (Transport Layer Security)

Metrics to Monitor
Monitoring Kafka is essential to ensure high availability, fault tolerance, and smooth data flow.
I will use our e-commerce example to illustrate four monitoring metrics.

Consumer Lag
Tracks how far behind consumers are from the latest messages in a partition.
Example:
During a flash sale, producers send 10,000 orders/minute, but the warehouse fulfillment consumers only process 7,000 orders/minute.

Consumer lag = 3,000 orders/minute.

If lag grows too much, shipments may be delayed.

Under-Replicated Partitions
Indicates partitions where not all replicas are in sync with the leader.

Example:
Topic: orders with replication factor = 3.
If one broker goes down, some partitions may only have 1 or 2 replicas instead of 3.
Impact: Risk of data loss if another broker fails before replication recovers.

Broker Disk Usage
Each broker stores partition logs on its disk.

Example:
If orders topic retention is set to 7 days, but brokers are close to disk capacity, old orders might be deleted sooner than expected, or brokers may stop accepting new data.

End-to-End Latency
Measures the time between a message being produced and consumed.

Example:
Goal: Orders should be processed within 2 seconds after checkout.
If latency spikes to 10 seconds, customer experience suffers (delayed confirmations or fulfillment).

Tools that can be used for monitoring include Prometheus + Grafana (Collects Kafka metrics and visualizes lag, disk usage, and latency) and Confluent Control Center (Provides an enterprise dashboard for brokers, topics, and consumer group health.)

Scaling Kafka
By now you know i like using illustrative descriptions, read this to understand why you will need to scale your Kafka and how to scale:

As your e-commerce platform grows, the volume of orders, payments, and shipment events increases. To ensure Kafka continues to handle this rising demand, scaling becomes essential.

One of the primary methods to scale Kafka is by increasing the number of partitions in your topics. Partitions are the units of parallelism in Kafka, meaning that more partitions allow more consumers to read and process data simultaneously. For example, if your orders topic initially had three partitions serving three consumers, and your business starts handling five times the traffic during a holiday sale, you can increase the partition count to distribute the workload across more consumers. This enables higher throughput without overwhelming individual services.

Another key strategy is to add more brokers to the Kafka cluster. Brokers are the servers that store partitions and manage data replication. By introducing additional brokers, you spread partitions across a larger number of servers, improving fault tolerance, reducing the risk of storage bottlenecks, and enhancing overall performance.

When partitions or brokers are added, Kafka requires a rebalance to redistribute partitions across the available brokers. Kafka provides built-in tools to manage this process, ensuring that the cluster automatically adjusts its load distribution with minimal disruption.

Performance Optimization
Here is another illustration on how you can achieve high performance with minimal resources:

As your e-commerce platform scales and Kafka processes increasing volumes of order, payment, and shipment events, optimizing performance becomes crucial to maintain low latency and high throughput.

One effective approach is to enable batching and compression. Instead of sending individual messages, Kafka producers can batch multiple messages together before sending them to brokers. This reduces network overhead and increases throughput. Compression techniques such as Snappy or LZ4 further optimize data transfer by reducing the size of these batches without significantly impacting processing speed. For example, during a flash sale, compressing batched order events can dramatically cut down network usage and storage requirements.

Another important aspect is to tune the Kafka broker configuration, specifically parameters such as num.io.threads and num.network.threads. These settings control the number of threads handling disk I/O operations and network requests, respectively. Proper tuning ensures that the broker can manage large volumes of incoming and outgoing messages without becoming a bottleneck.

Finally, ensuring sufficient disk I/O and network bandwidth is critical. Kafka relies heavily on disk writes and replication traffic between brokers. If your storage system is slow or your network is saturated, latency will spike, and consumer lag will grow. Upgrading to faster disks (e.g., SSDs) or scaling network infrastructure can significantly improve overall performance.

Real World Use case example - A case of Netflix
Now that we are here, let us look into a real company- Netflix. Netflix is a global movie streaming companies that has managed to offer personalized experiences to its millions of subscribers. Let has look at how Netflix has harnessed the capabilities of Kafka to power its large scale movie streaming services:

Kafka plays a critical role in Netflix's microservices architecture, enabling real-time data movement, personalized content delivery, and system resilience.

1. Personalized Recommendations

Netflix leverages Kafka to stream real-time user interaction events, such as:
Play, pause, fast-forward, and rewind actions.
Browsing history, search queries, and viewing patterns.
These events are sent to a Kafka topic (e.g., user_activity) and consumed by machine learning services that constantly update personalized recommendations.
Example: When a user starts watching a thriller, Kafka streams this event to the recommendation engine, which instantly adjusts the "Because You Watched…" carousel on their homepage.

2. Operational Monitoring and Alerting
Netflix uses Kafka to collect logs and operational events from thousands of microservices.
Topics aggregate metrics like streaming quality (bitrate changes), login errors, and regional performance stats.
These streams feed real-time dashboards and anomaly detection systems.
Example: If buffering spikes in a specific region, Kafka immediately triggers alerts and routes events to their incident management system, enabling engineers to respond before large-scale impact.

3. Secure Data Streaming
Security is paramount at Netflix, and Kafka supports it through:
Authentication via SASL/OAuth to ensure only trusted microservices can publish or consume sensitive topics (e.g., payment updates).
TLS encryption to protect user data (like subscription payments) in transit between microservices.
ACLs (Access Control Lists) to restrict access — for instance, only the billing service can write to the billing_updates topic, while the accounting service can read it.

4. Seamless Streaming Experience
Kafka helps Netflix achieve low-latency synchronization across devices.
When a user switches from their TV to their smartphone, Kafka streams the current playback position to the session_state topic.
The mobile app consumer picks it up instantly, resuming playback exactly where the user left off.

Top comments (0)