DEV Community: Brian Ouchoh

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

Brian Ouchoh — Tue, 09 Sep 2025 06:50:46 +0000

In real time, or near real time data processing, Apache Kafka is a critical tool to the data engineer. Apache Kafka a distributed software platform( a server side application) that provides real time messaging and data streaming capabilities between systems. A key feature of Apache Kafka is that it can handle millions of events per second with millisecond-level latency.
Alternatives to Apache Kafka are RabbitMQ, Pulsar, AWS Kinesis, Google Pub/Sub, and Azure Event Hubs are other messaging/streaming tools.

1. Kafka Architecture
The Kafka architecture consists of the following components:
Brokers (servers that store/serve data)
Topics & partitions (how data is organized)
Producers & consumers (who sends and reads data)
ZooKeeper (legacy) or KRaft mode (cluster coordination)
Replication, logs, offsets, serialization, connectors, streams, etc.

Brokers
Brokers are Kafka servers responsible for storing and serving data. Each broker can handle thousands of clients and manage multiple partitions across topics.
One broker can handle thousands of partitions, but production clusters usually have 3–5+ brokers for fault tolerance.

In production, several brokers work together. When more than one broker are working together, they are called a Kafka cluster.

ZooKeeper vs KRaft Mode
Since a cluster is a group of brokers, and each broker is holding relatable data, how does Kafka manage coordination with all this brokers?
Traditionally, Kafka relied on Apache ZooKeeper for cluster coordination, managing broker metadata, and leader election.
Recently, Kafka is transitioning to KRaft mode (Kafka Raft), a ZooKeeper-less architecture that simplifies cluster management and improves scalability.

Scaling an Apache Kafka instance achieved by adding brokers and redistributing partitions across the cluster.

Bonus:
In Apache Kafka, managing cluster metadata and leader election refers to how Kafka coordinates which broker is responsible for which data and ensures smooth operation when something changes or fails.

Cluster Metadata

Metadata = information about the Kafka cluster’s state.

Includes:

What topics and partitions exist.
Which brokers are online.
Which broker is the leader for each partition. Configuration settings (replication factor, retention, etc.). This metadata is essential because producers and consumers need to know where to send or read messages from

Leader Election
Each partition in Kafka has one leader (handles reads/writes) and zero or more followers (replicas).
Leader election is the process of choosing which broker is the leader for a partition.

Happens when:
A new partition is created.
A leader broker fails or goes offline.
A new broker joins, triggering rebalancing.

Example:
Partition 0 has 3 replicas on brokers 1, 2, 3.
Broker 1 is the leader.
If broker 1 fails, Kafka elects broker 2 as the new leader from the ISR (In-Sync Replicas)

2. Topics, Partitions, and Offsets
Topics, Partitions, and Offsets are critical concepts that determine how fast data is processed and enable scalability of Kafka cluster, i will illustrate this concepts using an example of an e-commerce platform.

Imagine you are running an e-commerce platform that records every order placed by customers.

Topic – Grouping Data
You create a topic called orders.
Every new order (order ID, product, quantity, price, customer ID) is sent to this topic.
Producers: Your checkout system sends these order events.
Consumers: Your accounting system, inventory service, and shipping system all read from this topic.
Think of a topic as a named folder or mailbox where all related messages go.

Partition – Breaking Down the Topic
As your business grows, thousands of orders come in every second.
A single topic may become a bottleneck.
You split the orders topic into 3 partitions: P0, P1, P2.
Each partition holds part of the topic’s data:
Orders with customer IDs starting A–G go to P0.
H–N go to P1.
O–Z go to P2.
This partitioning allows parallel processing. One consumer can read P0, another reads P1, and another reads P2 — speeding up order processing.

Offset – Tracking Each Message
Inside each partition, every order is assigned a unique offset (like a line number).

Example for partition P1:

Offset 0: Order #1001
Offset 1: Order #1002
Offset 2: Order #1003

Consumers use these offsets to know where they left off.
If a consumer stops at offset 1, it will resume from offset 2 next time.
Offsets ensure no orders are skipped or processed twice (unless you design it that way).

Producers
Producers are responsible for writing data into topics.To do this efficiently, producers have a set of instructions to determine the partition to send messages and confirm successful sending of messages. The two sets of instructions are known as Key-based Partitioning and Acknowledgment Modes (acks). let us continue with the e-commerce example to illustrate this:
Your checkout service acts as a Kafka producer. Each time a customer places an order, the producer sends that order event to the orders topic, Using Key-based Partitioning Kafka allows you to assign a key to each message (e.g., customer_id or order_id). If the key is provided, Kafka always routes messages with the same key to the same partition eg Orders for customer_id = CUST123 will always go to Partition P1.This ensures ordering is preserved per customer, which is useful for billing or delivery tracking. (If no key is given, Kafka uses a round-robin strategy to balance data evenly across partitions.)

Acknowledgments control data durability and reliability when producers send messages to brokers:

acks=0 (Fire-and-forget): The producer does not wait for any acknowledgment. Fastest but risk of data loss if a broker fails immediately after receiving a message. Example: Logging non-critical website clicks
acks=1 (Leader acknowledgment):The producer waits for the leader broker of the partition to confirm receipt. Safer than acks=0 but still risks data loss if the leader fails before replicating to followers. Example: E-commerce orders when speed is important but occasional loss is acceptable (not common in production).
acks=all (All in-sync replicas acknowledgment):The producer waits for all in-sync replicas (ISR) to acknowledge. Ensures maximum durability — no message is lost even if a broker crashes. Example: Financial transactions or high-value orders.

Consumers
Consumers read data from topics. They can work individually or in groups.

-Consumer groups: Distribute partitions among multiple consumers for scalability.
-Offset management: Can be automatic (Kafka commits offsets) or manual (application commits).

Continuing with the e-commerce example:
Your order fulfillment service (e.g., warehouse system) acts as a Kafka consumer. It reads messages from the orders topic to start packing and shipping items.
A consumer group is a set of consumers that work together to read data from the same topic. Kafka assigns each partition in the topic to only one consumer within a group.
Example:
Topic: orders has 3 partitions (P0, P1, P2).
Consumer Group: order_fulfillment_group with 3 consumers (C1, C2, C3).
Partition assignment:
C1 → P0
C2 → P1
C3 → P2
If you add a 4th consumer, it will remain idle because there are only 3 partitions.

Offsets track where each consumer left off in the partition. Example: If the last committed offset is 20 and a consumer restarts, it will resume reading from offset 21

Message Delivery Semantics
Message delivery semantics define how reliably messages are delivered and processed between producers and consumers.

Kafka supports three delivery semantics:

At-most-once: Messages may be lost but are never redelivered. e.g If an order message is sent and the consumer crashes before reading it, that order is lost — no retry occurs.
At-least-once: Messages are redelivered if acknowledgments are missing. e.g An order message might be processed twice — one copy might trigger a duplicate warehouse request, but no order is lost
Exactly-once: Guarantees a message is processed once (requires idempotent producers and transactions).e.g Each order is guaranteed to trigger only one invoice and one shipment, even if failures happen mid-process.

Retention Policies
Kafka retains data for a configurable period or size:

Time-based: Retain data for X days (e.g., 7 days).
Size-based: Retain up to a certain log size.
Log compaction: Keep only the latest value per key, useful for stateful topics.

Back Pressure & Flow Control
Kafka ensures system stability even under heavy load:

Slow consumers: Can create consumer lag. Example (E-commerce): Your order_fulfillment_group has three consumers, but during a holiday sale, thousands of orders are placed per minute.
Producers publish to the orders topic at 10,000 messages/minute.
Consumers can only process 7,000 messages/minute. This creates a lag of 3,000 messages/minute, meaning orders start queuing up in Kafka partitions.

2.Monitoring: Tools like Prometheus and Kafka Manager help track lag and throughput.
Example:
If Consumer C2 is lagging on Partition P1, you can:

Scale out by adding another consumer to the group.
Optimize processing speed (batch processing, better hardware, or asynchronous processing).

Serialization & Deserialization
Kafka transmits all data as raw bytes. To make this data meaningful for producers and consumers, serialization (writing data) and deserialization (reading data) are used.
Serialization formats include:

JSON: Human-readable but larger in size.
Avro: Compact, with schema evolution support.
Protobuf: Efficient and language-agnostic. Schema evolution is often managed using the Confluent Schema Registry.

Replication & Fault Tolerance
Kafka ensures high availability through replication:

Each partition has one leader and multiple followers.
ISR (In-Sync Replicas): Follower replicas in sync with the leader.
If a leader fails, a new leader is elected from the ISR.

Kafka Connect
Kafka Connect simplifies integrating Kafka with external systems:

1.Source connectors: Import data from systems like MySQL, PostgreSQL, or cloud storage.
2.Sink connectors: Export data to systems like Elasticsearch, Snowflake, or Hadoop.

Example:
{ "name": "mysql-source-connector", "config": { "connector.class": "io.debezium.connector.mysql.MySqlConnector", "database.hostname": "mysql", "database.user": "user", "database.password": "password", "database.server.id": "1", "database.server.name": "dbserver1", "table.include.list": "inventory.customers" } }

Kafka Streams
Kafka Streams is a client library for building real-time stream processing applications.

Stateless operations: Filter, map, transform.
Stateful operations: Joins, aggregations, windowing.

Example:
KStream source = builder.stream("orders");
KStream filtered = source.filter((key, value) -> value.contains("paid"));
filtered.to("processed-orders");

ksqlDB
ksqlDB provides a SQL-like interface for stream processing, allowing developers to write queries without Java/Scala code.
Using our e-commerce example, this is a scenario where ksqlDB comes in:
You operate an online store with a Kafka topic named orders. It contains real-time events like:

{ "order_id": "ORD12345", "customer_id": "CUST789", "status": "CONFIRMED", "amount": 249.99, "payment_status": "PENDING" }

Your team wants to:

Continuously track paid orders for warehouse dispatch.
Avoid writing a new Java or Scala microservice.

ksqlDB enables you to create Kafka topics using SQL-like queries in two steps:
Step 1: Define the Input Stream (This creates a streaming table reading from the orders topic.)

CREATE STREAM orders_stream ( order_id VARCHAR, customer_id VARCHAR, status VARCHAR, amount DOUBLE, payment_status VARCHAR ) WITH ( KAFKA_TOPIC='orders', VALUE_FORMAT='JSON' );

Step 2: Filter Paid Orders (ksqlDB creates a new topic paid_orders containing only orders that are fully paid and ready for fulfillment.)

CREATE STREAM paid_orders AS SELECT order_id, customer_id, amount FROM orders_stream WHERE payment_status = 'PAID';

Transactions & Idempotence

Idempotence: Prevents duplicate messages when retries occur.
Transactions: Enable exactly-once semantics (EOS) across multiple topics.

** Security in Kafka**
Kafka handles sensitive business data (e.g., payment info, customer orders), so security is critical. It involves three main layers: authentication, authorization, and encryption.

Authentication verifies the identity of clients (producers, consumers, admin tools) connecting to the Kafka cluster. Common methods used for authentication include: SASL, Kerberos, or OAuth
Authorization controls what authenticated clients can access. Kafka uses Access Control Lists (ACLs).
Encryption ensures data cannot be intercepted or tampered with during transmission between clients and brokers. Kafka uses TLS (Transport Layer Security)

Metrics to Monitor
Monitoring Kafka is essential to ensure high availability, fault tolerance, and smooth data flow.
I will use our e-commerce example to illustrate four monitoring metrics.

Consumer Lag
Tracks how far behind consumers are from the latest messages in a partition.
Example:
During a flash sale, producers send 10,000 orders/minute, but the warehouse fulfillment consumers only process 7,000 orders/minute.

Consumer lag = 3,000 orders/minute.

If lag grows too much, shipments may be delayed.

Under-Replicated Partitions
Indicates partitions where not all replicas are in sync with the leader.

Example:
Topic: orders with replication factor = 3.
If one broker goes down, some partitions may only have 1 or 2 replicas instead of 3.
Impact: Risk of data loss if another broker fails before replication recovers.

Broker Disk Usage
Each broker stores partition logs on its disk.

Example:
If orders topic retention is set to 7 days, but brokers are close to disk capacity, old orders might be deleted sooner than expected, or brokers may stop accepting new data.

End-to-End Latency
Measures the time between a message being produced and consumed.

Example:
Goal: Orders should be processed within 2 seconds after checkout.
If latency spikes to 10 seconds, customer experience suffers (delayed confirmations or fulfillment).

Tools that can be used for monitoring include Prometheus + Grafana (Collects Kafka metrics and visualizes lag, disk usage, and latency) and Confluent Control Center (Provides an enterprise dashboard for brokers, topics, and consumer group health.)

Scaling Kafka
By now you know i like using illustrative descriptions, read this to understand why you will need to scale your Kafka and how to scale:

As your e-commerce platform grows, the volume of orders, payments, and shipment events increases. To ensure Kafka continues to handle this rising demand, scaling becomes essential.

One of the primary methods to scale Kafka is by increasing the number of partitions in your topics. Partitions are the units of parallelism in Kafka, meaning that more partitions allow more consumers to read and process data simultaneously. For example, if your orders topic initially had three partitions serving three consumers, and your business starts handling five times the traffic during a holiday sale, you can increase the partition count to distribute the workload across more consumers. This enables higher throughput without overwhelming individual services.

Another key strategy is to add more brokers to the Kafka cluster. Brokers are the servers that store partitions and manage data replication. By introducing additional brokers, you spread partitions across a larger number of servers, improving fault tolerance, reducing the risk of storage bottlenecks, and enhancing overall performance.

When partitions or brokers are added, Kafka requires a rebalance to redistribute partitions across the available brokers. Kafka provides built-in tools to manage this process, ensuring that the cluster automatically adjusts its load distribution with minimal disruption.

Performance Optimization
Here is another illustration on how you can achieve high performance with minimal resources:

As your e-commerce platform scales and Kafka processes increasing volumes of order, payment, and shipment events, optimizing performance becomes crucial to maintain low latency and high throughput.

One effective approach is to enable batching and compression. Instead of sending individual messages, Kafka producers can batch multiple messages together before sending them to brokers. This reduces network overhead and increases throughput. Compression techniques such as Snappy or LZ4 further optimize data transfer by reducing the size of these batches without significantly impacting processing speed. For example, during a flash sale, compressing batched order events can dramatically cut down network usage and storage requirements.

Another important aspect is to tune the Kafka broker configuration, specifically parameters such as num.io.threads and num.network.threads. These settings control the number of threads handling disk I/O operations and network requests, respectively. Proper tuning ensures that the broker can manage large volumes of incoming and outgoing messages without becoming a bottleneck.

Finally, ensuring sufficient disk I/O and network bandwidth is critical. Kafka relies heavily on disk writes and replication traffic between brokers. If your storage system is slow or your network is saturated, latency will spike, and consumer lag will grow. Upgrading to faster disks (e.g., SSDs) or scaling network infrastructure can significantly improve overall performance.

Real World Use case example - A case of Netflix
Now that we are here, let us look into a real company- Netflix. Netflix is a global movie streaming companies that has managed to offer personalized experiences to its millions of subscribers. Let has look at how Netflix has harnessed the capabilities of Kafka to power its large scale movie streaming services:

Kafka plays a critical role in Netflix's microservices architecture, enabling real-time data movement, personalized content delivery, and system resilience.

1. Personalized Recommendations

Netflix leverages Kafka to stream real-time user interaction events, such as:
Play, pause, fast-forward, and rewind actions.
Browsing history, search queries, and viewing patterns.
These events are sent to a Kafka topic (e.g., user_activity) and consumed by machine learning services that constantly update personalized recommendations.
Example: When a user starts watching a thriller, Kafka streams this event to the recommendation engine, which instantly adjusts the "Because You Watched…" carousel on their homepage.

2. Operational Monitoring and Alerting
Netflix uses Kafka to collect logs and operational events from thousands of microservices.
Topics aggregate metrics like streaming quality (bitrate changes), login errors, and regional performance stats.
These streams feed real-time dashboards and anomaly detection systems.
Example: If buffering spikes in a specific region, Kafka immediately triggers alerts and routes events to their incident management system, enabling engineers to respond before large-scale impact.

3. Secure Data Streaming
Security is paramount at Netflix, and Kafka supports it through:
Authentication via SASL/OAuth to ensure only trusted microservices can publish or consume sensitive topics (e.g., payment updates).
TLS encryption to protect user data (like subscription payments) in transit between microservices.
ACLs (Access Control Lists) to restrict access — for instance, only the billing service can write to the billing_updates topic, while the accounting service can read it.

4. Seamless Streaming Experience
Kafka helps Netflix achieve low-latency synchronization across devices.
When a user switches from their TV to their smartphone, Kafka streams the current playback position to the session_state topic.
The mobile app consumer picks it up instantly, resuming playback exactly where the user left off.

Foundational Concepts of Data Engineering

Brian Ouchoh — Mon, 11 Aug 2025 07:17:00 +0000

What happens when you want to report about an event in your organization? What happens when you want to get insights of your operations through data analysis? What happens when a datascientist wants to train a large language Model? One common denomination for all this tasks is to consume data. Data engineering not only provides a way to collect, store, process and access data reliably, but also tools to design and optimize data systems.

Here are some core concepts that you need understand as a data engineer:

1.Batch ingestion vs Streaming Ingestion
Batch ingestion is collecting data over a period of time and then processing the data at a go , unlike dealing with a record as it arrives.

The period can be hourly, daily, weekly etc. An example is a restaurant collecting all point of sale transactions from all servers and load them into a database for end of shift reporting.

Unlike batch ingestion, streaming ingestion processes data as it arrives. An example, is a point of sale system that updates the amount of sales made as soon as a new sale is made.

2.Change Data Capture (CDC)
A change data capture is the process of identifying and recording changes i.e. inserts, updates and delete in a source database, then applying those changes downstream without having to reprocess the entire data-set

An example, you have and e commerce platform with a table called “orders” which is updated constantly as purchase status changes. Scenarios:

A. Without a CDC: instead of capturing the changes, the organization would periodically export the entire “orders” table from your database to your data-warehouse. This would result to high resource usage, increases latency and complex deduplication.

B. With a CDC: The changes in the purchases that affect the “orders” table will be captured and applies downstream without having to reprocess the entire data-set.

CDC is powered by tools such as Debezium, oracle GoldenGate and AWS Data Migration Service

3.Idempotency
Indempotency ensures that running the same operation multiple times such as restarting an ingestion job after a failure, has the same effect as running it once. Thus avoiding duplication.
Indempotency uses techniques such as upserts and using unique keys.

4.OLTP vs OLAP
OLTP (Online Transaction Processing)prioritize speed, consistency and concurrency to ensure that operational systems remain fast and reliable. Hence, OLTPs are optimized for handling a large number of small,quick transactions such as inserting and updating a single record.

OLAP (Online Analytical Processing) are designed for aggregations, trend analysis and multidimensional queries that may scan a large number of rows. Hence. OLAP systems are optimized for running complex analytical queries over large datasets.

5.Partitioning
Partitioning is a technique of splitting large datasets into smaller ,more manageable parts based on a key such as dates. The aim is to improve query performance and manageability.

Common types of partitioning include:
A. Range partitioning – Divides data based on a continuous range of values (e.g., dates or numeric IDs).
B. List partitioning – Groups data based on a predefined list of values (e.g., regions: "US", "EU", "APAC").
C. Hash partitioning – Uses a hash function on a key column to distribute rows evenly across partitions, improving load balancing.
D. Composite partitioning – Combines two or more partitioning strategies (e.g., range + hash) for better control.

6.ETL vs ELT
ETL in full is Extract, Transform, Load and ELT in full is Extract, Load, Transform. Both terms refer to different strategies of a data pipeline in data engineering.

In ETL Data is transformed before loading into the target system while in ELT Data is loaded first, then transformed in the target system

7.CAP Theorem
Distributed systems guarantee consistency, availability and partition tolerance. The CAP theorem states that this distributed systems can only provide two of the three things:

A. Consistency (all nodes see the same data at the same time)
B. Availability (every request gets a response)
C. Partition tolerance (system continues to operate despite network failures)

Example: Apache Cassandra prioritizes Availability and Partition tolerance (AP), while traditional SQL databases often prioritize Consistency and Availability (CA)

8.Windowing in Streaming
In a case of streaming data, it never ends. A window can be used to group data into finite chunks, eg data in the last 5 minutes. This makes it easy for processing.

Common window types:
A. Tumbling windows – Fixed-size, non-overlapping intervals (e.g., every 5 minutes).
B. Sliding windows – Overlapping intervals that “slide” forward, useful for rolling metrics
C. Session windows – Group events that occur within a defined inactivity gap, useful for user activity sessions.

9.DAGs and Workflow Orchestration
A DAG is Directed Acyclic Graph. A DAG represents a a set of tasks linked by dependencies, with a clear order, and no circular paths. Workflow orchestrators like Apache Airflow or Prefect use DAGs to define, schedule, and monitor data pipelines.

10.Retry Logic & Dead Letter Queues
Retry logic automatically attempts to reprocess failed tasks to handle temporary failures that often resolve on their own when retried(Transient errors).

Dead letter ques(DLQs) store messages that consistently fail processing for later inspection.

Example: A Kafka consumer might retry processing an event three times before sending it to a DLQ for manual review.

11.Back-filling & Reprocessing
Back filling is the process of ingesting historical data that was missed or never processed initially. Failure to process historical data can occur due to temporary outage that causes a gap or a new pipeline that goes live and needs to populate past data

Reprocessing involves rerunning processing logic on existing historical data to correct errors , apply updated transformations, or accommodate schema changes.

12.Data Governance
Data governance refers to the framework of rules, procedures, and best practices that guide how data is managed to maintain its accuracy, protect it from unauthorized access, ensure confidentiality, and meet regulatory obligations.

Examples of Data Governance frameworks are: Control Objectives for Information and Related Technologies, Data Management Capability Assessment Model and NIST Privacy Framework

13.Time Travel & Data Versioning

Time travel and data versioning are features in modern data warehouses and table formats (such as Snowflake, Delta Lake, and Apache Iceberg) that allow you to access and query historical versions of data. This means you can “look back in time” to see the state of your dataset at a specific moment, or maintain multiple dataset versions for auditing, debugging, or recovery purposes.

Why it matters:

A. Simplifies auditing and compliance reporting.
B. Helps debug data issues by comparing historical states.
C. Enables safe experimentation without risking permanent data loss.

14.Distributed Processing Concepts

Distributed processing splits a workload across multiple machines to handle large-scale data efficiently. Concepts include:
Sharding: Splitting data across nodes.

Replication: Keeping copies of data for fault tolerance.

MapReduce: Dividing a task into smaller “map” tasks, then combining results in a “reduce” step.