Hilary Wambwa

Posted on Sep 11

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

Kafka Architecture

Kafka is an distributed streaming system that uses producers to publish data and consumers to subscribe to the data in real time.

Topics are hosted in partitions. Several partitions make up a broker. A cluster is a group of brokers.

Brokers are servers that store and serve data. A cluster is made up of several brokers in communication. A broker stores topics in partitions.

A topic is a logical stream of data. Think of it as a table in a structured database. Like data in a table, events are written to a topic.

A partition is a unit of parallelism and scalability. Think of it as a subtopic. It is what makes Kafka distributed as partitions are distributed to brokers. A topic with 10 partitions can be processed by up to 10 consumers in parallel within a consumer group.

An offset is a position of a record in a partition. A partition contains messages. Each message is assigned a unique automatically increasing offset showing position of message in partition’s log.

ZooKeeper is a historic external coordinator for Kafka. Imagine Kafka are roads and highways transporting data, then zookeeper is the traffic command center that assigns partitions, configures topics, monitors health of brokers. Since it is external, data engineers have to configure it separately. If zookeeper fails, an entire Kafka cluster drops.

Kafka Raft enables zookeeper to manage its own ‘traffic’ where Kafka brokers elect a ‘controller’ from a quorum of brokers to configure topics and assign partitions. If the controller fails, another broker takes over.
Cluster setup and scaling.

Scaling in Kafka means it is able to accommodate extra load by adding more brokers. Done in 3 steps:

A new broker registers with the cluster, updating the cluster metadata.
Existing partitions are moved to the new broker to balance the load/redistribute.
Consumers adjust to new partition assignments to maintain parallel processing.

Producers and Consumers

Producers
These produce, write or publish data into topics.

Each message has a key attached to determine partition destination. Think of separate lines in a distribution center as partitions. If you use the same key (say, a user’s ID), all messages for that user go to the same partition, keeping things in order. For example, if you’re tracking orders for “User123,” all their order events are in one partition. This sends an order for “User123” to the “orders” topic, and Kafka figures out the partition based on the key.

Acknowledgment modes
Acks is how Kafka acknowledges a message.

Acks=0; Producer sends a message and does not wait for a reply/ acknowledgement. Fast and risky.
Acks=1; Producer waits for lead broker to acknowledge message. However, if lead broker crashes before copying message to other brokers, data is lost.
Acks=all; producer waits for all backup brokers (in sync replicas) to acknowledge the message. Slow but reliable.

For critical stuff like bank transactions, you’d go with acks=all. Netflix, for example, uses this for their event streaming to make sure no data gets lost, as they handle billions of events daily.

Consumers
These consume, read or subscribe data from topics.

Consumer groups are like a team of workers. These consumers in one group are assigned to partitions in a topic. In case of a consumer fail, Kafka assigns partition to another consumer in the group.

There are two ways to keep track of last message processed by consumers using offset.

Automatic Commits: Consumer activities are saved every few seconds (set by auto.commit.interval.ms). Easy but prone to skips and duplicates if anything crashes mid process.
Manual Commits: The consumer decides when to save its progress, giving more control.

Message Delivery & Processing

Message Delivery Semantics are mechanisms about whether a message sent by a producer will be delivered to a consumer and under what conditions.

At-most-once: Message is delivered zero or once. Might get lost, but never duplicated.
At-least-once: message is delivered one or more times. Won’t get lost, but duplicates are possible.
Exactly-once: message is delivered exactly one time. No losses, no duplicates.

These messages are retained via three methods:
Time-based (e.g., keep data for 7 days)
Messages deleted after a specified time. Perfect for when data is only relevant for a specific period of time. Example: Consider a building management system that monitors temp readings from IOT sensors installed in a building. It only requires data that is hours or minutes long to make decisions and discards message after decision is made.

Size-based (e.g., max 1 GB).
This limits space a topic partition can use. In cases of the size exceeding limit, Kafka starts deleting the oldest messages. Used in cases where control over storage costs is critical.

Log compaction (keep the latest value per key).
Only most recent message for each key in a topic is kept. This is for keeping a single source of truth for each key ideal for monitoring the latest edition of an entity e.g., user profile updates.

Back pressure & Flow Control

Back pressure is when producers write messages faster than consumers can read them. This can lead to consumers crashing or performance slowing down.

There are several ways to handle slow consumers;

Consumer group scalability: Adding more consumers to handle the available partitions thus distributing load and reducing back pressure.
Pause and resume: consumers can pause and resume reading messages from specific partitions to clear backlog without overwhelm of messages
Buffer management: Consumers use buffers that are configured to limit amount of data read per request. Lowering the limit prevents backlog of data in memory, reducing overload.

How can we monitor the speed of consumers?
This measures how far behind a consumer as far as the latest message in a partition is concerned.
Lag = Latest Offset - Consumer Offset

Monitoring tools
The kafka-consumer-groups.sh tool displays lag for each partition in a consumer group.

Command line
kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group order-consumers --describe

Metrics APIs
Kafka exposes JMX metrics like kafka.consumer:consumer-lag for monitoring. Grafana can scrape these metrics for real-time dashboards.

Third-Party Tools
Confluent Control Center or Burrow provide user-friendly interfaces for lag monitoring. According to a Confluent blog, monitoring lag is critical for ensuring SLAs in production systems.

Serialization & Deserialization

In Kafka, messages are stored and transmitted as raw bytes. Serialization is the process of converting data like a JSON object into a byte stream that Kafka can handle. Deserialization takes that byte stream and turns it back into something meaningful, like a python dictionary or a JSON object, for the consumer to process.

JSON: It’s human-readable, widely supported, and great for quick prototyping. Great for getting started, but for production and numerous messages, its size and lack of schema management can become a hinderance.
Example:
{
"order_id": "12345",
"amount": 99.99
}

Avro: a binary format designed for efficiency. Utilizes a schema used by producers and consumers to maintain consistency.
The schema is stored separately in the Confluent Schema Registry. This way, one can add new fields to schema or make them optional without breaking consumers.
Here’s an example Avro schema for an order:

When serialized, this schema produces a compact binary representation, unlike JSON’s bulky text.

Protobuf: Like Avro, it’s a binary format, but it’s designed for cross-language compatibility; Java, Python, C++ and high performance; often smaller than avro.

Advanced Kafka Concepts

Replication

To prevent chaos if something goes wrong, Kafka creates multiple copies (replicas) of each partition and spreads them across different brokers
Leader & follower replicas.
Leader replicas is the primary broker of a partition carrying out all reads and writes.
Follower replicas are backup brokers. They don’t serve directly but keep an ‘eye’ or copy of the leader’s activities.
ISR (In-Sync Replicas) are a list of replicas that are fully up-to-date with the leader’s data. If a follower falls behind, it’s kicked out of the ISR until it catches up.

How to configuring replication:
replication.factor: How many total replicas (leader + followers) each partition should have. A replication.factor=3 means one leader and two followers.
min.insync.replicas: The minimum number of replicas that must be in sync for the leader to accept writes. Setting min.insync.replicas=2 ensures at least two replicas (including the leader) are up-to-date before acknowledging a write.

Fault tolerance

A brief process of how Kafka handles failures
Normally; The leader replica for a partition ; Broker 1 handles all reads and writes, while followers; Brokers 2 and 3 replicate the data in real time.
Failure; If Broker 1 crashes, Kafka’s controller, part of the cluster management, handled by ZooKeeper or Kraft, detects the failure and promotes a follower to become the new leader.
Recovery; When Broker 1 comes back online, it rejoins as a follower, catching up on any missed messages before re-entering the ISR.

Kafka Connect

A scalable, easy to use framework built into Kafka to stream data. It’s a plug and play where only the connectors are configured while Kafka connect does the heavy lifting of integration. Two types of connectors:

Source connectors: pull data into Kafka

Sink connectors: push data out of Kafka

Each connector breaks the data flow into tasks, which run in parallel for scalability.

An example is where a CDC pipeline is created to pull data from a postgres DB and post them on a Casandra DB with a postgres-Debezium source connector and debezium-Cassandra sink connector as shown in the above code snippets.
Similarly, consider a retail company using a MongoDB source connector to feed inventory updates into Kafka, then a sink connector to push alerts to a notification system.

Kafka Streams

A java library used to build real time data processing applications. It utilizes Kafka capabilities of scale and fault tolerance without the need of a separate system to process, analyze, and transform your data on the fly.

Stateless operations: Each message (data record) is processed independently, with no context carried over.

Map: Transform data, like renaming.
Filter: Keep only certain records.
FlatMap: Split one record into multiple.

Stateful operations: These operations require maintaining “state” (context) across multiple messages.

Count: Track the number of orders per item.
Aggregate: Sum up or average values.
Join: Combine streams. States are managed using state stores, which are backed by Kafka topics.

Windowing

This is grouping data into time-based “windows” for analysis, perfect for real-time trends.

Tumbling: Fixed, non-overlapping time intervals (e.g., every 5 minutes).
Hopping: Fixed intervals that overlap (e.g., 5-minute windows advancing every 1 minute). Like checking messages every minute but looking at the last 5 minutes.
Session: Dynamic windows based on user activity (e.g., group orders from the same customer if they’re within 10 minutes of each other).

ksqlDB

SQL-like interface for streaming queries. It is like a streaming database developed on Kafka to query real time data using simple SQL commands. It operates on three terms:

Streams: This is an immutable sequence of events. Take an example of a temperature monitoring system. Stream in this case is temperature updates sent by a sensor.

This stream, stored in a Kafka topic called temperature_readings, keeps growing as new readings arrive.

To define a temperature stream:

Tables: These are a snapshot of what’s happening now and represent the current state of data.
Think of a stream as a log of every temperature reading ever sent, while a table is like a dashboard showing the most recent reading for each sensor.
Lets create a table of latest temperatures.

Queries: To analyze tables and streams using SQL.

Push Queries: Continuously deliver results as new data arrives, like a live feed of temperature alerts.

Pull Queries: Fetch the current state from a table.

Transactions & Idempotence

Kafka’s transactions and idempotence features are used to ensure every message is processed exactly once no duplicates, no misses by allowing the processing of data across multiple topics and partitions as a single, atomic unit.

This in turn enables exactly-once semantics (EOS), a holy grail for applications where precision is non-negotiable, like financial systems. EOS means a message is delivered and processed exactly once, despite the occurrence of network failures or broker crashes.

Idempotence ensures that even if a producer retries sending a message due to a broker failure, Kafka recognizes it as a duplicate and only records it once. This is achieved using a unique producer id and sequence numbers for each message.

Use Case: A bank uses Kafka to process transactions. Without EOS, a transfer could be recorded twice, overcharging a customer.

Security in Kafka

Authentication in Kafka is used to prevent unauthorized access of messages and topics.
SASL(Simple Authentication and Security Layer): Think of it as a versatile ID checker. It supports protocols like PLAIN (username/password), GSSAPI (Kerberos for enterprise environments), and SCRAM (secure password-based authentication).
Kerberos is common in large organizations like banks, where strict identity verification is a must.

OAuth: This is like using a third-party app like Google to log in. It’s great for modern, cloud-native setups.

Authorization (Access Control Lists, ACLs)
Once users are verified, you need to control what they can do. Authorization in Kafka uses ACLs to define permissions. ACLs specify who can perform actions e.g., read, write, delete on topics or consumer groups.

Encryption (TLS)
Encryption ensures messages stays confidential. Kafka uses Transport Layer Security to encrypt data in transit between producers, consumers and brokers, as well as between brokers.
Enabling TLS

Operations & Monitoring

Consumer lag
This measures how far behind a consumer as far as the latest message in a partition is concerned.
Lag = Latest Offset - Consumer Offset
Lag shows how quick consumers can handle the incoming data volume. If lag keeps growing, a real-time pipeline might not be so “real-time” anymore.
To monitor this, we can set alerts for lag thresholds e.g., >1000 messages using tools like Prometheus and Grafana.

Broker health & under-replicated partitions
Brokers are the heart of a Kafka cluster. Broker health metrics, especially under-replicated partitions (URPs), tell if brokers are keeping up with replication.

Kafka uses replication to ensure fault tolerance. Each partition has a leader replica and follower replicas. If followers can’t keep up with the leader, URPs are displayed, signaling potential data loss risks if the leader fails.

To monitor this, we check the UnderReplicatedPartitions metric via Kafka’s JMX metrics or tools like Prometheus. A non-zero value indicates trouble.

Throughput & latency
Throughput is a measure of how many messages a cluster processes per second. Latency tracks how long it takes for a message to go from producer to consumer.

Low throughput or high latency can choke a pipeline. For example, if streaming real-time analytics, slow throughput could delay driver updates, impacting user experience.

Monitoring is via metrics like BytesInPerSec, BytesOutPerSec, and MessageInPerSec for throughput, and RequestLatencyAvg for latency, using JMX.
Latency can be reduced by optimizing network settings or reducing partition counts for low-traffic topics.

Scaling Kafka

Partition count tuning
More partitions mean more parallelism, letting consumer groups process messages faster. A topic with 10 partitions can be consumed by up to 10 consumers in a group, speeding up throughput. However, too many partitions can increase overhead, like having too many queues confusing your baristas.

A good rule of thumb, per the Kafka documentation, is to start with 1–2 partitions per broker and adjust based on throughput needs.
NOTE: A a topic’s partition count can’t be changed without recreating it, so it is wise to plan ahead.

Adding brokers
Adding brokers increases storage and throughput capacity, it is like opening a new counter to serve more customers.
New brokers are added the cluster, and Kafka redistributes partitions to balance the load. The cluster metadata is updated via the Kafka controller.

Rebalancing partitions
After adding brokers or tuning partitions, rebalancing partitions is necessary to spread the load evenly, like reassigning baristas to quieter queues.

kafka-reassign-partitions.sh tool moves partitions between brokers, ensuring no single broker is overwhelmed while others sit idle.
Tip: Rebalancing can be resource-intensive, so schedule it during low-traffic periods
Weaving it all together, tune partitions to match throughput needs, add brokers to handle more traffic, and rebalance to keep things smooth.

Performance Optimization

Batching and compression
Producers send messages to brokers, but sending each message individually is like inefficient. Batching groups multiple messages into a single request, reducing network overhead and boosting throughput.
To configure this, we use linger.ms, which sets how long a producer waits to accumulate messages before sending a batch. A small delay 5milliseconds can dramatically increase throughput by allowing more messages to be sent together. Another setting, batch.size, controls the maximum size of a batch (in bytes).

Page cache usage
Kafka uses the operating system’s page cache, a portion of RAM used to cache disk data. When producers write messages to a partition, they’re stored on disk but also kept in the page cache. Consumers reading recent messages can often pull them directly from memory, avoiding slow disk I/O. This is especially powerful for Kafka’s log-based architecture, where messages are appended sequentially, making cache hits common.

Disk and network considerations
Disk considerations:
Kafka stores messages on disk in a log-based structure, making disk performance critical. To optimize this:
Use SSDs: Solid-state drives (SSDs) offer faster I/O than traditional HDDs, reducing write and read latency.
Separate log directories: Spread Kafka’s log directories (log.dirs) across multiple disks to parallelize I/O.
Avoid overloading disks: Monitor disk I/O using tools like iostat. If disk utilization nears 100%, add more disks or brokers to distribute load.

Network considerations:
High-Bandwidth NICs: Use network interface cards (NICs) with at least 1 Gbps (preferably 10 Gbps) to handle high message volumes.
This ensures Kafka can handle large bursts of traffic without bottlenecks.
Enable compression: Compression reduces network load. For network-bound clusters, lz4 compression offers a good balance of speed and efficiency.
Monitor network latency: Use metrics like network-io-rate to spot bottlenecks. If latency spikes, consider upgrading network hardware or optimizing producer/consumer configurations.

Use Cases

Uber's Real-Time Ad System
Imagine running Uber Eats ads where every click or impression is money on the line. Mess up, and you’re either losing revenue or overcharging advertisers. In 2021, Uber built a slick system using Apache Kafka, Flink, Pinot, and Hive to process ad events (clicks, impressions) in near real-time with exactly-once precision, no duplicates, no losses. It’s like a coffee shop where every order is tracked perfectly, even during a rush.

Kafka acts as the reliable message queue, with topics like “Mobile Ad Events” split into partitions for parallel processing. Producers (the app) send events with acks=all for reliability, while Flink jobs (consumers) aggregate, attribute, and load data across two regions for failover.

Exactly-once semantics: Flink’s “read_committed” mode and Kafka’s transactions, paired with unique record UUIDs, ensure no double-counting. Retention policies keep 3-day backups for recovery, and replication across brokers guarantees fault tolerance. Flink’s 1-minute windowing handles back pressure, while consumer lag monitoring keeps the pipeline smooth.

This setup powers ad auctions, billing, and analytics, processing millions of events weekly with sub-2-minute latency. Uber’s use of Kafka’s partitioning, transactions, and scalability shows how theory fuels real-world wins, delivering fast, accurate insights for advertisers.

Link to article : https://www.uber.com/en-KE/blog/real-time-exactly-once-ad-event-processing/

Pinterest
Pinterest hales massive data with 459 million users pinning images non-stop. To manage this data of 40 million messages per second, 50 GB/s traffic, pinterest leans on Apache Kafka, running 50+ clusters with 3,000 brokers.

Kafka’s topics and partitions are central, handling 3,000+ topics and 500K partitions for user events and database changelogs. Static “brokersets” ensure even partition distribution, boosting scalability. Producers like Singer (logging) and Maxwell (DB ingestion) publish compressed messages. Consumers, such as S3 Transporter for analytics and Flink/Kafka Streams for real-time spam detection or recommendations, use consumer groups to scale processing.

Serialization uses compact formats to ease CPU strain during upgrades. For performance optimization, SSDs replace magnetic disks, slashing I/O latency, while batching and compression boost throughput.

Pinterest powers real-time monetization and safety pipelines, processing petabytes for ML and metrics. Kafka’s partitioning, replication, and monitoring make this data chaos a seamless, scalable win.