susan waweru

Posted on Sep 12

Core Kafka Fundamentals for Data Engineering

#dataengineering #beginners #opensource #architecture

Apache Kafka is an open-source distributed event streaming platform.
Was originally developed by LinkedIn but later open-sourced under the Apache Software Foundation.

Event
A record of something that happened in the system eg button click in a website, data insertion in a database etc

Streaming
This is the continuous generation, delivery and processing of data in real-time

Event streaming is the continuous capture, storage and processing of events as they happen ie:

capture data in real time in the form of streams of events
store these streams of events for later retrieval
process, react to the event streams in real time
route the event streams to destination technologies as needed

Kafka Architecture

1. Brokers
Brokers are Kafka servers. They:

store topics and partitions
handle incoming and outgoing messages
communicate with producers and consumers (clients) Kafka clusters usually have multiple brokers for scalability and fault tolerance

Topics are where producers write events to and where consumers read events from. Topics are logical, not physical, they’re split into partitions for scalability.
An event is written into exactly one partition. Events inside a partition are stored in an ordered, immutable log (append-only).

2. Zookeper vs KRaft mode
Zookeeper
A distributed coordination service.
It helps manage metadata, configuration and synchronization for distributed systems.

Keeps track of all Kafka brokers
Elects the controller broker responsible for partition assignments
Detects and manages broker failures or restarts and decides which broker leads which partition

Suppose you have 3 Kafka brokers, if one broker fails, zookeeper notices the failure, elects the new leader for the affected partitions and updates metadata so consumers and producers know what to do.

cons

separate system to manage hence operational complexity

KRaft Mode
In KRaft mode, Kafka removes ZooKeeper and manages everything internally:

Brokers handle data, controllers manage metadata using the Raft protocol replacing ZooKeeper’s coordination model.

pros

simpler deployment and stronger metadata consistency
easier scaling to thousands of brokers

3. Cluster setup and scaling
A cluster is a group of Kafka brokers working together

A broker has:
Multiple brokers (for load-balancing and fault-tolerance)
A controller broker (which manages partition leadership)

Scaling is done by adding or removing brokers, which triggers partition rebalancing handled manually with tools like kafka-reassign-partitions.sh or automated with solutions such as Confluent Auto Data Balancer.

Kafka scales out by adding brokers and rebalancing partitions, and scales in by removing brokers after reassigning their partitions.
Partitioning enables parallelism, replication ensures fault tolerance, and monitoring metrics helps decide when to adjust cluster size.

Setup and Scaling
step i) deploy the more than one broker instances to increase storage and throughput
step ii) create topics and subdivide them into partitions. Partitions spread data across brokers for parallel processing and load balancing.
step iii) for fault tolerance, replicate each partition accross multiple brokers to ensure data availability if a broker fails.
step iv) connect client applications ie producers and consumers to the cluster
Producers auto-discover brokers via bootstrap servers. Consumers scale by adding more instances in a consumer group, each taking partitions.

Topics, Partitions, Offsets

Topic
A topic is a logical category or feed name to which events (messages/records) are published.
All events of a similar type go into the same topic.

Producers write events to a topic (to one of a topic's partitions).
Consumers read events from those partitions, usually in the order they were appended.

Partition
Partition is a subdivision of Kafka topics. It's the actual storage unit where the events are stored in an ordered, immutable log. Each topic is split into partitions for scalability and parallelism. These partitions physically reside on Kafka brokers

Offset
A position of a record in a partition. Offset is an increasing integer assigned to each record within a partition.

Producer writes messages and Kafka assigns them sequential offsets per partition
Example: In Partition 0, records may have offsets 0, 1, 2, 3..

Consumers read messages using offsets as pointers: A consumer remembers “last processed offset”.
Offsets act as bookmarks, if a consumer crashes and restarts, Kafka knows where it left off.

Producers

Producers are client applications that publish (write) events to a Kafka topic . They can choose which partition the event goes to (using key or a round-robin)

a) Writing Data into Topics
A producer sends records (messages) to a topic.
Kafka then decides which partition within that topic the record will go to.

b) Key-based Partitioning (Controlling Message Distribution)
Each record may include a key.
Kafka uses the key to determine which partition the record should go to:
If a key is provided, Kafka applies a hash function on the key and the record always goes to the same partition for the provided key.
If no key is provided, Kafka distributes records in a round-robin fashion across partitions for load balancing.

c) Acknowledgment Modes (acks)
Producers can choose how much confirmation they want from Kafka before considering a write successful:

acks=0 (Fire and Forget)
Producer does not wait for any acknowledgment.
Fastest, but messages may be lost if the broker fails.

acks=1 (Leader Acknowledgment)
Producer gets acknowledgment once the leader partition writes the record.
Safer, but if the leader crashes before followers replicate, data may be lost.

acks=all (or acks=-1) (All Replicas Acknowledge)
Producer waits until the leader and all in-sync replicas acknowledge the write.
Safest (strong durability guarantee), but slower due to waiting for replication

Consumers

Consumers are client applications that subscribe to (read and process) events from a kafka topic.
They can read from one or more partitions

a) Read data from topics
A consumer subscribes to one or more topics. They pull data from Kafka at their own pace

b) Consumer groups (scaling and parallel consumption)
Consumers belong to a consumer group, identified by a group id. Kafka ensures that each partition is consumed by exactly one consumer in a group.

Benefits
Scalability: Multiple consumers in a group can read from different partitions in parallel.
Fault Tolerance: If one consumer fails, Kafka reassigns its partitions to other consumers in the group.

c) Offset management (automatic vs manual commits)
An offset is a number that marks a consumer’s position in a partition (like a bookmark).
When a consumer reads messages, it must commit the offset to Kafka so it can resume from the right place if restarted.

To commit offsets:
1. Automatic Commits
Kafka handles committing offsets on behalf of the consumer at regular intervals. The consumer reads messages and commits the last consumed offset. Its controlled by setting enable.auto.commit=true and auto.commit.interval.ms

Pros:
Simple, low overhead.

Cons:
Risk of data loss if consumer crashes after processing but before commit or duplication if commit happens before processing finishes

2. Manual Commits
Consumer explicitly commits offsets after processing is done. Consumer reads messages, after processing each batch/ record, it calls commit to save the offset. Its controlled by setting enable.auto.commit=false

Pros:
More control, ensures exactly-once/at-least-once semantics.

Cons:
More complex to implement.

Message Delivery Semantics

Message delivers semantics describes the guarantee that a Kafka system provides when delivering messages to consumers, especially in the case of failures.
ie how many times a consumer may receive a message, especially when failures occur.

At-MostOnce (messages may be lost, never duplicated)
Consumer commits the offset before processing the message. Kafka may consider that message 'done' even if consumer hasn't finished processing.

In the case that the consumer crashes before processing, on restart, Kafka starts from the next offset.

Guarantee: Messages are delivered 0 or 1 times.

At-least-Once(messages are never lost, may be duplicated)
Consumer processes the message first, then commits the offset. This will ensure a message is acknowledged until its fully handled.

If consumer crashes before committing offset, even after it has processed the message, on restart, Kafka re-delivers that message and consumer reprocess it again hence duplicate

Guarantee: Messages are delivered 1 or more times. No message is lost, but duplicates may occur.

Exactly-Once(each message is delivered precisely once)
Each message is delivered precisely once.
With enable.idempotence=true, producers avoid duplicate writes when retrying after network or broker issues.

Kafka supports transactional producers where:

Messages sent by the producer + consumer offset commits are part of the same atomic transaction. This ensures that either both the message and offset are committed, or neither is.

Producer commits offset + writes results in the same transaction.
If a crash happens mid-way, Kafka ensures nothing is partially committed.
On restart, the consumer retries safely without duplicates.

Guarantee: Each message is processed once and only once, even with retries and failures.

Retention Policies

Retention policies are there to manage the data stored within the Kafka topics.
They help manage disk usage while ensuring consumers have access to the required data

Time-Based Retention
Messages are kept for a specified duration, after which they can be deleted.
Configured using parameters like log.retention.hours, log.retention.minutes, or log.retention.ms

When a message’s age exceeds the configured retention period, it becomes eligible for deletion.

Size-based retention
Policy limits how much disk space a topic or partition can use.
Configured using parameter log.retention.bytes

When the configured size threshold is reached, Kafka deletes the oldest log segments to make room for new data.

Log compaction
This policy focuses on retaining the latest value for each key within a topic.
Configured using parameter cleanup.policy=compact

If multiple records exist with the same key, Kafka removes the older versions and keeps only the most recent one.
Unlike time/size retention, compaction does not delete all old data — it only removes outdated records per key

Back pressure & Flow Control

Sometimes producers can generate messages faster than the consumer can process them. This can lead to consumer lag, memory buildup, or even crashes so Kafka uses back pressure and flow control mechanisms to handle this situation.

Back Pressure (Handling Slow Consumers)
In back pressure the system signals that consumers (or brokers) are overloaded and cannot keep up with the data flow.
If a consumer is slow, messages keep accumulating in the topic partitions.
If lag exceeds retention limits, consumers may miss data permanently.

Flow Control
Flow control ensures that producers and consumers operate at a balanced pace without overwhelming brokers or clients.
If the buffer fills up because brokers or consumers are too slow, Kafka blocks the producer max.block.ms until space becomes available or throws an exception.

Consumer Lag Monitoring
Consumer lag is the key metric for identifying slow consumers.
Kafka exposes lag metrics via tools like Kafka Consumer Group CLI, monitoring systems like Prometheus + Grafana.
Large or growing lag indicates consumers are falling behind.

Serialization & Deserialization

Kafka stores and transmits messages as byte arrays.
Producers must serialize data into bytes before sending it to Kafka, and consumers must deserialize those bytes back into a usable data structure.

Common Serialization Formats
JSON
Pros: Human-readable, language-agnostic and easy debugging and integration.
Cons: Larger message size, no built-in schema

Avro
Pros: Compact, binary format and schema is stored separately
Cons: Requires schema management (via Schema Registry).

Protobuf (Protocol Buffers)

Pros: Very compact and fast, Widely used in microservices
Cons: More complex tooling compared to JSON.

Confluent Schema Registry
One big challenge in Kafka is, how do producers and consumers agree on message structure over time?
Confluent Schema Registry provides a centralized repository for managing and validating schemas for Kafka messages
Its a central repository for schemas. Producers register schemas when publishing messages and Consumers fetch schemas to deserialize data correctly.

Replication & Fault Tolerance

In Kafka, high availability and durability is achieved through replication which ensures data is not lost if a broker fails and fault-tolerance which allows the system to keep working even when components go down

Replication
Remember a topic is divided into partitions and each partition can have replicas distributed across brokers.

Types of Replicas
Leader replica
Handles all reads/ writes for a partition.
Follower replica
Continuously fetch data from the leader to keep their logs synchronized. This ensures that if the leader broker fails, a follower can take over without data loss.

So the producer sends messages to the leader replica and follower replicas fetch data from the leader and stay in sync.

*Fault-Tolerance *
Achieved by ensuring data survives and remains available even after broker failures

In-Sync Replicas (ISR):
ISR are considered "in-sync" with the leader. They successfully replicated all committed messages from the leader and are not significantly lagging.
Only replicas within the ISR are eligible to be elected as the new leader if the current leader fails.

High Availability
If a broker hosting a leader replica fails, Kafka automatically elects a new leader from the remaining followers in the ISR.

Failover
Switching to a new leader when the current leader becomes unavailable. This ensures continuous operation of the Kafka cluster.

Consumers and producers automatically detect the new leader via metadata updates.
If a follower replica falls behind, it is temporarily removed from ISR.
Once it catches up, it rejoins ISR.

Kafka Connect

Kafka Connect is a framework for scalably and reliably streaming data between Apache Kafka and other systems (databases, cloud storage, search indexes, file systems)
It eliminates the need for writing custom integration code by providing a standardized way to move large datasets in and out of Kafka with minimal latency.

Kafka Connect includes two types of connectors:
Source Connector
Reads data from external systems into Kafka.

Example
Stream database changes (CDC) from PostgreSQL, MySQL, etc.
Collect metrics from application servers and publish them to Kafka topics.

Sink Connector
Writes data from Kafka into external systems.

Example
Index records into Elasticsearch for search.
Load messages into HDFS, S3, or Hadoop for offline analytics.
Insert Kafka records into relational databases or cloud warehouses.

A connector instance is a job that copies data between Kafka and another system.
A connector plugin contains the implementation (classes, configs) that define how the integration works.

Benefits
Scalable: Runs as a distributed service across multiple worker nodes.
Reliable: Handles faults, retries, and exactly-once semantics

Kafka Streams

Kafka Streams is a library that lets you process and analyze data as it flows through Kafka in real-time.

Stateless vs Stateful Operations
Stateless operations don't need to remember anything from before. They process each record independently without retaining any information from previous records.

Example
Map: to transform value of a record

Stateful operations on the other hand need to remember past data to work. They require maintaining and updating state based on past records to produce results.

Example
Join: to combine records from two streams
Aggregate: to aggregate a single aggregate value for records grouped by key

Windowing Concepts
Windowing defines how streaming records are grouped by time so you can run stateful operations like aggregations or joins.

Types of Windows
Tumbling Windows
Fixed-size, non-overlapping intervals.
Each record goes into exactly one window.

Hopping Windows
Fixed-size, but overlapping intervals that “hop” forward in smaller steps.
A record can belong to multiple windows.

Session Windows
Based on activity periods, not fixed time.
A session continues while events keep coming in, and ends after a defined inactivity gap.

Sliding Windows
Windows move continuously with event timestamps.
More advanced, often used with the lower-level Processor API.

ksqlDB

ksqlDB provides a SQL-like interface for Apache Kafka therefore enabling real-time stream processing and analytics using familiar SQL syntax.

Features and benefits
SQL-like Interface: Write queries using familiar SQL syntax just like working with a relational database

Real-Time Stream Processing: Processes events the moment they arrive in Kafka topics

Simplified Development: Eliminates the need for complex Java/Scala code by providing a higher-level SQL abstraction over Kafka Streams.

Built on Kafka Streams: Inherits scalability, performance, and fault tolerance from the Kafka Streams library.

Use Cases include Real-time analytics & dashboards, Data transformation and enrichment, Fraud and anomaly detection, IoT event processing, Event-driven microservices

Transactions & Idempotence

Exactly-Once Semantics (EOS) guarantees that a message is delivered and processed by the consuming application exactly one time. This ensures that even if a system component fails, the message will not be lost, and its effect on the target system will not be duplicated.

EOS is achieved by combining Kafka transactions and the use of idempotency.

Kafka transactions enable atomic processing of multiple write operations by grouping them into a single, indivisible unit that either succeeds entirely or fails entirely, ensuring data consistency in the event of failures. Ensure atomicity by means that all operations within the transaction are committed together, or none of them are.

Idempotency refers to the ability to perform an operation multiple times without causing unintended side effects or changes to the system's state beyond the initial application. It guarantees that messages sent by a producer are written to the Kafka log exactly once, even if the producer retries sending due to network issues or broker failures.

How it works
With enable.idempotence=true, the broker assigns a Producer ID (PID) and sequence numbers to each message. This lets the broker detect and drop duplicates caused by retries.
Guarantee: Messages are written once, in order, within a single producer session.
Limitation: Doesn’t prevent duplicates across producer restarts or across multiple partitions.

Security in Kafka

Apache Kafka’s security model is built on several key components that work together to protect data and ensure data confidentiality, integrity, and controlled access.

Authentication:
Verifies the identity of clients (producers/consumers) and brokers attempting to connect to the Kafka cluster
authentication mechanisms:
SASL (Simple Authentication and Security Layer) is a framework for authentication used in Kafka to verify the identity of clients and brokers

Example of a SASL mechanism in Kafka is:
PLAIN: which uses Username + password authentication
GSSAPI (Kerberos): Provides strong, centralized authentication using a Kerberos Key Distribution Center (KDC) for issuing tickets to clients and services

Authorization:
Authorization in Kafka is controlled by Access Control Lists (ACLs). Kafka checks authorization to determine what operations the client is allowed to perform.

Operations include:
Read/Write: On topics.
Describe/Create: On topics or consumer groups.
Alter/Delete: On topics or consumer groups.
Cluster-level operations: Such as managing brokers.

Encryption
Encryption ensures confidentiality and integrity of data in transit. It prevents unauthorized parties from reading or tampering with messages while they move: between clients (producers/consumers) and brokers and between brokers (inter-broker communication).

Kafka uses TLS (SSL) to encrypt connections. When TLS is enabled, data is encrypted before being sent over the network and the receiver decrypts it using trusted certificates

Operations & Monitoring

Metrics to Monitor

To effectively monitor Kafka, you should track consumer lag, under-replicated partitions (URP) and throughput and latency.

Consumer Lag
Shows the difference between the last message produced to a partition and the last message consumed by a consumer group.
High lag means consumers are falling behind, which can cause delays in latency-sensitive applications.
Large values in records-lag-max indicate a significant delay in consumption

Broker Health & Under-Replicated Partitions (URP)
URPs occur when the leader partition has replicas that are not fully in sync.
URPs signal a risk of data loss if a broker fails.

Throughput and Latency

Throughput
Measures rate of message processing, usually in records per second.

Latency
Measures time from when a message arrives until it is processed (end-to-end delay).

High throughput + low latency leads to efficient system.

High latency leads to potential bottlenecks (example network congestion, slow consumers, broker overload)

Scaling Kafka

Scaling Kafka is adjusting the Kafka cluster’s capacity so it can handle more data, higher throughput, or more clients without performance degradation

Partition Count Tuning
Increasing the number of partitions in a topic boosts parallelism, allowing more consumers in a consumer group to process data concurrently.

However more partitions improve throughput but also add overhead in terms of open file handles, memory, and controller load.

Adding Brokers
Adding brokers to the cluster distributes data and workload across more servers.

Benefits

Increases overall storage capacity
Improves throughput by balancing load
Enhances fault tolerance

Rebalancing Partitions
Rebalancing redistributes partitions across brokers to ensure even load distribution.

Triggers
New brokers added
Brokers removed or fail
Uneven data distribution detected

Performance Optimization

Performance Optimization focuses on making message production, storage, and consumption faster and more reliable by reducing bottlenecks in the producer, broker, and consumer pipeline

Batching and Compression

Batching:
Kafka producers send messages in batches instead of individually hence fewer network round trips

Compression:
Compressing batches reduces network bandwidth and disk usage.

Page Cache Usage
Kafka relies on the OS page cache for fast disk reads/writes.
Keeping enough RAM ensures frequently accessed logs stay cached, avoiding expensive disk I/O.

Benefit: Minimizes random disk I/O, improving read and write performance.

Disk & Network Considerations

Disk:
Use fast disks (SSD preferred over HDD).
Separate Kafka log directories from OS and application logs to avoid I/O contention

Network:
Kafka is network-intensive;
High-bandwidth, low-latency connections (10Gbps+ for large clusters), tuned socket buffers, and balanced partition leaders.

Conclusion

Apache Kafka is a powerful distributed event streaming platform designed for real-time data processing at scale. Its scalability, durability, and rich ecosystem make it a backbone for event-driven architectures and data pipelines. As organizations continue to generate and consume data at massive scale, Kafka provides the foundation to ensure that information flows reliably, securely, and with low latency.

DEV Community