Prudence Waithira

Posted on Sep 13

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

#dataengineering #kafka #webdev #programming

Apache Kafka – a distributed streaming platform widely used for building real-time data pipelines and streaming application.
It was originally developed at Linkedn and open sourced in 2011.
Architecture
1.Brokers: server component responsible for storing, replicating, and serving data streams to clients.

2.Kafka cluster: Group of brokers working together, with each broker independently managing partitions of topics. They can be scaled horizontally

Key Functions of a Kafka Broker:
• Data Storage:
Brokers store data as part of topics in partitioned logs.
• Data Serving:
They receive and respond to produce (write) and fetch (read) requests from client applications.
• Data Replication:
Brokers replicate data partitions to other brokers in the cluster, ensuring that data is not lost if a broker fails and providing high availability.
• Cluster Management (via KRaft or ZooKeeper):
Brokers work together in a cluster, coordinating their activities and managing the distribution of topic partitions.
• Fault Tolerance:
By distributing and replicating data across multiple brokers, the cluster remains operational even if some brokers fail.
• Dynamic membership

3.Zookeeper - used to manage cluster metadata and coordinate brokers. Externally from zookeeper cluster
However, Kafka is evolving to Kraft – uses the internal Raft consensus protocol to manage cluster metadata directly within Kafka brokers, eliminating the need for a separate Zookeeper cluster

4.Topics and Partitions:
-- Topics – where messages are published by producers and from which consumers subscribe to retrieve records. Serve as a way to categorize different types of data streams within Kafka. For example, a "user_activity" topic might contain records related to user logins, clicks, and page views, while an "order_processing" topic might contain records about new orders, order updates, and order cancellations.
-- Partitions – kafka topics are divided into one or more partitions. Each partition is an ordered, immutable sequence of records that is continually appended to. They form the basic unit of parallelism and data distribution in Kafka. Each record within a partition is assigned a unique, incremental identifier called an offset, which represents its position within that partition.

Producers and Consumers

a) Producers
Role: Send data records (events) to specific Kafka topics.
Decoupling: Producers don't need to know about consumers; they just publish data to topics.
Batching & Compression: Producers can batch records and compress them for efficiency. Message Partitioning: Producers can determine the partition a message goes into, often by using a key, which ensures related messages are sent to the same partition.
b) Consumers
Role: **
Subscribe to one or more topics and read (consume) the event streams published to them.
**Decoupling:
Consumers also don't need to know about producers; they subscribe to topics to receive messages.
Consumer Groups:
Consumers organize themselves into consumer groups to achieve parallel processing.
Parallel Processing:
Within a consumer group, each partition is assigned to only one consumer, allowing for distributed processing of messages.
Offset Management:
Consumers keep track of their progress by committing offsets, which are essentially pointers to the last message they processed in a partition, allowing them to resume from where they left off.

Message Delivery Semantics

At-most-once Delivery • Description: Messages are delivered zero or one time, meaning they can be lost if the system fails after they are sent but before they are processed. • Behavior: In Kafka, this is achieved by automatically committing consumer offsets as soon as messages are received. If the consumer fails before processing, the messages are lost and won't be read again. • Use Case: Suitable for applications where occasional message loss is acceptable and high throughput is prioritized.
At-least-once Delivery • Description: Guarantees that every message is delivered at least once, but it's possible for a message to be delivered multiple times. • Behavior: Kafka achieves this when the producer waits for acknowledgment before considering a message committed, or when the consumer commits its offset after successfully processing a message. If a failure occurs after processing but before the offset is committed, the message will be re-delivered. • Use Case: Ideal when message loss is unacceptable, such as in financial transactions, but duplicate processing can be handled by making the consumer idempotent.
Exactly-once Delivery • Description: Each message is delivered exactly once, without any duplication or loss. • Behavior: This is the most complex to achieve and requires cooperation between the producer, Kafka, and the consumer. o Kafka to Kafka (Kafka Streams): Kafka Streams API enables exactly-once semantics by leveraging Kafka's transaction API. o Kafka to External System (Sink): Requires the use of an idempotent consumer that ensures it only processes each message once, even if it receives duplicates. • Use Case: Essential for scenarios where data accuracy and non-duplication are critical, such as order processing or accounting.

Retention Policy and Back Pressure Handling

Kafka uses retention policies (time or size-based) to manage disk space by deleting old messages, while back pressure occurs when producers generate data faster than consumers can process it.
Kafka handles back pressure by enabling consumer groups to scale horizontally, throttling producers, and providing consumer-level configurations to control fetch behavior.

Kafka Retention Policies:
• Time-Based Retention: Messages are deleted after a specified duration (e.g., 7 days by default).
• Size-Based Retention: Messages are deleted to keep the total size of a partition below a configured limit.
• Combined Policies: You can combine time and size-based retention for a customized strategy.

Back Pressure in Kafka:

Back pressure occurs when the rate of data production exceeds the rate of data consumption, creating bottlenecks and potential system instability.
Indicators: o An increase in consumer lag (the difference between a consumer's latest offset and the producer's head of the log) signals back pressure on the consumer side.
Causes: o Consumers can't process messages fast enough due to network issues, disk I/O, or processing complexity.

Handling Back Pressure:
a) Scale consumers horizontally
b) Consumer-level Optimizations – batching and message compressions
c) Rate limiting
d) Stream processing frameworks
e) Monitor consumer lag

Serialization and Schema Evolution

-Serialization in Kafka refers to the process of converting data objects into a byte array format suitable for transmission over the network and storage in Kafka topics.
-Deserialization is the reverse process, converting the byte array back into a usable data object. This is crucial as Kafka messages are essentially byte arrays, and applications need to understand the structure of the data within those bytes. Common serialization formats include Avro, JSON, and Protobuf.

Schema evolution addresses the challenge of managing changes to the structure of data over time in Kafka topics. Ensures changes to the data’s schema can be managed without breaking compatibility between producers and consumers, allowing older and newer versions of data to coexist and be processed correctly. The Kafka Schema Registry plays a central role in facilitating both serialization and schema evolution. It acts as a centralized repository for managing and validating schemas.

Kafka Schema Registry

-A centralized repository for managing and validating schemas for data exchanged within Kafka ecosystem.
Key aspects of Kafka Schema Registry:
• Centralized Schema Management:
It provides a single location to store and manage schemas, ensuring all producers and consumers share a common understanding of message formats.
• Data Contract Enforcement:
Schemas act as a data contract, defining the structure and types of data within Kafka messages. This helps prevent breaking changes and ensures data quality.
• Schema Evolution:
Schema Registry supports schema evolution, allowing you to introduce changes to your data formats (e.g., adding new fields) while maintaining compatibility with older consumers. It handles compatibility checks (forward and backward compatibility) to prevent data corruption.
• Serialization and Deserialization:
It provides serializers and deserializers that integrate with Kafka clients, handling the process of converting data to and from a binary format (like Avro) based on the registered schemas.
• Data Governance:
Schema Registry plays a crucial role in data governance by providing visibility into data lineage, enabling audit capabilities, and facilitating collaboration among teams working with Kafka data.
• Underlying Storage:
Schema Registry uses Kafka itself as its durable backend, leveraging Kafka's log-based architecture for storing and managing schema metadata.

Replication and Fault Tolerance

Kafka achieves durability through replication:
• Each partition has multiple replicas across brokers.
• One replica is the leader, handling all read/write requests.
• Other replicas are followers, synchronizing data from the leader.
• The set of replicas in sync is called ISR (In-Sync Replicas).
If the leader fails, Kafka elects a new leader from followers, ensuring no data loss and continuous availability. The recommended replication factor is typically 3 for balance between fault tolerance and overhead.

Kafka Connect and Kafka Streams

A) Kafka Connect:
• Purpose:
Kafka Connect is a framework for reliably streaming data between Apache Kafka and other data systems. It focuses on simplifying the process of getting data into and out of Kafka.
• Functionality:
It uses pre-built or custom-developed "connectors" to interact with various data sources (e.g., databases, file systems, cloud storage) and sinks (e.g., data warehouses, search indexes).
• Use Cases:
Ideal for data integration, ETL (Extract, Transform, Load) operations where the primary goal is to move data between systems with minimal or no complex transformations.
• Key Feature:
Provides a scalable and fault-tolerant way to manage data pipelines without requiring extensive custom code for data movement.

B) Kafka Streams:
• Purpose:
Kafka Streams is a client library for building real-time stream processing applications directly on top of Apache Kafka. It focuses on processing and analyzing data within Kafka topics.
• Functionality:
It allows developers to write Java/Scala applications that consume data from Kafka topics, perform various transformations, aggregations, joins, and then produce results back to other Kafka topics or external systems.
• Use Cases:
Suited for real-time analytics, event-driven microservices, complex event processing, and applications requiring continuous data processing and analysis.
• Key Feature:
Offers powerful abstractions (KStream, KTable) for representing and manipulating streams and tables of data, enabling sophisticated stream processing logic with built-in fault tolerance and scalability.

ksqlDB: SQL for Streaming

ksqlDB offers a SQL-like interface for streaming data on top of Kafka:
• Enables real-time filtering, aggregation, joining, and enrichment.
• Simplifies stream processing without requiring Java/Scala coding.
• Supports creating persistent views/tables on streaming data.
• Used extensively in industries like healthcare for real-time transaction monitoring and anomaly detection.

Transactions and Idempotence

Kafka supports exactly-once semantics (EOS) through:
• Idempotent Producers: Prevent duplicate message sends during retries by assigning sequence numbers.

from kafka import KafkaProducer
import json
producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],  # Replace with your Kafka broker addresses
    enable_idempotence=True,              # Enable idempotence
    acks='all',                           # Ensure all replicas acknowledge the write
    retries=10,                           # Number of retries for failed sends
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
try:
    future = producer.send('my_topic', {'message': 'This is an idempotent message'})
    record_metadata = future.get(timeout=10)
    print(f"Message sent successfully to topic: {record_metadata.topic}, partition: {record_metadata.partition}, offset: {record_metadata.offset}")
except Exception as e:
    print(f"Error sending message: {e}")
finally:
    producer.close()

• Transactions: Enable grouped writes and offset commits to be atomic, preventing partial processing. This means either all messages in the transaction are committed and become visible to consumers, or none of them are.
Producers enable idempotence with enable.idempotence=true and transactions with transaction.id=xxx, improving accuracy in critical workflows.

from kafka import KafkaProducer
import json

producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    transactional_id='my_transactional_producer_id', # Unique ID for the transactional producer
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

producer.init_transactions()

try:
    producer.begin_transaction()
    producer.send('topic_a', {'data': 'message from topic A'})
    producer.send('topic_b', {'data': 'message from topic B'})
    producer.commit_transaction()
    print("Transaction committed successfully.")
except Exception as e:
    producer.abort_transaction()
    print(f"Transaction aborted due to error: {e}")
finally:
    producer.close()

Security in Kafka

Kafka incorporates robust security features:
• Authentication: Supports SASL (Kerberos, OAuth), TLS client certificates.
• Authorization: Fine-grained access control via ACLs.
• Encryption: TLS encryption for data in transit.
These measures protect Kafka clusters from unauthorized access and data breaches.

Operations and Monitoring

Key metrics for Kafka health:
• Consumer Lag: Indicates delays in processing.
• Under-replicated Partitions: Signals insufficient replication.
• Broker Health: Disk, network, JVM metrics.
• Throughput and Latency: Evaluated for performance tuning.

Performance Optimization

a. Batching and Compression
• Batching - Kafka producers group messages into batches before sending them to brokers. This reduces network overhead and improves throughput.
• Compression – Message compression at the producer level. This reduces the amount of data transferred over the network and stored on disk, saving network bandwidth and disk space.
b. Page cache Usage
• By maximizing page cache utilization, Kafka minimizes direct disk I/O, leading to faster read and write operations.
c. Disk and Network Considerations
• Disk:
 Fast Disks eg. SSD for Kafka data directories is crucial for optimal performance, especially for write-heavy workloads.
 RAID Configuration: Employing appropriate RAID configurations can improve disk I/O performance and provide data redundancy.
 Separate Disks: Ideally, separate disks should be used for Kafka logs and operating system files to prevent contention.
• Network:
 High-Bandwidth Network
 Network Interface Cards
 Proper Network Configuration

Scaling Kafka

This involves strategies for handling increased data loads and ensuring optimal performance.
1. Scaling Kafka Partition Count Tuning:
• Partitions and Parallelism:
Partitions are the unit of parallelism in Kafka. Increasing the number of partitions for a topic allows for greater parallelism in both producers (writing messages) and consumers (reading messages).
• Consumer Groups:
Within a consumer group, each partition can only be consumed by one consumer instance at a time. Therefore, the maximum parallelism for a consumer group is limited by the number of partitions in the topic. Adding more consumer instances than partitions will result in idle consumers.
• Overhead:
While increasing partitions can improve throughput, too many partitions can introduce overhead on brokers and consumers due to increased metadata management and potential for more frequent rebalances.
• Rule of Thumb:
A common starting point is 3-5 partitions per consumer instance in your consumer group, adjusting based on data volume and processing requirements.
2. Adding Brokers:
• Horizontal Scaling:
Adding new brokers to a Kafka cluster is a form of horizontal scaling, increasing the overall capacity of the cluster to handle higher data loads and improve fault tolerance.
• Uneven Distribution:
When new brokers are added, existing topic partitions are not automatically distributed to them, leading to an unbalanced cluster where new brokers remain idle while older ones carry the load.
• Replication and High Availability:
Adding brokers allows for more replicas of partitions to be stored, enhancing data durability and high availability in case of broker failures.
3. Rebalancing Partitions:
• Necessity:
Rebalancing is crucial after adding new brokers to distribute partitions evenly across all brokers (old and new), ensuring optimal resource utilization and preventing performance bottlenecks on overloaded brokers.
• Tools:
• kafka-reassign-partitions.sh: This command-line utility for self-managed Kafka clusters allows for manual partition reassignment. It requires creating a JSON file defining the desired partition distribution.
• Cruise Control: An open-source tool that automates partition rebalancing. It continuously monitors cluster performance and intelligently rebalances partitions to maintain an optimal distribution, reducing manual effort.
• Impact:
Rebalancing can temporarily impact performance as data is moved between brokers. Planning and monitoring during rebalancing operations are essential.

Real-World Use Cases

• Netflix: Employs Kafka for real-time event ingestion and stream processing in their data platform to personalize content and monitor services.
• LinkedIn: Originator of Kafka; uses it extensively for data integration, activity tracking, and operational metrics.
• Uber: Uses Kafka for event-driven microservices communication, real-time analytics, and surge pricing algorithms.
These companies benefit from Kafka’s scalability, fault tolerance, and exactly-once semantics to deliver reliable, real-time data-driven services

DEV Community