Mwirigi Eric

Posted on Sep 8

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

#apachekafka #dataengineerig

Introduction

Apache Kafka is a crucial component in modern data engineering, and a good understanding of its core concepts, architecture and applications is essential for any data engineer in todays data-driven world.

What is Apache Kafka?

According to the Official Kafka website, Apache Kafka is defined as a distributed event streaming platform. But what is event streaming? From the Kafka documentation, event streaming is the practice of capturing data in real-time from various sources such as databases, sensors and cloud services.

Thus, simply put, apache kafka provides a platform that handles and processes real-time data, whereby the platform works as a cluster of one or more nodes, making it scalable and fault-tolerant.

Apache Kafka Core Concepts

1. Kafka Architecture:

Brokers - Are set of servers that run kafka and form the storage layer. They store and serve requested data.
Zookeeper - Can be defined as a centralized service for managing and coordinating kafka brokers. In a cluster, it tracks broker membership, topic configuration, leader selection and meta data related to kafka topics, brokers and consumers.
Kraft - kraft is a protocol that replaces zookeeper-based metadata management system. With this protocol, kafka brokers manage metadata internally without the need of an external system like zookeeper.

2. Topics, Partitions and Offsets

Topics: These are logical channels to which events are published, similar to a database table. Topics are multi-producer, and multi-subscriber, and for scalability purposes are split into partitions.
Partitions: When a kafka topic is created, it is divided into one or more partitions, each functioning as an independent, ordered log file that holds aa subset of the topic's data. Hence allowing for efficient parallel data processing.
Offset: In Kafka, each record in a partition is assigned a unique identifier (offset). During writing, each new message written to topic is appended to the log and kafka assigns it the next sequential offset. when reading, consumers can specify an offset to start reading a particular message.

3. Producers: Kafka producers are client applications that publish (write) data to kafka topics, whereby the data is sent to the appropriate topic and partition based on either the key-based partitioning (involves using consistent hashing mechanism on the key of the message, which ensures that all messages with the same key go to the same partition) or the round-robin partitioning (in which kaka distributes messages across partitions in a rotating manner).

Acknowledgement Modes(Acks) -A kafka producer can choose to receive acknowledgement/confirmation of data writes in 3 modes, namely:
- acks=0: Producer sends data without waiting for an acknowledgement, hence prone to possible data loss in the event of sending data to a broker which is down.
- acks=1: Producer sends data and the the leader confirms receipt. If the ack is not received, the producer retries sending the data again.
- acks=all: In this scenario, both the leader and replica are requested to acknowledge receipt of messages.

4. Consumers: Kafka consumers are client applications that subscribe to kafka topics and processes the data.

Consumer groups: consist of client applications (consumers) responsible for reading messages from a topic across multiple partitions through polling, allowing consumers to process events from a topic in parallel.
Consumer group offset: For each topic and partition, the consumer group has a number that represents the latest consumed record from kafka. Hence, a consumer can keep track of where it is in a topic. Consumer offsets can be managed by the following strategies:
- Auto-commit - the enable.auto.commit = true property enables kafka to commit the consumer offset to the kafka cluster periodically as defined by auto.commit.interval.ms.
- Manual commit - Manual offset commit defined by enable.auto.commit = false allows control over when offsets are recorded.

5. Kafka Message Delivery and Processing

Message delivery semantics: Kafka semantic guarantees refers to how the broker, producer and consumer agree to share messages. According to the Kafka documentation, messages can be shared in the following ways:
- At-most-once - Messages are delivered once, and if there is a system failure, the messages are lost and are not re-delivered.
- At-least-once - Messages are delivered one or more times, and incase of a failure there is no loss.
- Exactly-once - This is the preferred mode in that each message is delivered once, and there is no loss or reading a message twice even if some part of the system fails.

6. Retention Policies

In kafka, retention refer to the policy that determines how long kafka keeps data/records in a topic before they are eligible for deletion. The following retention policies are provided by kafka:

Time-based Retention: Records are retained for a specified duration e.g. 7 days, after which upon expiry, the oldest records are deleted.
Size-based Retention: Records are retained until the total size of log segments reaches a specified limit e.g. 1GB, after which the oldest records are deleted.
Log Compaction: In log compaction, kafka retains only the latest value for each key in a topic, discarding older updates. This mode is useful for use cases where a complete and up-to-date view of a dataset is needed, for example when maintaining the latest state of user profile.

7. Back Pressure and Flow Control

In distributed systems, back pressure is a mechanism for regulating data flow in order to prevent overloads to parts of the system. Simply put, back pressure acts like a traffic control system, regulating the speed and volume of data flow to maintain optimal performance and reliability across the entire system.

During operations, a system sometimes experiences consumer lags (delay in time it takes a message to move from a producer to a consumer), therefore it is essential to perform Consumer Lag monitoring, so that slow consumers can be identified quickly and remedial action taken.

To undertake consumer lag monitoring, several methods can be used, among them the Offset explorer tool and proprietary kafka monitoring services such as Amazon's.

8. Serialization and Deserialization

Serialization converts data objects to binary format suitable for transmission or storage. Deserialization involves converting the binary data back to its original object form. The following schema formats are adopted:

JSON Schema: JSON schema converts an application's data object into a JSON string then serializes it into bytes before sending. On receiving, the consumer retrieves the JSON Schema using the extracted schema ID and deserializes the data back to the application's data object.
Avro Schema: - Avro schema uses the kafkaAvroSerializer to convert data into binary format. On receiving, the consumer uses the kafka Avro deserializer to extract schema ID, queries the schema registry to retrieve the Avro schema which then converts the binary data back to the data object.
Protobuf Schema: The producer application first creates a Protobuf message, and then the KafkaProtobufSerializer converts the message into a binary format. On receiving the binary message, a consumer uses the kafkaProtobufSerializer which converts it back into a protobuf message, extracts schema ID, retrieves the schema and uses it to deserialize the binary data into a structured protobuf message object.

9. Replication and Fault Tolerance

Basically, replication involves maintaining a copy of every topic in multiple brokers, ensuring that the data is accessible even in the event of broker failure, thus ensuring fault tolerance.

Leader replica: A leader replica receives all write requests for a partition. All produce and consume requests go through the leader, for consistency.
Follower replica: They replicate data from the leader by fetching log segments and applying the to their local logs. If current leader fails, the can take over as leader.
In-Sync replicas (ISR): These are a subset of replicas synchronized with the leader and kafka guarantees durability by ensuring that messages are only committed when they are replicated to all ISR replicas.

10. Kafka Connect

Kafka connect serves as a centralized data hub for simple data integration between databases, key-value stores, file systems etcetera.

Source Connectors: Pull/ingest data from external sources such as databases and message queues into kafka topics.
Sink Connectors: Deliver data from kafka topics to external systems.

11. Kafka Streams

Kafka Streams API: This is a powerful, lightweight library for building real-time, scalable and fault-tolerant stream processing applications.
Stateless versus Stateful Operations: Stateless operations transform individual records in the input streams independently, whereas Stateful operations maintain and update state based on the records processed so far, enabling operations like aggregations and joins.
Windowing: The windowing concept involves processing and aggregating data streams in pre-determined time frames, making windowed joins and aggregations possible.

12. ksqlDB

ksqlDB enables the querying, reading, writing and processing of data in real-time and at a scale using SQL-lie syntax.

13. Transactions and Idempotency

Kafka transactions ensure atomicity by allowing producers to group multiple write operations into single transaction. If the transaction commits successfully, all the writes are visible to consumers, and if aborted, none of the write is visible.

Exactly-Once Semantics: Ensures that each message is processed exactly once , even in the event of failure. This is achieved through idempotent producers and transactional consumers configured with isolation levels.
Isolation Levels: read-uncommitted allows consumers to see all records including aborted transactions, whereas read-committed allows consumers to only see records from committed transactions.

14. Security in Kafka

SSL/TLS authentication protocol: The secure socket layer (SSL) and transport layer security (TLS) protocols provide encryption and authentication for data in transit.
SASL protocol: The Simple authentication and security layer (SASL) provides a framework for adding authentication and data security services to connection-based protocols.

15 Operations and Monitoring

Consumer Lag Monitoring: Consumer lags can be monitored by:
- Consumer group script - exposes key details about the consumer group performance by detailing partition's current offsets, log end offset and lag.
- Using Burrow - Burrow is an open-source monitoring tool that monitors committed consumer offsets and generates reports.
Under-replicated Partitions: This is a kafka broker metric that can reveal various problems, from broker failures to resource exhaustion and suggests responding to fluctuations by running preferred replica elections.
Throughput and Latency: Throughput indicates the number of messages that can be processed in a given amount of time, whereas Latency describes how fast messages can be processed.

16. Scaling Kafka

Partition Count Tuning: Involves determining and adjusting optimal number of partitions for a given kafka topic, while considering throughput + parallelism, resource utilization and broker limitations.
Adding Brokers: Kafka architecture allows addition of broker nodes to a cluster as data processing needs grow. The horizontal scalability offers increased throughput, fault tolerance and ability to handle large data volumes. The guidelines for adding brokers can be found here setting up a multi-broker kafka cluster.
Rebalancing Partitions: When a consumer joins or leaves a consumer group, kafka automatically reassigns partitions among the consumers in that group to ensure an even distribution of workload.

17. Performance Optimization

Batching: In batching, kafka producers groups multiple messages from the same partition into a single batch. batch size defines the maximum size in bytes of a batch whereas linger.ms specifies the maximum time in milliseconds the producer will wait to accumulate messages for a batch before sending it, even if the batch-size has not been reached.
Compression: Prior to sending messages within a batch to kafka brokers, compression can be done by configuring as compression.type which specifies the compression algorithm to adopt e.g. gzip, zstd, and Snappy.
Page Cache Usage: Page cache is a transparent buffer maintained by the OS that keeps recently accessed file data in memory. During writes, data is to the page cache first and the OS flushes these page to disk synchronously. During reads, consumer requests are served from the page cache with the OS handling the prefetching of data.
Disk and Network Considerations: This is achieved by considering SSDs and NVMe (non-volatile memory express) which offer faster read/write speeds. For network considerations, ensuring high-bandwidth low-latency network infrastructure and distributing partitions and topics across multiple brokers is essential.

Apache Kafka Real-world Applications

1. Netflix

Real-time Data streaming: Netflix leverages the apache kafka capabilities to handle real-time data streaming that include user interactions, content consumption, system logs and operational metrics.
Microservices communication: In Netflix’s microservices architecture, Kafka is used to enable asynchronous communication between services.
Log aggregation and Monitoring: Kafka is instrumental in aggregating logs from different microservices and components within Netflix’s architecture such as centralized logging and real-time monitoring.

2. Uber

Real-time Pricing: Uber adopts Kafka in its pricing pipeline which adjusts models based on dynamic factors like driver availability, location, and weather to manage supply and demand.

DEV Community

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

Top comments (0)