Kafka consumer lag—Measure and reduce

#devops #performance #dataengineering #monitoring

Kafka consumer lag

Apache Kafka® is an open-source distributed streaming platform that implements applications that rely on real-time event processing and durable event storage. Event processing-based application architecture helps architects to decouple application components and scale them independently. While decoupling helps in scalability and resilience, it also makes the architecture complex to debug.

Optimizing performance for distributed applications requires considerable engineering effort and Kafka is no different. Kafka consumer lag — which measures the delay between a Kafka producer and consumer — is a key Kafka performance indicator. This article explores Kafka consumer lag in detail, including causes, monitoring, and strategies to address it.

Summary of key Kafka consumer lag concepts

Kafka consumer lag is a key performance indicator for the popular Kafka streaming platform. All else equal, lower consumer lag means better Kafka performance. The table below summarizes common causes of Kafka consumer lag. We will explore these causes in more detail later in this article.

What is Kafka consumer lag?

Kafka consumer lag is the difference between the last message produced by the producer and the offset committed by the consumer group. It represents the consumer processing delay.

Understanding the Kafka working model
Kafka excels in acting as a foundation for building decoupled applications that rely on event processing. It does its magic through the concept of producers and consumers. Producers are applications that send events to the Kafka broker. The broker stores the messages durably and enables the client applications to process the events logically.

Data is logically separated inside brokers using Kafka topics. Topics are categories that logically separate data so it can be uniquely addressed. Topic names are unique across a Kafka cluster. Topics are further divided into partitions to facilitate scaling. Partitions keep a subset of data belonging to a topic.

When a producer writes a message to a topic, Kafka broker writes into a partition that belongs to the topic. Kafka maintains the progress of writing data to each partition by tracking the last position of each data write. This position is called long-end offset. It is a partition-specific offset.

Consumers contain application logic about how to process the data written to partitions. To facilitate scaling within consumers, Kafka uses the concept of consumer groups. A consumer group is a set of consumers collaborating to consume messages from the same topic. Kafka ensures consumers belonging to the same consumer group receive messages from different partitions.

When a new consumer joins the group, Kafka rebalances the members in that consumer group to ensure that the new consumer gets a fair share of assigned partitions. Every rebalance operation results in new group configurations. Group configuration here means the assignment of consumers to various partitions.

Kafka message processing can be scaled by adding more consumers to a consumer group. To enable resilience, Kafka consumers keep track of the last position in a partition from where it is read. This helps consumers to begin again from the position they left off in case of unfortunate situations like crashes. This is called consumer offset. Consumer offset is stored in a separate Kafka topic.

The difference between the last offset stored by the broker and the last committed offset for that partition is called consumer lag. It defines the gap between consumers and producers. Consumer lag provides information about the real-time performance of the processing system. A positive value of consumer lag often flags up a sudden spike in traffic, skewed data patterns, a scaling problem, or even a code-level issue.

Reasons for Kafka consumer lag

Consumer lag can occur because of several internal and external factors. Even a healthy Kafka cluster will have some consumer lag at times. As long as the lag goes down in a reasonable time, there is nothing to worry about. The lag becomes alarming when it does not decrease or show signs of a gradual increase.

Incoming traffic surge
Traffic patterns often vary through a wide range based on external patterns. For example, imagine an IoT sensor system that sends alerts based on specific external environment variables. A change in the external environment for a set of customers can flood the topic with sudden spikes. Consumers will have difficulty dealing with the sudden spike, and the lag can become alarmingly high. Manual scaling helps address Kafka consumer lag in these cases.

Data skew in partitions
Partitions bring parallelism to Kafka. Consumers within a consumer group are mapped to specific partitions. The idea is that each consumer has enough resources to handle messages coming to that partition. But data is often not uniformly distributed in partitions. Kafka provides multiple strategies to select partitions while writing data. The simplest is robin assignment, where data is uniformly distributed. But round robin is unsuitable for applications that maintain state or order. In such cases, an application-specific partition key is used.

If the partition key does not distribute data uniformly, some partitions can have more data than others. Imagine a unique customer identity is mapped to a partition key. If a specific customer sends more data than others, that partition will experience a skew leading to consumer lag.

Slow processing jobs
Consumers process the messages pulled from the partitions according to application logic. Application logic can contain tasks like complex data transformations, external microservice access, database writes, etc. Such processing mechanisms are time-consuming and can get stuck due to external factors. Imagine a consumer that accesses an external microservice to complete its task. If the response time of the external service increases because of other factors, Kafka will experience consumer lag.

Error in code and pipeline components
Kafka consumers often contain complex application logic. Like any code, that logic can have bugs. For example, a processing module can go into an infinite loop or use inefficient algorithms. Similarly, improper handling of an erroneous or unexpected input message can slow a particular consumer. Such instances will result in consumer lag.

Monitoring Kafka consumer lag

Monitoring Kafka consumer lag helps developers take corrective actions to stabilize the cluster and optimize performance. Typically, there will always be a lag because batching and lag values vary from partition to partition. Slight lag is not a significant problem if it is stable. But lag with a tendency to increase points to a problem. This section details how teams can monitor consumer lag to identify potential issues.

Monitoring Kafka consumer lag with the consumer group script

The Kafka consumer group script exposes key details about the consumer group performance. It details each partition’s ‘current offset’, ‘log end offset’, and lag. The ‘current offset’ of a partition is the last committed message of the consumer dealing with that partition. The ‘log end offset' is the highest offset in that partition. In other words, it represents the offset of the last message written to that partition. The difference between these two is the amount of consumer lag.

You can use the command below to get consumer lag details with the consumer group script.

$KAFKA_HOME/bin/kafka-consumer-groups.sh  --bootstrap-server <> --describe --group <group_name>

Executing this in a live server will result in the below output.

GROUP          TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG    OWNER
ub-kf          test-topic      0          15              17              2      ub-kf-1/127.0.0.1  
ub-kf          test-topic      1          14              15              1         ub-kf-2/127.0.0.1

Conclusion

Consumer lag is a key metric that provides information about the extent of catchup consumers must do to achieve near real-time operation. While a little bit of consumer lag is inevitable, an increasing consumer lag points to a problem in data distribution, code, and traffic patterns.