Introducing klag: The Kafka Lag Exporter I Always Wanted.

#kafka #monitoring

TL;DR: please welcome github.com/themoah/klag — a new Kafka lag exporter with multiple sink options and even more visibility into your Kafka consumers.

Kafka lag exporter was very cool. I couldn't imagine running serious kafka-based workload without it. And then the repo got archived. So I had no other choice: I had to build my own kafka lag exporter. Better one. With all the features that I was missing.

It's fast, lightweight and very extendable, built atop Java 21, vert.x library and micrometer. It's already supporting otlp, datadog and prometheus.

So what is a kafka lag?

Basically, it’s the number of records (messages) that haven’t been processed by a consumer group — the downstream service. In general, you expect it to be zero or close to it. If it’s growing fast — something is wrong. Decreasing lag is the best indicator that you’ve successfully identified and fixed the issue.

Simple, right?

But let’s not forget: scale and distributed systems can make even the simplest problem incredibly hard. Processing metrics in a timely manner on huge clusters with hundreds of brokers, thousands of consumer groups, and millions of partitions is no joke.

The missing features.

Lag velocity.

Measuring the speed of lag change (growth or decline) is the key. Finding the right threshold, based on amount of records is a long story of trial and error. Lag has been growing without control for last period of time? That's really requires attention.

Hot partitions.

One partition can generate lots of headache, skew in processing, late data and other bad impacts. 100+ partitions topic might have an average lag under the threshold, but a single outlier partition can break your data pipeline.

Stale groups.

Reduce the noise. Inactive groups require cleanup, which can be very tricky.

Splitting the requests.

Huge cluster with thousands of topic-partitions and tens (or hundreds) of consumer groups create some enormous amount of metrics, which will overload your Kafka (and Zookeeper if you are still using it) cluster. Chunking request into mini sub-groups reduces the load on the cluster.

What's next ?

Planned next features:

Advanced filtering: whitelist (or blacklist) topics and/or consumer groups.
Support running with multiple metrics reporters.
More metrics sinks - stasD, cloudwatch and others.

Feature requests and feedback is always welcome.
(If you are running klag in production - i'd love to share your story or add a link to the Readme).

DEV Community