DEV Community

Cover image for Kafka Lag Is High — But Why? I Built a CLI to Answer That
closeup1202
closeup1202

Posted on

Kafka Lag Is High — But Why? I Built a CLI to Answer That

klag — A CLI That Tells You Why Your Kafka Consumer Is Lagging

You've seen it before: your Kafka consumer lag is climbing. You open Grafana, stare at the offset graphs, and start guessing.

Is the producer sending too much? Did a rebalance fire? Is the consumer process even alive?

klag is a CLI tool I built to skip the guessing. It connects to your Kafka broker, snapshots lag per partition, samples produce/consume rates, and runs a set of detectors to tell you the root cause — directly in your terminal.


Install

npm install -g @closeup1202/klag
Enter fullscreen mode Exit fullscreen mode

Basic Usage

# One-shot analysis
klag -b localhost:9092 -g my-consumer-group

# Watch mode — refreshes every 5s
klag -b localhost:9092 -g my-consumer-group --watch

# Omit --group to pick interactively from a live list
klag -b localhost:9092
Enter fullscreen mode Exit fullscreen mode

What It Detects

klag runs up to five detectors after collecting a lag snapshot and a rate sample:

Detector What it catches
PRODUCER_BURST Produce rate is 2x+ higher than consume rate — consumer can't keep up
SLOW_CONSUMER Consumer has stalled — produce rate is active but consume rate is near zero
OFFSET_NOT_MOVING Offsets haven't advanced despite active production — possible processing deadlock
REBALANCING Group is in PreparingRebalance / CompletingRebalance — consumption paused
HOT_PARTITION One partition holds a disproportionate share of total lag

Each result includes a description of what was observed and a suggestion for what to do next.


Sample Output

⚡ klag  v0.5.0

🔍 Consumer Group: my-consumer-group
   Broker:         localhost:9092
   Collected At:   2026-04-06 14:32:01 (Asia/Seoul)

   Group Status : 🚨 CRITICAL   Total Lag : 80,500   Drain : 58m15s

┌──────────────────┬───────────┬──────────────────┬────────────────┬────────┬────────────┬───────────┬──────────────┬──────────────┐
│ Topic            │ Partition │ Committed Offset │ Log-End Offset │ Lag    │ Status     │ Drain     │ Produce Rate │ Consume Rate │
├──────────────────┼───────────┼──────────────────┼────────────────┼────────┼────────────┼───────────┼──────────────┼──────────────┤
│ orders           │         0 │        1,200,000 │      1,242,000 │ 42,000 │ 🔴 HIGH    │ 58m20s    │ 120.0 msg/s  │  12.0 msg/s  │
│                  │         1 │        1,198,500 │      1,237,000 │ 38,500 │ 🔴 HIGH    │ 58m10s    │ 115.0 msg/s  │  11.0 msg/s  │
└──────────────────┴───────────┴──────────────────┴────────────────┴────────┴────────────┴───────────┴──────────────┴──────────────┘

🔎 Root Cause Analysis

   [PRODUCER_BURST] orders
   → produce rate 235.0 msg/s vs consume rate 23.0 msg/s (10.2x difference) — consumer is falling behind ingestion rate
   → Suggestion: Consider increasing consumer instances or partition count
Enter fullscreen mode Exit fullscreen mode

SSL & SASL Support

For production clusters:

# SASL/SCRAM
klag -b broker:9092 -g my-group \
  --sasl-mechanism scram-sha-256 \
  --sasl-username user \
  --sasl-password pass   # or set KLAG_SASL_PASSWORD env var

# Mutual TLS
klag -b broker:9092 -g my-group \
  --ssl-ca ./ca.pem \
  --ssl-cert ./client.pem \
  --ssl-key ./client.key
Enter fullscreen mode Exit fullscreen mode

.klagrc Config File

Tired of typing the same flags? Drop a .klagrc in your project root or home directory:

{
  "broker": "kafka.internal:9092",
  "group": "my-consumer-group",
  "interval": 5000,
  "timeout": 5000,
  "ssl": {
    "enabled": true,
    "caPath": "/etc/kafka/ca.pem", 
    "certPath": "/etc/kafka/client.crt",
    "keyPath": "/etc/kafka/client.key"
  },
  "sasl": {
    "mechanism": "plain",
    "username": "user"
  }
}
Enter fullscreen mode Exit fullscreen mode

CLI flags always take precedence over the config file.


JSON Output

For integrations or scripting:

klag -b localhost:9092 -g my-group --json
Enter fullscreen mode Exit fullscreen mode

Returns a full JSON payload with partition-level lag, rate snapshots, and RCA results.


How It Works

  1. Lag collection — connects via KafkaJS, fetches committed offsets and log-end offsets for every partition in the group
  2. Rate sampling — waits one interval (default 5s), collects offsets again, computes per-partition produce/consume rates
  3. Analysis — runs the detector pipeline; each detector reads the same snapshot independently
  4. Report — renders a colored table with severity levels (OK / WARN / HIGH) and drain time estimates

Severity is based on time-to-drain when rate data is available:

Level Drain time at current consume rate
OK Under 60 seconds
WARN 60s – 5 minutes
HIGH Over 5 minutes, or consume rate is zero

The detectors are mutually exclusive by design — PRODUCER_BURST and SLOW_CONSUMER can't both fire for the same topic, which keeps the output clean.


What's Next

  • Alert mode — non-zero exit code when lag exceeds a threshold
  • Partition-level RCA breakdown
  • Prometheus metrics export

Links

If you work with Kafka and spend time staring at lag dashboards, give it a try. Feedback and PRs welcome.


Tags: kafka, devtools, cli, node, typescript

Top comments (0)