klag — A CLI That Tells You Why Your Kafka Consumer Is Lagging
You've seen it before: your Kafka consumer lag is climbing. You open Grafana, stare at the offset graphs, and start guessing.
Is the producer sending too much? Did a rebalance fire? Is the consumer process even alive?
klag is a CLI tool I built to skip the guessing. It connects to your Kafka broker, snapshots lag per partition, samples produce/consume rates, and runs a set of detectors to tell you the root cause — directly in your terminal.
Install
npm install -g @closeup1202/klag
Basic Usage
# One-shot analysis
klag -b localhost:9092 -g my-consumer-group
# Watch mode — refreshes every 5s
klag -b localhost:9092 -g my-consumer-group --watch
# Omit --group to pick interactively from a live list
klag -b localhost:9092
What It Detects
klag runs up to five detectors after collecting a lag snapshot and a rate sample:
| Detector | What it catches |
|---|---|
PRODUCER_BURST |
Produce rate is 2x+ higher than consume rate — consumer can't keep up |
SLOW_CONSUMER |
Consumer has stalled — produce rate is active but consume rate is near zero |
OFFSET_NOT_MOVING |
Offsets haven't advanced despite active production — possible processing deadlock |
REBALANCING |
Group is in PreparingRebalance / CompletingRebalance — consumption paused |
HOT_PARTITION |
One partition holds a disproportionate share of total lag |
Each result includes a description of what was observed and a suggestion for what to do next.
Sample Output
⚡ klag v0.5.0
🔍 Consumer Group: my-consumer-group
Broker: localhost:9092
Collected At: 2026-04-06 14:32:01 (Asia/Seoul)
Group Status : 🚨 CRITICAL Total Lag : 80,500 Drain : 58m15s
┌──────────────────┬───────────┬──────────────────┬────────────────┬────────┬────────────┬───────────┬──────────────┬──────────────┐
│ Topic │ Partition │ Committed Offset │ Log-End Offset │ Lag │ Status │ Drain │ Produce Rate │ Consume Rate │
├──────────────────┼───────────┼──────────────────┼────────────────┼────────┼────────────┼───────────┼──────────────┼──────────────┤
│ orders │ 0 │ 1,200,000 │ 1,242,000 │ 42,000 │ 🔴 HIGH │ 58m20s │ 120.0 msg/s │ 12.0 msg/s │
│ │ 1 │ 1,198,500 │ 1,237,000 │ 38,500 │ 🔴 HIGH │ 58m10s │ 115.0 msg/s │ 11.0 msg/s │
└──────────────────┴───────────┴──────────────────┴────────────────┴────────┴────────────┴───────────┴──────────────┴──────────────┘
🔎 Root Cause Analysis
[PRODUCER_BURST] orders
→ produce rate 235.0 msg/s vs consume rate 23.0 msg/s (10.2x difference) — consumer is falling behind ingestion rate
→ Suggestion: Consider increasing consumer instances or partition count
SSL & SASL Support
For production clusters:
# SASL/SCRAM
klag -b broker:9092 -g my-group \
--sasl-mechanism scram-sha-256 \
--sasl-username user \
--sasl-password pass # or set KLAG_SASL_PASSWORD env var
# Mutual TLS
klag -b broker:9092 -g my-group \
--ssl-ca ./ca.pem \
--ssl-cert ./client.pem \
--ssl-key ./client.key
.klagrc Config File
Tired of typing the same flags? Drop a .klagrc in your project root or home directory:
{
"broker": "kafka.internal:9092",
"group": "my-consumer-group",
"interval": 5000,
"timeout": 5000,
"ssl": {
"enabled": true,
"caPath": "/etc/kafka/ca.pem",
"certPath": "/etc/kafka/client.crt",
"keyPath": "/etc/kafka/client.key"
},
"sasl": {
"mechanism": "plain",
"username": "user"
}
}
CLI flags always take precedence over the config file.
JSON Output
For integrations or scripting:
klag -b localhost:9092 -g my-group --json
Returns a full JSON payload with partition-level lag, rate snapshots, and RCA results.
How It Works
- Lag collection — connects via KafkaJS, fetches committed offsets and log-end offsets for every partition in the group
- Rate sampling — waits one interval (default 5s), collects offsets again, computes per-partition produce/consume rates
- Analysis — runs the detector pipeline; each detector reads the same snapshot independently
-
Report — renders a colored table with severity levels (
OK/WARN/HIGH) and drain time estimates
Severity is based on time-to-drain when rate data is available:
| Level | Drain time at current consume rate |
|---|---|
OK |
Under 60 seconds |
WARN |
60s – 5 minutes |
HIGH |
Over 5 minutes, or consume rate is zero |
The detectors are mutually exclusive by design — PRODUCER_BURST and SLOW_CONSUMER can't both fire for the same topic, which keeps the output clean.
What's Next
- Alert mode — non-zero exit code when lag exceeds a threshold
- Partition-level RCA breakdown
- Prometheus metrics export
Links
-
npm:
npm install -g @closeup1202/klag - GitHub: https://github.com/closeup1202/klag
If you work with Kafka and spend time staring at lag dashboards, give it a try. Feedback and PRs welcome.
Tags: kafka, devtools, cli, node, typescript
Top comments (0)