closeup1202

Posted on Apr 7

Kafka Lag Is High — But Why? I Built a CLI to Answer That

#cli #monitoring #node #showdev

klag — A CLI That Tells You Why Your Kafka Consumer Is Lagging

You've seen it before: your Kafka consumer lag is climbing. You open Grafana, stare at the offset graphs, and start guessing.

Is the producer sending too much? Did a rebalance fire? Is the consumer process even alive?

klag is a CLI tool I built to skip the guessing. It connects to your Kafka broker, snapshots lag per partition, samples produce/consume rates, and runs a set of detectors to tell you the root cause — directly in your terminal.

Install

npm install -g @closeup1202/klag

Basic Usage

# One-shot analysis
klag -b localhost:9092 -g my-consumer-group

# Watch mode — refreshes every 5s
klag -b localhost:9092 -g my-consumer-group --watch

# Omit --group to pick interactively from a live list
klag -b localhost:9092

What It Detects

klag runs up to five detectors after collecting a lag snapshot and a rate sample:

Detector	What it catches
`PRODUCER_BURST`	Produce rate is 2x+ higher than consume rate — consumer can't keep up
`SLOW_CONSUMER`	Consumer has stalled — produce rate is active but consume rate is near zero
`OFFSET_NOT_MOVING`	Offsets haven't advanced despite active production — possible processing deadlock
`REBALANCING`	Group is in `PreparingRebalance` / `CompletingRebalance` — consumption paused
`HOT_PARTITION`	One partition holds a disproportionate share of total lag

Each result includes a description of what was observed and a suggestion for what to do next.

Sample Output

⚡ klag  v0.5.0

🔍 Consumer Group: my-consumer-group
   Broker:         localhost:9092
   Collected At:   2026-04-06 14:32:01 (Asia/Seoul)

   Group Status : 🚨 CRITICAL   Total Lag : 80,500   Drain : 58m15s

┌──────────────────┬───────────┬──────────────────┬────────────────┬────────┬────────────┬───────────┬──────────────┬──────────────┐
│ Topic            │ Partition │ Committed Offset │ Log-End Offset │ Lag    │ Status     │ Drain     │ Produce Rate │ Consume Rate │
├──────────────────┼───────────┼──────────────────┼────────────────┼────────┼────────────┼───────────┼──────────────┼──────────────┤
│ orders           │         0 │        1,200,000 │      1,242,000 │ 42,000 │ 🔴 HIGH    │ 58m20s    │ 120.0 msg/s  │  12.0 msg/s  │
│                  │         1 │        1,198,500 │      1,237,000 │ 38,500 │ 🔴 HIGH    │ 58m10s    │ 115.0 msg/s  │  11.0 msg/s  │
└──────────────────┴───────────┴──────────────────┴────────────────┴────────┴────────────┴───────────┴──────────────┴──────────────┘

🔎 Root Cause Analysis

   [PRODUCER_BURST] orders
   → produce rate 235.0 msg/s vs consume rate 23.0 msg/s (10.2x difference) — consumer is falling behind ingestion rate
   → Suggestion: Consider increasing consumer instances or partition count

SSL & SASL Support

For production clusters:

# SASL/SCRAM
klag -b broker:9092 -g my-group \
  --sasl-mechanism scram-sha-256 \
  --sasl-username user \
  --sasl-password pass   # or set KLAG_SASL_PASSWORD env var

# Mutual TLS
klag -b broker:9092 -g my-group \
  --ssl-ca ./ca.pem \
  --ssl-cert ./client.pem \
  --ssl-key ./client.key

.klagrc Config File

Tired of typing the same flags? Drop a .klagrc in your project root or home directory:

{
  "broker": "kafka.internal:9092",
  "group": "my-consumer-group",
  "interval": 5000,
  "timeout": 5000,
  "ssl": {
    "enabled": true,
    "caPath": "/etc/kafka/ca.pem", 
    "certPath": "/etc/kafka/client.crt",
    "keyPath": "/etc/kafka/client.key"
  },
  "sasl": {
    "mechanism": "plain",
    "username": "user"
  }
}

CLI flags always take precedence over the config file.

JSON Output

For integrations or scripting:

klag -b localhost:9092 -g my-group --json

Returns a full JSON payload with partition-level lag, rate snapshots, and RCA results.

How It Works

Lag collection — connects via KafkaJS, fetches committed offsets and log-end offsets for every partition in the group
Rate sampling — waits one interval (default 5s), collects offsets again, computes per-partition produce/consume rates
Analysis — runs the detector pipeline; each detector reads the same snapshot independently
Report — renders a colored table with severity levels (OK / WARN / HIGH) and drain time estimates

Severity is based on time-to-drain when rate data is available:

Level	Drain time at current consume rate
`OK`	Under 60 seconds
`WARN`	60s – 5 minutes
`HIGH`	Over 5 minutes, or consume rate is zero

The detectors are mutually exclusive by design — PRODUCER_BURST and SLOW_CONSUMER can't both fire for the same topic, which keeps the output clean.

What's Next

Alert mode — non-zero exit code when lag exceeds a threshold
Partition-level RCA breakdown
Prometheus metrics export

DEV Community