DEV Community

Cover image for Kafka Ingestion & Processing at Scale | Rajamohan Jabbala
j raja mohan
j raja mohan

Posted on • Originally published at Medium

Kafka Ingestion & Processing at Scale | Rajamohan Jabbala

Most Kafka failures don’t happen because Kafka can’t scale.
They happen because teams never did the math.

This post walks through a capacity-driven approach to designing Kafka pipelines that scale predictably—before traffic spikes expose weak assumptions.

  1. What “Good” Looks Like (SLOs First)

A production Kafka pipeline should:

Handle N msgs/sec per topic with low latency

Scale linearly via partitions, consumers, brokers

Guarantee at-least-once (or exactly-once) semantics

Support fan-out via consumer groups

Stay within clear lag, throughput, durability SLOs

Example SLOs

Produce latency p99 ≤ X ms

Consumer lag ≤ Y sec (steady state)

Recovery ≤ Z min after 2× spike

Availability ≥ 99.9%

If you don’t define these, Kafka tuning becomes superstition.

  1. Kafka Mechanics (The Parts That Actually Matter)

Topics scale via partitions

One partition → one consumer per group

Multiple consumer groups = independent re-reads (fan-out)

Rule:

Max useful consumers in one group = number of partitions.

More consumers ≠ more throughput.

  1. Logical Architecture

Producers → orders topic (P partitions, RF=3)

Kafka cluster distributes:

One leader per partition

Replicas across brokers

Downstream:

Fraud consumer group

Analytics consumer group

ML feature consumer group

Each group owns its own offsets and scales independently.

  1. Capacity Planning (The Non-Negotiable Step) Inputs

T = msgs/sec

S = avg msg size (bytes, compressed)

R = replication factor

C = msgs/sec per consumer (measured)

H = headroom (1.3–2×)

RetentionDays

Core formulas

Partitions

P = ceil((T / C) × H)

Ingress

Ingress = T × S × R

Egress (per group)

Egress = T × S

Storage/day (leaders)

T × S × 86,400

Multiply by R and retention.

  1. Example (1M msgs/sec)

T = 1,000,000 msgs/sec

S = 200 bytes

C = 25k msgs/sec/consumer

H = 1.5

RF = 3

Results:

60 partitions

~572 MB/sec ingress

~191 MB/sec egress per consumer group

~155 TB for 3-day retention (with replicas)

This is why “just add brokers” isn’t a strategy.

  1. Partitioning That Doesn’t Backfire

Use high-cardinality keys (order_id, not country)

Monitor skew aggressively

Slightly over-partition early to avoid re-sharding later

  1. Consumer Group Scaling

Scale consumers up to P

Use separate groups for separate pipelines

Autoscale on lag growth, not raw lag

  1. Reliability Defaults That Work

acks=all

min.insync.replicas=2 (with RF=3)

Idempotent producers

Disable unclean leader election

Rack/AZ-aware replicas

Exactly-once only where business semantics demand it.

  1. Observability > Tuning

Watch:

Lag growth per partition

p95/p99 produce & consume latency

Under-replicated partitions

Disk, NIC, controller health

Scale:

Consumers → lag

Partitions → consumer saturation

Brokers → disk/NIC pressure

Final Takeaway

Kafka doesn’t scale because it’s Kafka.
It scales because you designed it to.

Math beats hope.
Measurements beat myths.

Top comments (0)