Most Kafka failures don’t happen because Kafka can’t scale.
They happen because teams never did the math.
This post walks through a capacity-driven approach to designing Kafka pipelines that scale predictably—before traffic spikes expose weak assumptions.
- What “Good” Looks Like (SLOs First)
A production Kafka pipeline should:
Handle N msgs/sec per topic with low latency
Scale linearly via partitions, consumers, brokers
Guarantee at-least-once (or exactly-once) semantics
Support fan-out via consumer groups
Stay within clear lag, throughput, durability SLOs
Example SLOs
Produce latency p99 ≤ X ms
Consumer lag ≤ Y sec (steady state)
Recovery ≤ Z min after 2× spike
Availability ≥ 99.9%
If you don’t define these, Kafka tuning becomes superstition.
- Kafka Mechanics (The Parts That Actually Matter)
Topics scale via partitions
One partition → one consumer per group
Multiple consumer groups = independent re-reads (fan-out)
Rule:
Max useful consumers in one group = number of partitions.
More consumers ≠ more throughput.
- Logical Architecture
Producers → orders topic (P partitions, RF=3)
Kafka cluster distributes:
One leader per partition
Replicas across brokers
Downstream:
Fraud consumer group
Analytics consumer group
ML feature consumer group
Each group owns its own offsets and scales independently.
- Capacity Planning (The Non-Negotiable Step) Inputs
T = msgs/sec
S = avg msg size (bytes, compressed)
R = replication factor
C = msgs/sec per consumer (measured)
H = headroom (1.3–2×)
RetentionDays
Core formulas
Partitions
P = ceil((T / C) × H)
Ingress
Ingress = T × S × R
Egress (per group)
Egress = T × S
Storage/day (leaders)
T × S × 86,400
Multiply by R and retention.
- Example (1M msgs/sec)
T = 1,000,000 msgs/sec
S = 200 bytes
C = 25k msgs/sec/consumer
H = 1.5
RF = 3
Results:
60 partitions
~572 MB/sec ingress
~191 MB/sec egress per consumer group
~155 TB for 3-day retention (with replicas)
This is why “just add brokers” isn’t a strategy.
- Partitioning That Doesn’t Backfire
Use high-cardinality keys (order_id, not country)
Monitor skew aggressively
Slightly over-partition early to avoid re-sharding later
- Consumer Group Scaling
Scale consumers up to P
Use separate groups for separate pipelines
Autoscale on lag growth, not raw lag
- Reliability Defaults That Work
acks=all
min.insync.replicas=2 (with RF=3)
Idempotent producers
Disable unclean leader election
Rack/AZ-aware replicas
Exactly-once only where business semantics demand it.
- Observability > Tuning
Watch:
Lag growth per partition
p95/p99 produce & consume latency
Under-replicated partitions
Disk, NIC, controller health
Scale:
Consumers → lag
Partitions → consumer saturation
Brokers → disk/NIC pressure
Final Takeaway
Kafka doesn’t scale because it’s Kafka.
It scales because you designed it to.
Math beats hope.
Measurements beat myths.
Top comments (0)