Joud Awad

Posted on May 26

17/30 Days System Design Questions!

#distributedsystems #systemdesign #architecture #software

Your Kafka consumer is processing 800 events/sec.

The producer just hit 5,000 events/sec and it’s not slowing down.

Lag chart: 12 minutes behind and climbing. Consumer memory: 89% and rising. The on-call alert just fired. You have ~4 minutes before the JVM starts GC-thrashing and the pod gets OOM-killed.

Here’s the setup:

Producer → Kafka topic (5K events/sec, growing)

Consumer → spring-kafka @KafkaListener, batch=500, processes ~800 events/sec

Downstream → Postgres write + external HTTP call (the real bottleneck)

SLA → events must be processed, not silently dropped

The consumer can’t keep up. The producer doesn’t know it. What do you do?

A) Drop events on the floor — fail fast, return early, let the lag burn down. The system stays alive.

B) Block the producer — make the consumer signal “slow down,” apply backpressure upstream until it catches up.

C) Buffer harder — bigger in-memory queue, larger batch size, scale the consumer to absorb the spike.

D) Rate-limit + load-shed — cap consumption rate, route the overflow to a DLQ or secondary topic for later replay.

Three of these are real production patterns. Only one of them actually fits this stack and this SLA.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments (including why two of the wrong answers will fool engineers who’ve shipped Kafka before).

If your team argues about backpressure in standups, share this with them. The right answer is platform-specific, and most posts get it wrong.

Drop your answer 👇

30DaysOfSystemDesign #Day17 #SystemDesign #DistributedSystems

Top comments (4)

Joud Awad • May 26

Answer: D — Rate-limit + load-shed ✅

Why D wins:

The SLA says events must be processed — not dropped, not blocked. That kills A and B immediately. Cap the consumer’s intake at ~1K/sec with headroom, route overflow to a durable secondary topic (events.overflow) with its own consumer group that drains during off-peak. Producer keeps producing. Primary consumer keeps a steady heartbeat. Overflow gets processed later. That’s graceful degradation. Stripe and Shopify publish architecture posts on this exact pattern.

Joud Awad • May 26

Why B is the trap answer:

This is what fools senior engineers. Backpressure-as-blocking works in Reactor, RxJava, Akka — because the protocol carries the signal. Kafka is pull-based and decoupled. The producer doesn’t know your consumer exists. There’s no socket-level backpressure signal flowing upstream. Engineers from gRPC streaming or RxJava reach for B because it worked there. In a brokered, decoupled, log-based system — it doesn’t apply.

Joud Awad • May 26

Why C is wrong:

Buffering is a delay tactic. A bigger buffer absorbs a 30-second spike — not a sustained 5K/sec input on an 800/sec consumer. The math: you’re 4,200 events behind per second. A 1M-event buffer buys ~4 minutes. Then you’re back where you started, but now with a memory exhaustion problem. And the real bottleneck is the external HTTP call — scaling consumers 5x doesn’t help if downstream rate-limits you at request #51.

Joud Awad • May 26

Why A is wrong:

Drops events. SLA says “must be processed, not silently dropped.” Done.