Joud Awad

Posted on May 31

25/30 Days System Design Questions!

#architecture #backend #distributedsystems #systemdesign

Your order processing service runs on SQS.

Normal load: 200 orders/min. Consumers keep up fine.

Then Black Friday hits. Producers start pushing 4,000 orders/min. Queue depth climbs to 80,000 messages in 20 minutes. Your downstream DB is at 95% CPU. Consumers are falling behind and you're watching the queue grow in real time.

You need to handle this backpressure. What do you do?

A) Scale consumers horizontally — add more Lambda functions / EC2 workers to chew through the backlog faster.

B) Set a visibility timeout and route failures to a dead-letter queue to protect against poison pills.

C) Rate-limit producers at the source — use a token bucket or sliding window to cap how fast messages enter the queue.

D) Switch to SQS delay queues — defer message visibility to spread out delivery and reduce consumer pressure.

Three of these are real patterns engineers reach for. Only one actually solves backpressure.

Pick one — A, B, C, or D — and tell me why. Full breakdown in the comments.

If this made you second-guess your instinct, share it — someone on your team is designing this right now.

30DaysOfSystemDesign #SystemDesign #AWS #SoftwareArchitecture

Top comments (4)

Joud Awad • May 31

Why C wins (Producer-side rate limiting):

The queue is growing because the producer is winning the race. 4,000 msg/min in, 200 msg/min out — no amount of consumer scaling closes that gap sustainably.

The fix lives upstream. Rate-limit the producer using a token bucket or sliding window. In AWS: use API Gateway usage plans, Lambda reserved concurrency, or application-level throttle middleware. The queue depth stabilizes immediately. Consumers catch up at their natural pace. No runaway DB load. No cost explosion.

The mental model: slow the tap, don't just widen the drain.

This is what "backpressure" actually means — the downstream system signaling the upstream to slow down. The queue depth IS the signal. React upstream.

Joud Awad • May 31

Why A is the trap answer (Scale consumers):

This is the instinct answer, and it's close enough to fool senior engineers.

Yes, more consumers means more throughput. But you're treating a rate mismatch as a capacity problem. If your producer can always generate faster than your consumers can process — and on Black Friday it can — you're in an arms race you can't win. You spin up 10x Lambda functions, hit DynamoDB write throughput limits, hammer your RDS connection pool, and the queue still grows. Just slower, and now your infra bill doubled.

Consumer scaling works when your bottleneck is compute. It doesn't work when the problem is unbounded producer rate.

Joud Awad • May 31

Why B is wrong (Visibility timeout + DLQ):

This is failure handling, not backpressure.

DLQ catches poison pills — messages that fail processing repeatedly and would otherwise loop forever. Visibility timeout controls how long a message stays invisible after a consumer picks it up (preventing double-processing). Neither of these slows the producer or reduces queue depth growth.

Mixing up backpressure with retry/error handling is one of the most common architecture confusions I see. They solve completely different problems.

Joud Awad • May 31

Why D is wrong (SQS delay queues):

Delay queues defer message visibility, not message creation. Your producer still pushes 4,000 msg/min. Messages pile up invisibly, then become visible in bursts when the

delay expires. You haven't reduced backpressure — you've deferred it and made the delivery pattern spikier.

This is like covering a flooding sink with a lid and calling it fixed. The water is still rising....