chandra penugonda

Posted on May 11

🚨 Stop Picking the Wrong Queue: You’re Probably Killing Your System 🚨

#distributedsystems #kafka #eventdriven #rabbitmq

You’re architecting a new system. You need asynchronous communication. Someone yells “Kafka!” Someone else says “RabbitMQ is easier.” Another person pipes up with “Just use SQS.”

They look similar on the surface. They are not interchangeable. They are not even in the same category. Picking the wrong one—and worse, misunderstanding the fundamental concepts—won’t just make your life harder. It will introduce cascading failures that take months to debug and fix.

Before you write a single line of code, there are four conceptual pillars you must master. If you get these wrong, your system will take itself down the first time a downstream service has a slight slowdown.

Let’s go.

Pillar 1: The Problem We’re Fixing (Decoupling)

Why are we even using queues? We’re trying to solve the problem of synchronous coupling.

Imagine your Checkout Service has to call your Inventory Service directly via an API.

Checkout calls Inventory directly. Inventory slows down (maybe it’s under high load, maybe the DB is lagging). It doesn’t crash; it just takes 5 seconds to respond instead of 50ms.

Because the connection is synchronous, Checkout’s threads pile up waiting. Then the internal pool exhausts. Then Checkout is out of threads. Now, anything depending on Checkout is also down.

This is Cascading Failure. It spreads like wildfire. The real cause isn’t the slow inventory; it’s the synchronous coupling that forces overload to feed on itself (retries, timeouts, waiting).

Now Watch What Happens With a Queue 🛡️

A queue acts as a buffer between the two services.

When Inventory slows down, messages simply accumulate in the queue. The Checkout service writes its “Order Placed” message and immediately moves on.

This is the whole point of a queue: It decouples the producer from the consumer in three critical dimensions:

Time ⏳: They don’t need to run at the same moment.

Availability ✅: One can be down for maintenance without taking the other down.

Speed ⚡️: The fast one isn’t held hostage by the slow one.

Pillar 2: Work vs. Events (Point-to-Point vs. Pub/Sub)

A team shipped a feature on a Friday. By Monday morning, every single user was getting 50 duplicate welcome emails. 📧🤯

The root cause was one simple conceptual mistake: They put a “Work Job” (Send This Email) onto a “Topic” (a fan-out mechanism). They had 50 worker processes. The topic did its job and fanned out one copy of the message to all 50 workers. One question asked in review would have saved their weekend:

“Should this message be handled ONCE, or should MULTIPLE services REACT to it?”

Handled Once (The Work Queue) 👷‍♂️

This is Point-to-Point. It’s a “Job” or “Work Queue.” Examples include: Resize this image, Charge this credit card, Send this one email. Any worker in the pool can grab it, but only one must process it. The message disappears once handled.

Reacted to by Many (The Event) 📣

This is Pub/Sub (Publisher/Subscriber), often called Topics.

An Event is a statement of fact: An order got placed. The Email Service cares (needs to send a receipt). The Shipping Service cares (needs to label a box). The Analytics Service cares (needs to update dashboards). Three independent, parallel reactions to the same fact. Each gets its own copy of the message.

The tool the team used (perhaps RabbitMQ or Kafka) wasn’t broken. Their semantic understanding of the message intent was wrong. Don’t argue about words (”Kafka Queue” vs. “RabbitMQ Topic”). Ask what actually happens to a message when it arrives.

Pillar 3: Delivery Guarantees (The Exactly Once Myth)

You’ll see three delivery guarantees advertised. One is, quite literally, impossible in general distributed systems.

At Most Once: Fire and forget. We send the message and don’t check. If it drops, it drops. Fine for metrics where one lost data point doesn’t matter.

At Least Once (The Real Default): The producer retries until it gets an acknowledgement (ACK) from the broker. If the ACK gets lost, the message gets resent. You will not lose data, but you will get duplicates. This is what most systems use by default.

Exactly Once: This is where vendors get creative with marketing.

The “Exactly Once” Reality 🛡️⚔️🛡️

Exactly once delivery across an unreliable network is not achievable. It’s tied to a result called the Two Generals Problem. Picture it: A producer sends a message. The broker gets it and sends back an ACK. The ACK vanishes. Now the producer has two choices: Retry (and maybe cause a duplicate) or Give Up (and maybe lose data). The network never tells you which scenario occurred.

When a system advertises “Exactly Once Semantics,” what’s actually happening is At Least Once Delivery PLUS either:

Idempotent Processing: The consumer is smart enough to handle duplicates.

Transactional Writes: The write to the final storage (DB) is part of a distributed transaction.

The distinct concept to keep in your head: Exactly Once Delivery over a network (No), exactly once EFFECT (Yes, through work).

🚨 Actionable Takeaway: Build Idempotent Consumers 🛡️

My advice: Assume “At Least Once” and build Idempotent Consumers. Every message handler must check: “Have I seen this unique Event ID before?” and skip if it has. This single pattern prevents standard nightmare bugs: double charges, duplicate emails, and inventory drifting.

If you do one thing after this post, check your handlers. If they aren’t idempotent, make them idempotent.

Pillar 4: When Things Fail (DLQs and Backpressure)

The Poison Message and the DLQ 🤮
Your consumer receives one malformed message. The code crashes. The message goes back to the queue, is retried immediately, crashes the consumer again... this creates a “retry storm” that consumes all CPU and blocks every message behind it.

The fix is the Dead Letter Queue (DLQ). After X failed attempts, the “poison message” is moved to a separate holding area, allowing the main pipeline to resume.

[Image suggestion: A standard, orderly queue line on one side. Below it, a large, dark graveyard pit with skeletons of messages.]

🚨 Crucial Check: A DLQ Without a Replay Path is a Graveyard 🪦

Most teams set up the DLQ, celebrate, and go home. Messages land in it, they fix the bug, and then... nothing. You must build tooling to replay messages back into the main queue. Without a replay path, your DLQ is just where problems go to be forgotten.

The Pager Goes Off: Memory Exhaustion and Backpressure 💥

It’s 3 a.m. The pager goes off. The broker is out of memory. Why? The producer was writing at 10,000 messages/second, but the consumer was only reading at 2,000/second—and had been for hours. The gap doesn’t close itself.

Backpressure is the umbrella term for how a slow consumer pushes back on a fast producer. You will reach for three techniques:

Bounded Queues (Cap it!): Set a max size. When full, the producer must fail or block. This is loud, fails early, and forces a resolution while you still have time.
Autoscale the Consumers: If the queue depth crosses a threshold, spin up more workers. (Works well for stateless consumers).
Credit-Based Flow Control: The consumer tells the producer: “I am ready for 5 messages.” The producer sends 5 and stops, waiting for the next request. This is the model behind reactive streams (Project Reactor, Akka).

The takeaway: Every queue has a limit. Either you pick it and plan, or the OS picks it for you by killing the process.

Decision Matrix: Kafka, RabbitMQ, or SQS?

These are not three flavors of the same tool. They are three different categories of technology.

RabbitMQ 🐇: The Broker

RabbitMQ’s superpower is Complex Routing 🛣️.

[Image suggestion: A complex transportation hub. An arrow points to an exchange (routing center). A dispatcher is configuring dials, directing messages via exact matches, broadcast patterns, or headers into specific destination queues.]

Messages don’t go to queues directly; they go to an Exchange, and the Exchange decides which queues they belong in based on complex, configurable rules. The broker does all the routing for you. And when a message is acknowledged, it is gone.

Reach for RabbitMQ when: Routing is the interesting part of your problem; you need per-message delivery control; raw throughput isn’t the primary bottleneck.

Kafka 🪵: The Log

Kafka is fundamentally a Distributed, Append-Only Log 📜.

[Image suggestion: An endless reel of old, unrolling film tape (the log). A timeline shows consumers tracking their position (offsets). One new ‘fraud detection’ consumer is seen actively pulling the tape reel backward to re-read ‘last 30 days of data’ without disturbing other live readers.]

In Kafka, consumed messages stay in the log for 7 days, 30 days, or forever. Consumers track their own position (offsets). This means any consumer can rewind history.

Reach for Kafka when: You need stream processing, event sourcing, or “time travel” (replaying history); raw throughput is critical (millions of events/sec). (Kafka now supports queue-style share groups, but the log is still its core reason for being).

SQS ☁️: The Managed Queue

SQS is Zero Ops ☕. It is three API calls (Send, Receive, Delete) running inside AWS. Nothing to tune, nothing to patch.

[Image suggestion: A minimalist, clean conveyor belt stretching into a bright cloud (AWS). A small button is labeled ‘Just Send It.’ A single person sits relaxing with coffee, simply pushing ‘Receive.’]

It comes in two flavors: Standard (at least once, best effort ordering, massive throughput) and FIFO (strict ordering, at least once).

Reach for SQS when: You want a queue, not a new operations commitment; you value fast time-to-market over complex features.

Pick the Simplest Tool

The biggest architectural mistake isn’t picking SQS when you could use Kafka. It’s picking the most massive, operational-heavy tool because you “might” need it in three years.

I have seen systems running full Kafka clusters to handle 40 messages an hour. 40. Every incident on that team took longer because the tool was exponentially more complicated than the problem it was solving.

Pick the simplest tool that meets your real requirements. Use SQS or managed RabbitMQ first. You can migrate to Kafka the day you have a real reason—and on that day, you’ll know.

Hands-on Architecture:

Building an app like Instagram 📸 or Uber 🚗 is a great way to see these tools in action. Since these platforms handle millions of users, they use “polyglot messaging”—using different tools for different jobs.

Let’s look at Uber as an example. When you request a ride, the system has to:

Find a Driver: Send the request to nearby drivers (Work Queue).
Update Analytics: Track demand in that neighborhood (Event/Log).
Notify Billing: Prepare the transaction (Reliable Task).

The Design Challenge

Imagine we are building the “Driver Dispatch” part of the app.🚕

When a rider hits “Request,” we need to alert the closest 5 drivers. If a driver accepts, the request must disappear for the other 4 drivers immediately. We also need to ensure that even if our “Dispatch Service” crashes, we don’t “lose” the rider’s request.

Given what we’ve discussed, how should we handle the Rider Request?

Option A: SQS (Managed Queue). Easy to scale, ensures the request is handled, and handles retries if a driver’s app glitches.
Option B: Kafka (Distributed Log). Good for tracking where every driver has been for the last hour, but might be “overkill” for a simple one-to-one dispatch.
Option C: RabbitMQ (Message Broker). Excellent if we want to use “Geographic Routing” to send messages only to drivers in a specific “NYC-Brooklyn” exchange.

Which tool would you pick to ensure the request is routed correctly based on location and deleted the moment it’s accepted?

--
RabbitMQ 🐇 is the standout for this specific task because of its sophisticated routing capabilities.

While SQS is great for simple queues, Uber’s dispatching needs are more dynamic. Using RabbitMQ, you can leverage Exchanges 📂 to route rider requests to specific queues based on geographic metadata (like longitude/latitude or neighborhood IDs).

Why RabbitMQ fits the Dispatcher:

Selective Routing: You can create a “Topic Exchange” where the routing key is something like geo.us.nyc.brooklyn. Only drivers subscribed to that specific area will see the request.

Direct Interaction: Once a driver accepts the ride, the “Competing Consumer” pattern ensures that the message is acknowledged and removed from the queue so no one else can take it.

Low Latency: For real-time dispatching where every second counts, RabbitMQ’s push-based model is slightly snappier than the polling required by SQS.

The Architecture Final Check 🏗️

In a real system like Uber, we wouldn’t just use RabbitMQ. We’d likely use a combination of all three tools we discussed:

RabbitMQ for the “Hot” path: Finding and notifying the driver right now.

Kafka for the “Audit” path: Recording every location update and request for the data science team to analyze later.

SQS for the “Side” path: Sending the email receipt or a push notification after the ride is over—tasks that aren’t time-critical but must happen eventually.

DEV Community