Youâre architecting a new system. You need asynchronous communication. Someone yells âKafka!â Someone else says âRabbitMQ is easier.â Another person pipes up with âJust use SQS.â
They look similar on the surface. They are not interchangeable. They are not even in the same category. Picking the wrong oneâand worse, misunderstanding the fundamental conceptsâwonât just make your life harder. It will introduce cascading failures that take months to debug and fix.
Before you write a single line of code, there are four conceptual pillars you must master. If you get these wrong, your system will take itself down the first time a downstream service has a slight slowdown.
Letâs go.
Pillar 1: The Problem Weâre Fixing (Decoupling)
Why are we even using queues? Weâre trying to solve the problem of synchronous coupling.
Imagine your Checkout Service has to call your Inventory Service directly via an API.
Checkout calls Inventory directly. Inventory slows down (maybe itâs under high load, maybe the DB is lagging). It doesnât crash; it just takes 5 seconds to respond instead of 50ms.
Because the connection is synchronous, Checkoutâs threads pile up waiting. Then the internal pool exhausts. Then Checkout is out of threads. Now, anything depending on Checkout is also down.
This is Cascading Failure. It spreads like wildfire. The real cause isnât the slow inventory; itâs the synchronous coupling that forces overload to feed on itself (retries, timeouts, waiting).
Now Watch What Happens With a Queue đĄď¸
A queue acts as a buffer between the two services.
When Inventory slows down, messages simply accumulate in the queue. The Checkout service writes its âOrder Placedâ message and immediately moves on.
This is the whole point of a queue: It decouples the producer from the consumer in three critical dimensions:
Time âł: They donât need to run at the same moment.
Availability â : One can be down for maintenance without taking the other down.
Speed âĄď¸: The fast one isnât held hostage by the slow one.
Pillar 2: Work vs. Events (Point-to-Point vs. Pub/Sub)
A team shipped a feature on a Friday. By Monday morning, every single user was getting 50 duplicate welcome emails. đ§đ¤Ż
The root cause was one simple conceptual mistake: They put a âWork Jobâ (Send This Email) onto a âTopicâ (a fan-out mechanism). They had 50 worker processes. The topic did its job and fanned out one copy of the message to all 50 workers. One question asked in review would have saved their weekend:
âShould this message be handled ONCE, or should MULTIPLE services REACT to it?â
Handled Once (The Work Queue) đˇââď¸
This is Point-to-Point. Itâs a âJobâ or âWork Queue.â Examples include: Resize this image, Charge this credit card, Send this one email. Any worker in the pool can grab it, but only one must process it. The message disappears once handled.
Reacted to by Many (The Event) đŁ
This is Pub/Sub (Publisher/Subscriber), often called Topics.
An Event is a statement of fact: An order got placed. The Email Service cares (needs to send a receipt). The Shipping Service cares (needs to label a box). The Analytics Service cares (needs to update dashboards). Three independent, parallel reactions to the same fact. Each gets its own copy of the message.
The tool the team used (perhaps RabbitMQ or Kafka) wasnât broken. Their semantic understanding of the message intent was wrong. Donât argue about words (âKafka Queueâ vs. âRabbitMQ Topicâ). Ask what actually happens to a message when it arrives.
Pillar 3: Delivery Guarantees (The Exactly Once Myth)
Youâll see three delivery guarantees advertised. One is, quite literally, impossible in general distributed systems.
At Most Once: Fire and forget. We send the message and donât check. If it drops, it drops. Fine for metrics where one lost data point doesnât matter.
At Least Once (The Real Default): The producer retries until it gets an acknowledgement (ACK) from the broker. If the ACK gets lost, the message gets resent. You will not lose data, but you will get duplicates. This is what most systems use by default.
Exactly Once: This is where vendors get creative with marketing.
The âExactly Onceâ Reality đĄď¸âď¸đĄď¸
Exactly once delivery across an unreliable network is not achievable. Itâs tied to a result called the Two Generals Problem. Picture it: A producer sends a message. The broker gets it and sends back an ACK. The ACK vanishes. Now the producer has two choices: Retry (and maybe cause a duplicate) or Give Up (and maybe lose data). The network never tells you which scenario occurred.
When a system advertises âExactly Once Semantics,â whatâs actually happening is At Least Once Delivery PLUS either:
Idempotent Processing: The consumer is smart enough to handle duplicates.
Transactional Writes: The write to the final storage (DB) is part of a distributed transaction.
The distinct concept to keep in your head: Exactly Once Delivery over a network (No), exactly once EFFECT (Yes, through work).
đ¨ Actionable Takeaway: Build Idempotent Consumers đĄď¸
My advice: Assume âAt Least Onceâ and build Idempotent Consumers. Every message handler must check: âHave I seen this unique Event ID before?â and skip if it has. This single pattern prevents standard nightmare bugs: double charges, duplicate emails, and inventory drifting.
If you do one thing after this post, check your handlers. If they arenât idempotent, make them idempotent.
Pillar 4: When Things Fail (DLQs and Backpressure)
The Poison Message and the DLQ đ¤Ž
Your consumer receives one malformed message. The code crashes. The message goes back to the queue, is retried immediately, crashes the consumer again... this creates a âretry stormâ that consumes all CPU and blocks every message behind it.
The fix is the Dead Letter Queue (DLQ). After X failed attempts, the âpoison messageâ is moved to a separate holding area, allowing the main pipeline to resume.
[Image suggestion: A standard, orderly queue line on one side. Below it, a large, dark graveyard pit with skeletons of messages.]
đ¨ Crucial Check: A DLQ Without a Replay Path is a Graveyard đŞŚ
Most teams set up the DLQ, celebrate, and go home. Messages land in it, they fix the bug, and then... nothing. You must build tooling to replay messages back into the main queue. Without a replay path, your DLQ is just where problems go to be forgotten.
The Pager Goes Off: Memory Exhaustion and Backpressure đĽ
Itâs 3 a.m. The pager goes off. The broker is out of memory. Why? The producer was writing at 10,000 messages/second, but the consumer was only reading at 2,000/secondâand had been for hours. The gap doesnât close itself.
Backpressure is the umbrella term for how a slow consumer pushes back on a fast producer. You will reach for three techniques:
- Bounded Queues (Cap it!): Set a max size. When full, the producer must fail or block. This is loud, fails early, and forces a resolution while you still have time.
- Autoscale the Consumers: If the queue depth crosses a threshold, spin up more workers. (Works well for stateless consumers).
- Credit-Based Flow Control: The consumer tells the producer: âI am ready for 5 messages.â The producer sends 5 and stops, waiting for the next request. This is the model behind reactive streams (Project Reactor, Akka).
The takeaway: Every queue has a limit. Either you pick it and plan, or the OS picks it for you by killing the process.
Decision Matrix: Kafka, RabbitMQ, or SQS?
These are not three flavors of the same tool. They are three different categories of technology.
RabbitMQ đ: The Broker
RabbitMQâs superpower is Complex Routing đŁď¸.
[Image suggestion: A complex transportation hub. An arrow points to an exchange (routing center). A dispatcher is configuring dials, directing messages via exact matches, broadcast patterns, or headers into specific destination queues.]
Messages donât go to queues directly; they go to an Exchange, and the Exchange decides which queues they belong in based on complex, configurable rules. The broker does all the routing for you. And when a message is acknowledged, it is gone.
Reach for RabbitMQ when: Routing is the interesting part of your problem; you need per-message delivery control; raw throughput isnât the primary bottleneck.
Kafka đŞľ: The Log
Kafka is fundamentally a Distributed, Append-Only Log đ.
[Image suggestion: An endless reel of old, unrolling film tape (the log). A timeline shows consumers tracking their position (offsets). One new âfraud detectionâ consumer is seen actively pulling the tape reel backward to re-read âlast 30 days of dataâ without disturbing other live readers.]
In Kafka, consumed messages stay in the log for 7 days, 30 days, or forever. Consumers track their own position (offsets). This means any consumer can rewind history.
Reach for Kafka when: You need stream processing, event sourcing, or âtime travelâ (replaying history); raw throughput is critical (millions of events/sec). (Kafka now supports queue-style share groups, but the log is still its core reason for being).
SQS âď¸: The Managed Queue
SQS is Zero Ops â. It is three API calls (Send, Receive, Delete) running inside AWS. Nothing to tune, nothing to patch.
[Image suggestion: A minimalist, clean conveyor belt stretching into a bright cloud (AWS). A small button is labeled âJust Send It.â A single person sits relaxing with coffee, simply pushing âReceive.â]
It comes in two flavors: Standard (at least once, best effort ordering, massive throughput) and FIFO (strict ordering, at least once).
Reach for SQS when: You want a queue, not a new operations commitment; you value fast time-to-market over complex features.
Pick the Simplest Tool
The biggest architectural mistake isnât picking SQS when you could use Kafka. Itâs picking the most massive, operational-heavy tool because you âmightâ need it in three years.
I have seen systems running full Kafka clusters to handle 40 messages an hour. 40. Every incident on that team took longer because the tool was exponentially more complicated than the problem it was solving.
Pick the simplest tool that meets your real requirements. Use SQS or managed RabbitMQ first. You can migrate to Kafka the day you have a real reasonâand on that day, youâll know.
Hands-on Architecture:
Building an app like Instagram đ¸ or Uber đ is a great way to see these tools in action. Since these platforms handle millions of users, they use âpolyglot messagingââusing different tools for different jobs.
Letâs look at Uber as an example. When you request a ride, the system has to:
- Find a Driver: Send the request to nearby drivers (Work Queue).
- Update Analytics: Track demand in that neighborhood (Event/Log).
- Notify Billing: Prepare the transaction (Reliable Task).
The Design Challenge
Imagine we are building the âDriver Dispatchâ part of the app.đ
When a rider hits âRequest,â we need to alert the closest 5 drivers. If a driver accepts, the request must disappear for the other 4 drivers immediately. We also need to ensure that even if our âDispatch Serviceâ crashes, we donât âloseâ the riderâs request.
Given what weâve discussed, how should we handle the Rider Request?
- Option A: SQS (Managed Queue). Easy to scale, ensures the request is handled, and handles retries if a driverâs app glitches.
- Option B: Kafka (Distributed Log). Good for tracking where every driver has been for the last hour, but might be âoverkillâ for a simple one-to-one dispatch.
- Option C: RabbitMQ (Message Broker). Excellent if we want to use âGeographic Routingâ to send messages only to drivers in a specific âNYC-Brooklynâ exchange.
Which tool would you pick to ensure the request is routed correctly based on location and deleted the moment itâs accepted?
--
RabbitMQ đ is the standout for this specific task because of its sophisticated routing capabilities.
While SQS is great for simple queues, Uberâs dispatching needs are more dynamic. Using RabbitMQ, you can leverage Exchanges đ to route rider requests to specific queues based on geographic metadata (like longitude/latitude or neighborhood IDs).
Why RabbitMQ fits the Dispatcher:
Selective Routing: You can create a âTopic Exchangeâ where the routing key is something like geo.us.nyc.brooklyn. Only drivers subscribed to that specific area will see the request.
Direct Interaction: Once a driver accepts the ride, the âCompeting Consumerâ pattern ensures that the message is acknowledged and removed from the queue so no one else can take it.
Low Latency: For real-time dispatching where every second counts, RabbitMQâs push-based model is slightly snappier than the polling required by SQS.
The Architecture Final Check đď¸
In a real system like Uber, we wouldnât just use RabbitMQ. Weâd likely use a combination of all three tools we discussed:
RabbitMQ for the âHotâ path: Finding and notifying the driver right now.
Kafka for the âAuditâ path: Recording every location update and request for the data science team to analyze later.
SQS for the âSideâ path: Sending the email receipt or a push notification after the ride is overâtasks that arenât time-critical but must happen eventually.
Top comments (0)