Message Queues Explained: Why LinkedIn Built Kafka and Changed Async Communication Forever
Covers: Point-to-Point vs Pub-Sub, Kafka Internals, Delivery Guarantees, Dead Letter Queues, Backpressure
The Upload That Broke Everything
In 2011, LinkedIn's activity feed was choking. Every time a user updated their profile, viewed a connection, or clicked an article, the system needed to:
- Update the activity feed
- Recalculate recommendations
- Notify relevant connections
- Update search indexes
- Log the event for analytics
All synchronously. All in the same request. All blocking the user from getting a response until every downstream system confirmed success.
When traffic spiked, the whole chain collapsed. One slow downstream system stalled every user action behind it.
The engineering team asked a radical question: "Does the user really need to wait for all of this?"
The answer was no. The user needed to know their profile was saved. Everything else — feed updates, recommendations, notifications — could happen a few seconds later without the user caring.
That insight led to the creation of Apache Kafka. And it fundamentally changed how large-scale systems handle communication between services.
What Is a Message Queue?
A message queue is a component that allows services to communicate asynchronously — one service produces a message, the queue stores it, and another service consumes it later.
Without queue (synchronous):
[User Action] → Service A → Service B → Service C → Service D → Response
↑ if any service is slow, user waits
With queue (asynchronous):
[User Action] → Service A → [Queue] → Response (immediate)
↓ ↓
Service B Service C Service D
(processes later, independently)
The user gets an instant response. The downstream work happens in the background, decoupled from the user's request lifecycle.
This unlocks three fundamental capabilities:
Decoupling: Producer doesn't know or care who consumes its messages. Add a new consumer without changing the producer at all.
Load leveling: Traffic spikes fill the queue rather than overwhelming consumers. Consumers process at their own pace.
Resilience: If a consumer is down, messages accumulate in the queue and are processed when it recovers. Nothing is lost.
Two Fundamental Models
Point-to-Point (Queue Model)
Each message is consumed by exactly one consumer. Once consumed, it's gone.
Producer → [Queue] → Consumer A
← (message removed after consumption)
If you have multiple consumers, each message goes to only one of them (competing consumers pattern):
Producer → [Queue] → Consumer A (takes message 1)
→ Consumer B (takes message 2)
→ Consumer C (takes message 3)
Use case: Task queues. You want each job done exactly once. Order processing, email sending, image resizing — each task should be handled by one worker, not three.
RabbitMQ is the canonical point-to-point queue. When you push a job to a RabbitMQ queue, exactly one worker picks it up and processes it.
Publish-Subscribe (Pub-Sub Model)
Each message is delivered to all subscribers. Producers publish to a topic; every subscriber to that topic gets every message.
Producer publishes "user.signup" event
↓
[Topic: user.signup]
├──► Email Service (sends welcome email)
├──► Analytics Service (records signup event)
├──► Recommendations (initializes user model)
└──► Notification Service (sends push notification)
All four consumers get the same message. Adding a fifth consumer (say, a fraud detection service) requires zero changes to the producer or existing consumers.
Use case: Event broadcasting. One thing happened; many systems need to know. This is the architecture behind every event-driven system.
Apache Kafka is the dominant pub-sub system. Let's go deep on how it works.
Kafka Internals: The Engine Behind Modern Data Pipelines
Kafka is not just a message queue — it's a distributed commit log. Understanding its internals is what separates senior engineers from those who just know the vocabulary.
Topics and Partitions
A topic is a named stream of messages (e.g., user-signups, order-placed, payment-completed).
A topic is divided into partitions — ordered, immutable sequences of messages. Each partition lives on a different broker (server):
Topic: "order-placed" (4 partitions across 3 brokers)
Broker 1: [Partition 0] msg1, msg4, msg7, msg10...
Broker 2: [Partition 1] msg2, msg5, msg8, msg11...
Broker 3: [Partition 2] msg3, msg6, msg9, msg12...
Broker 1: [Partition 3] msg0, msg3b, msg6b, msg9b...
Partitions enable parallel processing — multiple consumers can read from different partitions simultaneously, giving you throughput that scales linearly with partition count.
Partition key: When producing a message, you specify a key. Messages with the same key always go to the same partition:
producer.send(
topic="order-placed",
key=b"user_12345", # All orders from user 12345 → same partition
value=order_json
)
This guarantees ordering per key — all events for a given user are processed in sequence. Critical for correctness (you don't want "order cancelled" processed before "order placed").
Offsets: The Bookmark System
Every message in a partition has a sequential offset — an integer that uniquely identifies its position.
Partition 0: [offset 0][offset 1][offset 2][offset 3][offset 4]...
msg_A msg_B msg_C msg_D msg_E
Consumers track their offset — which message they've processed up to. This offset is stored in Kafka itself (in a special __consumer_offsets topic).
The replay superpower: Because messages are persisted on disk (not deleted after consumption), consumers can:
- Rewind to any offset and reprocess historical messages
- A new service joining today can process all events from Day 1
- After a bug fix, replay the last 24 hours of messages through the fixed code
This is something RabbitMQ cannot do — messages are deleted after consumption. Kafka's log retention (configurable, default 7 days) makes it a time machine.
Consumer Groups
A consumer group is a set of consumers that collectively process a topic's partitions. Each partition is assigned to exactly one consumer in the group:
Topic: "order-placed" (4 partitions)
Consumer Group: "payment-service" (2 consumers)
Consumer 1 → Partition 0, Partition 1
Consumer 2 → Partition 2, Partition 3
Scaling: Add more consumers to the group → each handles fewer partitions → higher throughput. You can scale up to as many consumers as there are partitions.
Multiple groups, same topic: Different services can each have their own consumer group, all reading the same topic independently:
Topic: "order-placed"
├── Consumer Group "payment-service" → processes all orders
├── Consumer Group "inventory-service" → processes all orders
└── Consumer Group "analytics-service" → processes all orders
All three groups get every message. None of them interfere with each other. This is the pub-sub model in action.
Delivery Guarantees: The Triangle of Trust
Every messaging system makes promises about delivery. Understanding these promises is critical for system design.
At-Most-Once
Message is delivered zero or one times. If the consumer crashes before acknowledging, the message is lost.
Producer → Queue → Consumer starts processing
→ Consumer crashes mid-processing
→ Message NOT retried
→ Message lost forever
When to use: Metrics, logs, analytics events — where losing occasional messages is acceptable and duplicates are worse than losses. Very high throughput. Very low overhead.
At-Least-Once
Message is delivered one or more times. On failure, it's retried. Duplicates are possible.
Producer → Queue → Consumer processes message
→ Consumer sends ACK
→ Network drops the ACK
→ Queue doesn't receive ACK
→ Queue retries message
→ Consumer processes message AGAIN (duplicate!)
When to use: Most production systems. The standard default. Your consumers must be idempotent — processing the same message twice produces the same result as processing it once.
# Idempotent consumer example:
def process_payment(payment_id, amount):
# Check if already processed (idempotency key)
if db.exists(f"processed_payment:{payment_id}"):
return # Already done, skip safely
# Process payment
charge_card(amount)
db.set(f"processed_payment:{payment_id}", True)
Exactly-Once
Message is delivered and processed exactly once, even in the face of failures. No duplicates, no losses.
The hardest guarantee. Kafka achieves it through:
- Idempotent producers — Kafka deduplicates producer retries using sequence numbers
- Transactional API — write to multiple partitions atomically
- Transactional consumers — offset commit and business logic in the same transaction
producer = KafkaProducer(
enable_idempotence=True, # Dedup producer retries
transactional_id="payment-producer-1"
)
producer.init_transactions()
producer.begin_transaction()
try:
producer.send("payments", payment_data)
producer.send("audit-log", audit_data)
producer.commit_transaction() # Both or neither
except Exception:
producer.abort_transaction()
When to use: Financial transactions, payment processing, inventory deduction — anywhere duplicates cause real harm (double-charging a customer, overselling stock).
The cost: Lower throughput than at-least-once. More complex implementation.
Dead Letter Queue: The Safety Net
What happens to messages that consistently fail processing? Without a safety net, they can block the queue forever (a "poison pill" message).
A Dead Letter Queue (DLQ) is a special queue where failed messages are sent after N retry attempts:
Normal Queue → Consumer fails to process message
→ Retry (attempt 2)
→ Retry (attempt 3) ← max retries reached
→ Move to DLQ
DLQ: message sits here for manual inspection or automated alert
Why this matters:
- Main queue keeps flowing — the poison pill doesn't block other messages
- Failed messages aren't lost — they're in the DLQ for investigation
- Engineers get alerted to DLQ growth → investigate the root cause
- After fixing the bug, messages can be replayed from DLQ back to the main queue
AWS SQS DLQ config:
{
"deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:my-dlq",
"maxReceiveCount": 3
}
After 3 failed attempts, message moves to my-dlq. Engineers receive CloudWatch alarm. Messages stay in DLQ for 14 days.
Backpressure: When Consumers Can't Keep Up
If producers emit 10,000 messages/second but consumers can only process 1,000/second, the queue grows indefinitely. Eventually: out-of-memory, disk full, system collapse.
Backpressure is the mechanism by which a system signals upstream components to slow down.
Pull-based consumption (Kafka's model):
Consumers pull messages at their own pace. They never receive more than they can handle. If a consumer is slow, it simply reads fewer messages — the queue absorbs the backlog.
Fast producer → [Kafka topic, growing backlog]
Slow consumer → pulls 100 messages at a time, processes, pulls next 100
→ Consumer naturally throttles itself
Push-based queues (RabbitMQ):
The broker pushes messages to consumers. prefetch count limits how many unacknowledged messages a consumer can hold:
channel.basic_qos(prefetch_count=10)
# Consumer receives at most 10 messages before it must ACK some
# Prevents overwhelming a slow consumer
Application-level backpressure:
When the queue depth exceeds a threshold, producers are asked to slow down:
Queue depth > 1 million messages → alert producer service
Producer service → reduce emission rate by 50%
Queue depth decreasing → producer returns to full rate
This is how streaming systems like Spark Streaming and Flink handle varying load without crashing.
Kafka vs RabbitMQ: The Real Comparison
| Feature | Kafka | RabbitMQ |
|---|---|---|
| Model | Pub-Sub (log-based) | Point-to-point + Pub-Sub |
| Message retention | Persisted on disk (days/weeks) | Deleted after consumption |
| Replay | Yes — rewind to any offset | No |
| Throughput | Millions/sec per broker | ~50K/sec per queue |
| Consumer model | Pull (consumer controls pace) | Push (broker sends to consumer) |
| Ordering | Per partition | Per queue |
| Routing | Topic + partition key | Flexible exchange-based routing |
| Use case | Event streaming, data pipelines | Task queues, complex routing |
Choose Kafka when:
- High throughput (100K+ messages/second)
- You need replay / event sourcing
- Multiple independent consumers need the same events
- Building a data pipeline (Kafka → Spark/Flink → data warehouse)
Choose RabbitMQ when:
- Complex routing logic (route by message header, content, priority)
- Task queue semantics (each job done by exactly one worker)
- Lower throughput requirements
- Need per-message TTL, priority queues, or delayed delivery
Key Takeaways
- Message queues decouple producers from consumers, enabling async processing, load leveling, and resilience.
- Point-to-point (RabbitMQ): one consumer per message. For task queues.
- Pub-Sub (Kafka): all consumers get every message. For event broadcasting.
- Kafka internals: topics → partitions → offsets. Consumer groups enable parallel processing. Partition keys guarantee per-key ordering.
- Delivery guarantees: At-most-once (lossy, fast), At-least-once (default, needs idempotency), Exactly-once (strong, costly).
- DLQ prevents poison pills from blocking queues — failed messages park here for investigation.
- Backpressure prevents fast producers from overwhelming slow consumers — Kafka's pull model handles this naturally.
What's Next
Topic 14 goes deeper into the architectural pattern that Kafka enables: Event-Driven Architecture — event sourcing, CQRS, the outbox pattern, and how to build systems where "something happened" is the fundamental primitive.
Tags: system-design kafka message-queues backend distributed-systems event-driven interview-prep
Top comments (0)