Rajkiran

Posted on Jun 11

System Design - 13. Message Queues Explained: Why LinkedIn Built Kafka and Changed Async Communication Forever

#architecture #backend #distributedsystems #systemdesign

Message Queues Explained: Why LinkedIn Built Kafka and Changed Async Communication Forever

Covers: Point-to-Point vs Pub-Sub, Kafka Internals, Delivery Guarantees, Dead Letter Queues, Backpressure

The Upload That Broke Everything

In 2011, LinkedIn's activity feed was choking. Every time a user updated their profile, viewed a connection, or clicked an article, the system needed to:

Update the activity feed
Recalculate recommendations
Notify relevant connections
Update search indexes
Log the event for analytics

All synchronously. All in the same request. All blocking the user from getting a response until every downstream system confirmed success.

When traffic spiked, the whole chain collapsed. One slow downstream system stalled every user action behind it.

The engineering team asked a radical question: "Does the user really need to wait for all of this?"

The answer was no. The user needed to know their profile was saved. Everything else — feed updates, recommendations, notifications — could happen a few seconds later without the user caring.

That insight led to the creation of Apache Kafka. And it fundamentally changed how large-scale systems handle communication between services.

What Is a Message Queue?

A message queue is a component that allows services to communicate asynchronously — one service produces a message, the queue stores it, and another service consumes it later.

Without queue (synchronous):
[User Action] → Service A → Service B → Service C → Service D → Response
                            ↑ if any service is slow, user waits

With queue (asynchronous):
[User Action] → Service A → [Queue] → Response (immediate)
                                ↓           ↓
                            Service B    Service C    Service D
                            (processes later, independently)

The user gets an instant response. The downstream work happens in the background, decoupled from the user's request lifecycle.

This unlocks three fundamental capabilities:

Decoupling: Producer doesn't know or care who consumes its messages. Add a new consumer without changing the producer at all.

Load leveling: Traffic spikes fill the queue rather than overwhelming consumers. Consumers process at their own pace.

Resilience: If a consumer is down, messages accumulate in the queue and are processed when it recovers. Nothing is lost.

Two Fundamental Models

Point-to-Point (Queue Model)

Each message is consumed by exactly one consumer. Once consumed, it's gone.

Producer → [Queue] → Consumer A
                  ← (message removed after consumption)

If you have multiple consumers, each message goes to only one of them (competing consumers pattern):

Producer → [Queue] → Consumer A  (takes message 1)
                  → Consumer B  (takes message 2)
                  → Consumer C  (takes message 3)

Use case: Task queues. You want each job done exactly once. Order processing, email sending, image resizing — each task should be handled by one worker, not three.

RabbitMQ is the canonical point-to-point queue. When you push a job to a RabbitMQ queue, exactly one worker picks it up and processes it.

Publish-Subscribe (Pub-Sub Model)

Each message is delivered to all subscribers. Producers publish to a topic; every subscriber to that topic gets every message.

Producer publishes "user.signup" event
    ↓
[Topic: user.signup]
    ├──► Email Service       (sends welcome email)
    ├──► Analytics Service   (records signup event)
    ├──► Recommendations     (initializes user model)
    └──► Notification Service (sends push notification)

All four consumers get the same message. Adding a fifth consumer (say, a fraud detection service) requires zero changes to the producer or existing consumers.

Use case: Event broadcasting. One thing happened; many systems need to know. This is the architecture behind every event-driven system.

Apache Kafka is the dominant pub-sub system. Let's go deep on how it works.

Kafka Internals: The Engine Behind Modern Data Pipelines

Kafka is not just a message queue — it's a distributed commit log. Understanding its internals is what separates senior engineers from those who just know the vocabulary.

Topics and Partitions

A topic is a named stream of messages (e.g., user-signups, order-placed, payment-completed).

A topic is divided into partitions — ordered, immutable sequences of messages. Each partition lives on a different broker (server):

Topic: "order-placed" (4 partitions across 3 brokers)

Broker 1: [Partition 0] msg1, msg4, msg7, msg10...
Broker 2: [Partition 1] msg2, msg5, msg8, msg11...
Broker 3: [Partition 2] msg3, msg6, msg9, msg12...
Broker 1: [Partition 3] msg0, msg3b, msg6b, msg9b...

Partitions enable parallel processing — multiple consumers can read from different partitions simultaneously, giving you throughput that scales linearly with partition count.

Partition key: When producing a message, you specify a key. Messages with the same key always go to the same partition:

producer.send(
    topic="order-placed",
    key=b"user_12345",      # All orders from user 12345 → same partition
    value=order_json
)

This guarantees ordering per key — all events for a given user are processed in sequence. Critical for correctness (you don't want "order cancelled" processed before "order placed").

Offsets: The Bookmark System

Every message in a partition has a sequential offset — an integer that uniquely identifies its position.

Partition 0: [offset 0][offset 1][offset 2][offset 3][offset 4]...
                msg_A    msg_B    msg_C    msg_D    msg_E

Consumers track their offset — which message they've processed up to. This offset is stored in Kafka itself (in a special __consumer_offsets topic).

The replay superpower: Because messages are persisted on disk (not deleted after consumption), consumers can:

Rewind to any offset and reprocess historical messages
A new service joining today can process all events from Day 1
After a bug fix, replay the last 24 hours of messages through the fixed code

This is something RabbitMQ cannot do — messages are deleted after consumption. Kafka's log retention (configurable, default 7 days) makes it a time machine.

Consumer Groups

A consumer group is a set of consumers that collectively process a topic's partitions. Each partition is assigned to exactly one consumer in the group:

Topic: "order-placed" (4 partitions)
Consumer Group: "payment-service" (2 consumers)

Consumer 1 → Partition 0, Partition 1
Consumer 2 → Partition 2, Partition 3

Scaling: Add more consumers to the group → each handles fewer partitions → higher throughput. You can scale up to as many consumers as there are partitions.

Multiple groups, same topic: Different services can each have their own consumer group, all reading the same topic independently:

Topic: "order-placed"
├── Consumer Group "payment-service"    → processes all orders
├── Consumer Group "inventory-service"  → processes all orders
└── Consumer Group "analytics-service"  → processes all orders

All three groups get every message. None of them interfere with each other. This is the pub-sub model in action.

Delivery Guarantees: The Triangle of Trust

Every messaging system makes promises about delivery. Understanding these promises is critical for system design.

At-Most-Once

Message is delivered zero or one times. If the consumer crashes before acknowledging, the message is lost.

Producer → Queue → Consumer starts processing
                → Consumer crashes mid-processing
                → Message NOT retried
                → Message lost forever

When to use: Metrics, logs, analytics events — where losing occasional messages is acceptable and duplicates are worse than losses. Very high throughput. Very low overhead.

At-Least-Once

Message is delivered one or more times. On failure, it's retried. Duplicates are possible.

Producer → Queue → Consumer processes message
                → Consumer sends ACK
                → Network drops the ACK
                → Queue doesn't receive ACK
                → Queue retries message
                → Consumer processes message AGAIN (duplicate!)

When to use: Most production systems. The standard default. Your consumers must be idempotent — processing the same message twice produces the same result as processing it once.

# Idempotent consumer example:
def process_payment(payment_id, amount):
    # Check if already processed (idempotency key)
    if db.exists(f"processed_payment:{payment_id}"):
        return  # Already done, skip safely

    # Process payment
    charge_card(amount)
    db.set(f"processed_payment:{payment_id}", True)

Exactly-Once

Message is delivered and processed exactly once, even in the face of failures. No duplicates, no losses.

The hardest guarantee. Kafka achieves it through:

Idempotent producers — Kafka deduplicates producer retries using sequence numbers
Transactional API — write to multiple partitions atomically
Transactional consumers — offset commit and business logic in the same transaction

producer = KafkaProducer(
    enable_idempotence=True,           # Dedup producer retries
    transactional_id="payment-producer-1"
)

producer.init_transactions()
producer.begin_transaction()
try:
    producer.send("payments", payment_data)
    producer.send("audit-log", audit_data)
    producer.commit_transaction()  # Both or neither
except Exception:
    producer.abort_transaction()

When to use: Financial transactions, payment processing, inventory deduction — anywhere duplicates cause real harm (double-charging a customer, overselling stock).

The cost: Lower throughput than at-least-once. More complex implementation.

Dead Letter Queue: The Safety Net

What happens to messages that consistently fail processing? Without a safety net, they can block the queue forever (a "poison pill" message).

A Dead Letter Queue (DLQ) is a special queue where failed messages are sent after N retry attempts:

Normal Queue → Consumer fails to process message
→ Retry (attempt 2)
→ Retry (attempt 3)  ← max retries reached
→ Move to DLQ

DLQ: message sits here for manual inspection or automated alert

Why this matters:

Main queue keeps flowing — the poison pill doesn't block other messages
Failed messages aren't lost — they're in the DLQ for investigation
Engineers get alerted to DLQ growth → investigate the root cause
After fixing the bug, messages can be replayed from DLQ back to the main queue

AWS SQS DLQ config:

{
  "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:my-dlq",
  "maxReceiveCount": 3
}

After 3 failed attempts, message moves to my-dlq. Engineers receive CloudWatch alarm. Messages stay in DLQ for 14 days.

Backpressure: When Consumers Can't Keep Up

If producers emit 10,000 messages/second but consumers can only process 1,000/second, the queue grows indefinitely. Eventually: out-of-memory, disk full, system collapse.

Backpressure is the mechanism by which a system signals upstream components to slow down.

Pull-based consumption (Kafka's model):
Consumers pull messages at their own pace. They never receive more than they can handle. If a consumer is slow, it simply reads fewer messages — the queue absorbs the backlog.

Fast producer → [Kafka topic, growing backlog]
Slow consumer → pulls 100 messages at a time, processes, pulls next 100
→ Consumer naturally throttles itself

Push-based queues (RabbitMQ):
The broker pushes messages to consumers. prefetch count limits how many unacknowledged messages a consumer can hold:

channel.basic_qos(prefetch_count=10)
# Consumer receives at most 10 messages before it must ACK some
# Prevents overwhelming a slow consumer

Application-level backpressure:
When the queue depth exceeds a threshold, producers are asked to slow down:

Queue depth > 1 million messages → alert producer service
Producer service → reduce emission rate by 50%
Queue depth decreasing → producer returns to full rate

This is how streaming systems like Spark Streaming and Flink handle varying load without crashing.

Kafka vs RabbitMQ: The Real Comparison

Feature	Kafka	RabbitMQ
Model	Pub-Sub (log-based)	Point-to-point + Pub-Sub
Message retention	Persisted on disk (days/weeks)	Deleted after consumption
Replay	Yes — rewind to any offset	No
Throughput	Millions/sec per broker	~50K/sec per queue
Consumer model	Pull (consumer controls pace)	Push (broker sends to consumer)
Ordering	Per partition	Per queue
Routing	Topic + partition key	Flexible exchange-based routing
Use case	Event streaming, data pipelines	Task queues, complex routing

Choose Kafka when:

High throughput (100K+ messages/second)
You need replay / event sourcing
Multiple independent consumers need the same events
Building a data pipeline (Kafka → Spark/Flink → data warehouse)

Choose RabbitMQ when:

Complex routing logic (route by message header, content, priority)
Task queue semantics (each job done by exactly one worker)
Lower throughput requirements
Need per-message TTL, priority queues, or delayed delivery

Key Takeaways

Message queues decouple producers from consumers, enabling async processing, load leveling, and resilience.
Point-to-point (RabbitMQ): one consumer per message. For task queues.
Pub-Sub (Kafka): all consumers get every message. For event broadcasting.
Kafka internals: topics → partitions → offsets. Consumer groups enable parallel processing. Partition keys guarantee per-key ordering.
Delivery guarantees: At-most-once (lossy, fast), At-least-once (default, needs idempotency), Exactly-once (strong, costly).
DLQ prevents poison pills from blocking queues — failed messages park here for investigation.
Backpressure prevents fast producers from overwhelming slow consumers — Kafka's pull model handles this naturally.

What's Next

Topic 14 goes deeper into the architectural pattern that Kafka enables: Event-Driven Architecture — event sourcing, CQRS, the outbox pattern, and how to build systems where "something happened" is the fundamental primitive.

Tags: system-design kafka message-queues backend distributed-systems event-driven interview-prep

DEV Community