DEV Community

Rajkiran
Rajkiran

Posted on

System Design - 13. Message Queues Explained: Why LinkedIn Built Kafka and Changed Async Communication Forever

Message Queues Explained: Why LinkedIn Built Kafka and Changed Async Communication Forever

Covers: Point-to-Point vs Pub-Sub, Kafka Internals, Delivery Guarantees, Dead Letter Queues, Backpressure


The Upload That Broke Everything

In 2011, LinkedIn's activity feed was choking. Every time a user updated their profile, viewed a connection, or clicked an article, the system needed to:

  • Update the activity feed
  • Recalculate recommendations
  • Notify relevant connections
  • Update search indexes
  • Log the event for analytics

All synchronously. All in the same request. All blocking the user from getting a response until every downstream system confirmed success.

When traffic spiked, the whole chain collapsed. One slow downstream system stalled every user action behind it.

The engineering team asked a radical question: "Does the user really need to wait for all of this?"

The answer was no. The user needed to know their profile was saved. Everything else — feed updates, recommendations, notifications — could happen a few seconds later without the user caring.

That insight led to the creation of Apache Kafka. And it fundamentally changed how large-scale systems handle communication between services.


What Is a Message Queue?

A message queue is a component that allows services to communicate asynchronously — one service produces a message, the queue stores it, and another service consumes it later.

Without queue (synchronous):
[User Action] → Service A → Service B → Service C → Service D → Response
                            ↑ if any service is slow, user waits

With queue (asynchronous):
[User Action] → Service A → [Queue] → Response (immediate)
                                ↓           ↓
                            Service B    Service C    Service D
                            (processes later, independently)
Enter fullscreen mode Exit fullscreen mode

The user gets an instant response. The downstream work happens in the background, decoupled from the user's request lifecycle.

This unlocks three fundamental capabilities:

Decoupling: Producer doesn't know or care who consumes its messages. Add a new consumer without changing the producer at all.

Load leveling: Traffic spikes fill the queue rather than overwhelming consumers. Consumers process at their own pace.

Resilience: If a consumer is down, messages accumulate in the queue and are processed when it recovers. Nothing is lost.


Two Fundamental Models

Point-to-Point (Queue Model)

Each message is consumed by exactly one consumer. Once consumed, it's gone.

Producer → [Queue] → Consumer A
                  ← (message removed after consumption)
Enter fullscreen mode Exit fullscreen mode

If you have multiple consumers, each message goes to only one of them (competing consumers pattern):

Producer → [Queue] → Consumer A  (takes message 1)
                  → Consumer B  (takes message 2)
                  → Consumer C  (takes message 3)
Enter fullscreen mode Exit fullscreen mode

Use case: Task queues. You want each job done exactly once. Order processing, email sending, image resizing — each task should be handled by one worker, not three.

RabbitMQ is the canonical point-to-point queue. When you push a job to a RabbitMQ queue, exactly one worker picks it up and processes it.


Publish-Subscribe (Pub-Sub Model)

Each message is delivered to all subscribers. Producers publish to a topic; every subscriber to that topic gets every message.

Producer publishes "user.signup" event
    ↓
[Topic: user.signup]
    ├──► Email Service       (sends welcome email)
    ├──► Analytics Service   (records signup event)
    ├──► Recommendations     (initializes user model)
    └──► Notification Service (sends push notification)
Enter fullscreen mode Exit fullscreen mode

All four consumers get the same message. Adding a fifth consumer (say, a fraud detection service) requires zero changes to the producer or existing consumers.

Use case: Event broadcasting. One thing happened; many systems need to know. This is the architecture behind every event-driven system.

Apache Kafka is the dominant pub-sub system. Let's go deep on how it works.


Kafka Internals: The Engine Behind Modern Data Pipelines

Kafka is not just a message queue — it's a distributed commit log. Understanding its internals is what separates senior engineers from those who just know the vocabulary.

Topics and Partitions

A topic is a named stream of messages (e.g., user-signups, order-placed, payment-completed).

A topic is divided into partitions — ordered, immutable sequences of messages. Each partition lives on a different broker (server):

Topic: "order-placed" (4 partitions across 3 brokers)

Broker 1: [Partition 0] msg1, msg4, msg7, msg10...
Broker 2: [Partition 1] msg2, msg5, msg8, msg11...
Broker 3: [Partition 2] msg3, msg6, msg9, msg12...
Broker 1: [Partition 3] msg0, msg3b, msg6b, msg9b...
Enter fullscreen mode Exit fullscreen mode

Partitions enable parallel processing — multiple consumers can read from different partitions simultaneously, giving you throughput that scales linearly with partition count.

Partition key: When producing a message, you specify a key. Messages with the same key always go to the same partition:

producer.send(
    topic="order-placed",
    key=b"user_12345",      # All orders from user 12345 → same partition
    value=order_json
)
Enter fullscreen mode Exit fullscreen mode

This guarantees ordering per key — all events for a given user are processed in sequence. Critical for correctness (you don't want "order cancelled" processed before "order placed").


Offsets: The Bookmark System

Every message in a partition has a sequential offset — an integer that uniquely identifies its position.

Partition 0: [offset 0][offset 1][offset 2][offset 3][offset 4]...
                msg_A    msg_B    msg_C    msg_D    msg_E
Enter fullscreen mode Exit fullscreen mode

Consumers track their offset — which message they've processed up to. This offset is stored in Kafka itself (in a special __consumer_offsets topic).

The replay superpower: Because messages are persisted on disk (not deleted after consumption), consumers can:

  • Rewind to any offset and reprocess historical messages
  • A new service joining today can process all events from Day 1
  • After a bug fix, replay the last 24 hours of messages through the fixed code

This is something RabbitMQ cannot do — messages are deleted after consumption. Kafka's log retention (configurable, default 7 days) makes it a time machine.


Consumer Groups

A consumer group is a set of consumers that collectively process a topic's partitions. Each partition is assigned to exactly one consumer in the group:

Topic: "order-placed" (4 partitions)
Consumer Group: "payment-service" (2 consumers)

Consumer 1 → Partition 0, Partition 1
Consumer 2 → Partition 2, Partition 3
Enter fullscreen mode Exit fullscreen mode

Scaling: Add more consumers to the group → each handles fewer partitions → higher throughput. You can scale up to as many consumers as there are partitions.

Multiple groups, same topic: Different services can each have their own consumer group, all reading the same topic independently:

Topic: "order-placed"
├── Consumer Group "payment-service"    → processes all orders
├── Consumer Group "inventory-service"  → processes all orders
└── Consumer Group "analytics-service"  → processes all orders
Enter fullscreen mode Exit fullscreen mode

All three groups get every message. None of them interfere with each other. This is the pub-sub model in action.


Delivery Guarantees: The Triangle of Trust

Every messaging system makes promises about delivery. Understanding these promises is critical for system design.

At-Most-Once

Message is delivered zero or one times. If the consumer crashes before acknowledging, the message is lost.

Producer → Queue → Consumer starts processing
                → Consumer crashes mid-processing
                → Message NOT retried
                → Message lost forever
Enter fullscreen mode Exit fullscreen mode

When to use: Metrics, logs, analytics events — where losing occasional messages is acceptable and duplicates are worse than losses. Very high throughput. Very low overhead.


At-Least-Once

Message is delivered one or more times. On failure, it's retried. Duplicates are possible.

Producer → Queue → Consumer processes message
                → Consumer sends ACK
                → Network drops the ACK
                → Queue doesn't receive ACK
                → Queue retries message
                → Consumer processes message AGAIN (duplicate!)
Enter fullscreen mode Exit fullscreen mode

When to use: Most production systems. The standard default. Your consumers must be idempotent — processing the same message twice produces the same result as processing it once.

# Idempotent consumer example:
def process_payment(payment_id, amount):
    # Check if already processed (idempotency key)
    if db.exists(f"processed_payment:{payment_id}"):
        return  # Already done, skip safely

    # Process payment
    charge_card(amount)
    db.set(f"processed_payment:{payment_id}", True)
Enter fullscreen mode Exit fullscreen mode

Exactly-Once

Message is delivered and processed exactly once, even in the face of failures. No duplicates, no losses.

The hardest guarantee. Kafka achieves it through:

  1. Idempotent producers — Kafka deduplicates producer retries using sequence numbers
  2. Transactional API — write to multiple partitions atomically
  3. Transactional consumers — offset commit and business logic in the same transaction
producer = KafkaProducer(
    enable_idempotence=True,           # Dedup producer retries
    transactional_id="payment-producer-1"
)

producer.init_transactions()
producer.begin_transaction()
try:
    producer.send("payments", payment_data)
    producer.send("audit-log", audit_data)
    producer.commit_transaction()  # Both or neither
except Exception:
    producer.abort_transaction()
Enter fullscreen mode Exit fullscreen mode

When to use: Financial transactions, payment processing, inventory deduction — anywhere duplicates cause real harm (double-charging a customer, overselling stock).

The cost: Lower throughput than at-least-once. More complex implementation.


Dead Letter Queue: The Safety Net

What happens to messages that consistently fail processing? Without a safety net, they can block the queue forever (a "poison pill" message).

A Dead Letter Queue (DLQ) is a special queue where failed messages are sent after N retry attempts:

Normal Queue → Consumer fails to process message
→ Retry (attempt 2)
→ Retry (attempt 3)  ← max retries reached
→ Move to DLQ

DLQ: message sits here for manual inspection or automated alert
Enter fullscreen mode Exit fullscreen mode

Why this matters:

  • Main queue keeps flowing — the poison pill doesn't block other messages
  • Failed messages aren't lost — they're in the DLQ for investigation
  • Engineers get alerted to DLQ growth → investigate the root cause
  • After fixing the bug, messages can be replayed from DLQ back to the main queue

AWS SQS DLQ config:

{
  "deadLetterTargetArn": "arn:aws:sqs:us-east-1:123:my-dlq",
  "maxReceiveCount": 3
}
Enter fullscreen mode Exit fullscreen mode

After 3 failed attempts, message moves to my-dlq. Engineers receive CloudWatch alarm. Messages stay in DLQ for 14 days.


Backpressure: When Consumers Can't Keep Up

If producers emit 10,000 messages/second but consumers can only process 1,000/second, the queue grows indefinitely. Eventually: out-of-memory, disk full, system collapse.

Backpressure is the mechanism by which a system signals upstream components to slow down.

Pull-based consumption (Kafka's model):
Consumers pull messages at their own pace. They never receive more than they can handle. If a consumer is slow, it simply reads fewer messages — the queue absorbs the backlog.

Fast producer → [Kafka topic, growing backlog]
Slow consumer → pulls 100 messages at a time, processes, pulls next 100
→ Consumer naturally throttles itself
Enter fullscreen mode Exit fullscreen mode

Push-based queues (RabbitMQ):
The broker pushes messages to consumers. prefetch count limits how many unacknowledged messages a consumer can hold:

channel.basic_qos(prefetch_count=10)
# Consumer receives at most 10 messages before it must ACK some
# Prevents overwhelming a slow consumer
Enter fullscreen mode Exit fullscreen mode

Application-level backpressure:
When the queue depth exceeds a threshold, producers are asked to slow down:

Queue depth > 1 million messages → alert producer service
Producer service → reduce emission rate by 50%
Queue depth decreasing → producer returns to full rate
Enter fullscreen mode Exit fullscreen mode

This is how streaming systems like Spark Streaming and Flink handle varying load without crashing.


Kafka vs RabbitMQ: The Real Comparison

Feature Kafka RabbitMQ
Model Pub-Sub (log-based) Point-to-point + Pub-Sub
Message retention Persisted on disk (days/weeks) Deleted after consumption
Replay Yes — rewind to any offset No
Throughput Millions/sec per broker ~50K/sec per queue
Consumer model Pull (consumer controls pace) Push (broker sends to consumer)
Ordering Per partition Per queue
Routing Topic + partition key Flexible exchange-based routing
Use case Event streaming, data pipelines Task queues, complex routing

Choose Kafka when:

  • High throughput (100K+ messages/second)
  • You need replay / event sourcing
  • Multiple independent consumers need the same events
  • Building a data pipeline (Kafka → Spark/Flink → data warehouse)

Choose RabbitMQ when:

  • Complex routing logic (route by message header, content, priority)
  • Task queue semantics (each job done by exactly one worker)
  • Lower throughput requirements
  • Need per-message TTL, priority queues, or delayed delivery

Key Takeaways

  • Message queues decouple producers from consumers, enabling async processing, load leveling, and resilience.
  • Point-to-point (RabbitMQ): one consumer per message. For task queues.
  • Pub-Sub (Kafka): all consumers get every message. For event broadcasting.
  • Kafka internals: topics → partitions → offsets. Consumer groups enable parallel processing. Partition keys guarantee per-key ordering.
  • Delivery guarantees: At-most-once (lossy, fast), At-least-once (default, needs idempotency), Exactly-once (strong, costly).
  • DLQ prevents poison pills from blocking queues — failed messages park here for investigation.
  • Backpressure prevents fast producers from overwhelming slow consumers — Kafka's pull model handles this naturally.

What's Next

Topic 14 goes deeper into the architectural pattern that Kafka enables: Event-Driven Architecture — event sourcing, CQRS, the outbox pattern, and how to build systems where "something happened" is the fundamental primitive.

Tags: system-design kafka message-queues backend distributed-systems event-driven interview-prep

Top comments (0)