TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 19

LinkedIn Needed a Message Queue. They Built the One the Entire Internet Runs On.

#distributedsystems #backend #programming #webdev

1 billion events/day at LinkedIn launch in 2011 — immediate production scale from day one
7 trillion messages/day by 2019 — same core architecture, 7,000× growth
~50 MB/sec Kafka producer throughput vs ~2 MB/sec ActiveMQ — in the original 2011 benchmark
9 bytes per-message overhead vs 144 bytes in ActiveMQ — 16× storage efficiency
Stateless brokers — consumers track their own offset; broker memory doesn't scale with consumers
80%+ of Fortune 100 run Kafka today; Confluent IPO'd at $4.5B valuation in 2021

In 2010, LinkedIn was drowning in data it couldn't move. Every ML model, every recommendation engine, every real-time feature was starving because there was no reliable way to get activity data from the website into the systems that needed it. Jay Kreps, Jun Rao, and Neha Narkhede spent a year building a fix. They named it after Franz Kafka. The rest of the internet adopted it.

The Story

By 2010, LinkedIn had dozens of data source systems and dozens of consumer systems — ML models, analytics pipelines, search indexers, real-time features — all needing the same activity stream data. The solution was point-to-point custom pipelines: each one custom-built, each one brittle, none sharing infrastructure. Adding one new data source meant writing N new pipelines. Adding one new consumer meant updating M existing sources. Jay Kreps, leading data infrastructure engineering, described the root cause directly: "Everyone wanted to build fancy machine-learning algorithms, but without the data, the algorithms were useless. Getting the data from source systems and reliably moving it around was very difficult."

Kreps, alongside Jun Rao (from IBM's database group) and Neha Narkhede (from Oracle), evaluated every existing solution. ActiveMQ (an open-source message broker implementing JMS, designed for reliable ordered message delivery between enterprise applications) and RabbitMQ (a message broker built around AMQP, designed for flexible routing and delivery guarantees) were built for a different problem — reliable delivery of individual task messages, not high-throughput streaming of millions of activity events. Their per-message broker state tracking consumed memory proportional to outstanding messages. They couldn't support the scenario where a Hadoop job needed to replay yesterday's activity data. Most critically: ActiveMQ's message format carried 144 bytes of overhead per message. LinkedIn needed millions of messages per second.

The Founding Insight: Treat Data Movement Like a Log

The breakthrough was recognising that LinkedIn's data movement problem was not a messaging problem — it was a log problem. Databases have used append-only logs for decades: the write-ahead log (a sequential record of all changes, written before the changes are applied — used for crash recovery, replication, and point-in-time restoration) is how MySQL and Postgres achieve durability. Jay Kreps asked: what if the data pipeline itself was an append-only log? Producers append events. Consumers read at their own pace. The log retains messages for a configured period. Any consumer can replay from any point. The broker tracks no state. That simplicity unlocked everything.

Problem

LinkedIn's Data Was Locked in Silos

By 2010, LinkedIn had an N×M integration problem — every data source needed a custom pipeline to every data destination. Existing messaging systems (ActiveMQ, RabbitMQ) were designed for task queues, not event streams, and couldn't handle LinkedIn's throughput requirements or support replay.

Cause

No Tool Existed for High-Throughput Real-Time Event Streaming

Batch systems (Hadoop) could handle large volumes but only hours later. Traditional message queues could deliver in real-time but couldn't scale to LinkedIn's volume or support replay. No system simultaneously provided high throughput, low latency, durability, replayability, and horizontal scalability. The three engineers concluded that the tool they needed did not exist.

Solution

One Year Building Kafka: The Append-Only Distributed Log

Kreps, Rao, and Narkhede spent approximately one year building the first version of Kafka. The core architectural decision was treating the message store as an append-only log rather than a queue. This single choice enabled sequential disk I/O (orders of magnitude faster than random I/O), stateless brokers (consumers track their own position), arbitrary replay (consumers read from any offset), and horizontal partitioning (each partition is an independent log that scales independently).

Result

1 Billion Events Per Day at Launch, 7 Trillion by 2019

Kafka went into production at LinkedIn in 2011 and immediately processed over 1 billion events per day. LinkedIn open-sourced it in early 2011. It became an Apache Top-Level Project in October 2012. By 2015: 1 trillion messages per day. By 2019: 7 trillion. Kreps, Narkhede, and Rao left LinkedIn in November 2014 to found Confluent, building the commercial ecosystem around Kafka.

I thought that since Kafka was a system optimized for writing, using a writer's name would make sense. I had taken a lot of lit classes in college and liked Franz Kafka. Plus the name sounded cool for an open source project.

— Jay Kreps, on naming Kafka, via Quora

The Fix

Five Design Decisions That Made Kafka Fast

Kafka's performance advantage was not clever optimisation of a standard architecture — it was a fundamentally different architecture where every key decision reinforced the same goal: maximise throughput for streaming event data. Five decisions stand out as architecturally defining, and each was a deliberate rejection of how existing messaging systems had been built.

~50 MB/s — Kafka producer throughput in the original 2011 benchmark vs ~2 MB/s for ActiveMQ at 200-byte messages
9 bytes — per-message overhead in Kafka vs 144 bytes in ActiveMQ — 16× storage efficiency
Stateless — Kafka brokers; consumer offset tracking is done by the consumer, not the broker
Sequential — disk access pattern for both writes and reads; append-only means no random I/O

// The five key Kafka design decisions illustrated in code

// DECISION 1: Append-only log storage (not a queue)
// Each partition is a directory of sequential segment files
// /kafka-logs/my-topic-0/00000000000000000000.log
// /kafka-logs/my-topic-0/00000000000000100000.log
// → Sequential writes: disk seeks are expensive; sequential I/O is ~100x faster

// DECISION 2: Consumer tracks its own offset — broker holds no state
long consumerOffset = consumer.position(topicPartition); // consumer owns this
// → Brokers are stateless: no per-consumer memory, no ack tracking overhead
consumer.seek(topicPartition, 0); // replay from the beginning — any time

// DECISION 3: Topics partitioned for horizontal scale
ProducerRecord record = new ProducerRecord<>(
    "user-activity",
    userId,    // partition key: same user → same partition = ordered per user
    eventJson  // the message payload
);
// → N partitions = N consumers in parallel = linear throughput scaling

// DECISION 4: Batch I/O from client to broker
props.put("batch.size", 16384); // batch up to 16KB before sending
props.put("linger.ms", 5);      // or wait 5ms for the batch to fill
// Original paper: batch size 50 improved throughput ~10x vs batch size 1

// DECISION 5: Zero-copy transfer via OS sendfile()
// Consumer fetch path: disk page cache → network socket (no userspace copy)
// → No data enters JVM heap → no GC pressure → consistent low latency
// → Delivers data at near-network-hardware-limit throughput

The Stateless Broker: The Counterintuitive Masterstroke

In ActiveMQ and RabbitMQ, the broker maintains delivery state for every message: who acknowledged it, who hasn't, what needs to be retried. At scale, this per-message state tracking consumes enormous memory and creates a bottleneck. Kafka's solution was radical: let consumers track their own position (their offset in each partition). The broker stores bytes in a log. Consumers read at their own pace and can reset to any offset to replay. The broker's memory footprint is constant regardless of consumer count or message backlog — making horizontal scaling of consumers a configuration change, not an infrastructure problem.

Kafka vs traditional message queues — original 2011 benchmarks and design properties:

Property	ActiveMQ / RabbitMQ	Kafka
Storage model	Queue — messages deleted after ack	Append-only log — retained by time/size
Broker state	Tracks ack state per message per consumer	Stateless — consumers track own offset
Producer throughput	~2 MB/sec (ActiveMQ)	~50 MB/sec (batch size 50)
Message overhead	144 bytes (ActiveMQ JMS header)	9 bytes
Consumer replay	Not supported	Supported — seek to any offset
Horizontal scale	Limited (complex cluster configs)	Native — add partitions, add consumers
Use case fit	Task queues, guaranteed delivery, routing	Event streaming, log aggregation, activity tracking

Zero-copy: the OS kernel trick that doubled throughput

One of Kafka's most impactful performance optimisations is invisible to application code. In a traditional data transfer, data moves: disk → kernel buffer → userspace → socket buffer → network. In Kafka's consumer path, the OS sendfile() syscall transfers data directly from the page cache to the network socket, bypassing userspace entirely. No data is copied into the JVM heap — no GC pressure, no object allocation overhead. At LinkedIn's throughput rates, this optimisation alone accounts for significant throughput gains and, more importantly, consistent low latency even under high load.

The log/table duality: Jay Kreps' deeper insight

In his 2013 essay "The Log," Kreps articulated a concept beyond Kafka's implementation: the log/table duality — any database table can be derived by replaying a log of changes from the beginning, and any log can be materialised into a table by applying each event as a state update. Every database table is a log in disguise. Every stream of events can be materialised into a table. This duality means a Kafka topic is simultaneously a stream and a database — query it as a stream in motion (stream processing) or materialise it as a snapshot (a table). This insight became the foundation for Kafka Streams, ksqlDB, and the entire stream-processing ecosystem that followed.

Architecture

Kafka's architecture has three layers. The storage layer is a set of partitioned, replicated append-only log files on disk — each partition is an independent, totally ordered sequence of records. The broker layer is a cluster of server processes managing partition assignment, replication, and client connections — holding no consumer state. The client layer is producers writing to partitions and consumer groups reading from them, each group maintaining its own independent offset per partition.

Before Kafka: N×M Integration Spaghetti

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

After Kafka: The Centralised Log Hub

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Inside Kafka: Topics, Partitions, Offsets, and Consumer Groups

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

LinkedIn's Kafka by 2019: The Scale Numbers

From 1 billion events per day at launch (2011) to 7 trillion messages per day by 2019 — a 7,000× growth in eight years on the same fundamental architecture. Spread across 100+ clusters, 4,000+ brokers, 100,000+ topics, and 7 million partitions. Each message consumed by approximately four consumer groups on average. The most remarkable fact: the append-only partitioned log described in the 2011 paper is still the architecture running at 7 trillion messages per day. Good architecture ages well.

Lessons

Before building, verify no existing tool solves your problem at your scale. The Kafka team evaluated ActiveMQ, RabbitMQ, and existing log aggregation systems before building. Their conclusion — existing tools were designed for the wrong problem — was evidence-based. The benchmark (50 MB/sec vs 2 MB/sec) made the decision concrete. Never rebuild what can be adopted; never adopt what demonstrably can't serve your workload.
The append-only log (a data structure where records are only ever added to the end, never modified in place — enabling sequential I/O, arbitrary consumer replay, and stateless brokers) is the universal data integration primitive. Any system moving data between producers and consumers is implementing a log, whether it knows it or not. Recognising and building explicitly on this pattern is what gave Kafka its performance advantage and its flexibility.
Stateless brokers make systems horizontally scalable in ways stateful brokers cannot match. When the broker tracks delivery state per consumer per message, broker memory scales with consumers × outstanding messages. When consumers track their own offsets, broker memory scales with partitions only. This single architectural choice is why Kafka can serve hundreds of consumer groups without broker degradation.
Sequential I/O is dramatically faster than random I/O on both HDDs and SSDs. An append-only log turns a bursty stream of writes into sequential disk operations, allowing Kafka to approach disk hardware throughput limits. Systems that update records in-place pay random I/O costs on every write. Kafka writes append-only and leverages the OS page cache for reads, achieving throughput that surprised the entire industry.
Open-sourcing infrastructure that solves a universal problem creates compounding returns. LinkedIn open-sourced Kafka in 2011 because the team recognised it solved a problem every data-intensive company had. Community contributions from Netflix, Uber, Twitter, and thousands of others built tooling LinkedIn could never have built alone: Kafka Streams, Kafka Connect, ksqlDB, MirrorMaker, Schema Registry. The return on open-sourcing infrastructure is measured in ecosystem, not just code.

Engineering Glossary

Append-only log — a data structure where records are only ever added to the end, never modified in place. Enables sequential disk I/O, arbitrary consumer replay, and stateless brokers. The core data structure underlying Kafka's architecture and the reason for its performance advantage over traditional message queues.

Consumer group — a set of Kafka consumers that collectively read from a topic, with each partition assigned to exactly one consumer in the group at a time. Enables parallel consumption: a topic with N partitions can be consumed by up to N consumers simultaneously.

Consumer offset — the position of a consumer within a partition, tracking which messages have been read. In Kafka, consumers (not brokers) own and commit their offsets — the key architectural decision that makes Kafka brokers stateless.

Log/table duality — the mathematical relationship where any database table can be derived by replaying a log of changes from the beginning, and any log can be materialised into a table by applying each event as a state update. The theoretical foundation for Kafka Streams and ksqlDB.

Partition — the unit of parallelism in Kafka. Each topic is divided into one or more partitions, each of which is an independent append-only log stored on a single broker. Producers write to partitions by key; consumers read from partitions independently.

Stateless broker — a broker that holds no per-consumer delivery state. Kafka brokers store bytes in partitioned logs; consumers own their own offset positions. Broker memory scales with partition count, not consumer count — the property that makes Kafka horizontally scalable to hundreds of consumer groups without broker degradation.

Write-ahead log (WAL) — a sequential record of all changes made to a database, written to disk before the changes are applied. Used for crash recovery and replication in MySQL, Postgres, and virtually every serious database. The inspiration for Kafka's append-only log architecture.

Zero-copy transfer — the use of the OS sendfile() syscall to transfer data directly from the kernel page cache to a network socket, bypassing userspace entirely. Used in Kafka's consumer fetch path to eliminate JVM heap copies, GC pressure, and the associated latency spikes at high throughput.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community

LinkedIn Needed a Message Queue. They Built the One the Entire Internet Runs On.

The Story

Problem

LinkedIn's Data Was Locked in Silos

Cause

No Tool Existed for High-Throughput Real-Time Event Streaming

Solution

One Year Building Kafka: The Append-Only Distributed Log

Result

1 Billion Events Per Day at Launch, 7 Trillion by 2019

The Fix

Five Design Decisions That Made Kafka Fast

Architecture

Before Kafka: N×M Integration Spaghetti

After Kafka: The Centralised Log Hub

Inside Kafka: Topics, Partitions, Offsets, and Consumer Groups

Lessons

Engineering Glossary

Top comments (0)