Tyler Tan

Posted on May 21

Building Kafka from Scratch: A Message Broker in 1800 Lines of C++23

#cpp #distributedsystems #systemdesign #tutorial

You wrote a web scraper. It crawls product pages and pipes the results downstream for processing. You wired it up with raw TCP, fire-and-forget style. Then the downstream service crashed. After the restart, the messages were gone, and your scraper had no idea what was sent and what wasn't.

You need something that holds onto messages until the consumer is ready to pick them up. In other words, you need a message queue.

Enter Kafka.

Kafka is the most widely deployed distributed messaging engine on the planet, powering data pipelines at LinkedIn, Uber, Netflix, and basically anywhere that moves serious volume. You give it a topic (say, crawler-results), producers push messages in, consumers pull them out. In between sits the Broker, handling connections, persisting data to disk, and routing traffic. Messages don't get lost, ordering is preserved, and scaling is just a matter of adding more machines.

But real Kafka clocks in at roughly half a million lines of Java. Even browsing the source tree is enough to make most people close the tab.

So I stripped it to the bone. The result is TinyKafka, written from scratch in C++23, ~1,800 lines of core code, zero external dependencies, pure standard library and POSIX sockets. It implements four essential APIs, Produce, Fetch, DescribeTopicPartitions, and ApiVersions, plus a hand-rolled Kafka binary protocol stack and disk-backed persistence. Over 3,200 lines of tests verify every byte on the wire and every write to disk.

Running it is trivial:

$ cd TinyKafka && ./build.sh && ./build/kafka
Waiting for clients to connect...

It binds port 9092 and sits there waiting for Kafka clients to show up.

So what actually happens to a message from the moment it arrives to the moment it leaves? Let's crack it open, layer by layer.

What Kafka Actually Is

Let's get the concepts out of the way in two paragraphs.

Kafka's core model has three pieces: a Producer writes messages into a logical channel called a Topic, a Consumer reads messages from that Topic, and the Broker in the middle stores and forwards everything. A Topic can be split into multiple Partitions, spreading data across them so throughput scales horizontally.

It solves three problems: decoupling (producers and consumers don't need to know about each other), buffering (messages pile up on disk and get consumed at the consumer's pace), and durability (messages hit the disk and survive restarts).

Alright, concepts done. Now let's see what a Broker looks like when you remove everything that isn't essential.

The While Loop Is the Whole Engine

Here's TinyKafka's entire flow in pseudocode:

1. Startup: read metadata from disk → now you know what topics and partitions exist
2. Bind port 9092, start accepting client connections
3. For each connected client, detach a thread:
   while (connection alive) {
       read 4 bytes → now you know the message length
       read the full message body
       parse_request()   → binary blob becomes typed struct
       Broker::handle()  → do the actual work
       serialize()       → response back to binary
       send_all()        → fire it back to the client
   }

That's main.cpp in its entirety, 88 lines. Here's the real thing:

int main() {
    // 1. Read KRaft metadata from disk
    auto metadata = parse_cluster_metadata_file(
        "/tmp/kraft-combined-logs/__cluster_metadata-0/00000000000000000000.log");

    // 2. Start TCP server on port 9092
    auto server = Server::create(9092);

    // 3. Accept loop: accept → hand off to thread
    while (true) {
        auto client = server->accept();
        int client_fd = *client;

        std::thread([client_fd, &metadata] {
            Broker broker(metadata, "/tmp/kraft-combined-logs");

            while (true) {
                // Receive: first read 4-byte length prefix
                std::array<uint8_t, 4> len_buf{};
                recv_all(client_fd, len_buf);
                auto message_length = decode_int32_be(len_buf);

                // Read the message body
                std::vector<uint8_t> buf(message_length);
                recv_all(client_fd, buf);

                // Binary → typed request → handle → serialize → send back
                auto req  = parse_request(buf);
                auto resp = broker.handle(*req);
                auto bytes = serialize(resp);
                send_all(client_fd, bytes);
            }
        }).detach();
    }
}

Notice the detach(). Each client gets its own thread running its own receive-parse-handle-send loop. Multiple producers and consumers can connect simultaneously without stepping on each other. Simple, blunt, and effective.

Real Kafka is far less cowboy about this. It uses thread pools and a Reactor pattern to avoid the cost of spawning and tearing down threads constantly, and Java NIO for non-blocking I/O. TinyKafka's one-thread-per-connection model is more of a proof of concept, it lets you see the concurrency model in one glance. The real thing adds enormous engineering (zero-copy sendfile, mmap-backed file access, segmented indices for O(1) offset lookup, and a dozen other things), but the skeleton loop, receive request, process, send response, is identical.

Speaking Kafka's Language: The Binary Protocol

So what's parse_request() actually doing? Turning raw network bytes into C++ structs, and that's where TinyKafka's grittiest code lives: a hand-built implementation of the Kafka binary wire protocol.

Every Kafka message is structured as three parts:

+------------------+------------------+------------------+
|  message_size    |     Header       |      Body        |
|  (4 bytes, BE)   |  (variable)      |  (API-dependent)  |
+------------------+------------------+------------------+

message_size: a 4-byte big-endian integer that tells the other side "here's how many more bytes to read." TCP is a stream protocol with no built-in message boundaries. This length prefix is a simple framing layer, without it you'd never know when one message ends and the next begins.

Header carries three critical fields:

api_key (2 bytes): what kind of message this is. 0 = Produce, 1 = Fetch, 18 = ApiVersions, 75 = DescribeTopicPartitions. Real Kafka has over a hundred api keys; we implemented four.
api_version (2 bytes): the version of this API. Kafka keeps multiple versions of the same API alive simultaneously. An old client speaks v0, a newer one speaks v16, the broker picks the intersection.
correlation_id (4 bytes): a sequence number for matching responses to requests. The client stamps it on the request, the broker echoes it back, and the client uses it to figure out "this response goes with that request I sent earlier."

Body: varies by api_key. A Produce body carries a topic name and a blob of record batch bytes. A Fetch body carries a topic UUID and a list of partitions. The structure is strictly defined by the Kafka protocol spec.

Why binary instead of something like HTTP? Because it's compact. An int32 in HTTP is the string "2147483647", ten bytes. In binary it's always exactly four bytes. Kafka moves trillions of messages a day; that difference is not academic. And fixed-position binary fields mean no per-character scanning like JSON parsing, byte 4 is always this, bytes 5-6 are always that, one memcpy and you're done.

Since network byte order is big-endian everywhere, TinyKafka has a small arsenal of hand-rolled encode/decode primitives. Reading an int32, for instance:

auto decode_int32_be(std::span<const uint8_t, 4> data) -> int32_t {
    return (static_cast<int32_t>(data[0]) << 24) |
           (static_cast<int32_t>(data[1]) << 16) |
           (static_cast<int32_t>(data[2]) << 8)  |
           static_cast<int32_t>(data[3]);
}

Four bytes. Most significant in data[0], least significant in data[3]. This function looks trivial, but the entire Kafka protocol stack is built out of hundreds of calls just like it.

On top of these primitives sit ByteReader and ByteWriter, two utility classes that read and write int16/int32/varints/compact strings sequentially over std::span. parser.cpp runs 290 lines, serializer.cpp 225, both standing on the shoulders of these two helpers.

The api_key in the header determines how the body gets parsed and how the request gets handled. We implemented four of them, 0, 1, 18, and 75. Let's take them one at a time.

Four APIs, Four Kinds of Work

Open broker.cpp and you'll find a single method, Broker::handle(), that does all the heavy lifting. But before we look at the dispatch mechanism, let's understand what each of the four APIs actually does.

ApiVersions (api_key = 18): The Handshake

The first thing a Kafka client typically does after connecting is ask the broker: "What APIs do you support, and which version ranges?"

The response is a table:

struct ApiVersionEntry {
    int16_t api_key;       // API number
    int16_t min_version;   // lowest supported version
    int16_t max_version;   // highest supported version
};

struct ApiVersionsResponse {
    int32_t correlation_id;
    int16_t error_code;
    std::vector<ApiVersionEntry> api_keys;  // our four entries
    int32_t throttle_time_ms;
};

TinyKafka's answer is this compile-time table:

inline constexpr std::array<ApiVersionEntry, 4> kSupportedApis{{
    {.api_key = 0,  .min_version = 0, .max_version = 11},  // Produce
    {.api_key = 1,  .min_version = 0, .max_version = 16},  // Fetch
    {.api_key = 18, .min_version = 0, .max_version = 4},   // ApiVersions
    {.api_key = 75, .min_version = 0, .max_version = 0},   // DescribeTopicPartitions
}};

If the client sends a version outside [0, 4], the broker fires back error_code = 35 (UNSUPPORTED_VERSION) and that's the end of the conversation. That's Kafka version negotiation in its entirety, simpler than HTTP Content-Negotiation by a mile.

DescribeTopicPartitions (api_key = 75): "What partitions does this topic have?"

A client wants to know about a topic's metadata. Does it exist? What partitions does it have? Who's the leader of each partition?

TinyKafka handles this by looking up the topic name in an in-memory ClusterMetadata structure. That structure gets built at startup by parsing a KRaft metadata log file, __cluster_metadata-0/00000000000000000000.log, which contains the canonical record of every topic and partition.

Found it? Here's the partition list. Didn't find it? error_code = 3 (UNKNOWN_TOPIC_OR_PARTITION). Results come back sorted alphabetically by topic name, because the Kafka spec demands it. You can see this sorting behavior verified byte-for-byte in the integration tests: send ["zebra", "apple"], get back ["apple", "zebra"].

Fetch (api_key = 1): Consumer Pulling Messages

A consumer says: "Give me the messages for partition 0 of the topic with UUID a1b2c3d4...."

A Fetch request comes with a pile of fields, max_wait_ms, min_bytes, max_bytes, isolation_level, session_id, session_epoch, all controlling fetch behavior. TinyKafka keeps only the two that matter for the minimal path: topic UUID and partition_index. Everything else gets skipped. Real Kafka uses those extra fields for long-polling, transactional isolation, and other advanced features, but our goal is just to get the bytes flowing.

struct FetchTopicRequest {
    std::array<uint8_t, 16> topic_id;           // 16-byte UUID
    std::vector<FetchPartitionRequest> partitions;
};

struct FetchPartitionResponse {
    int32_t partition_index;
    int16_t error_code;
    std::vector<uint8_t> records;  // the payload: raw record batch bytes
};

Once the broker finds the topic, it reads the entire partition log file from disk and stuffs it directly into the records field. The consumer gets raw record batch bytes and does its own decoding. This is Kafka's philosophy: the broker should touch message content as little as possible. It stores, it forwards. Decoding is the client's problem.

Produce (api_key = 0): Producer Sending Messages

This is where the scraper's data from the opening story finally enters Kafka:

struct ProduceTopicRequest {
    std::string topic_name;
    std::vector<ProducePartitionRequest> partitions;
    // each partition carries a blob of records (record batch bytes)
};

TinyKafka looks up the topic by name, verifies the partition exists, and appends the record batch bytes to a disk file. Kafka doesn't store messages one at a time. They're packed into record batches, each batch containing multiple records, with a magic byte tagging the format version. This batching is what gives Kafka its legendary throughput, it dramatically reduces the number of disk I/O operations.

Four APIs down. Now let's see how the broker routes a request to the right handler.

variant + visit + overloaded: The Compiler Won't Let You Forget

TinyKafka models the four request types as a std::variant:

using Request = std::variant<
    ApiVersionsRequest,
    DescribeTopicPartitionsRequest,
    FetchRequest,
    ProduceRequest
>;

Response works the same way. Then Broker::handle() dispatches every case in one shot using std::visit with the overloaded pattern:

auto Broker::handle(const Request& req) -> Response {
    return std::visit(overloaded{
        [](const ApiVersionsRequest& r) -> Response { /* version negotiation */ },
        [this](const DescribeTopicPartitionsRequest& r) -> Response { /* lookup */ },
        [this](const FetchRequest& r) -> Response { /* read from disk */ },
        [this](const ProduceRequest& r) -> Response { /* write to disk */ },
    }, req);
}

In the MoonieCode post we called std::variant a "paranoid envelope": it holds exactly one of the declared types, nothing else, and the compiler forces you to handle every single case. Forget to write the Produce handler? Your build breaks. No virtual function overhead, no if-else chain, no missed cases. It's cleaner and safer than traditional OOP with virtual dispatch.

Where Messages Live: The Disk Layout

DescribeTopicPartitions depends on metadata read from disk. Fetch reads from disk. Produce writes to disk. They all converge on TinyKafka's storage layer. So what exactly is sitting on that filesystem?

Directory Structure: Exactly Like Real Kafka

/tmp/kraft-combined-logs/
├── __cluster_metadata-0/
│   └── 00000000000000000000.log    ← KRaft metadata
├── orders-0/
│   └── 00000000000000000000.log    ← partition 0 of the "orders" topic
├── crawler-results-0/
│   └── 00000000000000000000.log    ← partition 0 of "crawler-results"
└── ...

The naming convention is {topic}-{partition}/00000000000000000000.log, identical to real Kafka. Offset starts at zero. Single segment file per partition. No rolling segments (real Kafka would rotate to 00000000000000000020.log after hitting a size threshold).

The Starting Point: Metadata Files

That __cluster_metadata-0/00000000000000000000.log is the KRaft metadata log. Since Kafka 2.8, KRaft mode lets you run without ZooKeeper, cluster metadata lives as record batches inside this file.

At startup, TinyKafka slurps it into memory and walks the record batch v2 format layer by layer: first identify record batch boundaries (magic byte = 2), then parse each record's value. The value itself is a compact varint-encoded frame where the critical field is type: type=2 means it's a topic record (name + UUID), type=3 means it's a partition record (partition ID + parent topic UUID). Topics and partitions get linked by UUID.

The parsed result becomes a ClusterMetadata struct:

struct ClusterMetadata {
    std::vector<TopicInfo> topics;                              // all topics
    std::unordered_map<std::string, size_t> name_to_topic;     // lookup by name
    std::unordered_map<std::array<uint8_t, 16>, size_t, UuidHash> uuid_to_topic;  // lookup by UUID
};

Three tables. Two lookup paths. O(1) to find any topic. With this map built, all routing falls into place:

Produce routing: topic name → name_to_topic → find partition list → verify partition exists → write to disk
Fetch routing: topic UUID → uuid_to_topic → find topic name → construct file path → read from disk

Notice that Produce uses the name and Fetch uses the UUID. This isn't arbitrary, it's mandated by the Kafka protocol: producers send by topic name, consumers fetch by UUID (because the DescribeTopicPartitions step already translated name to UUID for them).

Writing and Reading

Writing (the Produce path) takes only a few lines:

auto dir = std::format("{}/{}-{}", root_path, topic_name, partition);
std::filesystem::create_directories(dir, ec);   // auto-create directory tree
auto path = std::format("{}/00000000000000000000.log", dir);
std::ofstream file(path, std::ios::binary | std::ios::app);  // append mode
file.write(records.data(), records.size());

create_directories handles the first write to a new partition by building the directory tree automatically. ios::app means every write appends to the end of the file, never overwriting existing data.

Reading (for Fetch and metadata) is just as short: ifstream open, tellg for size, one shot into a vector<uint8_t>:

std::ifstream file(path, std::ios::binary | std::ios::ate);
auto sz = file.tellg();
std::vector<uint8_t> data(sz);
file.read(reinterpret_cast<char*>(data.data()), sz);

Real Kafka would never read entire files into memory like this. It uses mmap for file-backed access and indexed lookups to find the exact byte range it needs by offset. But for an 1,800-line prototype, simple and direct is exactly the right call.

The Best Way to Understand Something Is to Build It

That's TinyKafka, every layer from network protocol to disk storage, peeled open. In 1,800 lines it packs a full binary protocol stack, handlers for four APIs, KRaft metadata parsing, and disk-backed persistence. Think of it as a miniature Kafka anatomy model.

I didn't build TinyKafka to create a production-grade broker. I built it to understand. Kafka's documentation and source code are intimidatingly large, but once you've built a minimal version yourself, you realize the core skeleton isn't that complicated: accept binary requests over TCP, dispatch by api_key, route through metadata to the right disk file, read or write. Everything else, zero-copy, segmented indices, replica synchronization, ISR management, transactional support, is engineering built on top of that skeleton.

There's a learning philosophy here: instead of letting half a million lines of source code intimidate you, spend a day building a minimal prototype. Afterwards, those massive codebases stop looking like alien artifacts. You recognize the bones.

Code is on GitHub. Stars, issues, and ruthless code review are all welcome.

DEV Community