Alain Airom (Ayrom)

Posted on May 5

Foundations of Distributed Streaming: A Book Review of “Kafka for Architects”

#kafka #apachekafka #confluent #bookreview

Understanding why even Kafka exists!?

Introduction

Since the recent acquisition of Confluent by IBM, I was rather a stranger to Apache Kafka and data streaming. However, since that major industry shift, I’ve become very interested in the mechanics of real-time data and am looking to spot high-quality resources to master the matter. This synthesis one of my latests reading, covering one of the most respected modern guides for transitioning from traditional request-response architectures to event-driven ecosystems.

TL;DR-Kafka

Originally developed at LinkedIn and later open-sourced as an Apache Software Foundation project, Apache Kafka is a distributed event-streaming platform designed to handle high-volume, real-time data feeds. It functions as a distributed commit log, allowing for the seamless publishing, storing, and processing of streams of records as they occur, which effectively decouples data producers from data consumers in a scalable and fault-tolerant manner. In modern architecture, it is used for a wide variety of critical tasks, including real-time data integration, stream analytics, log aggregation, and building event-driven microservices that require high throughput and low latency. Due to its reliability and massive scalability, Kafka is utilized by thousands of global enterprises, including industry leaders like LinkedIn, Netflix, Uber, Airbnb, and Goldman Sachs, to power everything from recommendation engines and fraud detection to real-time payment processing and activity tracking.

Disclaimer: Before going further, please note that I am not affiliated, associated, authorized, endorsed by, or in any way officially connected with Manning Publications or the author, Katya Gorshkova. This document is a synthesis intended for educational and reference purposes based on the content of the book.

All images provided are either from the book or the GitHub repositories cited!

Overall view (Part 1)-Architectural Foundations of Kafka

The book establishes Kafka not merely as a “message broker” (like RabbitMQ or ActiveMQ) but as a distributed streaming platform and a distributed commit log.

The Log-Centric View

At its core, Kafka is built on the concept of an append-only immutable log. This architectural choice ensures:

Sequential I/O: High throughput by avoiding random disk seeks.
Determinism: Events are stored in the order they arrived, allowing for “time-travel” debugging and state reconstruction.
Decoupling: Producers don’t need to know who the consumers are; they simply append to the log.

Key Components

Brokers and Clusters: Kafka runs as a cluster of servers (brokers). Architects must understand how data is distributed across these nodes to ensure high availability.
Topics and Partitions: Topics are logical categories, while partitions are the physical unit of parallelism. The book emphasizes that partitioning strategy is the most critical decision an architect makes for scalability.

Event-Driven Architecture (EDA) Patterns

Gorshkova spends significant time detailing how Kafka fits into broader enterprise patterns:

CQRS (Command Query Responsibility Segregation): Using Kafka to separate the “write” model (commands) from the “read” model (queries). Kafka acts as the bridge that updates read-optimized databases in real-time.
Event Sourcing: Instead of storing the current state of an object, you store the sequence of events that led to that state. Kafka’s persistence makes it an ideal “Source of Truth.”
Saga Pattern: Managing distributed transactions in microservices without a central coordinator by using events to trigger compensating transactions.

Data Governance and Schema Registry

A major focus for architects is Data Contracts. The book argues that without schema management, a Kafka cluster becomes a “data swamp.”

Schema Registry: A centralized repository for Avro, Protobuf, or JSON schemas.
Evolution: How to handle “Forward,” “Backward,” and “Full” compatibility so that changing a producer doesn’t break dozens of downstream consumers.

Ecosystem and Tools (GitHub References)

The book cites several industry-standard tools to solve complex architectural challenges. Below are the key GitHub references mentioned and their specific roles:

Cruise Control

Reference: https://github.com/linkedin/cruise-control
Usage & Explanation: Cited as the gold standard for cluster load balancing. In a large Kafka deployment, some brokers inevitably become “hot” (overloaded). Cruise Control automates the redistribution of partitions based on CPU, disk, and network utilization, reducing the operational burden on architects.

/*
 * Copyright 2019 LinkedIn Corp. Licensed under the BSD 2-Clause License (the "License"). See License in the project root for license information.
 */

package com.linkedin.kafka.cruisecontrol;

import com.codahale.metrics.MetricRegistry;
import com.codahale.metrics.jmx.JmxReporter;
import com.linkedin.kafka.cruisecontrol.async.AsyncKafkaCruiseControl;
import com.linkedin.kafka.cruisecontrol.config.KafkaCruiseControlConfig;
import com.linkedin.kafka.cruisecontrol.config.constants.WebServerConfig;
import com.linkedin.kafka.cruisecontrol.metrics.LegacyObjectNameFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

public abstract class KafkaCruiseControlApp {
  protected static final Logger LOG = LoggerFactory.getLogger(KafkaCruiseControlApp.class);
  protected static final String METRIC_DOMAIN = "kafka.cruisecontrol";

  protected final KafkaCruiseControlConfig _config;
  protected final AsyncKafkaCruiseControl _kafkaCruiseControl;
  protected final JmxReporter _jmxReporter;
  protected MetricRegistry _metricRegistry;
  protected Integer _port;
  protected String _hostname;

  KafkaCruiseControlApp(KafkaCruiseControlConfig config, Integer port, String hostname) {
    this._config = config;
    _metricRegistry = new MetricRegistry();
    _jmxReporter = JmxReporter.forRegistry(_metricRegistry).inDomain(METRIC_DOMAIN)
            .createsObjectNamesWith(LegacyObjectNameFactory.getInstance()).build();
    _jmxReporter.start();
    _port = port;
    _hostname = hostname;

    _kafkaCruiseControl = new AsyncKafkaCruiseControl(config, _metricRegistry);

  }

  public String getHostname() {
    return _hostname;
  }

  public int getPort() {
    return _port;
  }

  public void start() throws Exception {
    _kafkaCruiseControl.startUp();
  }

  void registerShutdownHook() {
    Runtime.getRuntime().addShutdownHook(new Thread(this::stop));
  }

  /**
   * Stops Cruise Control
   */
  public void stop() {
    _kafkaCruiseControl.shutdown();
    _jmxReporter.close();
  }

  public abstract String serverUrl();

  protected void printStartupInfo() {
    boolean corsEnabled = _config.getBoolean(WebServerConfig.WEBSERVER_HTTP_CORS_ENABLED_CONFIG);
    boolean vertxEnabled = _config.getBoolean(WebServerConfig.VERTX_ENABLED_CONFIG);
    String webApiUrlPrefix = _config.getString(WebServerConfig.WEBSERVER_API_URLPREFIX_CONFIG);
    String uiUrlPrefix = _config.getString(WebServerConfig.WEBSERVER_UI_URLPREFIX_CONFIG);
    String webDir = _config.getString(WebServerConfig.WEBSERVER_UI_DISKPATH_CONFIG);
    String sessionPath = _config.getString(WebServerConfig.WEBSERVER_SESSION_PATH_CONFIG);
    System.out.println(">> ********************************************* <<");
    System.out.println(">> Application directory            : " + System.getProperty("user.dir"));
    System.out.println(">> REST API available on            : " + webApiUrlPrefix);
    System.out.println(">> Web UI available on              : " + uiUrlPrefix);
    System.out.println(">> Web UI Directory                 : " + webDir);
    System.out.println(">> Cookie prefix path               : " + sessionPath);
    System.out.println(">> Kafka Cruise Control started on  : " + serverUrl());
    System.out.println(">> CORS Enabled ?                   : " + corsEnabled);
    System.out.println(">> Vertx Enabled ?                  : " + vertxEnabled);
    System.out.println(">> ********************************************* <<");
  }

}

Excerpt of the Github Repository

Cruise Control is a product that helps run Apache Kafka clusters at large scale. Due to the popularity of Apache Kafka, many companies have increasingly large Kafka clusters with hundreds of brokers. At LinkedIn, we have 10K+ Kafka brokers, which means broker deaths are an almost daily occurrence and balancing the workload of Kafka also becomes a big overhead.

Kafka Cruise Control is designed to address this operational scalability issue.

JMX Monitoring Stacks

Reference: https://github.com/confluentinc/jmx-monitoring-stacks
Usage & Explanation: Kafka exposes internal metrics via Java Management Extensions (JMX). This repository provides pre-configured dashboards (often using Prometheus and Grafana) to monitor broker health, consumer lag, and throughput. Architects use this to define Service Level Indicators (SLIs).

scrape_configs:
  - job_name: Confluent Cloud
    scrape_interval: 1m
    scrape_timeout: 1m
    honor_timestamps: true
    static_configs:
      - targets:
        - api.telemetry.confluent.cloud
    scheme: https
    basic_auth:
      username: $CONFLUENT_CLOUD_API_KEY
      password: $CONFLUENT_CLOUD_API_SECRET
    metrics_path: /v2/metrics/cloud/export
    params:
      resource.kafka.id: [${CCLOUD_KAFKA_LKC_IDS}]
      resource.connector.id: [${CCLOUD_CONNECT_LCC_IDS}]
      resource.ksql.id: [${CCLOUD_KSQL_LKSQLC_IDS}]
      resource.schema_registry.id: [${CCLOUD_SR_LSRC_IDS}]
  - job_name: Confluent Cost Exporter
    scrape_interval: 5m
    scrape_timeout: 30s
    honor_labels: true 
    metrics_path: /probe  
    static_configs:
      - targets: ['confluent_cost_exporter:7979']

Kafka Connectors

Reference: Various repositories under https://github.com/confluentinc
Usage & Explanation: Instead of writing custom code to move data from a database to Kafka, the book suggests using Kafka Connect. These references provide ready-made “Source” and “Sink” connectors (e.g., Debezium for CDC) to ensure reliable data ingestion and egress.

Deployment and Reliability

Architects must design for failure. The book covers:

Replication Factor: Ensuring data exists on multiple brokers.
In-Sync Replicas (ISR): Understanding the trade-offs between “at-least-once,” “at-most-once,” and “exactly-once” delivery semantics.
Multi-Region Deployment: Strategies like MirrorMaker 2 or Confluent Cluster Linking for disaster recovery and geo-replication.

Synthesis view of Book Chapters (Part 2)

> The chapter’s categorization is based on my own PoV and does not follow the exact actual book’s organization!

A walkthrough of what are my takes from each of the book’s chapters.

Chapter 1 (up to chapter 3-Introductory sections): Introduction to Kafka

This chapter establishes why Kafka exists. Unlike traditional message brokers (like RabbitMQ) that delete messages once they are read, Kafka is a distributed commit log.

The Problem: Traditional architectures often suffer from “spaghetti” integration, where every system needs a direct connection to every other system.
The Solution: Kafka acts as a central nervous system, allowing data producers to send data once and multiple consumers to read it at their own pace.

Chapter 2: Essential Concepts

Before building, you must understand the terminology:

Topics & Partitions: A Topic is a category or feed name. Topics are split into Partitions, which are the fundamental unit of parallelism.
Offsets: Every message in a partition is assigned a unique, sequential ID called an offset. This allows consumers to track exactly where they left off.
Brokers: These are the servers that form the Kafka cluster, storing the data and serving the clients.

Chapter 3: The Architecture of Kafka

This chapter dives into the “Brain” of the operation.

Replication: To prevent data loss, Kafka replicates partitions across multiple brokers. One broker acts as the Leader, and others act as Followers.
Cluster Management: It explains the transition from using ZooKeeper to KRaft, which allows Kafka to manage its own metadata internally, making it much easier to scale to millions of partitions.

Chapter 4 (up to chapter 6-Implementation and Data Flow): Producing Messages

Architecting the “write” side of the system:

Load Balancing: Producers decide which partition to send a message to using a partitioner (often a hash of a key, like a UserID).
Batching & Compression: To achieve high throughput, Kafka batches small messages together and compresses them before sending them over the wire.
Acks (Acknowledgements): You can configure if a producer waits for one broker to receive the data (acks=1) or all replicas (acks=all) to ensure durability.

Chapter 5: Consuming Messages

Architecting the “read” side:

Consumer Groups: This is Kafka’s primary scaling mechanism. If you have a topic with 4 partitions and a group of 4 consumers, each consumer takes one partition.
Rebalancing: The chapter explains what happens when a consumer fails: Kafka automatically redistributes the workload among the remaining healthy consumers.

Chapter 6: Message Delivery Semantics

What is critical for data integrity:

At-least-once: Messages are never lost but may be redelivered if a consumer crashes before committing its offset.
Exactly-once (Transactions): Kafka’s “magic” feature that ensures even if a system fails, the end result is as if the message was processed exactly one time.

Chapter 7 (up to the end-Advanced Ecosystem and Operations)-Kafka Streams

Instead of just moving data, this chapter teaches you how to process it in real-time.

Stateless vs. Stateful: Simple filtering (Stateless) vs. complex operations like “How many orders did this user place in the last hour?” (Stateful).
Windowing: Processing data in time blocks (e.g., 5-minute windows) to identify trends as they happen.

Chapter 8: Kafka Connect

Why write code when you can use a connector?

Source Connectors: Automatically pull data from a database (like MongoDB or Postgres) into Kafka.
Sink Connectors: Automatically push data from Kafka into a storage engine (like S3, Elasticsearch, or Snowflake).

Chapter 9: Security and Operations

The “Day 2” realities of running Kafka:

Encryption: Using TLS to protect data in transit.
Authentication: Using SASL or Kerberos to ensure only authorized apps can read or write data.
Monitoring: Tracking metrics like “Consumer Lag” — the gap between the latest message and where the consumer currently is — which is the most vital health check for any Kafka architect.

Chapter 10: Kafka Projects (I think this one is one of the most

important chapter for me, and also to build a project around Kafka!)

Defining Project Requirements: the foundation of a Kafka project involves translating business logic into technical specifications.

Workflow Identification: Architects must identify event-driven workflows and decompose business processes into discrete events.
Functional Requirements: This includes defining topic names, selecting appropriate message keys for ordering, and determining data types.
Nonfunctional Requirements: These are critical for cluster sizing and include:
Throughput and Latency: Estimating the volume of messages per second.
Durability and Availability: Setting replication factors and min.insync.replicas.
Retention: Defining how long data stays in the cluster based on business needs and disk capacity.

Maintaining Cluster Structure: Once requirements are set, the “source of truth” for the infrastructure must be managed.

Tooling: While CLI and UI tools are useful for exploration, they are insufficient for enterprise-grade management.
Environment Management: Establishing consistent configurations across Development, Test, and Production environments is essential for stability. Testing Kafka Applications: testing in a distributed, asynchronous environment requires a multi-layered approach to ensure reliability and performance.

Chapter 11: Data Governance and Schema Management

As a Kafka cluster grows, the biggest risk isn’t technical failure, but “data rot” — where producers send data that consumers can’t understand.

The Schema Registry: This chapter introduces the Confluent Schema Registry (or similar tools), which acts as a gatekeeper. It ensures that every message sent to a topic follows a predefined format (like Avro, Protobuf, or JSON Schema).
Compatibility Rules: It explains how to manage “Evolution.” For example, if you add a new field to a database, how do you ensure the older consumers don’t crash? The chapter covers Forward, Backward, and Full compatibility strategies.
Data Lineage: It discusses tracking where data comes from and where it goes, which is essential for regulatory compliance (like GDPR or HIPAA).

Chapter 12: Future Trends and Architecture Evolution

This concluding chapter looks at how Kafka is evolving from a simple “tool” into a “cloud-native data backbone.”

Tiered Storage: This is a major architectural shift. Instead of keeping all data on expensive, fast disks on the Kafka brokers, older data is automatically moved to cheaper object storage (like Amazon S3 or Google Cloud Storage). This allows for nearly infinite data retention without ballooning costs.
Serverless Kafka: The rise of “Kafka as a Service” (like Confluent Cloud or Amazon MSK), where architects don’t manage servers at all and instead focus purely on data streams and throughput.
Event-Driven Evolution: The chapter explores how companies are moving toward “Data Mesh” architectures, where Kafka isn’t just a side-tool but the primary way different business departments share information.

Conclusion: Wrapping Up

“Kafka for Architects” serves as a bridge between high-level business requirements and low-level technical implementation. It moves beyond the “how-to” of coding and focuses on the “why” of system design.

The synthesis of the book reveals that successful Kafka implementation relies on three pillars:

Correct Partitioning: The key to throughput and scaling.
Strict Schema Governance: The key to long-term maintainability and preventing system-wide breakages.
Observability: Using tools like Cruise Control and JMX Stacks to manage the inherent complexity of distributed systems.

For anyone entering the streaming space — especially in light of the IBM/Confluent era — understanding these architectural primitives is essential for building resilient, real-time data backbones that can scale with the enterprise.

Last but not least, I’ve gained so much insight from this book! If you’re like me and want to understand the core philosophy of Apache Kafka while also getting hands-on with implementation, this is a must-read. It bridges the gap between the basics and real-world application perfectly.

>>> Thanks for reading <<<

Top comments (1)

Andrew Tan • May 13

Your summary of strict schema governance as a prerequisite for a successful Kafka implementation is spot on. Too many teams treat schema evolution as an afterthought.

DEV Community