Apache Kafka — Deep Dive: Core Concepts, Data-Engineering Applications, and Real-World Production Practices

#dataengineering #programming #ai

1. Introduction

At its core, Apache Kafka is a distributed event streaming platform used to publish, store, and process streams of records in a fault-tolerant, horizontally scalable way. Kafka is widely used for activity tracking, real-time analytics, stream processing, and as a central data bus across microservices and data systems. The official documentation describes Kafka as a distributed system of brokers and clients with a high-performance TCP protocol, designed to serve as a durable, ordered commit-log.

2. Core concepts and architecture
An event records the fact that "something happened" in the world or in your business. It is also called record or message in the documentation. When you read or write data to Kafka, you do this in the form of events. Conceptually, an event has a key, value, timestamp, and optional metadata headers. Here's an example event:

Event key: "Alice" Event value: "Made a payment of $200 to Bob" Event timestamp: "Jun. 25, 2020 at 2:06 p.m."
Producers are those client applications that publish (write) events to Kafka, and consumers are those that subscribe to (read and process) these events. In Kafka, producers and consumers are fully decoupled and agnostic of each other, which is a key design element to achieve the high scalability that Kafka is known for. For example, producers never need to wait for consumers. Kafka provides various guarantees such as the ability to process events exactly-once.

Events are organized and durably stored in topics. Very simplified, a topic is similar to a folder in a filesystem, and the events are the files in that folder. An example topic name could be "payments". Topics in Kafka are always multi-producer and multi-subscriber: a topic can have zero, one, or many producers that write events to it, as well as zero, one, or many consumers that subscribe to these events. Events in a topic can be read as often as needed—unlike traditional messaging systems, events are not deleted after consumption. Instead, you define for how long Kafka should retain your events through a per-topic configuration setting, after which old events will be discarded. Kafka's performance is effectively constant with respect to data size, so storing data for a long time is perfectly fine.

Topics are partitioned, meaning a topic is spread over a number of "buckets" located on different Kafka brokers. This distributed placement of your data is very important for scalability because it allows client applications to both read and write the data from/to many brokers at the same time. When a new event is published to a topic, it is actually appended to one of the topic's partitions. Events with the same event key (e.g., a customer or vehicle ID) are written to the same partition, and Kafka guarantees that any consumer of a given topic-partition will always read that partition's events in exactly the same order as they were written.

Topic: orders
Partitions: P0, P1

Broker A (leader P0)   Broker B (leader P1)
   |                       |
   v                       v
Replica on Broker C     Replica on Broker A
( follower )            ( follower )

Producers & consumers; consumer groups
Producers publish records to topics (optionally controlling partition choice). Consumers read from topics; consumers that belong to the same consumer group coordinate so each partition is consumed by exactly one group member — enabling both parallel processing and scalability. Offsets (per partition) track read progress and can be committed either automatically or manually for stronger processing guarantees.

Delivery semantics, offsets and retention

Kafka supports configurable delivery semantics:
At most once (producer sends, but offset committed before processing),
At least once (default when commit after processing),
Exactly once semantics (EOS) in the Kafka Streams / transactional API for end-to-end exactly-once processing across producers, brokers and consumers. Topics also have configurable retention: time-based or size-based, and log compaction can retain the latest value per key for compacted topics. 3. Stream processing and Kafka Streams / Connect Kafka is more than messaging — it’s a streaming platform. Two core components:

Kafka Connect: pluggable framework for moving data in/out of Kafka (sources and sinks) — e.g., JDBC, HDFS, S3, cloud connectors. Great for CDC/ETL.

Kafka Streams: a lightweight Java library for building stateful and stateless stream processing apps. It supports windowing, joins, state stores, and integrates with Kafka’s exactly-once semantics when configured.

4. Data-engineering patterns and architectures

Event sourcing / commit log

Kafka often serves as a durable commit-log: every change (event) is stored in order, enabling replay and recovery. This model simplifies replicating state between systems and decouples producers and consumers. LinkedIn’s original design rationale for Kafka emphasized using a single unified log for both online and offline consumers.

Real-time ETL and CDC pipelines
Change Data Capture (CDC) tools (Debezium, Confluent connectors) stream changes from databases into Kafka topics; downstream consumers handle enrichment, analytics, or load data lakes/warehouses. Kafka’s durability and retention allow late consumers and reprocessing without re-ingestion.

5.Use cases
LinkedIn

LinkedIn originally built Kafka to handle activity streams and log ingestion, unifying online systems and offline analytics on a single log abstraction. That origin story shaped Kafka’s design goals: high throughput, low latency, and cheap sequential writes/reads for many consumers. LinkedIn has published many lessons learned from scaling Kafka internally.

Netflix
Netflix uses Kafka heavily as the backbone for eventing, messaging, and stream processing across studio and product domains — powering real-time personalization, event propagation, and operational telemetry. Netflix runs Kafka as a platform (self-service) that supports multi-tenant workloads and integrates with their stream processing and storage systems. Confluent’s coverage and Netflix engineering posts discuss Kafka’s place in their architecture.

Uber
Uber treats Kafka as a cornerstone of its data and microservice architecture, used for pub/sub between hundreds of microservices, for real-time pipelines, and for tiered storage strategies to manage petabytes of events. Uber engineering has written at length on securing, auditing, and operating Kafka at massive scale (multi-region replication, consumer proxies, auditing tools).

DEV Community

Apache Kafka — Deep Dive: Core Concepts, Data-Engineering Applications, and Real-World Production Practices

Top comments (0)