Sarah Muriithi

Posted on Sep 10

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications and Real-World Production Practices.

Apache Kafka is an open source distributed streaming platform. It is designed to handle real-time streams of data at scale, in a fault-tolerant way.

1. Core Concepts.

Clusters.

A collection of brokers(servers) working together to provide fault tolerance, scalability and high throughput.
They handle million of messages per second in distributed systems.

Topic.

A topic is a logical channel where messages are produced and consumed.
Each topic is split into partitions for parallel processing.
Writing a Kafka topic:

bin/kafka-topics.sh --create \
  --topic my-first-topic \
  --bootstrap-server localhost:9092 \
  --partitions 3 \
  --replication-factor 1

Listing all topics:

bin/kafka-topics.sh --list --zookeeper localhost:2181

Partition.

It is an ordered append-only sequence of records inside a topic.
They enable parallel consumption and allow horizontal scaling.

Brokers.

A Kafka server that stores partitions and serves clients.
Kafka brokers manage topic partitions, mess<age replication and data storage and retrieval.

Producers.

They send messages (events) to Kafka topics.
They ensure that messages with the same key go to the same partition.
Writing messages to a Kafka topic:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic customer_orders

Consumers.

They read messages from Kafka topics.
Consumers belong to consumer groups, thus allowing parallel processing.
Reading messages from a topic:

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic customer_orders --from-beginning

Offset.

A unique identifier for each message within a partition.
Consumers use offsets to track which messages they’ve already read.

Zookeeper vs Kraft.

Zookeeper is an external system used by Kafka for metadata management and cluster coordination.
Kraft is a zookeeper-free mode where Kafka manages it's own metadata using the Raft consensus algorithm.
Starting a zookeeper.

bin/zookeeper-shell.sh localhost:2181

Kafka Connect.

Kafka Connect is used to stream data between Kafka and external data systems like databases, file systems, and cloud storage.
Running a Kafka Connect:

bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties

Replication.

Kafka keeps multiple copies of partitions across brokers which provides fault tolerance.
One broker hosts the leader partition, others host replicas (followers).
Producers and consumers talk to the leader.

Retention.

Kafka doesn’t delete messages immediately after consumption.
Messages are kept based on time (e.g., 7 days) or size (e.g., 1GB).
Allows consumers to re-read messages later (useful for replaying events).

2. Data Engineering Applications.

Real-Time Data Ingestion.

Streams data from multiple sources (databases, APIs, IoT devices, apps).
Example: Collecting clickstream events from a website.
Tools: Kafka Connect + Source Connectors (Debezium).

ETL/ELT Pipelines.

Kafka acts as the transport layer in ETL workflows.
Data can be cleaned/transformed on the fly using Kafka Streams.

Data Lake/ Warehouse Ingestion.

Kafka feeds batch & streaming data into storage systems.

Examples:

- Write data to S3 (data lake).
-Send cleaned data into Snowflake, BigQuery, or Redshift.

Change Data Capture.

Tools like Debezium integrate with Kafka to capture changes from relational databases (MySQL, Postgres, Oracle).
This enables real-time ETL, replication, and synchronization across systems.

Machine Learning Pipelines.

Kafka streams feed real-time features into ML models, powering use cases like fraud detection, dynamic pricing, or recommendation systems.

3. Real-World Production Practices.

Netflix

Netflix uses Kafka for real-time monitoring, event sourcing and recommendations.
Every playback event or error is streamed to Kafka for analysis.

LinkendIn

LinkendIn uses Kafka to process 1 trillion messages per day for activity tracking, search indexing and fraud-detection.

Uber

Uber relies on Kafa to match riders with drivers, hanles surge pricing and provides real-time Estimated Time of Arrival.

DEV Community

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications and Real-World Production Practices.

1. Core Concepts.

2. Data Engineering Applications.

3. Real-World Production Practices.

Top comments (0)