DEV Community

Sarah Muriithi
Sarah Muriithi

Posted on

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications and Real-World Production Practices.

Apache Kafka is an open source distributed streaming platform. It is designed to handle real-time streams of data at scale, in a fault-tolerant way.

1. Core Concepts.

Clusters.

  • A collection of brokers(servers) working together to provide fault tolerance, scalability and high throughput.

  • They handle million of messages per second in distributed systems.

Topic.

  • A topic is a logical channel where messages are produced and consumed.

  • Each topic is split into partitions for parallel processing.

  • Writing a Kafka topic:

bin/kafka-topics.sh --create \
  --topic my-first-topic \
  --bootstrap-server localhost:9092 \
  --partitions 3 \
  --replication-factor 1

Enter fullscreen mode Exit fullscreen mode
  • Listing all topics:
bin/kafka-topics.sh --list --zookeeper localhost:2181
Enter fullscreen mode Exit fullscreen mode

Partition.

  • It is an ordered append-only sequence of records inside a topic.

  • They enable parallel consumption and allow horizontal scaling.

Brokers.

  • A Kafka server that stores partitions and serves clients.

  • Kafka brokers manage topic partitions, mess<age replication and data storage and retrieval.

Producers.

  • They send messages (events) to Kafka topics.

  • They ensure that messages with the same key go to the same partition.

  • Writing messages to a Kafka topic:

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic customer_orders
Enter fullscreen mode Exit fullscreen mode

Consumers.

  • They read messages from Kafka topics.

  • Consumers belong to consumer groups, thus allowing parallel processing.

  • Reading messages from a topic:

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic customer_orders --from-beginning
Enter fullscreen mode Exit fullscreen mode

Offset.

  • A unique identifier for each message within a partition.
  • Consumers use offsets to track which messages they’ve already read.

Zookeeper vs Kraft.

  • Zookeeper is an external system used by Kafka for metadata management and cluster coordination.

  • Kraft is a zookeeper-free mode where Kafka manages it's own metadata using the Raft consensus algorithm.

  • Starting a zookeeper.

bin/zookeeper-shell.sh localhost:2181
Enter fullscreen mode Exit fullscreen mode

Kafka Connect.

  • Kafka Connect is used to stream data between Kafka and external data systems like databases, file systems, and cloud storage.

  • Running a Kafka Connect:

bin/connect-standalone.sh config/connect-standalone.properties config/connect-file-source.properties
Enter fullscreen mode Exit fullscreen mode

Replication.

  • Kafka keeps multiple copies of partitions across brokers which provides fault tolerance.
  • One broker hosts the leader partition, others host replicas (followers).
  • Producers and consumers talk to the leader.

Retention.

  • Kafka doesn’t delete messages immediately after consumption.
  • Messages are kept based on time (e.g., 7 days) or size (e.g., 1GB).
  • Allows consumers to re-read messages later (useful for replaying events).

2. Data Engineering Applications.

Real-Time Data Ingestion.

  • Streams data from multiple sources (databases, APIs, IoT devices, apps).
  • Example: Collecting clickstream events from a website.
  • Tools: Kafka Connect + Source Connectors (Debezium).

ETL/ELT Pipelines.

  • Kafka acts as the transport layer in ETL workflows.
  • Data can be cleaned/transformed on the fly using Kafka Streams.

Data Lake/ Warehouse Ingestion.

  • Kafka feeds batch & streaming data into storage systems.
  • Examples:

    - Write data to S3 (data lake).
    -Send cleaned data into Snowflake, BigQuery, or Redshift.
    

Change Data Capture.

  • Tools like Debezium integrate with Kafka to capture changes from relational databases (MySQL, Postgres, Oracle).
  • This enables real-time ETL, replication, and synchronization across systems.

Machine Learning Pipelines.

  • Kafka streams feed real-time features into ML models, powering use cases like fraud detection, dynamic pricing, or recommendation systems.

3. Real-World Production Practices.

Netflix

  • Netflix uses Kafka for real-time monitoring, event sourcing and recommendations.

  • Every playback event or error is streamed to Kafka for analysis.

LinkendIn

  • LinkendIn uses Kafka to process 1 trillion messages per day for activity tracking, search indexing and fraud-detection.

Uber

  • Uber relies on Kafa to match riders with drivers, hanles surge pricing and provides real-time Estimated Time of Arrival.

Top comments (0)