Apache Kafka Deep Dive: Concepts, Applications, and Production

#kafka #eventdriven #zookeeper #dataengineering

You've probably heard of Kafka, right? But how did it come to existence, and what kind of problems did it solve?

Kafka was developed by LinkedIn (2010) to handle massive streams of user activity and logs. In a publication by Mammad Zadeh(2015), "LinkedIn use kafka as the messaging backbone that helps the many company's applications to work together in a loosely coupled manner.". At LinkedIn, overall use cases are:

Activity Stream Tracking: Every click, profile view, search, or action is published to Kafka topics for analytics.
Log Aggregation: Instead of services writing to files, logs are centralized via Kafka.
Real-Time Analytics: Metrics like "how many people viewed my profile in the last 10 minutes" are powered by Kafka.
Data Pipeline Backbone: Kafka acts as a central bus to feed data to Hadoop, monitoring systems, and other consumers.

This article will explore Apache Kafka and dive deeper to understand its core concepts.

What is Apache Kafka?

Apache Kafka is an open-source, distributed, event-streaming system that processes real-time data.

Kafka has three main functions:

It enables applications to publish or subscribe to data or event stream.
It offers real-time data processing.
Offers storage of streams of records as they occur.

What is event-streaming?

Event-streaming is the real-time capture of data as it is produced from event sources like database, APIs', IoT devices, Cloud services and other software applications.

How does Kafka Work?

Kafka has two messaging models, queuing and publish-subscribe. Queuing distributes data processing across multiple consumers, enabling scalability, while publish-subscribe supports multiple subscribers but sends every message to all, limiting workload distribution. Kafka resolves this by using a partitioned log model. A log is an ordered record sequence, divided into partitions that can be assigned to different subscribers. This design allows multiple consumers to process the same topic while balancing the workload efficiently. Additionally, Kafka supports replayability, enabling independent applications to read and reprocess data streams at their own pace, ensuring flexibility, scalability, and reliability in real-time data processing.

Kafka Concepts summary

a). Event: A record of something that happened (key, value, timestamp, headers).
b). Producer: Writes events to topics.

c). Consumer: Reads events from topics.

d). Topic: Stores events (like a folder).

e). Partition: Subset of a topic; preserves order for events with the same key.

f). Replication: Multiple copies of partitions for fault tolerance (commonly 3).

g). Retention: Events kept for a configurable time, not deleted on read.

A simple Quickstart project (via Docker)

Here is a simple quickstart project in python to stream BTC/USDT price data from Binance API into Kafka, and then consume it back.

Prerequisite

Kafka & Zookeeper > Example Docker compose snippet:


services:
  zookeeper:
    image: confluentinc/cp-zookeeper:7.4.0
    environment:
      ZOOKEEPER_CLIENT_PORT: 2181
  kafka:
    image: confluentinc/cp-kafka:7.4.0
    ports:
      - "9092:9092"
    environment:
      KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 1

Run:

docker-compose up -d

Install dependencies

pip install kafka-python

Code
1. Producer: Stream BTC price from Binance to Kafka

# producer.py
import time
import requests
from kafka import KafkaProducer
import json

KAFKA_TOPIC = "btc_prices"
KAFKA_BROKER = "localhost:9092"

def get_btc_price():
    url = "https://api.binance.com/api/v3/ticker/price?symbol=BTCUSDT"
    response = requests.get(url).json()
    return response

if __name__ == "__main__":
    producer = KafkaProducer(
        bootstrap_servers=KAFKA_BROKER,
        value_serializer=lambda v: json.dumps(v).encode("utf-8")
    )

    while True:
        price_data = get_btc_price()
        producer.send(KAFKA_TOPIC, price_data)
        print(f"Sent: {price_data}")
        time.sleep(2)  # fetch price every 2 seconds

2. Consumer: Read BTC price from kafka

# consumer.py
from kafka import KafkaConsumer
import json

KAFKA_TOPIC = "btc_prices"
KAFKA_BROKER = "localhost:9092"

if __name__ == "__main__":
    consumer = KafkaConsumer(
        KAFKA_TOPIC,
        bootstrap_servers=KAFKA_BROKER,
        value_deserializer=lambda m: json.loads(m.decode("utf-8")),
        auto_offset_reset="earliest",
        enable_auto_commit=True
    )

    for message in consumer:
        print(f"Received: {message.value}")

Run the Project

Start Kafka + Zookeeper on docker

docker-compose up -d

Run Producer: python producer.py
Run Consumer: python consumer.py

You'll see live BTC/USDT prices flowing from Binance --> Kafka --> Consumer.

Github code

Conclusion

In conclusion, Kafka bridges the gap between traditional queuing and publish-subscribe systems, offering a scalable, fault-tolerant, and high-performance solution for real-time data streaming. Its partitioned log architecture enables parallel processing while ensuring data consistency and replayability, making it an essential tool for modern data-driven applications. From powering Uber’s trip analytics to LinkedIn’s activity feeds, Kafka has proven its reliability in large-scale production environments. As organizations continue to embrace event-driven architectures, mastering Kafka will be a valuable skill for engineers seeking to build resilient, future-ready data pipelines.