Kepha Mwandiki

Posted on Sep 23

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

#architecture #dataengineering #kafka

Apache Kafka is an open-source distributed event streaming platform.

What does this mean? - Kafka combines three key capabilities so you can implement your use cases for event streaming end-to-end with a single solution:

To publish (write) and subscribe to (read) streams of events, including continuous import/export of your data from other systems.
To store streams of events durably and reliably for as long as you want.
To process streams of events as they occur.

How Does Kafka Work?

Kafka is a distributed system consisting of servers and clients that communicate via a high-performance TCP network protocol.

Server/Broker - The Server run the Kafka software and is responsible for:

Receiving messages from producers
Storing them in topics & partitions
Serving them to consumers when requested

Each broker can handle thousands of partitions and millions of messages per second.

i. Kafka clients

They allow you to write distributed applications and services that read, write, and process streams of events in parallel, at scale, and in a fault-tolerant manner even in the case of network problems or machine failures.

ii. Producer clients

Apps that send data into Kafka topics.

Example: My weather script → sends weather JSON to Kafka.

iii. Consumer clients

Apps that read data from Kafka topics.

Example: Snowflake loader → consumes weather data and inserts into a table.

iv. Admin clients

Used to manage Kafka; create topics, configure partitions, check clusters etc.

2. Apache Kafka Core Concepts.

2.1 Producer

An application that sends messages (records/events) into Kafka.

from kafka import KafkaProducer
import json
import time

# Create a KafkaProducer instance
# Using a JSON serializer for demonstration purposes
producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

# Define the topic to send messages to
topic_name = 'My_Kafka_Article'

# Send a few messages
for i in range(3):
    message_data = {"id": i, "message": f"This is message number {i}"}
    print(f"Sending message: {message_data}")
    producer.send(topic_name, value=message_data)
    time.sleep(1) # Simulate some delay

# Ensure all messages are sent
producer.flush()

# Close the producer
producer.close()

print("Kafka Producer finished sending messages.")

2.2 Consumer

An application that reads messages from Kafka.

from kafka import KafkaConsumer
import json

# 1. Create a Kafka Consumer Instance
consumer = KafkaConsumer(
   'My_Kafka_Article',  
    bootstrap_servers='localhost:9092', 
    # Deserialize JSON data
    value_deserializer=lambda x: json.loads(x.decode('utf-8')) 
)

# 2. Poll for Messages and Process Them
for message in consumer:
    print(f"Received message: {message.value}")

2.3 Topic

A category or channel in Kafka where data is stored.
Below is my example of some topics.

2.4 Partition

Topics are partitioned, meaning a topic is spread over a number of "buckets" located on different Kafka brokers. This distributed placement of your data is very important for scalability because it allows client applications to both read and write the data from/to many brokers at the same time.

2.5 Broker

A Kafka server that stores data and serves producers/consumers.

Many brokers form a Kafka cluster.

2.6 Cluster

A cluster is a group of brokers working together, where topics/partitions are distributed by kafka.

3. Data Engineering Applications for Kafka

3.1 Real-Time Data Ingestion

This is the process of bringing real-time data from various data sources and streaming it into a warehouse, lake or a streaming platform.

3.2 Log Aggregation

Log aggregation collects physical log files off servers and puts them in a central place (a file server or HDFS perhaps) for processing. Kafka abstracts away the details of files and gives a cleaner abstraction of log or event data as a stream of messages.

3.2.1 Log Aggregation Pipeline with Kafka

3.2.1.1 Log Sources - Producers:

Applications
Web servers (Apache)
System logs (via Fluentd, Filebeat, syslog → Kafka)

3.2.1.2 Kafka Topics:

Logs are written into topics like app_logs, error_logs, access_logs.

3.2.1.3 Consumers:

HDFS/S3 → long-term storage.
Monitoring tools → Grafana.

3.3 Website Activity Tracking

The original use case for Kafka was to be able to rebuild a user activity tracking pipeline as a set of real-time publish-subscribe feeds. This means site activity is published to central topics with one topic per activity type. These feeds are available for subscription for a range of use cases including real-time processing, real-time monitoring, and loading into Hadoop or offline data warehousing systems for offline processing and reporting.

3.4 Stream processing

Stream processing means working with continuous flows of data, streams, in real time.

Eg, Instead of analyzing yesterday’s sales, you process every transaction as it happens.

3.4.1 Typical Flow of stream processing

Producers → send data into Kafka (clicks, IoT, transactions).
Kafka Topics → receive and store the message.
Stream Processor (Kafka Streams ) → processes data in real time.

Below shows a picture of real-time data querying, using confluent, a kafka streaming platform

4 Real-World Production Practices Using Kafka

4.1 Sports

Millions of fans world-wide need real-time updates on scores, player stats, and events.

How Kafka helps:

Producers → stadium sensors, referee systems (VAR), commentary feeds.

Kafka Topics → scores, player_stats, tracking_data.

Consumers → Mobile apps eg LiveScore, FotMob get instant updates.

Stream processors - aggregate, filter, and push alerts eg notifying the fans "Goal Scored"

4.2 Banking & Finance

Fraud needs to be caught in seconds. Payments must be processed swiftly and fast, with no duplicates.

How Kafka helps:

Producers → ATM machines, mobile banking apps.

Kafka Topics → transactions, fraud_alerts.

Stream processors → Aggregate transactions per user in a 5-minute window.

Flag anomalies eg notifying there have been too many withdrawals in 5 minutes.

Consumers → fraud detection systems, real-time dashboards, data warehouse for historical data.

4.3 Healthcare

Patient vitals eg oxygen and heart rate must be monitored continuously to ensure good health and real-time follow up.

How Kafka helps:

Producers → IoT devices on patients eg wearables and hospital monitors.

Kafka Topics → patient_vitals, alerts.

Stream processors → Check thresholds eg heart rate greater than 180 bpm.
Trigger emergency alerts instantly.

Consumers → doctor dashboards, alert systems, patient history databases.

DEV Community

Apache Kafka Deep Dive: Core Concepts, Data Engineering Applications, and Real-World Production Practices

Top comments (0)