Samuel Wachira

Posted on May 21

Apache Kafka for Beginners: Building Real-Time Streaming Systems with Python

#kafka #dataengineering #luxdev

Apache Kafka is widely recognized as the go-to way system for real-time event streaming. Modern systems across banking, e-commerce, healthcare, gaming and government institutions use Kafka to process massive streams of data continuously and reliably.

Kafka enables organizations to:

Process real-time events.
Store historical streams.
Build scalable distributed systems
Power modern data engineering pipelines.
At its core, Kafka acts as a distributed event streaming platform where applications continuously exchange streams of information.

Understanding Real-Time Event Streaming
An event is simply something that happens in a system. Examples include customer purchase, payment transaction, fraud alert etc.

Kafka captures and processes these events instantly while preserving them for future replay and analysis.
Kafka solve several major challenges in distributed systems. It has several advantages that includes:

Scalability
Kafka scales horizontally by adding more brokers and partitions. This enables organizations to process millions of messages per second and real-time analytics streams.
Parallelization
Kafka divides topics into partitions, enabling multiple consumers to produce simultaneously. This improves performance, throughput and efficiency in distributed processing.
Persistent Storage and Rolling files
Kafka stores events on disk using append-only logs. Instead of deleting messages immediately, Kafka retains data for configurable periods. Its benefits include historical replay, fault recovery and audit trails.

Kafka Fundamentals
Kafka systems revolve around three major components.

A). Producers

A producer is an application that writes data into Kafka topics. Examples include web applications or payment systems.

Kafka producers supports multiple programming languages including:

Python
Java
Go
C/C++

The producer below sends streaming students into kafka

from kafka import KafkaProducer
import json
import time

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

students = [
    {"id": 1, "name": "Samwel", "course": "Data Engineering"},
    {"id": 2, "name": "Alice", "course": "AI Engineering"},
    {"id": 3, "name": "John", "course": "Cloud Computing"}
]

for student in students:
    producer.send("students", value=student)

    print("Message sent:", student)

    time.sleep(1)

producer.flush()
producer.close()

Output

Kafka producers are designed for:

Asynchronous communication
High throughput
Partition-aware messaging
Fault tolerance.

Producers determine which partition receives a message.

Important note: An Idempotent producer ensures the same message is never written twice during retries. This prevents duplicate data during failures or network interruptions.

Kafka enables this using:

producer =KafkaProducer(
     bootstrap_servers = 'localhost:9092',
     acks = 'all',
     retries = 5
)

Idempotent producers are critical for:

financial transactions
payment systems
stream processing pipelines

B). Brokers
A Broker is a Kafka server responsible for: receiving messages, storing events and managing partitions.
A collection of brokers forms a Kafka Cluster

Kafka brokers:

Manage partitions
Replicate data
Distribute workloads
Ensure durability

A broker can manage multiple partitions simultaneously.

C). Consumers
A consumer reads messages from Kafka topics. Consumers continuously pull new events, process records and track progress using offsets.

Python Kafka Consumer Example:

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'students',
    bootstrap_servers='localhost:9092',
    group_id='etl-group',
    auto_offset_reset='latest',
    enable_auto_commit=True,
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

print("Consumer running...")

for message in consumer:
    data = message.value

    print("Received:", data)

Output:

Consumer Groups: Allows multiple consumers to share workload processing. Its benefits include:

Scalability
Parallel processing
Fault tolerance

Example:

Consumer A reads partition 1
Consumer B reads partition 2

Consumer Rebalances: It occurs when a consumer joins the group, a consumer leaves or partitions change.
Kafka automatically redistributes partitions across consumers. Although rebalances improves resilience and excessive rebalancing may temporarily pause processing.

Kafka Topics and Partitions

Topics
A topic is a stream of related messages. Example includes: payments, orders, fraud alerts or user_activity.
Kafka supports unlimited topics.
Partitions
Topics are divided into partitions. Each partition behaves like an append-only log.
Old message -> New Message -> Latest message

Partition allow:

Parallel processing
Scalability
Ordered event storage.

Kafka Message Structure

Every Kafka event contains:

Key: used for partitioning
Value: Actual event data
Headers: Optional metadata
Timestamp: Event creation time.

Example:

{
 "key":"customer_101",
 "value":{
   "purchase": "laptop",
   "amount":1200
   }
}

Kafka ETL Example with SQLite

The following ETL consumer processes Kafka messages and stores them into SQLite.

import sqlite3
from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
      'students',
      bootstrap_server = 'localhost:9092',
      group_id = 'etl-group-1',
      auto_offset_reset = 'latest'
     enable_auto_commit = True,
     value_deserializer = lamda x: json.loads(x.decode('utf-8'))

)

conn=sqlite3.connect("kakfka_etl_db")
cursor =conn.cursor()

cursor.execute("""
CREATE TABLE IF NOT EXISTS students (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT,
    course TEXT,
    UNIQUE(name, course)
)
""")

conn.commit()

print("Etl running..")

for message in consumer:
    data = message.value

    name = data["name"].upper()
    course=data["course"]

    cursor.execute("""
        INSERT OR IGNORE INTO students(name, course)
        VALUES (?, ?)""",(name,course))

    conn.commit()

   print("Inserted:", name, course)

Decoupling Producers and Consumers

Kafka decouples producers from consumers. This means:

Producers continue sending data even if consumers are slow.
Consumers can fail independently.
New consumers can be added without affecting producers.

This architecture improves scalability, resilience and flexibility.

Exactly-Once Semantics(EOS)

Kafka supports strong transactional guarantee known as Exactly-Once Semantic(EOS). EOS ensures messages are processed once only, duplicated are prevented and failures are handled gracefully.
This is critical in banking systems, payment platforms or fraud detection.
Kafka achieves EOS using idempotent producers, transactional APIs and coordinated offset managemnet.

Data Retention Policies
Kafka retains messages for configurable periods. Default retention is 1 week.
Retention policies support historical replay, auditing, compliance and recovery from failures.

Kafka Durability and Availability
Kafka achieves durability through :

Replication: Partitions are replicated accross brokers. If one broker fails, another replica automatically takes over.
Persistent Disk Storage:Kafka stores data on risk rather than memory alone.
Consumer Offsets: Consumers track progress using offsets stored inside Kafka itself. This allows consumers to resume after failures.

Kafka Security Overview
Security is essential in production Kafka deployments. Kafka supports Encryption in Transit(SSL.TLS encryption to protect data across networks), Authentication (SAL & SSL certificates) and Authorization(Access Control List for controlling producer and consumer permissions).

Kafka does not provide built-in encryption at rest by default. Organizations often combine Kafka with encrypted disks, cloud encryption services and enterprise security tools.

Kafka Troubleshooting Methods

Confluent Control Center: provides control center for monitoring consumer lag, brokers, throughput and cluster health.
Log Files: Kafka logs help diagnose broker failures, replication issues, authentication errors and network problems
SSL Logging and Authorizer Debugging: Special debugging configurations help troubleshoot SSL handshake issues, authorize failures and ACL problems.

High-Level Kafka Consumer Logic
A Kafka consumer generally follows this workflow:


  Connect to Kafka
      ↓
  Subscribe to Topic
      ↓
  Pull Messages
      ↓
  Process Data
      ↓
  Store Results
      ↓
  Commit Offsets

This loop runs continuously in real-time.

Modern Kafka and KRaft
Traditional Kafka relied on ZooKeeper for:

Cluster coordination
Metadata management
Broker synchronization

Modern Kafka deployments now use KRaft mode eliminating ZooKeeper dependency. Its benefits include:

Simple architeture
Easier scaling
Faster startup
Fewer operational issues

Conclusion
Apache Kafka has become one of the most important technologies in modern data engineering. Its combination of scalability, durability, stream processing, exactly-one guarantees and fault tolerance makes it ideal for building modern real-time distributed systems.

Kafka powers systems capable of processing millions of events every second while maintaining reliability and performance.

DEV Community

Apache Kafka for Beginners: Building Real-Time Streaming Systems with Python

Top comments (0)