Understanding Apache Kafka: A Beginner's Guide to Real-time Data Streaming

#kafka #python #dataengineering #data

Apache Kafka is a technology that is mainly used to build high-performance, real-time data pipelines and streaming applications. It handles vast quantities of data and this article aims to demystify Kafka for beginners. We will explain its core concepts and show its practical application through a real-time weather data processing example.

What is Apache Kafka?

Apache Kafka is an event streaming platform that is distributed, scalable, and fault-tolerant. Think of it as a highly efficient, persistent message queue that allows different applications to communicate by sending and receiving data streams. Kafka facilitates a publish-subscribe model where data producers send messages to a central system, and data consumers can read these messages independently.

Key Components of Apache Kafka

To understand how Kafka works, it's essential to grasp its fundamental components:

1. Producers
Producers are client applications that publish (write) data records (messages) to Kafka topics. They are responsible for creating new data and sending it to the Kafka cluster. For instance, in our weather data example, the Python script fetching weather information from an API acts as a producer.

2. Consumers
Consumers are client applications that subscribe to (read) data records from Kafka topics. They process the data streams published by producers.

3. Brokers
Brokers are the core servers that form the Kafka cluster. Each broker is a Kafka server that stores data, handles requests from producers and consumers, and replicates data for fault tolerance.

4. Topics and Partitions
Topics are categories or feeds to which records are published. They are logical channels for organizing data streams. For example, open_weather_data would be a topic for all weather-related messages. Topics are further divided into partitions, which are ordered, immutable sequences of records.

5. Zookeeper (or Kraft in newer versions)
Historically, Kafka relied on Apache ZooKeeper for managing the cluster's metadata. In newer versions of Kafka, Kraft has been introduced to remove the dependency on ZooKeeper, simplifying the architecture.

Building a Real-time Weather Data Processing Pipeline with Kafka

Let's illustrate these concepts with a practical example: a real-time weather data processing pipeline using the provided Python code. This pipeline demonstrates how producers fetch data, publish it to Kafka, and consumers then process it.

Producer Code Explanation

import requests
import json
from kafka import KafkaProducer
import time
from dotenv import load_dotenv
import os

load_dotenv()

api_key = os.getenv("API_KEY")

def get_weather_data():

    cities = ["Nairobi", "Mombasa", "Kisumu", "Eldoret", "Nakuru"]

    city_list = []

    for city in cities:

        url = f"https://api.openweathermap.org/data/2.5/weather?q={city}&appid={api_key}"

        data = requests.get(url)

        raw_data = data.json()

        city_list.append(
            {
                "city": city,
                "temperature": raw_data["main"]["temp"],
                "humidity": raw_data["main"]["humidity"],
                "description": raw_data["weather"][0]["description"],
                "last_update": raw_data["dt"]
            }
        )

    return city_list

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer = lambda p:  json.dumps(p).encode('utf-8')
                            )


while True:
    weather_data = get_weather_data()
    topic = 'open_weather_data'
    producer.send(topic, value=weather_data)
    print(f"Producer: {weather_data}")
    time.sleep(5)

The script fetches current weather data for a list of Kenyan cities from the OpenWeatherMap API.
It then extracts relevant weather details (city, temperature, humidity, description, last update timestamp) and returns them as a list of dictionaries.
We then specify the address of the brokers (localhost:9092) that the Producer will connect to.
value_serializer : Defines how the data sent to Kafka should be serialized. In our case, it converts the Python dictionary containing the weather data to json and then encodes it to utf-8, which is the format Kafka expects.
The while loop ensures we run the above steps continuously.
We then define a topic as, 'open_weather_data' and use producer.send(topic, value=weather_data) to publish the fetched weather data to the topic defined.
In our code, we've defined a pause of 5 seconds before fetching the next batch but this can be adjusted accordingly.

Consumer Code Explanation

from kafka import KafkaConsumer
import json

consumer = KafkaConsumer(
    'open_weather_data',
    bootstrap_servers='localhost:9092',
    value_deserializer = lambda m: json.loads(m.decode('utf-8')),
    auto_offset_reset='earliest'
)

for message in consumer:
    print(f"Consumer: {message.value}")

'open_weather_data' specifies the topic from which the consumer will read messages.
bootstrap_servers = 'localhost:9092' is the same as for the producer, pointing to the Kafka broker(s).
value_deserializer = lambda m: json.loads(m.decode('utf-8')) defines how the received data (value) from Kafka should be deserialized. It decodes the UTF-8 bytes back into a JSON string and then parses it into a Python dictionary.
auto_offset_reset = 'earliest' : 'earliest' means the consumer will start reading from the beginning of the topic (the earliest available offset). Other options include 'latest' (start from the most recent messages) or 'none' (throw an error if no valid offset is found).

Conclusion

Apache Kafka provides a robust and scalable solution for handling real-time data streams. By understanding its core components—producers, consumers, brokers, topics, and partitions—and seeing how they interact in a practical example like our weather data pipeline, you can begin to appreciate its power. This setup allows for efficient, decoupled communication between different parts of an application, enabling real-time data processing and analytics that are vital in today's fast-paced digital world.

Top comments (2)

Andrew Tan • May 20

Nice overview of Kafka fundamentals. One thing I'd add from production experience: the "consumer group rebalancing" you mention briefly becomes a real headache at scale. When you have hundreds of consumers and someone deploys a change, the entire cluster can spend minutes rebalancing partitions while your latency spikes. Just sayin' :-)

Ng'ang'a Njongo • May 20

Hi Andrew,

Thanks for the feedback. I'm still early on in my Kafka journey, looking forward to troubleshooting on a prod environment