Cliffe Okoth

Posted on May 18 • Edited on Jun 19

How Apache Kafka Powers Real-Time Data Pipelines

#eventdriven #kafka #ai #dataengineering

Most standard data pipelines run on a schedule. You use tools like Airflow and dbt to extract and transform large batches of data once a day. However, what would happen if the data wasn't collected but rather is being collected in the moment.

This is where streaming comes in. You would require a system designed for continuous event streaming like Apache Kafka, an open-source distributed event streaming platform.
It acts as a massive central nervous system, allowing data to flow continuously from source to destination.

To understand how Kafka works, I'll break down its core concepts using a live streaming project:
a pipeline that extracts real-time weather data from the OpenWeatherMap API and streams it directly into a Cassandra database.

Let's look at Kafka's architecture.

Broker

Kafka does not run on a single machine, it is a distributed system.

A Broker is a single Kafka server responsible for receiving, storing and serving messages.

A Cluster is a group of brokers working together. If one broker fails, the cluster ensures the data is replicated and safe elsewhere.

In the project's code, you can see the connection to the broker defined via the bootstrap_servers parameter pointing to localhost:9092.

Events

In Kafka, an event (also record or message) records the fact that 'something happened.' They consist of a key, value, timestamp and headers and cannot be updated or changed. In the streaming pipeline, this is the json response extracted from the weather api.

Topic

Whereas in a database you insert data into a table, in Kafka, you push data to a topic. In the weather pipeline, the topic is simply defined as:

topic = 'weather_info'

Every API response pulled will be published to this specific topic.

producer.send(topic, {api_response})

To ensure the system can scale horizontally and process millions of messages simultaneously, topics are split into Partitions. They allow multiple consumers to read from the same topic in parallel.

Within each partition, messages are assigned a unique sequential ID known as an Offset. This allows consumers to track exactly where they left off in reading the stream, ensuring no data is skipped or read twice.

Producer

A producer is any application that publishes data to a Kafka topic. Their only job is to gather data and push it to the broker. For this project, producer.py acts as the producer. It requests data from the weather api, receives a json payload, and sends it to the topic every 5 seconds.

producer = KafkaProducer(
    bootstrap_servers='localhost:9092',
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
    )

while True:
    results = extract()
    producer.send(topic, results)
    time.sleep(5)

Serialization

Kafka is designed for maximum throughput, meaning it doesn't process the internal structure of your data. To Kafka, a complex JSON payload or DataFrame is just an array of raw bytes hence the value_serializer argument in the code block above which transforms the data to Kafka-readable bytes.

Conversely, when the data reaches its destination, it must be deserialized back into a readable format. This is why the consumer script includes a matching deserializer value_deserializer.

Consumer

A consumer subscribes to one or more topics, reads the stream of incoming records and processes them. In consumer.py, the script acts as a continuous listener on the weather-info topic. As soon as a new weather event arrives, the consumer receives the event, flattens the nested JSON, converts Unix timestamps into standard datetime formats, and executes an INSERT statement to load the clean data into a Cassandra database table.

for message in consumer:
    raw_data = message.value

    # ... data flattening and timestamp conversion ...

    session.execute(insert_query, (
        raw_data.get('id'),
        unix_to_dt(sys_data.get('sunrise')),
        # ... other fields ...
        sys_data.get('country')
    ))

Unlike a standard Python for loop that ends when it reaches the bottom of a list, a Kafka for message in consumer loop is infinite.

Why use Kafka?
Because, if the OpenWeatherMap API surges, sending exponentially more data, and you have a Python script writing directly to Cassandra, the database might become overwhelmed and crash, taking your entire pipeline down with it.

Kafka on the other hand, acts as an indestructible shock absorber. The Producer can dump millions of records into the Kafka topic at lightning speed and Kafka will just take them. The Consumer will then read from the topic at its own pace, processing and inserting records into Cassandra as fast as it can without overwhelming the database.
And even if the consumer crashes, Kafka remembers exactly where it left off, ensuring zero data loss when it restarts.

Top comments (2)

Andrew Tan • May 29

What you describe is real-time in the most positive case with a smooth performance degradation where kafka acts as the buffer. what's your production experience with real-world data in this specific scenario?

Cliffe Okoth • Jun 19

Not much. I was just exploring Apache Kafka and its foundational concepts.