DEV Community: Batrudin Jamaludin

A Step-by-Step Guide to Streaming Live Weather Data Using Apache Kafka and Mongodb

Batrudin Jamaludin — Mon, 07 Apr 2025 08:00:21 +0000

In this guide, we will walk you through the process of building a real-time weather data pipeline using Apache Kafka for streaming and MongoDB for storage. The pipeline collects weather data from the OpenWeatherMap API, streams it via Kafka, and stores it in MongoDB for real-time analysis and querying. By the end of this tutorial, you’ll have a working solution capable of streaming live weather data and storing it for further analysis.

Overview

This project demonstrates how to create a real-time data pipeline that extracts weather data from the OpenWeatherMap API, streams it through Apache Kafka, and stores it in MongoDB. The data will be available for querying and analysis in near real-time. Here's how each component works:

Weather Data Extraction: Using the OpenWeatherMap API, we fetch live weather data for a given city.
Kafka Producer: The producer sends weather data to Kafka, which allows the data to be streamed to multiple consumers.
Kafka Consumer: The consumer retrieves weather data from Kafka and stores it in MongoDB for persistence.

Prerequisites

Before proceeding with the tutorial, ensure that you have the following installed:

Python 3.x
Apache Kafka and Zookeeper
MongoDB (or MongoDB Atlas if you prefer a cloud-based instance)
Apache Kafka (with Zookeeper)
Confluent Kafka Python library
Pandas for data transformation
OpenWeatherMap API key

Setting Up Apache Kafka

Apache Kafka is the backbone of our real-time streaming solution. It allows us to decouple the producer and consumer, ensuring that weather data can be streamed. Kafka operates by sending data to topics, which can be consumed by the consumer.

Start Zookeeper (Kafka depends on it):

bin/zookeeper-server-start.sh config/zookeeper.properties

Start Kafka in another terminal:

bin/kafka-server-start.sh config/server.properties

Create a Kafka Topic

Kafka uses topics to organize data. For our weather data, we will create a topic called weather_topic.

To create the topic, run:

bin/kafka-topics.sh --create --topic weather_topic --bootstrap-server localhost:9092 --partitions 6 --replication-factor 1

Setting Up MongoDB

We will use MongoDB to store the weather data in a database for real-time querying and analysis. In this tutorial we use mongodb on the cloud and created a database called weather_db and a collection called weather_data. To secure our application we store our mongo uri in a .env file.
Mongodb

Building the Weather Data Extraction Script

We will now create the script to extract weather data from OpenWeatherMap. The script will fetch the current weather for a city and format the data into a JSON object suitable for Kafka streaming.

import requests
import pandas as pd
import json
from dotenv import load_dotenv
import os

from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

load_dotenv() # Load env 
uri = os.getenv('DB_STRING')

# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))


def extract_data():


    city_name = 'Nairobi'

    # get API KEY
    KEY = os.getenv('WEATHER_KEY')

    #load weather data in Nairobi,KE from openweathermap
    url = f"https://api.openweathermap.org/data/2.5/weather?q={city_name}&appid={KEY}"


    weather_data = requests.get(url=url)

    df_weather = pd.DataFrame(weather_data.json()['weather'], index=[1])
    df_temp = pd.DataFrame(weather_data.json()['main'], index=[1])

    country = weather_data.json()['sys']['country']

    df_loc = pd.DataFrame(
        {
            'country': country,
            'city': city_name
        },
        index=[1]
    )

    #merge data frame
    merged_df = pd.merge(pd.merge(df_weather, df_temp, left_index=True, right_index=True, how='outer'),
                        df_loc, left_index=True, right_index=True, how='outer'
                        )
    return merged_df
def transform_data(merged_df):
    # drop columns 
    df = merged_df.drop(columns=['id', 'icon'])

    #tranform temp from kelvin to celcius
    temp_list = ['temp', 'feels_like', 'temp_min', 'temp_max']

    df[temp_list] = df[temp_list] - 273

    data_dict = df.to_dict(orient="records") #Convert the DataFrame to a list of dictionaries

    return data_dict


def load_data(data_dict):

    #load to mongodb

    db = client['weather_db']  #create db 
    collection = db['weather_data'] #creates collection


    # data_dict = df.to_dict(orient="records") #Convert the DataFrame to a list of dictionaries

    # print(data_dict)

    collection.insert_many(data_dict) # insert the entire DataFrame into the collection

Kafka Producer Implementation

from confluent_kafka import Producer
from app import extract_data, transform_data
import json
import time

conf = {
    'bootstrap.servers': 'localhost:9092',
    'client.id': 'python-producer'
}

producer = Producer(conf)
topic = 'weather_topic'



# Delivery callback
def delivery_report(err, msg):
    if err is not None:
        print(f"Delivery failed: {err}")
    else:
        print(f"Message delivered to {msg.topic()} [{msg.partition()}]")



while True:
    #call functions
    extracted_data = extract_data()
    data = transform_data(extracted_data)

    for record in data:
        producer.produce(topic,  value=json.dumps(record), callback=delivery_report)
        producer.poll(0) # Trigger delivery callback



    time.sleep(180) #after 3 minutes

Kafka Consumer Implementation

from confluent_kafka import Consumer, KafkaError,KafkaException
from app import load_data
from dotenv import load_dotenv
import os
import json
import time
from pymongo.mongo_client import MongoClient
from pymongo.server_api import ServerApi

load_dotenv() # Load env 
uri = os.getenv('DB_STRING')

# Create a new client and connect to the server
client = MongoClient(uri, server_api=ServerApi('1'))


# Kafka Consumer configuration
conf = {
    'bootstrap.servers': 'localhost:9092',  # Kafka broker(s)
    'group.id': 'weather-consumer-group',  # Consumer group ID
    'auto.offset.reset': 'earliest'  # Start consuming from the earliest message
}

# Create Consumer instance
consumer = Consumer(conf)

# Kafka topic to consume messages from
topic = 'weather_topic'

# Subscribe to the topic
consumer.subscribe([topic])

# Delivery callback to handle message processing
def delivery_report(err, msg):
    if err is not None:
        print(f"Message delivery failed: {err}")
    else:
        print(f"Message delivered to {msg.topic()} [{msg.partition()}] at offset {msg.offset()}")


# Main consumer loop
while True:
    try:
        # Poll for a message (timeout )
        msg = consumer.poll(180.0)  # 3 minutes timeout

        if msg is None:
            print("No message received within the timeout period.")
        elif msg.error():
            # Handle errors from the consumer
            if msg.error().code() == KafkaError._PARTITION_EOF:
                print(f"End of partition reached: {msg.partition()} at offset {msg.offset()}")
            else:
                raise KafkaException(msg.error())
        else:
            # Successfully received a message
            print(f"Received message: {msg.value().decode('utf-8')}")

            # Deserialize the message (assuming the message is in JSON format)
            message_data = json.loads(msg.value().decode('utf-8'))

            # Load data in db
            load_data([message_data]) #list message data

            # Optional: You can print or handle the processed data here
            print(f"Processed data: {message_data}")

            time.sleep(180) 


    except Exception as e:
        print(f"An error occurred: {str(e)}")
        break

Streaming Data in the DB

Conclusion and Recommendation

In this guide, we've built a real-time weather data pipeline using Apache Kafka and MongoDB to fetch, stream, and store weather data. While this setup works well for basic use, there are several ways to improve and expand it.

Recommendations:
Dashboard Visualization
Integrate tools like Grafana to visualize real-time weather data with charts and graphs for better insights.

Alert Systems
Set up automatic notifications via email or SMS when the weather conditions exceed a predefined threshold.

Analytics Integration
Use machine learning or statistical models to analyse long-term weather trends and predict future conditions.

For complete code and further details, visit the project on Github.

The Ultimate Guide to Apache Kafka: Basics, Architecture, and Core Concepts

Batrudin Jamaludin — Mon, 10 Mar 2025 09:40:52 +0000

Apache Kafka is a powerful open-source distributed streaming platform used to handle real-time data feeds. It was Originally developed by LinkedIn and later open-sourced in 2011, Kafka is now one of the most popular tools for building real-time data pipelines and streaming applications. Similar to other distributed systems, Kafka boasts a complex architecture, which may pose a challenge for new developers. Setting up Kafka involves navigating a formidable command line interface and configuring numerous settings. In this guide, I will provide insights into architectural concepts and essential commands frequently used by developers to initiate their journey with Kafka.

Understanding Apache Kafka Basics

Kafka is different from traditional messaging systems because it allows data to be published, consumed, and stored across a distributed network of servers, enabling real-time data processing.

Some of the key concepts of Kafka include: clusters, Topic, Producer, Consumer, Partitions and Connect.

Kafka’s Architecture

The architecture of Kafka is designed to be highly scalable, fault-tolerant, and efficient. It is built around a few core components:

Topic: Topic is Streams of records that Kafka organizes data into. It is the equivalent of tables that hold records in a relational database. A topic can have multiple partitions, allowing data to be distributed across different Kafka brokers, improving scalability and fault tolerance.
Producer: The producer is applications that write data to kafka topic and are responsible for sending message to kafka topic through Kafka’s client libraries. Examples include a microservice, web application or any other system that generates real time data.
Consumer: The consumer is applications that read data from kafka's topics. Kafka supports multiple consumers that can consume messages from the same topic independently. Consumers are often part of a consumer group to enable parallel processing of data and fault tolerance.
Broker: Its a Kafka server instances that receive, store and forward messages. Multiple brokers work together to form a Kafka cluster, which helps scale the system horizontally. Each broker in the cluster is responsible for managing a portion of the topic's data.
Partition: Partition is the division and partition of topics to split data into smaller and manageable chunks. This allows Kafka to parallelize the consumption of messages, which can significantly boost performance. Each partition is an ordered, immutable sequence of messages.
Connect: Kafka Connect manages the tasks. The connector is only responsible for generating the set of tasks and indicating to the framework when they need to be updated.

Commonly Used CLI Commands

Start Zookeeper

zookeeper-server-start config\zookeeper.properties

Start Kafka Server

kafka-server-start config\server.properties

List existing topics

bin/kafka-topics.sh --zookeeper localhost:2181 --list

Describe a topic

bin/kafka-topics.sh --zookeeper localhost:2181 --describe --topic mytopic

Delete a topic

bin/kafka-topics.sh --zookeeper localhost:2181 --delete --topic mytopic

Consume messages

bin/kafka-console-consumer.sh --new-consumer --bootstrap-server localhost:9092 --topic mytopic --from-beginning

Start Kafka Producer

bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test

Start Kafka Consumer

bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --from-beginning --topic test

Conclusion

Apache Kafka's real time processing capabilities make it a powerful tool for building data pipelines and stream-processing applications. Whether you are handling log data, monitoring system activity, or building complex event-driven applications, Kafka provides a reliable and efficient solution to stream and process large volumes of data. Understanding Kafka's architecture and core concepts, such as topics, producers, consumers, and partitions, is crucial for leveraging its full potential in modern data-driven applications.

Mastering SQL for Data Engineering: Advanced Queries, Optimization, and Data Modelling Best Practices

Batrudin Jamaludin — Mon, 10 Feb 2025 08:03:15 +0000

Structured Query Language (SQL) is Key in the world of Data Engineering. From managing large amounts of data to working on complex data processing pipelines, understanding SQL is essential for extracting, transforming, and analyzing data. For data engineers, SQL is not only a tool for querying databases but the foundation for building efficient data pipelines, optimizing queries, and ensuring scalability. In this article, we'll explore advanced SQL techniques, query optimization strategies, and best practices in data modelling that every data engineer should know.

Core SQL Concepts for Data Engineering

It’s important to have a solid understanding on the core SQL concepts that form the foundation of data engineering.
SELECT, WHERE, JOIN, GROUP BY, and HAVING
These are the building blocks of SQL and will be used in almost every query. Here's a brief overview of how they work:
• SELECT: An SQL statement Used to specify the columns you want to retrieve from a table in a database.
• WHERE: Filters data based on a given condition.
• JOIN: Combines data from multiple tables based on a related column.
• GROUP BY: Aggregates rows that have the same values in specified columns.
• HAVING: Filters aggregated results, similar to the WHERE clause but for grouped data.

Real-World Use Case: Data Pipelines and ETL Processes

In a typical ETL (Extract, Transform, Load) pipeline, these concepts are frequently used to pull data from different sources, filter unnecessary data, and transform it into a structured format.

Lets use an example to demonstrate this concept. Imagine you have 3 tables: employee_data, salary_data and payroll_data. The task is to extract the data from both employee_data and salary_data, clean it, and then load it into a structured table (payroll_data) that can be used for payroll reports.

-- Extract employee details from raw_employees
SELECT employee_id, employee_name, department
FROM employee_data;

-- Extract salary details from raw_salaries
SELECT employee_id, salary
FROM salary_data;

-- Transform data: Join employee and salary tables, and calculate salary after tax
SELECT 
    e.employee_id, 
    e.employee_name, 
    e.department, 
    s.salary, 
    (s.salary * 0.9) AS salary_after_tax
FROM 
    employee_data e
JOIN 
    salary_data s ON e.employee_id = s.employee_id;

-- Load transformed data into a new table (payroll_data)
CREATE TABLE payroll_data AS
SELECT 
    e.employee_id, 
    e.employee_name, 
    e.department, 
    s.salary, 
    (s.salary * 0.9) AS salary_after_tax
FROM 
    employee_data e
JOIN 
    salary_data s ON e.employee_id = s.employee_id;

Advanced SQL Techniques

In this section we are going to dive into more complex queries that can improve your efficiency and problem-solving skills in data engineering.

Recursive Queries and Common Table Expressions (CTEs) Recursive queries are useful when dealing with hierarchical data (e.g., organizational structures).

Common Table Expressions (CTEs) is a temporary result set that is defined within the execution of a query. After the execution the result no longer exists. They make the queries more readable and easier to manage by allowing you to define for example a subquery and reference it multiple times within the main query. CTEs are defined using the WITH keyword, followed by the CTE name and the query that generates the result set.

WITH cte_name AS (
    SELECT column1, column2
    FROM table_name
    WHERE condition
)
SELECT column1, column2
FROM cte_name;

Window Functions for Running Totals, Ranking, and Partitioning

Window functions are powerful tools for performing calculations across a set of rows that are related to the current row, without collapsing the result set unlike aggregate functions.
Examples:

ROW_NUMBER() for ranking.
SUM() OVER() for running totals.
RANK() for ranking data with ties.

SELECT 
    employee_id,
    salary,
    SUM(salary) OVER (PARTITION BY department_id ORDER BY salary DESC) AS running_total
FROM employees;
This query computes a running total of salaries per department, ordered by salary in descending order.

Complex JOINs and Subqueries for Efficient Data Retrieval

Complex joins like LEFT JOIN, RIGHT JOIN, and INNER JOIN are used to combine data across different tables in flexible ways.

SELECT e.employee_id, e.name, e.salary, d.department_name
FROM employees e
JOIN departments d ON e.department_id = d.department_id
WHERE e.salary > (
    SELECT AVG(salary) FROM employees WHERE department_id = e.department_id
);
This query returns employees who earn more than the average salary in their respective departments.

Query Optimization and Performance Tuning

As your database grows, query optimization becomes a critical skill. Without proper tuning, even the simplest queries can become slow and inefficient. We are going to look at indexing and stored procedure

Indexing Indexes are essential for speeding up queries, especially when dealing with large datasets. By creating an index on columns that are frequently queried, you can reduce the time needed to retrieve data.

CREATE INDEX idx_order_date ON orders(order_date);
This index speeds up queries that filter by order_date.

Stored Procedure Stored procedure is a precompiled collection of one or more SQL statements that can be executed as a single unit. Stored procedures are particularly useful for repetitive tasks, improving both performance and maintainability.

--- An example of stored procedure that selects records from a table
CREATE PROCEDURE GetOrdersByDate 
AS
BEGIN
    SELECT order_id, customer_id, order_date
    FROM orders
    WHERE order_date = @order_date;
END;
--- Execute the stored procedure
EXEC GetOrdersByDate;

Data Modeling Best Practices

Effective data modeling ensures that your database schema is optimized for both storage and query performance.

Normalization vs. Denormalization
• Normalization reduces data redundancy and ensures data integrity by breaking down data into smaller tables.
• Denormalization involves merging tables to reduce joins, which can speed up read-heavy applications, at the cost of additional storage and potential update anomalies.
Designing Efficient Relational Schemas
When designing a schema, focus on the following:
• Minimize redundancy and maintain consistency.
• Ensure foreign key relationships between tables to enforce integrity.
Star Schema vs. Snowflake Schema
• Star Schema: A simplified schema with a central fact table surrounded by dimension tables. Ideal for analytical queries.
• Snowflake Schema: A more complex schema where dimension tables are normalized. While it saves storage, it may lead to slower queries due to the need for more joins.

Example of a star schema:
CREATE TABLE fact_sales (
    sale_id INT PRIMARY KEY,
    product_id INT,
    customer_id INT,
    sales_amount DECIMAL(10, 2),
    sale_date DATE
);

CREATE TABLE dim_product (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(255),
    category VARCHAR(100)
);

CREATE TABLE dim_customer (
    customer_id INT PRIMARY KEY,
    customer_name VARCHAR(255)
);
In this example, the fact_sales table holds the sales data, while dim_product and dim_customer contain information on products and customers, respectively.

Real-World Application & Case Study

Consider the following query, which joins multiple tables but is slow due to lack of indexing

SELECT orders.order_id, customers.customer_name, SUM(order_items.quantity * order_items.price) AS total
FROM orders
JOIN customers ON orders.customer_id = customers.customer_id
JOIN order_items ON orders.order_id = order_items.order_id
GROUP BY orders.order_id, customers.customer_name;

--- By adding indexes to the foreign key columns (customer_id, order_id), you can significantly improve performance.

Case Study: Transforming Raw Data into Structured Reports
Imagine you are tasked with transforming raw transactional data into monthly sales reports. Using SQL, you can aggregate data, join it with dimension tables for more context, and present it in a clean, readable format.

SELECT 
    MONTH(order_date) AS month,
    SUM(order_items.quantity * order_items.price) AS total_sales
FROM orders
JOIN order_items ON orders.order_id = order_items.order_id
WHERE order_date BETWEEN '2024-01-01' AND '2024-12-31'
GROUP BY MONTH(order_date)
ORDER BY month;

Conclusion

Mastering SQL is a vital skill for any data engineer. By understanding and applying advanced SQL techniques, optimizing queries, and designing efficient data models, you can significantly enhance your ability to work with large datasets and build scalable data pipelines. The key takeaways are:

Understand core SQL concepts, including joins, filtering, and grouping.
Use advanced techniques like recursive queries, window functions, and CTEs to solve complex problems.
Focus on query optimization by analyzing execution plans and leveraging indexing.
Follow data modelling best practices to design scalable and efficient schemas. By applying these strategies to real-world projects, you'll be well on your way to mastering SQL and becoming a more effective data engineer.

Building Scalable Data Pipelines with Python – A Complete Guide

Batrudin Jamaludin — Sat, 08 Feb 2025 14:38:40 +0000

What is a Data Pipeline?

Ever wondered how data is moved from a source to its destination or how messy data is transformed into clean and ready for analysis without doing it manually? Well, that is what a data pipeline is. It's a series of steps that automate the movement of data from various sources to a destination, ensuring that it is structured, clean, and ready for analysis.

Key Functions of a Data Pipeline

Extract: The process of retrieving raw data from multiple sources (databases, APIs, files, etc.).
Transform: Clean, filter, aggregate, or modify the data to make it useful.
Load: Store the transformed data into a target system, the destination (data warehouse, database, or analytics tool).

Hands-on Python ETL Implementation

In this section, we will implement ETL Pipeline examples using Python:

Reading from CSV and Writing to PostgreSQL

Generate sales data (I have used Mockaroo to generate dummy data).
Create a PostgreSQL database. I have used aiven.io to store my transformed data and DBeaver to connect to the database.
Implement the ETL process using Python. Make sure to install the necessary libraries (pandas and sqlalchemy).

import pandas as pd
from sqlalchemy import create_engine

# Database connection
engine = create_engine("postgresql://username:password@localhost:5432/etl_db")

# Extract
df = pd.read_csv("sales.csv")

# Transform the data
df = df.rename(columns={'id': 'sales_id'})  # Rename columns
df = df.dropna()  # Drop null values

# Load
df.to_sql("sales", engine, if_exists="append", index=False)

print("Data successfully loaded into PostgreSQL!")

Reading from API and Writing to Database

We are going to use this endpoint to fetch JSON data (Staff data of a fictitious company): Sample JSON Data.

Python Code to Fetch and Load API Data:

import requests
import json
import pandas as pd
from sqlalchemy import create_engine

# Step 1: Fetch data from API
url = "https://raw.githubusercontent.com/LuxDevHQ/LuxDevHQDataEngineeringGuide/refs/heads/main/samplejson.json"
response = requests.get(url)
data = response.json()

# Step 2: Extract - Convert JSON data to DataFrame
df = pd.DataFrame(data)

# Step 3: Transform - Clean and reshape the data
df = df[['name', 'position', 'country']]  # Select relevant columns
df.columns = ['full_name', 'role', 'country']  # Rename columns for clarity

# Step 4: Load - Insert the data into PostgreSQL
engine = create_engine("postgresql://username:password@localhost:5432/etl_db")
df.to_sql("api_staff", engine, if_exists="append", index=False)

print("API Data Loaded Successfully!")

Congratulations! 🎉

If you've followed each step correctly, you've successfully implemented an ETL data pipeline using Python!

This hands-on example demonstrates how to automate the process of moving data from CSV files and APIs into a database, streamlining your data processing workflows and making them more efficient and scalable.