DEV Community: Gathuru_M

Building a Real-Time Weather Streaming Pipeline with Apache Kafka 4 and Cassandra

Gathuru_M — Mon, 29 Jun 2026 00:28:17 +0000

In my last article, we broke down the core concepts of Apache Kafka — moving away from scheduled batch jobs and shifting toward real-time event streaming.

This article focuses on a practical Kafka project to help you understand the architecture better. The goal is simple: fetch live weather metrics from the OpenWeather API, stream those events instantly through an Apache Kafka topic, and consume them downstream to be stored in Apache Cassandra for time-based analysis.

To make this happen, we will be writing two standalone Python scripts: a custom weather_producer.py script to fetch and stream the data, and a separate weather_consumer.py script to read the stream and save it to our database.

Upgrading from Kafka 3.9.2 to Kafka 4.0 (KRaft Native)

Before writing our scripts, we have to address our Kafka environment. In my earlier practice sessions, I was running Apache Kafka version 3.9.2, which still relied heavily on an external service called Apache ZooKeeper to manage cluster metadata.

However, with the release of Apache Kafka 4.x, ZooKeeper support has been completely removed. Kafka now runs entirely in KRaft (Kafka Raft) mode, managing its own internal metadata log natively. This simplifies things dramatically, meaning we only have to run a single Kafka process instead of managing two entirely different software setups on our machine.

To get ready for this project, I had to completely clean out my old local environment. If you are following along on WSL or Linux, here is exactly how to uninstall version 3.9.2 and upgraded to Kafka 4.0:

i). Removing Kafka 3.9.2

First, stop any running instances of ZooKeeper and Kafka, then wipe out the old binaries and the local data directories to prevent any legacy configuration conflicts:

# Delete the old installation directory
rm -rf ~/kafka_2.13-3.9.2

# Clear out local temporary data/log directories used by the old version
rm -rf /tmp/zookeeper
rm -rf /tmp/kafka-logs

ii). Installing and Formatting Kafka 4.0

KRaft requires you to explicitly generate a cluster ID and format your storage directory before launching the broker:

# Download and extract Kafka 4.0
wget https://archive.apache.org/dist/kafka/4.0.0/kafka_2.13-4.0.0.tgz
tar -xzf kafka_2.13-4.0.0.tgz
cd kafka_2.13-4.0.0

# Generate a unique cluster ID for KRaft
KRAFT_CLUSTER_ID=$(bin/kafka-storage.sh random-uuid)

# Format your storage log directory using that ID
bin/kafka-storage.sh format -t $KRAFT_CLUSTER_ID -c config/kraft/server.properties

Now, starting Kafka is as simple as running one single command targeting our KRaft configuration file:

bin/kafka-server-start.sh config/kraft/server.properties

Kafka 4.0 starting up in KRaft mode

Step 1: Creating the Weather Topic

Before our scripts can send or read data, our new 4.0 broker needs a destination ready. Using the Kafka CLI tools, create a dedicated topic named weather_updates with 3 partitions to allow for parallel consumption down the line:

bin/kafka-topics.sh --create --topic weather_updates --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1

Describing the weather_updates topic

Step 2: Writing the Python Producer (`weather_producer.py`)

Our first script is a standalone Python script called weather_producer.py. Its sole job is to talk to the OpenWeather API, fetch the raw metrics, and hand the payload over to Kafka. We use the kafka-python library to handle the client connection.

Notice the continuous while True loop. Unlike Airflow DAGs that run and stop, streaming application scripts run indefinitely in your terminal:

# weather_producer.py
import time
import json
import requests
from kafka import KafkaProducer

API_KEY = "YOUR_OPENWEATHER_API_KEY"
CITY = "Nairobi"
URL = f"https://api.openweathermap.org/data/2.5/weather?q={CITY}&appid={API_KEY}"

# Initialize Kafka Producer targeting our KRaft Broker port
producer = KafkaProducer(
    bootstrap_servers=['localhost:9092'],
    value_serializer=lambda v: json.dumps(v).encode('utf-8')
)

print("--- Starting Weather Producer Loop ---")
while True:
    try:
        response = requests.get(URL)
        if response.status_code == 200:
            weather_data = response.json()

            # Send payload to our Kafka topic
            producer.send('weather_updates', value=weather_data)
            print(f"Sent event to Kafka: {weather_data['name']} - {weather_data['main']['temp']}K")

    except Exception as e:
        print(f"Error fetching/sending data: {e}")

    # Poll the API every 10 seconds to create a continuous stream
    time.sleep(10)

Sending event to Kafka topic

Step 3: Setting Up Apache Cassandra For the First Time

If you have never used Apache Cassandra before, it is a distributed NoSQL database built specifically for lightning-fast write speeds and handling massive time-series data.

Because it doesn't come pre-installed on standard Linux distributions, we will need to install it first - binary version. Here is the step-by-step process to set it up on my WSL environment for the first time:


# 1. Navigate to your home directory
cd ~

sudo apt update
sudo apt install openjdk-11-jdk -y

It is important to note that Cassandra 3.x and 4.x strictly require Java 8 or Java 11.

Since we updated to Kafka 4.0 which requires Java 17 or Java 21, Cassandra will instantly crash on startup.
Therefore, run the above command to download and install Java 11 side-by-side with your existing Java 17.

# 2. Download the official Apache Cassandra binary tarball
wget https://archive.apache.org/dist/cassandra/4.1.4/apache-cassandra-4.1.4-bin.tar.gz

# 3. Extract the tarball
tar -xzf apache-cassandra-4.1.4-bin.tar.gz

# 4. Rename the folder to just 'cassandra' for cleaner navigation
mv apache-cassandra-4.1.4 cassandra

# 5. Clean up the downloaded tar.gz file to save space
rm apache-cassandra-4.1.4-bin.tar.gz

Now, everything for Cassandra lives explicitly inside ~/cassandra

Start Cassandra
Ubuntu allows you to have multiple versions of Java installed at the same time. You can pick which one is active by running:
sudo update-alternatives --config java

Look for the row that mentions java-11-openjdk. Type that selection number and hit Enter.

To make sure your terminal is officially using the older version, run java -version.

It should now say openjdk version "11.0.x"

Because Cassandra here is a binary version, to start it up, we will execute the startup script directly from the folder:

cd ~/cassandra
bin/cassandra -f

A large stream of startup logs will begin scrolling down your screen. The specific line INFO [main] ... Startup completed near the bottom confirms Cassandra started successfully.

Once the logs stop moving and Cassandra stays open safely:

Leave that terminal window completely alone (let it keep running).
Open a brand-new terminal tab/window in WSL.
Navigate to your folder and run your cluster status tool:

cd ~/cassandra
bin/nodetool status

You should see your clean grid return with UN 127.0.0.1

Terminal output of nodetool status confirming Cassandra is awake

Step 4: Creating the Storage Schema

Now that Cassandra is running, we can log into the Cassandra Query Language shell (cqlsh) right from our command line to create our schema.

We will create a Keyspace (Cassandra’s version of a database schema) and a table structured specifically for historical analysis, sorting rows chronologically using a clustered timestamp:

cqlsh

Inside the interactive shell, run the following commands:

CREATE KEYSPACE weather_analytics 
WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};

USE weather_analytics;

CREATE TABLE city_weather (
    city_name text,
    timestamp timestamp,
    temperature float,
    humidity int,
    wind_speed float,
    PRIMARY KEY (city_name, timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);

Step 5: Writing the Python Consumer (`weather_consumer.py`)

Our second standalone script is weather_consumer.py. It runs completely independently from the producer, continuously listening to the weather_updates topic, parsing the incoming JSON data using the cassandra-driver client, and appending the records directly into Cassandra.

# weather_consumer.py
from kafka import KafkaConsumer
from cassandra.cluster import Cluster
import json

# Connect to local Cassandra instance
cassandra_cluster = Cluster(['127.0.0.1'])
session = cassandra_cluster.connect('weather_analytics')

# Subscribe to Kafka Topic
consumer = KafkaConsumer(
    'weather_updates',
    bootstrap_servers=['localhost:9092'],
    value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)

print("--- Weather Consumer Listening for Events ---")
for message in consumer:
    event = message.value

    # Extract specific nested fields from OpenWeather JSON payload
    city = event['name']
    temp = event['main']['temp']
    humid = event['main']['humidity']
    wind = event['wind']['speed']

    # Insert statement into Cassandra NoSQL table
    insert_query = """
    INSERT INTO city_weather (city_name, timestamp, temperature, humidity, wind_speed)
    VALUES (%s, toTimestamp(now()), %s, %s, %s);
    """

    session.execute(insert_query, (city, temp, humid, wind))
    print(f"Successfully streamed and stored record for: {city}")

Verifying the Streaming Pipeline Data Flow

When both independent python scripts are active at the same time, the pipeline functions as a living unit. To confirm that our real-time records are landing correctly inside our NoSQL storage layer, we can log into the Cassandra shell (cqlsh) and run a quick verification query:

SELECT * FROM weather_analytics.city_weather LIMIT 5;

Producer and Consumer scripts running concurrently

Major Lessons from Entering the Streaming Space

Upgrading simplifies infrastructure: Dropping ZooKeeper and upgrading from 3.9.2 to 4.x made managing the local environment much smoother.
Decoupling creates stability: If the Cassandra database goes down for maintenance, the Producer script doesn't care. It keeps pulling weather data and sending it to Kafka. Kafka will safely hold onto those messages until Cassandra recovers and the Consumer script turns back on. This completely prevents data loss!

What real-time data sources are you planning to stream in your next project? Let's brainstorm ideas in the comments below!

Introduction to Apache Kafka: Shifting from Batch Processing to Real-Time Streaming

Gathuru_M — Fri, 19 Jun 2026 01:13:25 +0000

Batch Processing in data, is the approach where a large amount of information is collected over a period of time and processed as a single unit while Streaming is the approach where this information is processed individually in real-time.

In Batch Processing, whether we are running python scripts manually or scheduling them with Apache Airflow, the core concept is the same: wait for a period of time, collect a chunk of data (like an hour's worth of news or crypto prices), process it, and load it into a database.

Batch processing is recommended for reports, daily dashboards, and historical analysis. But what happens when a business needs answers in milliseconds?

Consider these scenarios:

Ride-Sharing Apps: Uber or Bolt tracking a driver’s GPS coordinates second-by-second to update your ETA.
Financial Security: A bank analyzing a credit card swipe for fraud before the transaction is approved.
E-commerce: Tracking clickstream data (every single scroll, click, and hover a user makes) to instantly update product recommendations on a homepage.

You can't wait an hour for an Airflow DAG to trigger for these use cases. You need Event Streaming, and that is exactly what Apache Kafka is designed to handle.

What is Apache Kafka?

Apache Kafka is a distributed event streaming platform.
Instead of acting like a traditional database where data sits in tables waiting to be queried, Kafka handles data as a continuous, high-speed flow of messages (called events).

It acts as a highly decoupled, fault-tolerant middleman between systems that produce data and systems that need to consume that data.

Core Components of Kafka Architecture

To manage real-time streams at scale, Kafka relies on a few key structural components:

1. Producers and Consumers

Producers: Applications that generate and send data into Kafka. For example, a mobile app sending user location updates, or a microservice publishing transaction details.
Consumers: Applications that subscribe to Kafka to read and process those incoming streams. For example, an analytics system calculating live traffic patterns, or a notification service sending an SMS receipt.

The key point here is that Producers and Consumers are completely independent.
A producer doesn't know or care who is reading its data, which prevents your entire system architecture from becoming a tangled web of direct API connections.

2. Topics and Partitions

Data within Kafka is organized into categories called Topics (similar to a table in a traditional database). If you are tracking a logistics fleet, you might have a topic named truck_gps_coordinates.

To handle massive scale, a single Topic is split into multiple pieces called Partitions.

Partitions are spread across different servers (called Brokers).
This allows Kafka to achieve parallel processing — multiple consumers can read from different partitions of the same topic simultaneously, maximizing throughput.
Inside a partition, every message is appended in a strict chronological order and given a unique sequential ID called an Offset.

3. Offsets: How Kafka Tracks Progress

Unlike a traditional message queue, Kafka doesn't delete messages the moment a consumer reads them. Messages stay in Kafka for a configured amount of time (e.g., 7 days).

Because the data stays put, a consumer uses its Offset pointer like a bookmark to remember exactly which message it read last. If the consumer crashes, once it reboots, it checks its last committed offset and resumes reading exactly where it left off without losing a single event.

Local Setup: Zookeeper and Brokers

When you start practicing with Kafka locally, especially inside a Linux environment like WSL — you quickly realize it requires a bit of infrastructure orchestration to spin up.

Historically, Kafka relies on Apache Zookeeper to act as the coordinator, managing the cluster, tracking which brokers are alive, and electing leaders for partitions.

When launching Kafka via the CLI, you have to spin up Zookeeper first, and then launch your Kafka Broker service(Kafka Server).

Running Zookeeper and Kafka Broker services in terminal

Summary

In Batch (Airflow/Postgres): Data is at rest. We run queries over the static data.
In Streaming (Kafka): Data is in motion. Our application logic stays active, and the data constantly flows through our code.

What's Next?

Understanding the architecture of topics, partitions, and offsets is step one. But as data engineers, we need to programmatically interact with this streaming cluster.

In the next article, we are going to write Python scripts using a Kafka client library. We will build a custom Python Producer to generate stream events and a Python Consumer to read and display those events in real-time inside our setup.

Are you moving into real-time streaming workflows yet, or sticking to batch pipelines? Let's discuss in the comments below!

Containerizing an ETL Pipeline with Docker Compose and Clean Git Hygiene

Gathuru_M — Sun, 14 Jun 2026 15:52:16 +0000

In my last post, we broke down the core concepts of Docker and why packaging your code into standardized containers is ideal for avoiding "dependency hell."

In this article, we look into developing an ETL project and running it on Docker using docker-compose. The project uses the Coinpaprika API to fetch and normalize ticker data.
We will also look at how to push the entire project to GitHub using atomic commits to keep our version control clean. Let’s dive into how it works!

Why Docker Compose?

Instead of managing containers individually, Docker Compose allows us to define a multi-container application in a single file → docker-compose.yml.

Our project requires two distinct services/containers:

db (PostgreSQL): The data warehouse destination where our normalized coin data will be loaded.
etl_script (Python): Our custom application container that sends requests to the Coinpaprika API, transforms the JSON response using pandas, and pushes it to our database.

Step 1: Writing the Configuration Files

To build this, we create a clean directory structure. First, write a Dockerfile for the Python ETL script to ensure it has all the necessary packages installed:

# Dockerfile
FROM python:3.11-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY etl_script.py .

CMD ["python", "etl_script.py"]

Step 2: Keeping Secrets Secret with a `.env` File

Introduce a .env file - a simple text file that holds sensitive environment variables.
In this case, the file will hold our database configurations.

Note that this file is not pushed to GitHub.

Here is what the .env file looks like:

# .env
DB_USER=crypto_admin
DB_PASSWORD=SuperSecretPassword123
DB_NAME=crypto_warehouse

Step 3: Composing the Multi-Container Network

Next, write the docker-compose.yml file to define the database and the script together.
Notice how it references the variables dynamically from our .env file using the ${VARIABLE_NAME} syntax. Docker Compose automatically detects this file in the same directory and injects the values safely:

version: '3.8'

services:
  db:
    image: postgres:15
    container_name: crypto_postgres
    restart: always
    environment:
      POSTGRES_USER: ${DB_USER}
      POSTGRES_PASSWORD: ${DB_PASSWORD}
      POSTGRES_DB: ${DB_NAME}
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

  etl_script:
    build: .
    container_name: crypto_etl_runner
    depends_on:
      - db
    environment:
      - DB_HOST=db
      - DB_NAME=${DB_NAME}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}

volumes:
  postgres_data:

Networking in Docker:

Under etl_script, the DB_HOST environment variable is set to db instead of localhost. Because Docker Compose spins up both containers on a shared default network, they can find each other using their service names as hostnames!

Step 4: Spinning it Up

With our files defined, launching the entire multi-container architecture requires just one command in the terminal:

docker compose up --build

Docker automatically pulls the Postgres image, builds our custom Python image, injects our hidden .env configurations, sets up the network isolation, and spins up both containers simultaneously.

Terminal logs from running docker compose up

Step 5: Practicing Clean Version Control (Atomic Commits)

Instead of working for three hours and writing a giant, generic commit message like "fixed code and added files", let's practice how to write atomic commits.

An atomic commit means each commit does exactly one logical thing. It makes your GitHub commit history readable, and if something breaks, it’s very easy to roll back to the exact step where things went wrong.

Here is an example of our commit timeline for this project:

idx: Starting point, Initializing project scaffolding
infra: Added Docker Infrastructure: Dockerfile, docker-compose.yml and env vars
feature: Add ETL script with CoinPaprika API integration and PostgreSQL connection handler

Key Takeaways

Volumes prevent data loss: Adding the volumes tag to the Postgres container ensures that even if we stop and destroy our containers using docker-compose down, the actual crypto data stays saved safely on the hard drive.
Environments are modular: Now, our configurations are completely decoupled from the code. If someone clones the repository from GitHub, the project won't leak any secrets, and they can easily plug in their own database credentials by creating their own local .env file.

All the best as you continue learning Docker!

Understanding Docker for Data Engineering

Gathuru_M — Sun, 14 Jun 2026 14:52:02 +0000

In your data engineering journey, you may have pipelines running locally inside your development environment, and it works beautifully, on your machine.

But what happens if you want to hand over a project to a colleague, deploy it to a shared server managed by your team leader, or push it to a cloud provider?

Suddenly, a storm of errors appears. Like...:

You don't have PostgreSQL 15 installed locally?"
"Your machine is running an older version of Python that doesn't support that syntax?"
"Your operating system is missing the specific database drivers needed for psycopg2"

This is called Dependency Hell. In data engineering, ensuring your pipelines run exactly the same way everywhere is just as important as writing the code itself. That is why we use Docker.

The Container Analogy:

Before the 1950s, shipping goods across the world was incredibly messy. Workers had to manually load barrels of oil, sacks, and crates of electronics onto ships. Every item was a different shape and size, making loading slow, inefficient, and prone to accidents.

Then came the standardized shipping container.

It didn’t matter what was inside the box, whether it was cars, clothes, or frozen food, the shipping container was always the exact same size, had the same hooks, and fit perfectly on every ship, train, and crane in the world.

Docker does exactly this for software. Instead of shipping raw Python scripts and text files, Docker lets you package your application code, dependencies, runtime, and configurations into a single, standardized box called a Container. If a machine can run Docker, it can run your container seamlessly, whether it’s a Windows laptop, a Mac, or a Linux server.

A common question beginners ask is: "Why not just use a Virtual Machine (VM) to isolate our code?"

While VMs provide isolation, they are quite heavy. A VM copies an entire guest operating system (like a whole installation of Windows or Ubuntu), which consumes gigabytes of RAM and CPU before your code even starts running.

Docker containers are lightweight. They don't include a whole operating system; instead, they share the host machine’s operating system kernel and only pack the bare essentials (your app code and libraries). This means a container can spin up in seconds rather than minutes, using a fraction of the system resources.

The Three Core Concepts in Docker

1. The Dockerfile

A text document containing step by step instructions on how to build a Docker environment.
You specify the base image, install your python packages, and copy your scripts into it.

2. The Docker Image

When you run a build command on your Dockerfile, it compiles into a Docker Image. This is a read-only blueprint of your environment.
It contains all the snapshots of your libraries and setup files. You can share this image on platforms like Docker Hub.

3. The Docker Container

When you run an image, it becomes a Container. This is the active instance of Docker containing everything an application needs to run.
You can start it, stop it, write data inside it, and delete it when you're done.

Bleow is a guide to help you install Docker on Linux or on Windows if you have Windows Subsytem for Linux, (WSL)
-- Using Docker Engine

# Step 1: Update the system and install prerequisites    
    sudo apt update && sudo apt upgrade -y    
    sudo apt install -y ca-certificates curl    

    # Step 2: Add Docker's official GPG key and repository    
    sudo install -m 0755 -d /etc/apt/keyrings  # create the keyring directory
    # Download Docker's GPG key
    sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg \    
      -o /etc/apt/keyrings/docker.asc  
    # Set correct permissions on the key file
    sudo chmod a+r /etc/apt/keyrings/docker.asc    
     # Add Docker's stable apt repository  
    echo "deb [arch=$(dpkg --print-architecture) \    
      signed-by=/etc/apt/keyrings/docker.asc] \    
      https://download.docker.com/linux/ubuntu \    
      $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \    
      sudo tee /etc/apt/sources.list.d/docker.list > /dev/null    

    # Step 3: Install Docker Engine (latest version)
    sudo apt update    
    sudo apt install -y docker-ce docker-ce-cli containerd.io \    
      docker-buildx-plugin docker-compose-plugin    

    # Step 4: Verify installation    
    docker --version    
    docker compose version    

    # Step 5: Check if Docker is running 
    sudo systemctl start docker   # start the Docker daemon   
    sudo systemctl status docker  # verify Docker is running
    sudo systemctl enable docker    # enable Docker to start on boot

When you install Docker Desktop or run Docker commands via your terminal, you get an organized view of your environment.

To confirm Docker is working, pull and run the hello-world image
docker run hello-world

What happens behind the scenes:

Docker checks locally for the 'hello-world' image
Not found locally → pulls it from Docker Hub
Creates a container from the image
Runs it → prints the success message
Container exits

Core Docker Commands

    # ── Containers ──────────────────────────────────────────────────────────────    
    docker ps                    # list RUNNING containers only    
    docker ps -a                 # list ALL containers (running + stopped + exited)    
    docker run <image>           # create and run a container from an image    
    docker run -it ubuntu bash   # run interactively (-i) with a terminal (-t)    
    docker stop <container_id>   # gracefully stop a running container    
    docker rm <container_id>     # remove a stopped container    
    docker rm $(docker ps -aq)   # remove ALL stopped containers    

    # ── Images ──────────────────────────────────────────────────────────────────    
    docker images                # list all locally stored images    
    docker pull python:3.10      # download an image from Docker Hub without running it    
    docker rmi <image_id>        # remove a local image    
    docker rmi $(docker images -q) # remove ALL local images    

    # ── Building ─────────────────────────────────────────────────────────────────    
    docker build -t myapp .      # build an image from Dockerfile in current folder    
                                 # -t = tag (name) for the image    
                                 # .  = path to the Dockerfile (current directory)    

    # ── Running Interactively ────────────────────────────────────────────────────    
    docker run -it ubuntu bash               # run Ubuntu container with bash shell    
    docker exec -it <container_id> bash      # open a shell inside a RUNNING container    

    # ── System Info ──────────────────────────────────────────────────────────────    
    docker version               # show Docker client and server (daemon) version    
    docker info                  # detailed Docker system information

What's Next?

Understanding individual containers is the first step. However, in data engineering, projects rarely rely on just one thing. A project could need a Python environment to run a script and a PostgreSQL database to store the data.
This will require two containers to run. Running those as separate containers manually and trying to link their networks together can get complicated.

In my next article, we are going to look at Docker Compose — a tool that lets us define and run multi-container applications using a single YAML file.
We will package an entire ETL pipeline so that anyone can spin up a database and execution script with just one command: docker-compose up.

TaskFlow API vs. Traditional Operators: Practical Airflow ETL Pipeline

Gathuru_M — Sun, 07 Jun 2026 19:56:12 +0000

In my last article, we went over the foundational pillars of Apache Airflow—DAGs, Tasks, and why orchestration beats manual scripts.

For this practical Airflow project, we will build
an ETL pipeline to aggregate market data from Massive API. But instead of just writing it one way, we look into writing two different versions of the same DAG:

The Traditional Approach: Using classic standard operators (PythonOperator) and manual XCom pulling/pushing.
The Modern Approach: Using the TaskFlow API (@dag and @task decorators) to make the code clean and Pythonic.

If you’re confused about the difference or wondering which one you should use in your projects, let’s break down both.

The Goal: Aggregating Data from Massive API

The pipeline does three basic things:

Extract: Pulls raw daily open/close asset data from our market API client.
Transform: Normalizes the heavy, nested JSON payload into a clean structure using pandas.
Load: Inserts the structured records into a cloud PostgreSQL database.

Approach 1: The Traditional Way (Standard Operators)

When Airflow was first built, you had to explicitly define every single task using an Operator class and manually stitch them together using the bitshift operators (>>).

The trickiest part here is data sharing. Because tasks run in isolation, we have to use explicit XComs (cross-communications) to pass our API payload from the extract task to the transform task.

Here is how the traditional DAG looks:

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime, timedelta
import pandas as pd
# (Assume our custom API extraction and DB loading logic are imported here)

default_args = {
    'owner': 'my_name',
    'start_date': datetime(2026, 6, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

def _extract(ti):
    # Fetching data from our massive asset API
    raw_data = {"ticker": "BTC", "price": 65000, "timestamp": "2026-06-07T00:00:00Z"} 
    # We MUST explicitly push to XCom so the next task can see it
    ti.xcom_push(key='raw_market_data', value=raw_data)

def _transform(ti):
    # We MUST explicitly pull from XCom
    raw_data = ti.xcom_pull(key='raw_market_data', task_ids='extract_task')
    print(f"Transforming data for: {raw_data['ticker']}")
    # Transformation logic goes here...
    return raw_data  # Returning automatically pushes to default XCom

with DAG(
    dag_id='market_data_traditional_v1',
    default_args=default_args,
    schedule_interval='@hourly',
    catchup=False
) as dag:

    extract_task = PythonOperator(
        task_id='extract_task',
        python_callable=_extract
    )

    transform_task = PythonOperator(
        task_id='transform_task',
        python_callable=_transform
    )

    # Stitching the pipeline dependency together
    extract_task >> transform_task

The Verdict on Traditional:

It works perfectly, but it feels heavy. You have to write a lot of boilerplate code just to set up the tasks, and passing data requires explicitly passing the Task Instance (ti) and remembering exact task_ids string names. If you typo a string, your whole pipeline crashes.

Approach 2: The Modern Way (TaskFlow API)

Introduced in Airflow 2.0, the TaskFlow API uses Python decorators (@dag and @task). It fundamentally changes how we write pipelines by making Airflow handle XComs silently behind the scenes.

Look at how clean this version is:

from airflow.decorators import dag, task
from datetime import datetime, timedelta

default_args = {
    'owner': 'my_name',
    'start_date': datetime(2026, 6, 1),
    'retries': 1,
}

@dag(dag_id='market_data_taskflow_v1', default_args=default_args, schedule='@hourly', catchup=False)
def market_data_etl():

    @task()
    def extract():
        raw_data = {"ticker": "BTC", "price": 65000, "timestamp": "2026-06-07T00:00:00Z"}
        return raw_data  # No manual xcom_push! Airflow handles it.

    @task()
    def transform(raw_data: dict):
        # No manual xcom_pull! We just treat it like a regular Python variable.
        print(f"Transforming data for: {raw_data['ticker']}")
        return raw_data

    # This single clean line sets up dependencies AND passes data!
    market_data = extract()
    transform(market_data)

# Instantiate the DAG
market_data_etl_dag = market_data_etl()

The Verdict on TaskFlow:

This feels like writing native Python! You don’t have to instantiate PythonOperator manually, and dependencies are implicitly built simply by passing outputs into inputs (transform(extract())).

Seeing Both in the Airflow UI

Once I deployed both DAG files to my local environment running on WSL, they popped up immediately in my Airflow Web UI dashboard.

Even though the code looks completely different, Airflow creates the exact same visual graph architecture for both under the hood.

Traditional DAG

When I ran them, checking the logs for the TaskFlow DAG showed how cleanly it handled the data context without any missing parameters.

Which One Should You Use?

After building this massive API project using both paradigms, here is my takeaway for fellow beginners:

Use TaskFlow API whenever you are working strictly with Python functions (PythonOperator). It reduces boilerplate code significantly, makes your code highly readable, and saves you from the headache of managing raw XCom keys.
Use Standard Operators when you need to interact with external tools directly via specialized operators (like PostgresOperator, S3CreateObjectOperator, or BashOperator). TaskFlow is amazing for Python-native workflows, but standard classes are still essential for interacting with cloud infra and heavy enterprise databases.

What's Next?

Now that our pipeline can be scheduled and orchestrated automatically, another major question popped up: Where do we deploy this safely so it runs the exact same way on a remote server as it does on my local machine? Right now, everything is relying on my specific local python setup and WSL configuration. If I send this code to a teammate, they might hit a wall of environment errors.

In the next part of this series, we are diving into Docker to containerize our database and ETL processes so they can run seamlessly anywhere!

Which style do you prefer writing in Airflow? Let me know your thoughts in the comments!

Why use Apache Airflow? Quick Guide for Data Engineers

Gathuru_M — Sun, 07 Jun 2026 18:02:19 +0000

If you followed my last post, we successfully built an ETL pipeline that fetched data from the News API, cleaned it with pandas, and loaded it into a PostgreSQL database. It felt amazing to watch it run successfully in the terminal.

But what if the News API goes down for 10 minutes? or what if my laptop is asleep when the script is supposed to run?

In the real world, you can't just sit at your laptop and manually click "Run" on a Python script every day. You need automation, monitoring, and a way to handle failures. That is exactly where Apache Airflow comes in.

The Problem:

Before tools like Airflow, developers relied heavily on Cron jobs (a built-in Linux tool used to schedule scripts at specific times). Cron is great for simple things, but it has huge blind spots for data engineering:

No Dependency Management: If your "Transform" script takes longer than usual, your "Load" script might start running before the data is even ready, causing a massive crash.
Lack of Visibility: If a script fails at 3 AM, you won't know until you check the logs manually or notice empty tables the next morning.
No Easy Retries: If a network glitch causes an API call to fail, Cron won't automatically try again 5 minutes later. You have to handle that messy logic yourself in Python.

Airflow solves all of this by acting as the workflow orchestrator. It doesn't actually store or process your data; instead, it acts as the manager telling your scripts exactly when to run, in what order, and what to do if something breaks.

The Core Concepts Explained

Let’s break down the four most important concepts using a simple analogy: Baking a Cake.

1. The DAG (Directed Acyclic Graph)

Think of a DAG as the entire recipe for your cake.

Directed: It has a clear starting point and moves in a specific direction (you can't frost the cake before you bake it).
Acyclic: It cannot go in circles. Step C cannot loop back and trigger Step A, otherwise your pipeline would run forever.
Graph: It's just a structural map of how your steps link together.

In data engineering, your DAG is the blueprint of your entire ETL pipeline.

2. Operators

If the DAG is the recipe, Operators are the kitchen appliances. They are the pre-built templates that define what actually gets done. Airflow provides different types of operators:

PythonOperator: Used to execute a piece of Python code (like our transform_data function).
PostgresOperator: Used to run SQL queries directly inside a Postgres database.
BashOperator: Used to run command-line terminal scripts.

3. Tasks

A Task is an operator that has been given specific instructions. It’s a single node inside your DAG. For example, using a PythonOperator to run extract_data() becomes the "Extract Task".

4. XComs (Cross-Communications)

In our standalone Python script, passing data was easy: we just returned a value from one function and passed it into the next (cleaned_df = transform_data(raw_data)).

In Airflow, tasks run completely independently. They can't easily talk to each other. XComs are like little sticky notes that tasks use to pass small amounts of data or metadata down the line. One task "pushes" a note, and the next task "pulls" it.

A Quick Peek at the Airflow UI

The absolute best part of Apache Airflow is its user interface. Instead of staring at text scrolling through a dark terminal window, Airflow gives you a beautiful visual dashboard where you can see your pipelines running in real-time.

When a pipeline runs successfully, the tasks turn a satisfying dark green. If a task fails, it turns red, making it incredibly easy to spot exactly where your pipeline broke.

Airflow Web UI Dags page

What We’re Doing Next

Now that we understand the foundational pillars of Airflow—DAGs, Operators, Tasks, and why we use them over simple cron jobs—it’s time to get our hands dirty.

In my next article, we are going to break down an ETL project into Airflow Tasks and watch it run automatically inside the Airflow UI.

Are you using Airflow or which other tools do you prefer for orchestration?

ETL Pipeline: Fetching Real-Time News Data with Python and Postgres

Gathuru_M — Sun, 07 Jun 2026 17:25:39 +0000

The best way to actually understand data engineering is to build something that breaks, fix it, and watch it successfully run.

In this article, we build an ETL pipeline that pulls data from the News API, cleans it up using pandas, and loads it into a local PostgreSQL database.

If you are a beginner Python developer or just getting into data engineering, this one is for you!

The Goal & The Architecture

Before writing a single line of code, let’s look at what we are actually trying to achieve:

Extract: Connect to the News API using Python, fetch the top headlines about technology, and pull down the raw data.
Transform: The raw data comes back as a messy, nested JSON object. We'll use pandas to flatten it, pick the columns we actually care about, handle missing values, and format the dates.
Load: Connect to a local PostgreSQL database and append our clean data into a structured table.

Step 1: Setting Up the Database

First, we need a place for our data to live. I used a PostgreSQL instance running on the cloud with Aiven.

We need a clean target table. Here is the SQL script I used to create a simple news_articles table. Notice how we have to be careful with our data types (like using TIMESTAMP for dates and TEXT for long URLs).

CREATE TABLE IF NOT EXISTS news_articles (
    id SERIAL PRIMARY KEY,
    source VARCHAR(100),
    author VARCHAR(150),
    title TEXT,
    description TEXT,
    url TEXT,
    published_at TIMESTAMP,
    extracted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Below is a diagram of the news_articles table created.

Step 2: The Python ETL Script

To keep things clean and modular, I broke the code down into three distinct functions representing E, T, and L.

Make sure you have your dependencies installed before running this:

pip install requests pandas psycopg2-binary sqlalchemy

Here is the full, documented script:

import requests
import pandas as pd
from datetime import datetime
import psycopg2

# Configuration
API_KEY = "YOUR_NEWS_API_KEY"
URL = f"https://newsapi.org/v2/everything?q=technology&apiKey={API_KEY}"

DB_PARAMS = {
    "host": "localhost",
    "database": "your_db_name",
    "user": "your_username",
    "password": "your_password",
    "port": "5432"
}

def extract_data():
    print("--- Starting Extraction ---")
    response = requests.get(URL)
    if response.status_code == 200:
        data = response.json()
        articles = data.get("articles", [])
        print(f"Successfully extracted {len(articles)} articles.")
        return articles
    else:
        raise Exception(f"API Request Failed with status code: {response.status_code}")

# Transforming Data
import pandas as pd

def transform_data(articles):

    # Create an empty list to store clean articles after looping
    cleaned_data = []

    # Parse through each dictionary to extract what we need
    for article in articles:
        clean_article = {
            'source' : article.get('source', {}).get('name', 'Unknown'),
            'author' : article.get('author'),
            'title' : article.get('title'),
            'description' : article.get('description'),
            'url' : article.get('url'),
            'publishedAt' : article.get('publishedAt')
        }

        cleaned_data.append(clean_article)

    df = pd.DataFrame(cleaned_data)

    # Rename column to match Postgres fields
    df = df.rename(columns={'publishedAt': 'published_at'})

    # Handle missing values  
    df['author'] = df['author'].fillna('Unknown')
    df['description'] = df['description'].fillna('No Description')

  # Format dates
    df['published_at'] = pd.to_datetime(df['published_at'])

    print("Transformation complete!")
    return df

def load_data(df):

    db_URI = os.getenv('URI')

    try:
        engine = create_engine(db_URI)

        df.to_sql(
            name = 'news_articles',
            con = engine,
            schema = 'news_api',
            if_exists = 'append',
            index = False
        )
        print("Data loaded successfully to 'news_articles'")
    except Exception as e:
        print(f"Failed to load data to the database:{e}")

# Run the pipeline
if __name__ == "__main__":
    try:
        raw_data = extract_data()
        cleaned_df = transform_data(raw_data)
        load_data(cleaned_df)
        print("ETL Pipeline Finished Successfully! 🎉")
    except Exception as e:
        print(f"Pipeline failed: {e}")

Step 3: Running the Pipeline and Verifying the Results

When I first ran this script, I ran into a classic beginner issue: the date format coming from the API included a Z at the end (e.g., 2026-06-07T06:00:00Z), which caused my local database to complain until I used pd.to_datetime() to safely parse it.

But once those quirks were ironed out, running the script in the terminal yielded beautiful logs:

![Successful execution](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/vhq6zbzsct5asbdocdcf.png)

To verify it actually worked, I hopped back into my database client and ran a simple query:

SELECT source_name, author, title, published_at FROM news_articles LIMIT 5;

And there it was—real live web data, organized neatly into my own database tables!

Key Takeaways from My First Project

Building this taught me a few massive lessons that you don't get just by reading textbooks:

API Data is messy: You can almost never load API responses directly into a database. Nested dictionaries (like the source field in this project) require explicit flattening.
Idempotency matters: If I run this script twice right now, it will duplicate all the articles. As I move forward, I need to look into how to handle duplicates (like checking for existing URLs before inserting).

What's Next?

Running this manually via a Python script is great for practice, but what if I want this data updated every single morning at 6 AM while I'm asleep? I can't just sit here and click "Run" manually.

That is where Data Orchestration comes in. In my next article, we are going to look at Apache Airflow and how we can take this exact code and turn it into an automated, scheduled workflow!

Have you built an ETL pipeline before? What was the trickiest data cleaning issue you faced? Let’s chat in the comments!

ETL vs. ELT: Which Approach Should You Use and Why?

Gathuru_M — Thu, 14 May 2026 20:20:27 +0000

1. Introduction

Understanding a company's data architecture can feel overwhelming, but once you cut out the noise, you will notice one of two operations taking place: ETL or ELT.

These two operations are the backbone of how data moves from a source (like an app or a database) to a destination (like a data warehouse). While they sound almost identical, the order of the letters changes everything about how a company manages its data.
In this article, we will break down both approaches for better understanding and I will give my take on which is better.

2. What is ETL? (Extract, Transform, Load)

ETL is the traditional way of handling data that follows the process:

Extract: Pull data from various sources (Excel, SQL databases, APIs).
Transform: Before the data reaches its final home, it is cleaned and formatted in a Staging Layer (a temporary storage area). Business logic is applied here to make the data "useful."
Load: The cleaned, "ready-to-use" data is finally saved in the destination.

Key Characteristic: The data is transformed before it is stored.
Common Tools: Microsoft SSIS, Talend, Informatica.

3. What is ELT? (Extract, Load, Transform)

ELT is the more modern approach, with the same process as you would guess but in a different order.

Extract: Pull the raw data from the sources.
Load: Instead of cleaning it first, you move the raw data directly into a high-capacity storage system, like a Data Lake or a Data Warehouse (BigQuery, Snowflake).
Transform: You perform the cleaning and modelling after the data is already in its destination.

Key Characteristic: Data grows into a historical archive. Since the raw data is always there, you can go back and re-transform it differently next year if your business needs change.
Common Tools: Fivetran or Airbyte (for loading), and dbt (for the transformation part).

Comparison Table

Feature	ETL	ELT
Order	Transform before Loading	Load before Transforming
Storage	Uses temporary Staging	Uses permanent Data Lake
Flexibility	Rigid / Fixed	Highly Flexible
Best For	On-premise / Small data	Cloud / Big Data

4. Which is Better?

With cloud storage becoming cheaper and databases becoming more powerful, ELT would be the preferred option. Here’s why:

Scalability: ELT can handle massive "Big Data" sets that would crash a traditional ETL staging server.
Flexibility: Because you store the raw data first, you never lose information. In ETL, if you don't "transform" a column, it's gone. In ELT, you can decide to use that column later.
Speed: You can load data as often as you want without waiting for complex cleaning scripts to finish.

While ETL is still used for highly sensitive data or older systems, ELT is the best approach for modern, cloud-based data engineering. It is more scalable, flexible, and allows for much deeper historical analysis.

How to connect Power BI to a PostgreSQL Database

Gathuru_M — Thu, 14 May 2026 17:20:31 +0000

Introduction

Power BI is a business analytics platform developed by Microsoft. It is mostly used by organizations to transform raw data from multiple sources into interactive visualizations that support data-driven decision-making.

Data sitting in various places in an organization—such as databases, Excel files, and cloud storage—can be collectively imported into Power BI to create a wholesome report. This allows data analysts and developers to extract insights that would otherwise be hidden in rows of text.

Most companies connect Power BI directly to databases for the following reasons:

Real-time Data Access: It gives visibility of the most current data available.
Automation: Reports can be developed and scheduled to refresh instantly, removing the need for manual updates.
Efficiency: It saves analysts time by eliminating the need to generate new reports every time the underlying data changes.

PostgreSQL is a primary example of a SQL database. These are commonly used in companies as they offer a structured and secure environment to store, manage, and retrieve large volumes of data effectively.

In this article, we will walk through connecting a Postgres Database to Power BI Desktop.

Part 1: Connecting to a Local PostgreSQL Database

To connect to a Postgres database sitting on your own machine:

Open Power BI Desktop.
On the Home ribbon, click Get Data, select PostgreSQL database from the list, and click Connect.
Connection Details: For the Server, type localhost. This tells Power BI to look at your own machine.
Database Name: Enter the specific name of your database (e.g., sales_db).
Credentials: When prompted, enter the username and password you created during PostgreSQL installation.

After successful authentication, a popup appears showing the existing tables in your database. Select the tables you need and click Load to bring them into the Power BI environment.

Part 2: Moving to the Cloud (Aiven PostgreSQL)

In most real work environments, data lives in the cloud so everyone can access it from anywhere. For this guide, we will be using Aiven. Connecting from a cloud database requires a few extra security steps.

How to connect:

Open Power BI Desktop. Click Get Data, select PostgreSQL database, and click Connect.
Obtain Details: From your Aiven console, copy your Host, Port, Database Name, Username, and Password.
Download the Certificate: In the list of connection details, download the ca.pem file by clicking the download icon.

Why SSL?
When connecting to a cloud database, we use SSL certificates. This acts as a secure tunnel for your data as it travels over the internet. SSL encrypts the connection so that malicious actors cannot "intercept" your credentials or your company's private data.

Enter database details: Fill in the details obtained from Aiven.

Under server name, enter the host name and the port in the format: `host:port`

SSL Configuration: You may receive an error because Power BI cannot automatically verify the cloud's CA certificate.

To resolve this, manually import the certificate to your Windows machine:

Press Windows + R, type certmgr.msc, and press Enter.
Select Trusted Root Certification Authorities, then click Certificates.
Right-click the folder, choose All Tasks > Import.
Browse to your ca.pem file (change file type to "All Files" to see it). Select it and click Next.
Finish the wizard. You will get a popup saying "Import Successful."

Restart Power BI to apply the changes. Your connection will now be successful, and your tables will be ready for loading.

Part 3: Building the Data Model

Once your tables (Customers, Products, Sales, and Inventory) are loaded, they appear in the "Data" pane. Now you must connect them.

Data Modeling is the process of telling Power BI how these tables relate to one another. For example, the Customer_ID in your Sales table should link to the ID in your Customers table.

The Benefit: By creating these relationships, you can filter a chart by "Customer Name" and see exactly what "Products" they bought across all "Sales."

Power BI often creates these automatically, which you can verify in the Model View pane.

Conclusion: Why SQL is a Superpower for Power BI Analysts

You might wonder, "If I have Power BI, do I still need SQL?" The answer is yes. SQL skills are important because they allow you to:

Filter at the source: Instead of bringing 1 million rows into Power BI and slowing it down, you can write a SQL query to only bring in the specific data you need.
Data Preparation: You can perform complex aggregations and clean up messy data before it even reaches your dashboard.

Mastering the connection between SQL and Power BI is what turns a basic report into a powerful, automated business tool.

SQL Joins Explained: Case Example

Gathuru_M — Mon, 02 Mar 2026 18:43:59 +0000

Structured Query Language(SQL) is a computer language for storing, manipulating, and retrieving data stored in a relational database.

SQL Joins are like clauses used to combine records from two or more tables in a database, just as the name(join) means.

In this article, you will learn, using a case example, the different types of joins, when, and how they are used. Be sure to check for “bonus joins” included at the end of the article.

Example Data

We will use data from these 2 tables to show various ways to display employees from John Smith's company.

Departments Table

Department_Id	Department_Name
1	Executive
2	HR
3	Sales
4	Support
5	Sales
6	Research

Employees Table

Employee_Id	Full_Name	Department_Id	Job_Role	Manager_Id
1	John Smith	1	CEO	Null
2	Sarah Goodes	1	CFO	1
3	Wayne Ablett	1	CIO	1
4	Michelle Carey	2	HR Manager	1
5	Chris Matthews	3	Sales Manager	2
6	Andrew Judd	4	Development Manager	3
7	Danielle McLeod	5	Support Manager	3
8	Matthew Swan	2	HR Representative	4
9	Stephanie Richardson	2	Salesperson	5
10	Tony Grant	3	Salesperson	5
11	Jenna Lockett	4	Front-End Developer	6
12	Michael Dunstall	4	Back-End Developer	6
13	Jane Voss	4	Back-End Developer	6
14	Anthony Hird	Null	Support	7
15	Natalie Rocca	5	Support	7

The code below shows the syntax for writing a JOIN:

SELECT columns
FROM table1
JOIN table2 ON table1.column1=table2.column1

Types of Joins

There are four major types of joins in SQL, as listed below:

Inner Join
Left Join
Right Join
Full Join

Inner Join

It creates a result table that displays information common between two tables based on a shared piece of information.

It is the most important and frequently used of the joins.

In this case, we could use it to display employees from the departments listed in the first table.

SELECT employees.Full_Name, employees.Job_Role, departments.Department_Name
FROM departments
INNER JOIN employees ON departments.Department_Id = employees.Department_Id
 -- You can replace the keyword INNER JOIN with JOIN

You should get the results below

Result

Full_Name	Job_Role	Department_Name
John Smith	CEO	Executive
Sarah Goodes	CFO	Executive
Wayne Ablett	CIO	Executive
Michelle Carey	HR Manager	HR
Chris Matthews	Sales Manager	Sales
Andrew Judd	Development Manager	Development
Danielle McLeod	Support Manager	Support
Matthew Swan	HR Representative	HR
Stephanie Richardson	Salesperson	HR
Tony Grant	Salesperson	Sales
Jenna Lockett	Front-End Developer	Development
Michael Dunstall	Back-End Developer	Development
Jane Voss	Back-End Developer	Development
Natalie Rocca	Support	Support

Left Join

A LEFT JOIN returns all rows from the left table, plus matched values from the right table or NULL in case of no match.

Note: The left table refers to the table that appears before the "LEFT JOIN" keywords in your SQL query. Same case applies when using RIGHT JOIN

We could use left join to retrieve a list of all employees along with their department names. If an employee doesn't belong to a department, display NULL

SELECT employees.Full_Name, departments.Department_Name
FROM employees 
LEFT JOIN departments ON departments.Department_Id = employees.Department_Id
ORDER BY employees.Full_Name ASC

You should get the results below

Result

Employee_Id	Full_Name	Department_Name
1	John Smith	Executive
2	Sarah Goodes	Executive
3	Wayne Ablett	Executive
4	Michelle Carey	HR
5	Chris Matthews	SALES
6	Andrew Judd	Development
7	Danielle McLeod	Support
8	Matthew Swan	HR
9	Stephanie Richardson	HR
10	Tony Grant	SALES
11	Jenna Lockett	Development
12	Michael Dunstall	Development
13	Jane Voss	Development
14	Anthony Hird	NULL
15	Natalie Rocca	Support

You can use the COALESCE() function to replace NULL values with more relatable user-defined values.

i.e. If an employee doesn't belong to a department, display "No Department" instead.

SELECT employees.Employee_Id, employees.Full_Name, 
COALESCE(departments.Department_Name, "No Department") AS Department_Name
FROM employees 
LEFT JOIN departments ON departments.Department_Id = employees.Department_Id
ORDER BY employees.Full_Name ASC

You should get the results below

Result

Employee_Id	Full_Name	Department_Name
1	John Smith	Executive
2	Sarah Goodes	Executive
3	Wayne Ablett	Executive
4	Michelle Carey	HR
5	Chris Matthews	SALES
6	Andrew Judd	Development
7	Danielle McLeod	Support
8	Matthew Swan	HR
9	Stephanie Richardson	HR
10	Tony Grant	SALES
11	Jenna Lockett	Development
12	Michael Dunstall	Development
13	Jane Voss	Development
14	Anthony Hird	No Department
15	Natalie Rocca	Support

Right Join

The SQL RIGHT JOIN returns all rows from the right table, even if there are no matches
in the left table. Not so different from the LEFT JOIN.

In this case, we could view employees and their various departments

SELECT Department_Name,
COALESCE (employees.Full_Name, "No Employee") AS Full_Name 
FROM employees 
RIGHT JOIN departments ON departments.Department_Id = employees.Department_Id;

You should get the results below

Result

Department_Name	Full_Name
Executive	Wayne Ablett
Executive	Sarah Goodes
Executive	John Smith
HR	Stephanie Richardson
HR	Matthew Swan
HR	Michelle Carey
Sales	Tony Grant
Sales	Chris Matthews
Development	Jane Voss
Development	Michael Dunstall
Development	Jenna Lockett
Development	Andrew Judd
Support	Natalie Rocca
Support	Danielle McLeod
Research	No Employee

Full Join

The SQL FULL JOIN combines the results of both left and right outer joins.

The joined table will contain all records from both tables and fill in NULLs for missing
matches on either side.

SELECT employees.Employee_Id, employees.Full_Name, employees.Job_Role, 
departments.Department_Id, departments.Department_Name
FROM employees 
FULL JOIN departments ON departments.Department_Id = employees.Department_Id
ORDER BY employees.Employee_Id ASC

You should get the results below

Result

EMPLOYEE_ID	FULL_NAME	JOB_ROLE	DEPARTMENT_ID	DEPARTMENT_NAME
1	John Smith	CEO	1	Executive
2	Sarah Goodes	CFO	1	Executive
3	Wayne Ablett	CIO	1	Executive
4	Michelle Carey	HR Manager	2	HR
5	Chris Matthews	Sales Manager	3	Sales
6	Andrew Judd	Development Manager	4	Development
7	Danielle McLeod	Support Manager	5	Support
8	Matthew Swan	HR Representative	2	HR
9	Stephanie Richardson	Salesperson	2	HR
10	Tony Grant	Salesperson	3	Sales
11	Jenna Lockett	Front-End Developer	4	Development
12	Michael Dunstall	Back-End Developer	4	Development
13	Jane Voss	Back-End Developer	4	Development
14	Anthony Hird	Support	NULL	NULL
15	Natalie Rocca	Support	5	Support
NULL	NULL	NULL	6	Research

Bonus Joins

Self Join

A SELF JOIN is typically not a join type but a special way of joining a table to itself. You may want to combine rows in a table based on a related column present in the table.

It is commonly used when you need to traverse a hierarchical structure where each row references another row in the same table

When comparing data within rows in the same table.

In this case, we will use self-join in the employees’ table to display the employee-manager relationship. Each employee record contains a reference to the manager's ID, allowing us to retrieve more information about the managers by adding another column, “Manager_Name”. In Data Science, this is an example of feature engineering.

SELECT
e.Employee_Id,
e.Full_Name,
e.Job_Role,
m.Employee_Id AS Manager_Id,
m.Full_Name AS Manager_Name
FROM employees e, employees m 
WHERE e.Manager_Id = m.Employee_Id;

You should get the results below

Result

EMPLOYEE_ID	FULL_NAME	JOB_ROLE	DEPARTMENT_ID	MANAGER_NAME
2	Sarah Goodes	CFO	1	John Smith
3	Wayne Ablett	CIO	1	John Smith
4	Michelle Carey	HR Manager	1	John Smith
5	Chris Matthews	Sales Manager	2	Sarah Goodes
6	Andrew Judd	Development Manager	3	Wayne Ablett
7	Danielle McLeod	Support Manager	3	Wayne Ablett
8	Matthew Swan	HR Representative	4	Michelle Carey
9	Stephanie Richardson	Salesperson	5	Chris Matthews
10	Tony Grant	Salesperson	5	Chris Matthews
11	Jenna Lockett	Front-End Developer	6	Andrew Judd
12	Michael Dunstall	Back-End Developer	6	Andrew Judd
13	Jane Voss	Back-End Developer	6	Andrew Judd
14	Anthony Hird	Support	7	Danielle McLeod
15	Natalie Rocca	Support	7	Danielle McLeod

That's most of what you need to know about SQL Joins. Hopefully, you found this information helpful. Feel free to share it with anyone who's having a hard time with Joins, and keep learning! 😊

If you have any questions, please leave them in the comments section below.

Thanks for reading!

Turning Messy Data into Business Action

Gathuru_M — Mon, 09 Feb 2026 02:07:28 +0000

How Analysts Translate Messy Data, DAX, and Dashboards into Action Using Power BI

As a data analyst, whether in the corporate world or working on a personal project, you will find that data analysis is often driven by the need to solve a problem or wanting an accurate, insightful view of data to make a better decision.

1. Define the Objective

First, we need to come up with the objective of our analysis.

Ask Questions: Write down the specific questions you want answers to from the analysis.
Theme Your Dashboard: If you need a dashboard, design one that maintains a clear theme. Don’t just show every kind of data to the user—focus on what matters.

2. Gathering the Data

The data we need is often scattered. You might need to:

Scrape it online.
Use data from survey results.
Combine several Excel sheets.
Fetch data from a database.

So, how do analysts translate this Messy Data into action?

Power BI allows you to access data from various sources, load it, transform it, and analyze it—all in the same workspace.

In Power BI, we follow the ELT approach (Extract, Load, Transform).

The ELT Process

Extract and Load

First, we start by Extracting the required data.
When you launch Power BI, you get a prompt to select a data source. Simply select where your data lies and import it into Power BI by Loading it there.

Transform

Once all your data is loaded, the next step is Cleaning. You want to remove all inconsistencies in your data by ensuring:

Correct Data Types: All fields (Dates, Decimals, Text) must be set correctly.
Handle Missing Values: Deal with nulls or blanks by either removing rows or filling them where appropriate.
Splitting/Merging: Split or merge columns where necessary (e.g., splitting an "Address" column to get separate "Location" and "Street" fields).
Standardization: Ensure consistency (e.g., making sure all currency data is in either Dollars or Shillings across the entire dataset).

Data Modelling: The Integration Step

After the data is clean and transformed, the best way to integrate it is through Data Modelling.

As discussed in my previous article Link , data modelling helps you create a clear structure and establish necessary relationships before you begin any plotting or calculations. This involves creating a proper Star Schema, consisting of a Fact table and Dimension tables.

Note: Always make sure to click "Close & Apply" to save all changes made in the Power Query Editor before you start building your reports.

DAX

Once your data model is set up, you might find that the raw data doesn't provide all the answers immediately. This is where DAX (Data Analysis Expressions) comes in.
DAX is the formula language of Power BI, allowing us to calculate and generate new information in two primary ways:

Measures: Use DAX to create summary aggregations. These calculate values on the fly, such as Total Sales, Year-over-Year Growth, Profit Margin, or Average Selling Price.
Calculated Columns: Generate new columns within your tables to provide more granular information. For example, you could create a "Profit Status" column that labels each row as "Profitable" or "Loss" based on a calculation.

DAX respects your model’s relationships. When you create a measure and plot it in a visual, it automatically reacts to the "Filter Context." This means the numbers will automatically filter and update based on the Dimensions (such as Date, Category, or Location) you use in your report.

Visualizing for Answers

A well-modeled dataset allows us to answer any question we might have. The best way to do this is to plot Visuals. Why?

Spot Patterns Fast: Visuals help you and your audience see trends or patterns immediately.
Tell a Story: After answering your initial questions, creating a dashboard tells a story about the problem you want to solve.

A great dashboard should:

Describe the problem clearly.
Illustrate insights found in the data.
Suggest a potential solution to the problem.

By following this flow, your analysis remains relevant and provides genuine value, rather than just showing the dashboard user what they already know.

How to Implement Data Modelling in Power BI

Gathuru_M — Mon, 02 Feb 2026 21:25:14 +0000

In this article we will explore the fundamental concepts of data modelling and specifically how to implement data modelling within Power BI for effective data analysis.

Data Modelling is simply structuring or setting up your data in tables, relationships, and access to data for analysis scenarios.

Often you'll find data distributed in several sheets or systems since, that is how it's maintained. If lucky your data could be in a neat setup (Data Warehouse) where most of the structuring has already been done.

Case Scenario:

Here is some sample data, to help make sense of this concept:
Sample Chocolate Sales Data

Understanding the Tables

In our case, we have 5 tables.

Fact Table: One table that contains pointers to other tables, but has no specific information. In our case, the shipments table.
Dimension Tables: Each explains one dimension of data. What is happening from the perspective of each dimension/entity. These will be the rest of our tables.
Calendar Table: A special Dimension table found in most data that holds the time component of analysis, i.e., [Timeseries, Forecasting or Trend analysis].

Doing this in Power BI

Load Data: Load the data in Power BI by selecting Excel Workbook as the data source.
Automatic Modelling: Power BI automatically tries to match the tables and create a model; you can see this in the Model View.
Manual Relationship Setup: 🤔 However, not all the data was modelled automatically. To finish up, drag the 'Shipdate' field from the Shipment table onto 'cal_date' in the Calendar table to create a new relationship and complete the look.

Note: Notice that the schema now looks Star-shaped. This kind of model with one fact table and multiple dimension tables is called a Star Schema. The process of setting up this structure is Data Modelling.
Manage Relationships: You can view and edit all connections by clicking on Manage Relationships. This menu shows the relationship status, specifically whether they are Active or Inactive.

Table Characteristics

Relationship Design: The Products table shows each product in a row with more information from the Shipments table indicating a Many to One Relationship.
Fact Table Design: The shipments table has more records as compared to the dimension tables in our data.

Therefore, Fact tables are engineered to host many records and may also have fewer columns as compared to dimension tables.

Conclusion

Effective data modelling is the backbone of any powerful analysis. By transforming messy, multi-sheet data into a structured Star Schema, everything will feel fast and intuitive; when it's wrong, you spend more time fixing "broken" numbers than actually analyzing data.

DEV Community: Gathuru_M

Building a Real-Time Weather Streaming Pipeline with Apache Kafka 4 and Cassandra

Upgrading from Kafka 3.9.2 to Kafka 4.0 (KRaft Native)

i). Removing Kafka 3.9.2

ii). Installing and Formatting Kafka 4.0

Step 1: Creating the Weather Topic

Step 2: Writing the Python Producer (weather_producer.py)

Step 3: Setting Up Apache Cassandra For the First Time

Step 4: Creating the Storage Schema

Step 5: Writing the Python Consumer (weather_consumer.py)

Verifying the Streaming Pipeline Data Flow

Major Lessons from Entering the Streaming Space

Introduction to Apache Kafka: Shifting from Batch Processing to Real-Time Streaming

What is Apache Kafka?

Core Components of Kafka Architecture

1. Producers and Consumers

2. Topics and Partitions

3. Offsets: How Kafka Tracks Progress

Local Setup: Zookeeper and Brokers

Summary

What's Next?

Containerizing an ETL Pipeline with Docker Compose and Clean Git Hygiene

Why Docker Compose?

Step 1: Writing the Configuration Files

Step 2: Keeping Secrets Secret with a .env File

Step 3: Composing the Multi-Container Network

Networking in Docker:

Step 4: Spinning it Up

Step 5: Practicing Clean Version Control (Atomic Commits)

Key Takeaways

Understanding Docker for Data Engineering

The Container Analogy:

The Three Core Concepts in Docker

1. The Dockerfile

2. The Docker Image

3. The Docker Container

Core Docker Commands

What's Next?

TaskFlow API vs. Traditional Operators: Practical Airflow ETL Pipeline

The Goal: Aggregating Data from Massive API

Approach 1: The Traditional Way (Standard Operators)

The Verdict on Traditional:

Approach 2: The Modern Way (TaskFlow API)

The Verdict on TaskFlow:

Seeing Both in the Airflow UI

Which One Should You Use?

What's Next?

Why use Apache Airflow? Quick Guide for Data Engineers

The Problem:

The Core Concepts Explained

1. The DAG (Directed Acyclic Graph)

2. Operators

3. Tasks

4. XComs (Cross-Communications)

A Quick Peek at the Airflow UI

What We’re Doing Next

ETL Pipeline: Fetching Real-Time News Data with Python and Postgres

The Goal & The Architecture

Step 1: Setting Up the Database

Step 2: The Python ETL Script

Step 3: Running the Pipeline and Verifying the Results

Key Takeaways from My First Project

What's Next?

ETL vs. ELT: Which Approach Should You Use and Why?

1. Introduction

2. What is ETL? (Extract, Transform, Load)

3. What is ELT? (Extract, Load, Transform)

4. Which is Better?

How to connect Power BI to a PostgreSQL Database

Introduction

Part 1: Connecting to a Local PostgreSQL Database

Part 2: Moving to the Cloud (Aiven PostgreSQL)

Part 3: Building the Data Model

Conclusion: Why SQL is a Superpower for Power BI Analysts

SQL Joins Explained: Case Example

Example Data

Departments Table

Employees Table

Bonus Joins

Turning Messy Data into Business Action

Step 2: Writing the Python Producer (`weather_producer.py`)

Step 5: Writing the Python Consumer (`weather_consumer.py`)

Step 2: Keeping Secrets Secret with a `.env` File