DEV Community: Joan Wambui

Data Pipeline Orchestration with Apache Airflow

Joan Wambui — Tue, 07 Jul 2026 18:11:19 +0000

Airflow is an open-source platform for programmatically authoring, scheduling, and monitoring workflows. It was developed at Airbnb in 2014 and has since become the standard for data pipeline orchestration.

Airflow does the following:

Schedules your tasks to run at the right time and in the right order.
Manages dependencies, i.e. if Task A fails, Task B never runs.
Gives you visibility into what succeeded, what failed, and how long it took.

Workflows are defined as codes (referred to as a DAG file) in Python.

Case scenario example:

You've been running the same three scripts for months:

extract.py pulls data from the API at 6 AM.
transform.py cleans it at 6:30AM.
load.py pushes it into the warehouse at 7AM.

One Tuesday, the API rate-limits you. extract.py fails silently. transform.py runs on yesterday's stale file, and load.py happily overwrites your production table with duplicate data. By the time the analytics team spots the error, it's noon. You spend the rest of the day cleaning up.

This is why orchestration exists. It handles dependencies, failures, retries, and visibility. Apache Airflow turns a fragile chain of scripts into a resilient, observable pipeline.

Airflow Components:

Scheduler - watches the clock, triggers tasks when conditions are met.
Executor - runs the task. The scheduler decides what to run; the executor does the running. Different executors handle different scales. SequentialExecutor for local development, CeleryExecutor or KubernetesExecutor for distributed production workloads.
Webserver - the dashboard at localhost:8080. Every DAG, every run, every task status, full logs are visible in one place.
Metadata database - Airflow's internal record. Every run is stored: what ran, when, the result, how long it took.

The DAG:

In Airflow, a pipeline becomes a DAG (Directed Acyclic Graph). Each step is a task; the relationships between tasks define the order.

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(
    dag_id="gas_prices_pipeline",
    start_date=datetime(2026, 1, 1),
    schedule="0 6 * * *",
    catchup=False,
) as dag:

    extract_task = PythonOperator(
        task_id="extract",
        python_callable=extract_gas_prices,
    )

    transform_task = PythonOperator(
        task_id="transform",
        python_callable=transform_cities,
    )

    load_task = PythonOperator(
        task_id="load",
        python_callable=load_cities,
    )

    extract_task >> transform_task >> load_task

The >> operator defines dependency. Transform won't run until extract succeeds. Load won't run until transform succeeds. A failure at any step stops what's downstream and records exactly where and why — no manual inspection required.

XCom - How tasks pass data

Tasks in Airflow run independently. They don't share variables the way functions do in a script.
XCom (cross-communication) is Airflow's built-in mechanism for passing return values between tasks.

PythonOperator approach (explicit XCom pull):

def transform_cities(**context):
    data = context['ti'].xcom_pull(task_ids='extract')
    cities = data['result']['cities']
    cities_df = pd.DataFrame(cities)
    return cities_df.to_dict(orient='records')

TaskFlow approach ( automatic XCom via decorators):

@task
def transform_cities(data):
    cities = data['result']['cities']
    cities_df = pd.DataFrame(cities)
    return cities_df.to_dict(orient='records')

# wired by natural function calls:
data = extract_gas_prices()
records = transform_cities(data)
load_cities(records)

TaskFlow handles XCom automatically as the decorator handles all the repetitive setup code.

One constraint: XCom serializes data as JSON internally. DataFrames aren't JSON-serializable, so they convert to a list of dicts before passing through XCom, then reconstruct on the other side.

Airflow is most needed when:

Multiple dependent steps each need independent tracking and retry logic
Failures need to alert, retry automatically, and log precisely
Multiple pipelines need managing from one interface
Teams need audit trails of what ran, when, and with what result

ETL vs. ELT in Data Engineering:

Joan Wambui — Tue, 07 Jul 2026 17:45:45 +0000

A Case Scenario:

Picture this: your startup just landed its first 500 paying customers. The product team wants a dashboard by Friday. You build a quick pipeline that places raw data into your cloud data platform, then write a handful of SQL models to clean it up. It works perfectly. Fast forward to six months, those same models now run for three hours and incur significant computational expenses each day. Your “modern” ELT pipeline is suddenly the bottleneck.

This is the real conversation behind the ETL vs. ELT debate. It’s not about which one is better. It’s about which one keeps your data team sane and your cloud bill predictable.

Definitions:

ETL (Extract → Transform → Load)

Data is extracted from source systems, cleaned and restructured on an external processing engine (like Apache Spark or a Python script), then loaded into the data warehouse in a ready-to-query state. The warehouse only sees clean data.

ELT (Extract → Load → Transform)

Raw data lands in the warehouse first, in its original, sometimes messy form. Transformation happens inside the warehouse using SQL or warehouse-native tools like dbt. The warehouse becomes both the storage layer and the computation engine.

Why did ELT become possible?

Cloud data warehouses (Snowflake, BigQuery, Redshift) separated storage from compute. Suddenly you could store petabytes cheaply and spin up massive compute only when you needed to run transforms — no need for an external Spark cluster.
Difference between ETL and ELT:

When to conduct ETL

· PII and compliance-heavy data. If you can’t legally store raw email addresses or health records in the warehouse, ETL lets you hash or mask them before they ever touch the cloud environment.

· Massive, complex transformations. When you’re doing heavy feature engineering for machine learning that doesn’t map well to SQL, an external Spark job is often cheaper and faster.

· Legacy sources with unpredictable schemas. Cleaning up a 15-year-old mainframe dump is painful in SQL; a Python script with strict type casting can save you from 100 dbt model errors.

When select ELT

· You need speed and self-service. Loading raw data immediately means product analysts can explore it the same day, not wait a week for a data engineer to write a pipeline.

· Your team is SQL-first. If most of your data team thinks in SQL, tools like dbt let them own transformations from extraction to metric definition and no Python required.

· You’re modernising. ELT lets you gradually refactor messy tables with staging models, incremental loads, and snapshots, all inside a language the team already knows.

A Decision Framework in three questions

When designing a pipeline, pause and ask:

1. Can this raw data legally live in my warehouse?

No - ETL or a hybrid masking layer before loading.

2. Is my transformation logic SQL-friendly, or does it require complex procedural code?

Complex code - ETL.
SQL-friendly - ELT.

3. Who will own and maintain the pipeline next year?

Data engineers → ETL is comfortable.
Analytics engineers → ELT with dbt reduces handoffs.

NB: Most production systems run a hybrid. You mask PII with a lightweight pre-load step (ETL framework), but all business logic lives in dbt inside the warehouse (ELT framework). The debate isn’t binary.

The real debate it's where does transformation logic live, who owns it, and how do we version-control it.

Python and Its Role in Data Analytics

Joan Wambui — Sun, 10 May 2026 14:12:37 +0000

Introduction

Python is a programming language. But if you are hearing that for the first time, it probably does not mean much yet. So let us break it down in a simple way.

A programming language is simply a way of giving instructions to a computer. The computer does not understand English, so we use languages like Python to communicate with it. What makes Python special is that it is written in a way that is close to how humans write and speak. It looks almost like plain English, which makes it one of the friendliest languages to start with.

According to W3Schools, Python is one of the most popular programming languages in the world, and it is designed to be readable and simple. That is exactly what draws me to it.

A Little Background on Python

Python was created by a man named Guido van Rossum and was first released in 1991. That means Python has been around for over 30 years. It has been tested, improved, and built upon by millions of developers around the world.

One thing that makes Python powerful is that it is open source. This means it is free to use, and anyone can contribute to making it better. You can simply go to python.org, download it, and start writing code today.

Python is also a general purpose language, meaning you can use it for many different things: building websites, automating tasks, artificial intelligence, and of course, data analytics, which is what we are going to focus on.

So What Is Data Analytics?

Before we talk about Python in data analytics, let us make sure we understand what data analytics actually means.

Data analytics is the process of looking at data and trying to make sense of it. Every business, every banking institution, and even every hospital has data. Sales numbers, customer records,
website visits, financial transactions, all of that is data.

The job of a data analyst is to take that raw data, clean it, organise it, and then draw conclusions from it. Those conclusions help individuals and organisations make better decisions.

For example, a supermarket might want to know which products sell the most during December. A data analyst would look at the sales data, process it, and give a clear answer. That answer could influence how much stock the supermarket orders. That is the power of data analytics.

Why Python for Data Analytics?

There are other tools used in data analytics. Excel is one that most people are already familiar with. SQL and R are also used. So why Python?

The honest answer is that Python can do almost everything these tools do, and in many cases it can do more. It can handle very large datasets that would crash Excel. It can connect to databases the way SQL does. And it has libraries that make data work much faster and easier.

W3Schools describes Python as having a simple syntax that allows developers to write programs with fewer lines of code compared to other programming languages. In data analytics, that matters because you are often writing the same types of processes repeatedly, and fewer lines means less room for error.

Another reason Python is preferred is the community behind it. If you get stuck, there are millions of people online who have probably had the same problem. Platforms like Stack Overflow, W3Schools, and even YouTube have free resources that can help you solve almost any Python problem.

Python Libraries That Matter in Data Analytics

A library in Python is a collection of pre-written code that you can use in your own work. Instead of writing everything from scratch, you simply import the library and use what you need.

Here are the main ones used in data analytics:

Pandas

Pandas is the most important library for data analytics in Python. It allows you to work with data in a structured way using something called a DataFrame, which is essentially a table that lives inside your Python code.

With pandas you can:

Load data from CSV files, Excel files, or URLs
Clean messy data
Filter, sort, and group data
Merge different datasets together

For example, pandas can be used to fetch JSON data from a GitHub link or an external API, convert it into a structured DataFrame, and export it as a CSV file.

NumPy

NumPy stands for Numerical Python. It is used for working with numbers and mathematical operations. While you may not use it directly as often as pandas in day-to-day analysis, it works quietly in the background.

Interesting fact: Pandas is built on top of NumPy.

If you ever need to work with large arrays of numbers or perform calculations across a dataset quickly, NumPy makes that possible.

Matplotlib and Seaborn

Once you have cleaned and analysed your data, the next step is usually to visualise it. This is made possible in Python by Matplotlib and Seaborn.

Matplotlib is the foundation of data visualisation in Python. It allows you to create line graphs, bar charts, pie charts, scatter plots, and more. Seaborn is built on top of Matplotlib and makes it easier to create more visually appealing charts with less code.

In a workplace setting, a chart often communicates a finding faster than a table of numbers. Being able to visualise your data is a very important part of the analytics process.

The Data Analytics Process in Python

Let me walk you through how a typical data analytics process looks when using Python. I will keep it simple because that is where I am coming from.

Step 1: Get the Data

This could mean reading a CSV file, connecting to a database, or fetching data from an API using the requests library. The goal is simply to bring the data into your Python environment.

import requests
import pandas as pd

url = 'https://your-data-source.json'
response = requests.get(url)
data = response.json()

Step 2: Understand the Structure

Before you do anything else, look at what you are working with. Is it a list or a dictionary? How many rows and columns does it have? Are there missing values?

df = pd.DataFrame(data)
print(df.shape)
print(df.head())
print(df.info())

Step 3: Clean the Data

Real world data is rarely clean. There will be missing values, duplicates, wrong data types, and inconsistent formatting. Cleaning is often the longest part of the process.

df.dropna(inplace=True) # remove missing values
df.drop_duplicates(inplace=True) # remove duplicates

Step 4: Analyse the Data

Once the data is clean, you start asking questions. What is the average? What is the highest value? Which category appears the most? Python and pandas make it easy to answer these.

print(df['price'].mean())
print(df['category'].value_counts())

Step 5: Visualise and Share

The final step is turning your findings into something others can understand. It can be a chart, a report, or a CSV file.

import matplotlib.pyplot as plt

df['category'].value_counts().plot(kind='bar')
plt.title('Products by Category')
plt.show()

df.to_csv('final_output.csv', index=False)

Where Can You Learn Python for Data Analytics?

If you are reading this and you want to start, here are the platforms I have found useful:

W3Schools (w3schools.com) - very beginner friendly, explains concepts with simple examples and lets you practice in the browser
Google Colab - a free environment where you can write and run Python code without installing anything on your computer
Kaggle - has free courses and real datasets to practice on
YouTube - countless free tutorials for every level

You do not need to learn everything at once. Start with the basics on W3Schools, get comfortable with pandas, and then start working on small projects using real data.

Final food for thought:

Python is not as intimidating as it looks at first. Once you understand that every line of code is just an instruction you are giving to the computer, it starts to make sense.

In the data analytics space, Python is one of the most valuable skills you can have. It is used by analysts, data scientists, finance professionals, marketers, and many others. The fact that it is free, widely supported, and beginner friendly makes it accessible to anyone interested in the language.

Happy learning!

CTEs, Subqueries, and Query Optimisation in SQL

Joan Wambui — Tue, 21 Apr 2026 16:54:35 +0000

If you have been writing SQL for a while, you have almost certainly run into a moment where a query starts to feel unwieldy. You need to filter based on an aggregation, or you need to reference a result you just computed, and suddenly your query is nested three levels deep and impossible to read. That is where CTEs and subqueries come in. Both solve the same core problem: referencing intermediate results within a larger query. But they do it differently, and knowing when to reach for each one will make you a sharper, more deliberate SQL writer.

Subqueries

A subquery is a query written inside another query. It is enclosed in parentheses and can appear in several places: inside a WHERE clause, inside a FROM clause, or inside a SELECT clause.

Subquery in a WHERE clause

This is the most common use. You filter the outer query based on a result computed by the inner query.

SELECT product_name, price
FROM products
WHERE price > (
  SELECT AVG(price)
  FROM products
);

The inner query runs first, computes the average price across all products, and hands that single value to the outer query. The outer query then filters using it.

Subquery in a FROM clause

Here, the inner query acts as a temporary table that the outer query reads from. This is sometimes called a derived table.

SELECT customer_name, total_spent
FROM (
  SELECT customer_id, SUM(total_amount) AS total_spent
  FROM sales
  GROUP BY customer_id
) AS customer_totals
WHERE total_spent > 5000;

Notice the alias customer_totals after the closing parenthesis. This is required in most databases when you use a subquery in a FROM clause.

Correlated subquery

A correlated subquery is one that references a column from the outer query. It re-executes for every row the outer query processes.

SELECT product_name, price, category
FROM products p
WHERE price > (
  SELECT AVG(price)
  FROM products
  WHERE category = p.category
);

Here, the inner query cannot run independently. It depends on p.category from the outer query, so it runs once per row. This is powerful but can be slow on large tables.

Common Table Expressions (CTEs)

A Common Table Expression, defined with the WITH keyword, lets you name an intermediate result and reference it by that name later in the same query. Think of it as a named, temporary result set that exists only for the duration of the query.

WITH customer_spending AS (
  SELECT customer_id, SUM(total_amount) AS total_spent
  FROM sales
  GROUP BY customer_id
)
SELECT c.first_name, c.last_name, cs.total_spent
FROM customers c
INNER JOIN customer_spending cs ON c.customer_id = cs.customer_id
WHERE cs.total_spent > 5000;

The WITH block defines customer_spending. The main query below it then joins against it as if it were a real table.

Chaining multiple CTEs

One of the most useful features of CTEs is that you can define several of them in sequence, and each one can reference the ones defined before it.

WITH product_sales AS (
  SELECT product_id, SUM(quantity_sold) AS total_qty
  FROM sales
  GROUP BY product_id
),
avg_sales AS (
  SELECT AVG(total_qty) AS avg_qty
  FROM product_sales
)
SELECT p.product_name, ps.total_qty
FROM products p
INNER JOIN product_sales ps ON p.product_id = ps.product_id
WHERE ps.total_qty > (SELECT avg_qty FROM avg_sales);

avg_sales is built directly on top of product_sales. This kind of step-by-step construction would be much harder to express clearly with nested subqueries.

Recursive CTEs

CTEs also support recursion, which is useful for querying hierarchical data such as org charts, file structures, or category trees. The CTE references itself until a termination condition is met.

WITH RECURSIVE employee_hierarchy AS (
  SELECT employee_id, name, manager_id, 1 AS level
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  SELECT e.employee_id, e.name, e.manager_id, eh.level + 1
  FROM employees e
  INNER JOIN employee_hierarchy eh ON e.manager_id = eh.employee_id
)
SELECT * FROM employee_hierarchy ORDER BY level;

Recursive CTEs are one of the few things CTEs can do that subqueries fundamentally cannot replicate cleanly.

CTEs vs Subqueries: Key Differences

Understanding the difference between the two goes beyond syntax. They reflect a different way of structuring your thinking.

Readability

Subqueries, especially nested ones, are read from the inside out. The deeper the nesting, the harder it is to follow. CTEs read from top to bottom, like steps in a recipe. When a query has multiple intermediate stages, CTEs are almost always clearer.

Reusability within a query

A subquery can only be used in the one place where it is written. If you need the same intermediate result in two different parts of your query, you have to write the subquery twice. A CTE can be referenced multiple times within the same query after it is defined once.

WITH high_value_sales AS (
  SELECT customer_id, SUM(total_amount) AS total_spent
  FROM sales
  GROUP BY customer_id
  HAVING SUM(total_amount) > 10000
)
SELECT COUNT(*) AS total_high_value_customers FROM high_value_sales
UNION ALL
SELECT SUM(total_spent) AS combined_revenue FROM high_value_sales;

Performance

This is more nuanced. In most modern databases like PostgreSQL, MySQL 8+, and SQL Server, the optimiser treats CTEs and subqueries similarly in many cases. However, some databases materialise CTEs (compute and store the result once), while subqueries may be inlined and optimised as part of the surrounding query. In PostgreSQL prior to version 12, CTEs were always materialised, which could actually make them slower in some situations. From PostgreSQL 12 onward, the optimiser can choose whether to materialise or inline a CTE. The practical takeaway is: do not choose one over the other for performance reasons without testing on your specific database version.

Correlated logic

Correlated subqueries are uniquely suited to row-by-row comparisons, like checking whether a value exceeds an average within its own group. CTEs cannot be correlated in the same way because they are evaluated independently of the outer query.

When to use each

Use a subquery when:

The logic is simple and self-contained
You need correlated row-by-row comparisons
The intermediate result is only needed in one place
You want to keep the query concise for a single-step filter

Use a CTE when:

You have multiple intermediate steps
You need to reference the same intermediate result more than once
You are working with recursive or hierarchical data
Readability and maintainability matter (which is most of the time)

Optimising SQL Queries

Writing a query that returns the right answer is the first goal. Writing one that does it efficiently is the second. Here are the most impactful methods for optimising SQL queries.

1. Use indexes wisely

An index is a data structure that allows the database to find rows much faster than scanning the entire table. Always ensure that columns used frequently in WHERE, JOIN, ORDER BY, and GROUP BY clauses are indexed.

-- Without an index on sale_date, this scans every row
SELECT * FROM sales WHERE sale_date = '2023-06-01';

-- Creating an index
CREATE INDEX idx_sales_date ON sales(sale_date);

Be careful not to over-index. Every index adds overhead to INSERT, UPDATE, and DELETE operations because the index must be updated alongside the data.

2. Avoid SELECT *

Selecting all columns forces the database to read and transfer every column, even those you do not need. Always specify only the columns your query requires.

-- Avoid
SELECT * FROM customers;

-- Preferred
SELECT customer_id, first_name, last_name FROM customers;

This reduces the amount of data read from disk and transferred across the network.

3. Filter early

The further upstream you apply filters, the less data each subsequent step has to process. In joins, filter the data before joining where possible.

-- Less efficient: joins all sales, then filters
SELECT c.first_name, s.total_amount
FROM customers c
INNER JOIN sales s ON c.customer_id = s.customer_id
WHERE EXTRACT(YEAR FROM s.sale_date) = 2023;

-- More efficient: pre-filter in a subquery or CTE
WITH sales_2023 AS (
  SELECT customer_id, total_amount
  FROM sales
  WHERE EXTRACT(YEAR FROM sale_date) = 2023
)
SELECT c.first_name, s.total_amount
FROM customers c
INNER JOIN sales_2023 s ON c.customer_id = s.customer_id;

4. Avoid functions on indexed columns in WHERE clauses

Wrapping a column in a function in a WHERE clause prevents the database from using any index on that column. The function has to be applied to every row before filtering can happen.

-- This cannot use an index on sale_date
WHERE EXTRACT(YEAR FROM sale_date) = 2023

-- This can use a range index on sale_date
WHERE sale_date BETWEEN '2023-01-01' AND '2023-12-31'

5. Use LIMIT when exploring data

When you are investigating data or developing a query, add LIMIT to avoid accidentally scanning millions of rows.

SELECT * FROM sales ORDER BY sale_date DESC LIMIT 100;

Remove it only when you need the full result set.

6. Prefer EXISTS over IN for large subquery results

When checking whether a related row exists, EXISTS stops as soon as it finds the first match. IN collects all results from the subquery first and then checks membership. On large datasets, EXISTS is typically faster.

-- Using IN
SELECT first_name, last_name
FROM customers
WHERE customer_id IN (SELECT customer_id FROM sales);

-- Using EXISTS (generally faster on large tables)
SELECT first_name, last_name
FROM customers c
WHERE EXISTS (
  SELECT 1 FROM sales s WHERE s.customer_id = c.customer_id
);

7. Use EXPLAIN / EXPLAIN ANALYZE

Before and after any optimisation, use your database's query plan tool to understand what is actually happening. In PostgreSQL:

EXPLAIN ANALYZE
SELECT p.product_name, SUM(s.total_amount)
FROM products p
INNER JOIN sales s ON p.product_id = s.product_id
GROUP BY p.product_name;

The output shows whether indexes are being used, how many rows are being scanned, and where the most time is spent. This is the single most reliable way to confirm whether an optimisation is actually helping.

8. Avoid correlated subqueries on large tables

As noted earlier, a correlated subquery runs once per row in the outer query. On a table with 100,000 rows, that means 100,000 subquery executions. Wherever possible, rewrite correlated subqueries as joins or CTEs.

-- Correlated subquery: runs once per customer row
SELECT first_name,
  (SELECT SUM(total_amount) FROM sales WHERE customer_id = c.customer_id) AS total_spent
FROM customers c;

-- Equivalent join: far more efficient
SELECT c.first_name, SUM(s.total_amount) AS total_spent
FROM customers c
LEFT JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.customer_id, c.first_name;

9. Use appropriate JOIN types

Inner joins are generally faster than outer joins because they return fewer rows and allow the optimiser more flexibility. Use LEFT JOIN or RIGHT JOIN only when you genuinely need to retain unmatched rows.

10. Aggregate before joining where possible

If you need to join an aggregated result to another table, aggregate first and then join, rather than joining first and then aggregating. This reduces the number of rows the join has to process.

-- Join first, aggregate after (more rows to process during join)
SELECT c.first_name, SUM(s.total_amount)
FROM customers c
INNER JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.customer_id, c.first_name;

-- Aggregate first, then join (fewer rows in the join)
WITH customer_totals AS (
  SELECT customer_id, SUM(total_amount) AS total_spent
  FROM sales
  GROUP BY customer_id
)
SELECT c.first_name, ct.total_spent
FROM customers c
INNER JOIN customer_totals ct ON c.customer_id = ct.customer_id;

Putting It All Together

CTEs and subqueries are not competing tools; they are complementary. Subqueries are precise and compact for simple, single-use logic. CTEs shine when a query has multiple steps, when the same intermediate result is needed more than once, or when the goal is to write something a colleague can read and understand without guessing. Query optimisation, meanwhile, is not about memorising a checklist; it is about understanding what the database is doing and removing unnecessary work at every stage. The combination of clean, readable structure and deliberate performance thinking is what separates functional SQL from genuinely good SQL.

SQL fundamentals: DDL, DML, filtering, and CASE WHEN

Joan Wambui — Mon, 13 Apr 2026 16:06:42 +0000

What is SQL?

SQL (Structured Query Language) is the standard language for interacting with relational databases. Whether you are storing customer records, tracking exam results, or analysing transactions, SQL is how you talk to the database.

The five command categories

SQL commands are grouped into five categories:

Category	Its Function	Commands
DDL (Data Definition Language)	Defines and modifies the database structure	CREATE, ALTER, DROP, TRUNCATE
DML (Data Manipulation Language)	Manages/manipulates data inside tables	INSERT, UPDATE, DELETE
DQL (Data Query Language)	Retrieves data from the tables	SELECT
TCL (Transaction Control Language)	Manages transactions	COMMIT, ROLLBACK, SAVEPOINT
DCL (Data Control Language)	Controls access and permissions	GRANT, REVOKE

This article focuses on DDL, DML, and DQL.

DDL (Data Definition Language)

DDL creates the schemas, tables and columns. It defines the shape of your data before any data exists. It has four commands, as mentioned in the table: CREATE, ALTER, DROP and TRUNCATE.

CREATE TABLE sets up a table with its columns, data types, and constraints. ALTER TABLE modifies an existing table by adding columns, renaming them, or dropping ones that are no longer needed. TRUNCATE TABLE removes the contents of the table; however, it retains the table structure. DROP TABLE removes the entire table.

A lightbulb moment: CREATE TABLE wraps its column definitions in parentheses. ALTER TABLE does not as the instruction follows directly after the table name.

-- Creation of a student table
CREATE TABLE students (
    student_id    INT PRIMARY KEY,
    first_name    VARCHAR(50) NOT NULL,
    last_name     VARCHAR(50) NOT NULL,
    gender        VARCHAR(1),
    date_of_birth DATE,
    class         VARCHAR(10),
    city          VARCHAR(50)
);

-- Addition of a phone number column
ALTER TABLE students
ADD COLUMN phone_number VARCHAR(20);

DML (Data Manipulation Language)

INSERT INTO adds rows to a table. You list the target columns, follow with VALUES, and can insert multiple rows in one statement by separating them with commas. Always quote dates. Without quotes, PostgreSQL reads 2008-03-12 as arithmetic (2008 − 3 − 12 = 1993).

INSERT INTO students (student_id, first_name, gender, date_of_birth, class, city)
VALUES
    (1, 'Amina', 'F', '2008-03-12', 'Form 3', 'Nairobi'),
    (2, 'Brian', 'M', '2007-07-25', 'Form 4', 'Mombasa');

plaintext

UPDATE modifies existing rows. DELETE removes them. Both require a WHERE clause. Without it, every row in the table is affected, not just the one you intended.

-- Updating a student's current city of residence:
UPDATE students
SET city = 'Nairobi'
WHERE student_id = 5;

-- Deleting an exam result from a student whose id is 9:
DELETE FROM exam_results
WHERE result_id = 9;

plaintext

WHERE clause

SQL has a variety of operators that allow for various filtering conditions. The WHERE clause, when used with these operators, allows you to target specific rows.

Some operators include:

BETWEEN filters a range and is inclusive on both ends.
IN checks against a list of values. It is cleaner than chaining multiple OR conditions.
LIKE matches text patterns, where % represents any sequence of characters.

WHERE marks BETWEEN 50 AND 80
WHERE city IN ('Nairobi', 'Mombasa', 'Kisumu')
WHERE class NOT IN ('Form 1', 'Form 2')
WHERE first_name LIKE 'A%' OR first_name LIKE 'E%'

plaintext

CASE WHEN

CASE WHEN lets you apply conditional logic directly inside a query, generating a new column based on rules you define — without touching the underlying table.

-- Exam results labelling:

SELECT *,
    CASE
        WHEN marks >= 80 THEN 'Distinction'
        WHEN marks >= 60 THEN 'Merit'
        WHEN marks >= 40 THEN 'Pass'
        ELSE 'Fail'
    END AS performance
FROM exam_results;

PostgreSQL evaluates conditions top to bottom and stops at the first match, so order matters. The ELSE clause covers anything that falls outside your defined conditions. Without it, unmatched rows return NULL.

In CASE WHEN, the result exists only in the query output. The table structure is never changed.

What makes CASE WHEN powerful is that it lets you enhance your data at query time without needing extra columns in your schema. For reporting and analysis, that flexibility is very useful.

How I Published My Power BI Report and Embedded It on a Website

Joan Wambui — Sun, 05 Apr 2026 22:07:04 +0000

I recently completed a Power BI project, an electronics sales dashboard built from a raw CSV file. Cleaning the data, modeling the tables, writing DAX measures, all of that happened in Power BI Desktop.

This article covers what happens next: getting that report out of your computer and onto the internet where it can actually be used.

1. Create a Workspace

Before publishing anything, you need a workspace. In Power BI Service, a workspace is essentially a folder that holds your reports and datasets.
Go to app.powerbi.com and sign in. On the left navigation panel, click Workspaces, then click “New workspace”.

2. Publish from Power BI Desktop

Go back to Power BI Desktop, press Ctrl + S to save your file, then go to the Home tab in the ribbon and click Publish on the right side.

A dialogue box appears asking where to publish. Select the workspace you just created and click Select.

3. Generate the Embed Code

Now that the report is live in Service, you need the embed code to place it on a website.
With the report open in the browser, click File in the top menu. Hover over Embed report and select Publish to web (public).

You can also embed the report using the other option “Website or portal” copy the iframe code

4. Embed It in a Webpage

Open any HTML file in VS Code or a text editor of choice. Paste the iframe code where you want the report to appear. Save the file and open it in a browser.

Your report is now published and the link is ready to share.

Understanding Data Modeling in Power BI: Joins, Relationships, and Schemas Explained

Joan Wambui — Tue, 31 Mar 2026 03:15:27 +0000

What Is Data Modeling?

Data modeling refers to how one organises and connects their data so it can be analysed correctly. In Power BI, it defines how tables relate to each other, how filters move across visuals, and how your measures calculate results.

Joins

Joins combine data from two tables based on a shared column. In Power BI, joins happen in Power Query Editor before data enters the model.
INNER JOIN - Returns only rows with a match in both tables.
LEFT JOIN - Returns all rows from the left table only.
RIGHT JOIN - Returns all rows from the right table only.
FULL OUTER JOIN - Returns everything from both tables. Useful for spotting data gaps.
LEFT ANTI JOIN - Returns only rows from the left table with no match on the right.
RIGHT ANTI JOIN - Returns only rows from the right table with no match on the left.

Power BI Relationships

Relationships connect tables inside the model without merging them. They control how filters move between tables when you interact with visuals.
Cardinality defines how rows relate across tables:
• One-to-Many (1:M) - One customer, many orders. The most common type.
• Many-to-Many (M:M) - Requires careful handling, best resolved with a bridge table.
• One-to-One (1:1) - Usually means the tables can be merged.

Active vs Inactive relationship - Only one active relationship is allowed between two tables. Inactive relationships are triggered using USERELATIONSHIP() in DAX when needed.

Cross-filter direction - Single direction is the default and safest. Bidirectional filters flow both ways but can cause performance issues if overused.

Joins vs Relationships

Fact Tables vs Dimension Tables

Fact tables hold measurable data such as sales and transactions. They are long with many rows and link out to dimension tables via foreign keys.
Dimension tables hold descriptive data such as customer names, product categories, dates. They give context to the numbers in your fact table.
In simple terms, your fact table is the center. Dimension tables surround it.

Schemas

Star Schema - One fact table connected directly to multiple dimension tables. Fast, clean, and what Power BI is optimised for. This should be used by default.
Snowflake Schema - Dimension tables broken into sub-dimensions. This is more complex and slower in Power BI. It is recommended for one to use only it when source data is already normalised.
Flat Table - Everything in one table. This is simple but it creates redundancy and performance problems at scale.
Pro Tip:

A clean data model is what makes a Power BI report trustworthy. Start with a star schema, separate your facts from your dimensions, define your relationships carefully, and always build a proper Date table. The model is invisible to the end user, but it is what everything depends on.

How Linux is Used in Real-World Data Engineering

Joan Wambui — Mon, 30 Mar 2026 02:46:26 +0000

Linux has been such a common word in the tech journey, a mountain of a word for beginners, but a simple and helpful foundation once you grasp it. I had heard it before but couldn't tell you what it actually was. Was it a programming language? A tool? The same thing as Bash or Git? For a while I used all four interchangeably. A quiet kind of confusion, the type you don't realise you have until something breaks and you don't know where to look.

The moment it clicked wasn't dramatic. It was sitting at a terminal during my data engineering class, connected to a remote server, and realising the environment I was operating in was Linux. Not a language. Not a tool I had installed. The environment itself.

What Actually Is Linux?

Linux is an operating environment, similar to Windows or macOS, that sits beneath your tools, files, terminal sessions, and pipelines. Bash is the language you speak inside it. Git is a version control tool that runs inside it. GitHub is the cloud platform where your Git repositories live. Four different things, four different layers, blurring together because you encounter them all at once through the same black terminal window.

This matters because data engineering assumes Linux fluency without announcing it.

How Linux Shows Up in Real Data Engineering Work

When you provision a virtual machine on Azure or AWS, you land in a Linux environment. When your pipeline runs on a scheduler, it is running on a Linux server. When you work with Docker containers, they are built on Linux. It is rarely the thing being discussed, but it is always underneath the thing being discussed.

In practice, data engineers use Linux to navigate directories where raw data lands and processed outputs go. They write Bash scripts to automate repetitive ingestion tasks (which include ingesting a file, validating its structure, moving it to a staging folder, logging the result) and schedule them with cron to run automatically.

# Step 1: Navigate to where raw data lands
cd /data/raw

# Step 2: Run your ingestion script
bash ingest.sh

# Step 3: Schedule it to run automatically every day at 6am
0 6 * * * /scripts/ingest.sh

They monitor running jobs, check resource usage, and kill processes that hang. When something breaks on a production server, Linux is the environment you are debugging in. Knowing it is not optional.

Linux will not be the most exciting thing you learn in data engineering. It does not have a sleek interface or a compelling pitch. But it is the ground where real data work happens, on servers, in the cloud, inside containers, underneath every tool you will eventually depend on. The clearer your understanding of what it is, the faster everything else makes sense.