DEV Community: John Wakaba

ETL vs ELT: Which One Should You Use and Why?

John Wakaba — Fri, 10 Apr 2026 06:17:56 +0000

A Beginner's Guide to Data Pipeline Architecture

If you have ever worked with data or heard engineers talk about data pipelines — you have probably come across the terms ETL and ELT. They sound almost identical, but they represent two different philosophies for moving and processing data. Understanding the difference between them can help you make better architectural decisions for your projects or simply help you follow technical conversations with more confidence.

This article breaks down both approaches, explains where each one shines, and helps you figure out which one might be the right choice for your situation.

1. What is ETL?

ETL stands for Extract, Transform, Load. It is a three-step process used to move data from one place to another usually from various source systems into a central data warehouse.

Think of it like a water treatment plant. Water (data) is collected from rivers (source systems), cleaned and purified (transformed), and then distributed to homes (loaded into a warehouse). The treatment happens before the water reaches your tap.

The Three Steps of ETL

Step 1 — Extract

Data is pulled from one or more source systems. These sources could be relational databases (like MySQL or PostgreSQL), spreadsheets, APIs, log files, CRM systems like Salesforce, or even flat files on a server.

Example: A retail company extracts daily sales records from its point-of-sale (POS) system, customer data from its CRM, and inventory data from its warehouse management system.

Step 2 — Transform

This is the most complex step. The extracted raw data is processed and reshaped in a separate staging environment (called the ETL engine or transformation layer) before it ever enters the destination.

Transformations can include:

Cleaning data (removing duplicates, fixing null values)
Standardising formats (converting dates from DD/MM/YYYY to YYYY-MM-DD)
Enriching data (adding new computed columns, e.g. calculating customer age from a birth date)
Joining data from multiple sources into a single, consistent structure
Applying business rules (e.g. marking orders over $10,000 as high-value)

Example: The sales data is cleaned to remove duplicate transaction IDs, dates are normalised to UTC, and customer names are standardised to title case.

Step 3 — Load

The now clean, structured data is loaded into the destination typically a data warehouse like Microsoft SQL Server, Oracle. Because the data was already transformed, it arrives ready to query.

ETL in One Sentence:

"Extract the data, clean and reshape it on a separate server, then load only the polished result into your warehouse."

Use Cases and Strengths of ETL

ETL is well suited for scenarios where data sources are smaller in scale but transformations are complex, where there is a need to offload transformation processing away from the target system, and where data security is a priority requiring sensitive data to be masked or encrypted before it ever reaches a warehouse. ETL is an excellent choice when data consistency, quality, and compliance are non-negotiable.

Core Strength of ETL:

ETL processes data before it reaches the warehouse, reducing the risk of sensitive data exposure and ensuring that all data conforms to business rules and standards from the moment it lands.

Python as an ETL Tool

Python has become a go-to language for building ETL pipelines. Its rich ecosystem of libraries and frameworks makes every step of the ETL process extract, transform, and load more accessible and flexible.

Key Python Libraries for ETL

Pandas

Pandas is the workhorse of data manipulation in Python. Its DataFrame structure makes it easy to load raw data, clean it, filter rows, rename columns, and reshape datasets. For small to medium sized ETL jobs, Pandas alone can handle the entire transformation step.

SQLAlchemy

SQLAlchemy provides a consistent and database agnostic way to interact with relational databases. It is especially useful in the Extract phase (reading from MySQL, PostgreSQL, SQL Server) and the Load phase (writing results back into a target database).

PySpark

When your data volumes outgrow what a single machine can handle, PySpark steps in. It offers distributed data processing across a cluster of machines, making it suitable for large scale ETL tasks.

Luigi and Apache Airflow

ETL pipelines are rarely one off scripts. Luigi and Apache Airflow help orchestrate and schedule ETL pipelines. Airflow has become the industry standard for managing complex multi-step workflows.

Advantage	What It Means in Practice
Flexibility	Python libraries allow fully custom ETL processes tailored to business needs
Scalability	PySpark enables processing of massive datasets
Community Support	Large ecosystem of tutorials and libraries
Ecosystem Integration	Works well with cloud, APIs, and databases

2. What is ELT?

ELT stands for Extract, Load, Transform. Notice the difference: the T (Transform) and L (Load) have swapped positions.

Instead of transforming data before loading it, ELT loads the raw data first and then transforms it inside the target system usually a modern cloud data warehouse.

Using the water analogy again: instead of treating water before distribution, you pipe all the raw water directly into a large, powerful filtration tank at the destination.

The Three Steps of ELT

Step 1 — Extract

Same as ETL — data is pulled from various source systems.

Step 2 — Load

Raw data is loaded directly into the target system without transformation.

Example: Raw transaction records are loaded into a Snowflake table called raw_transactions.

Step 3 — Transform

Transformations are applied inside the warehouse using SQL or tools like dbt.

Example: A dbt model queries raw_transactions and creates a clean table called fact_sales.

ELT in One Sentence:

"Extract the data, load all of it into your powerful cloud warehouse first, then transform it there."

Why ELT Has Become So Popular

ELT’s rise is tied to cloud warehouses like:

Snowflake
Google BigQuery
Amazon Redshift

These systems provide:

elastic compute power
columnar storage
massively parallel processing (MPP)

Key Advantages of ELT

Flexibility

Raw data is stored first, allowing transformation logic to change later.

Efficiency at Scale

Parallel processing makes ELT faster for large datasets.

Suitability for Large Datasets

ELT scales horizontally as data volumes grow.

3. Key Differences Between ETL and ELT

Factor	ETL	ELT
Transform Location	Outside the warehouse	Inside the warehouse
Best For	Structured data	Big data analytics
Scalability	Limited by server	Cloud scalable
Flexibility	Schema defined early	Schema flexible
Speed	Slower load	Faster load
Security	Data filtered before load	Raw data stored first
Popular Tools	Talend, Informatica	dbt, Snowflake

Understanding the Most Important Differences

Where Does Transformation Happen?

ETL transforms data before loading.

ELT transforms data after loading.

Raw Data Preservation

ELT keeps original raw data available for reprocessing.

Scalability

ELT scales automatically with cloud warehouses.

Speed and Data Ingestion

ELT often loads data faster because transformation happens later.

Control and Data Exposure

ETL offers more control over what enters the warehouse.

4. Real-World Use Cases

When ETL Makes Sense

Banking and Financial Reporting

Strict validation rules required.

Tools:

Informatica PowerCenter
IBM DataStage

Healthcare Data Integration

Standardised clinical data formats required.

Tools:

Talend
Microsoft SSIS
Apache NiFi

Legacy System Migration

Cleaning historical data before migration.

When ELT Makes Sense

E-commerce Analytics Platform

Tools:

Fivetran
Snowflake
dbt

SaaS Product Analytics

Tools:

Segment
Google BigQuery
dbt

Marketing Attribution Analysis

Tools:

Airbyte
Amazon Redshift
dbt

5. Popular Tools for ETL and ELT

Tool	Type	Best Known For
Informatica PowerCenter	ETL	Enterprise pipelines
Microsoft SSIS	ETL	SQL Server integration
Talend Open Studio	ETL	Open-source pipelines
Apache NiFi	ETL	Real-time flows
AWS Glue	ETL/ELT	AWS integration
Fivetran	ELT	automated connectors
Airbyte	ELT	open-source connectors
dbt	ELT	SQL transformations
Snowflake + dbt	ELT	modern stack
Google BigQuery	ELT	serverless analytics

A Closer Look at dbt

dbt enables analysts to write SQL SELECT statements that transform raw data directly inside the warehouse.

Features:

version control
testing
documentation
modular SQL models

6. Which One Should You Choose?

Situation	Recommended
Using cloud warehouse	ELT
Sensitive data	ETL
Frequent transformation changes	ELT
Legacy infrastructure	ETL
SQL-based teams	ELT
Need raw data history	ELT
regulated industries	ETL

General Rule of Thumb:

If you are building a new pipeline using a cloud warehouse, ELT is often the better starting point.

7. Putting It All Together: A Practical Example

Scenario: Online Bookstore

Data Sources

Orders database (PostgreSQL)
Customer reviews (MongoDB)
Marketing emails (Mailchimp API)
Website behaviour (Google Analytics)

Goal

Build a dashboard showing:

daily revenue
top-selling books
customer acquisition cost
review sentiment trends

ETL Approach

Talend extracts from multiple sources, transforms on ETL server, loads into SQL Server warehouse.

ELT Approach

Fivetran loads raw data into Snowflake.

dbt transforms raw tables into analytics models.

Which approach wins?

ELT provides more flexibility for analytics teams.

Conclusion

ETL and ELT are architectural patterns with different strengths.

ETL excels in:

regulated environments
structured pipelines
legacy systems

ELT excels in:

cloud analytics
scalability
flexibility

The key difference:

ETL cleans before storing.

ELT stores before cleaning.

As modern data tooling evolves, ELT is becoming the default approach for analytics engineering workflows.

Understanding both approaches allows you to design better pipelines and make smarter technical decisions.

Advanced SQL Techniques Every Data Analyst Should Know

John Wakaba — Thu, 09 Apr 2026 10:54:27 +0000

You can write a SELECT statement. You can JOIN tables and slap on a WHERE clause. But somewhere between "I know SQL" and "I really know SQL" lies a gap that separates analysts who get things done from analysts who get things done fast, elegantly, and correctly.

This article covers the techniques that live in that gap.

1. Window Functions

Most analysts discover GROUP BY early and lean on it forever. Window functions do something fundamentally different — they let you compute aggregates without collapsing rows.

SELECT
  employee_id,
  department,
  salary,
  AVG(salary) OVER (PARTITION BY department) AS dept_avg,
  salary - AVG(salary) OVER (PARTITION BY department) AS diff_from_avg
FROM employees;

You get one row per employee, but each row carries its department's average alongside it. No subquery. No self-join. No mess.

Running totals and moving averages

SELECT
  order_date,
  revenue,
  SUM(revenue) OVER (ORDER BY order_date) AS running_total,
  AVG(revenue) OVER (
    ORDER BY order_date
    ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
  ) AS rolling_7day_avg
FROM daily_sales;

The ROWS BETWEEN clause is where window functions get powerful. You can define exactly which rows belong to each window frame — preceding rows, following rows, or any combination.

Ranking without ties headaches

SELECT
  product_name,
  category,
  revenue,
  RANK()       OVER (PARTITION BY category ORDER BY revenue DESC) AS rank_with_gaps,
  DENSE_RANK() OVER (PARTITION BY category ORDER BY revenue DESC) AS rank_no_gaps,
  ROW_NUMBER() OVER (PARTITION BY category ORDER BY revenue DESC) AS row_num
FROM products;

RANK() skips numbers after ties. DENSE_RANK() doesn't. ROW_NUMBER() ignores ties entirely and just counts. Know which one you actually need before reaching for the first one.

2. CTEs

Common Table Expressions (CTEs) don't make your query faster (usually), but they make it more readable — and readable queries are maintainable queries.

WITH monthly_revenue AS (
  SELECT
    DATE_TRUNC('month', order_date) AS month,
    SUM(total_amount) AS revenue
  FROM orders
  GROUP BY 1
),
revenue_growth AS (
  SELECT
    month,
    revenue,
    LAG(revenue) OVER (ORDER BY month) AS prev_month_revenue,
    ROUND(
      100.0 * (revenue - LAG(revenue) OVER (ORDER BY month))
           / LAG(revenue) OVER (ORDER BY month),
      2
    ) AS mom_growth_pct
  FROM monthly_revenue
)
SELECT *
FROM revenue_growth
WHERE mom_growth_pct IS NOT NULL
ORDER BY month;

Each CTE is a named, composable step. You can read this top-to-bottom and understand exactly what's happening. Compare that to a nested subquery version and you'll never go back.

Recursive CTEs for hierarchical data

When you have org charts, category trees, or any parent-child relationship, recursive CTEs are the right tool:

WITH RECURSIVE org_chart AS (
  -- Base case: top-level managers
  SELECT
    employee_id,
    name,
    manager_id,
    0 AS depth,
    name AS path
  FROM employees
  WHERE manager_id IS NULL

  UNION ALL

  -- Recursive case: direct reports
  SELECT
    e.employee_id,
    e.name,
    e.manager_id,
    oc.depth + 1,
    oc.path || ' > ' || e.name
  FROM employees e
  INNER JOIN org_chart oc ON e.manager_id = oc.employee_id
)
SELECT * FROM org_chart ORDER BY path;

This walks the entire hierarchy in a single query, no matter how deep it goes.

3. Advanced Aggregations: GROUPING SETS, ROLLUP, and CUBE

Say goodbye to UNION ALL chains for multi-level summaries.

SELECT
  region,
  product_category,
  SUM(revenue) AS total_revenue
FROM sales
GROUP BY GROUPING SETS (
  (region, product_category),  -- subtotals by region + category
  (region),                    -- subtotals by region only
  (product_category),          -- subtotals by category only
  ()                           -- grand total
);

ROLLUP is a shorthand when your groupings have a natural hierarchy:

GROUP BY ROLLUP (year, quarter, month)
-- Produces: (year, quarter, month), (year, quarter), (year), ()

CUBE generates all possible combinations. Useful for cross-dimensional analysis, but be careful — it grows exponentially with the number of dimensions.

4. The FILTER Clause: Conditional Aggregation Without CASE

Most people do conditional aggregation like this:

SELECT
  SUM(CASE WHEN status = 'completed' THEN amount ELSE 0 END) AS completed_revenue,
  SUM(CASE WHEN status = 'refunded'  THEN amount ELSE 0 END) AS refunded_revenue
FROM orders;

There's a cleaner way:

SELECT
  SUM(amount) FILTER (WHERE status = 'completed') AS completed_revenue,
  SUM(amount) FILTER (WHERE status = 'refunded')  AS refunded_revenue
FROM orders;

The FILTER clause attaches directly to the aggregate function. It's not just aesthetically cleaner — it makes the intent unmistakably clear, and it works with any aggregate function including COUNT, AVG, STRING_AGG, and window functions.

5. LATERAL Joins: Correlated Subqueries That Scale

A LATERAL join lets a subquery in the FROM clause reference columns from tables that appear earlier in the same FROM clause. Think of it as a for each row, compute this operation.

SELECT
  c.customer_id,
  c.name,
  recent.order_date,
  recent.amount
FROM customers c
CROSS JOIN LATERAL (
  SELECT order_date, amount
  FROM orders o
  WHERE o.customer_id = c.customer_id
  ORDER BY order_date DESC
  LIMIT 3
) recent;

This fetches the 3 most recent orders per customer — something that's awkward with a window function and impossible with a regular join. Lateral joins shine for top-N-per-group patterns.

6. String Aggregation and Array Operations

Real-world data is messy. Sometimes you need to collapse multiple rows into a single delimited string, or work with arrays directly.

-- Collapse tags into a comma-separated list per article
SELECT
  article_id,
  STRING_AGG(tag, ', ' ORDER BY tag) AS tags
FROM article_tags
GROUP BY article_id;

-- PostgreSQL: aggregate into an actual array
SELECT
  user_id,
  ARRAY_AGG(DISTINCT product_id ORDER BY product_id) AS purchased_products
FROM purchases
GROUP BY user_id;

And once you have arrays, you can query into them:

SELECT user_id
FROM user_preferences
WHERE 'dark_mode' = ANY(feature_flags);

7. Query Optimization: Reading EXPLAIN Output

Fast queries aren't magic — they're the result of understanding what the database is actually doing.

EXPLAIN ANALYZE
SELECT c.name, COUNT(o.order_id)
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.name;

The output tells you:

Seq Scan vs Index Scan: sequential scans on large tables are a red flag
Hash Join vs Nested Loop: hash joins are usually better for large datasets; nested loops for small ones
Actual rows vs estimated rows: large discrepancies mean stale statistics — run ANALYZE
Rows removed by filter: indexes on the right columns eliminate these entirely

A few high-impact habits:

Index foreign keys (the database often won't do this automatically)
Avoid functions on indexed columns in WHERE clauses — WHERE YEAR(created_at) = 2024 can't use an index on created_at, but WHERE created_at >= '2024-01-01' can
Use LIMIT with OFFSET carefully — large offsets scan and discard rows; keyset pagination is faster for deep pages

8. Date and Time Manipulation

Time-series analysis is central to most analyst work, and doing it well requires a solid command of date functions.

-- Cohort analysis: group users by signup month
SELECT
  DATE_TRUNC('month', signup_date) AS cohort_month,
  DATE_TRUNC('month', activity_date) AS activity_month,
  DATE_PART('month', AGE(activity_date, signup_date)) AS months_since_signup,
  COUNT(DISTINCT user_id) AS active_users
FROM user_activity
GROUP BY 1, 2, 3
ORDER BY 1, 3;

-- Generate a complete date spine (no gaps even if data is missing)
SELECT date::date
FROM GENERATE_SERIES(
  '2024-01-01'::date,
  '2024-12-31'::date,
  '1 day'::interval
) AS gs(date);

The date spine pattern is essential for time-series work — join it against your data and missing dates appear as NULLs rather than disappearing from your results entirely.

9. NULL Handling

NULL is not zero. NULL is not an empty string. NULL is the absence of a value, and it propagates in ways that catch everyone out at some point.

-- This looks right but silently excludes NULLs from the average
SELECT AVG(response_time) FROM requests;
-- NULL values are ignored by AVG — this may be what you want, or may not be

-- Be explicit:
SELECT
  AVG(response_time) AS avg_excluding_nulls,
  AVG(COALESCE(response_time, 0)) AS avg_treating_null_as_zero,
  COUNT(*) AS total_rows,
  COUNT(response_time) AS non_null_rows
FROM requests;

And the classic NULL comparison mistake:

-- This never returns rows where manager_id is NULL
WHERE manager_id != 5

-- You need:
WHERE manager_id != 5 OR manager_id IS NULL

NULLIF is your friend for division-by-zero protection:

SELECT revenue / NULLIF(sessions, 0) AS revenue_per_session
FROM traffic_data;

When sessions is 0, NULLIF returns NULL, and dividing by NULL yields NULL — no error, no corrupt data.

10. Pivot Tables with CASE and FILTER

SQL doesn't have a native PIVOT keyword in most databases, but you can build pivot tables with conditional aggregation:

SELECT
  product_category,
  SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2022) AS revenue_2022,
  SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2023) AS revenue_2023,
  SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2024) AS revenue_2024,
  ROUND(
    100.0 * (
      SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2024) -
      SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2022)
    ) / NULLIF(SUM(amount) FILTER (WHERE EXTRACT(YEAR FROM order_date) = 2022), 0),
    1
  ) AS pct_change_2022_to_2024
FROM orders
GROUP BY product_category
ORDER BY revenue_2024 DESC NULLS LAST;

Conclusion

These techniques aren't separate tricks to memorize — they combine. A typical advanced query might use a CTE to build a clean base dataset, window functions to compute ranks and running totals, FILTER for conditional aggregation, and LATERAL to pull related records. The result is a single, readable, performant query that would have taken three separate queries and some Python glue code to produce otherwise.

CONNECTING POSTGRESQL WITH POWERBI (FOR A LOAN PERFORMANCE DASHBOARD)

John Wakaba — Thu, 09 Apr 2026 10:23:23 +0000

INTRODUCTION

Power BI is one of the most widely used business intelligence tools out there. One of the most valuable skills for a data analyst is the ability to transform raw data into actionable insights.

In practical financial institutions like banks, SACCOs, and fintech firms, loan-related data is produced on a daily basis. This encompasses details about borrowers, disbursed amounts, repayment histories, and indicators of loan performance. Left unanalyzed, all of this information sits dormant and serves no purpose.

In this tutorial, we will develop a Loan Performance Dashboard by integrating PostgreSQL with Power BI. This project illustrates:

How to store structured data within PostgreSQL.
Establish a connection between PostgreSQL and Power BI
Prepare data for reporting purposes, construct essential loan performance metrics, and design an interactive dashboard.

Why Use PostgreSQL Together with Power BI?

PostgreSQL ranks among the most widely used relational databases in the field of data analytics. It enables analysts to store structured datasets in an organized manner and carry out transformations through SQL. Power BI, on the other hand, is a business intelligence tool that empowers analysts to build dashboards and interactive reports.

When combined, the two form a robust workflow: PostgreSQL handles data storage, SQL manages data preparation, and Power BI takes care of visualization and insight generation. This workflow finds broad application across financial analytics, credit risk analysis, fintech analytics, and business intelligence reporting.

Project Overview

I want to build a dashboard that answers questions such as:

How many loans have been issued?
How much money has been disbursed?
How much has been repaid?
How much is still outstanding?
Which counties have the highest borrowing levels?
What percentage of loans are defaulted?
How does loan activity change over time?

To accomplish this, I will set up a simple relational database consisting of three tables — borrowers, loans, and repayments.

I will work with a small sample dataset to keep the project straightforward and easy to follow. This dataset represents a simplified loan portfolio containing borrower information alongside repayment activity.

Step 1: Create Tables in PostgreSQL

Create the borrowers table

CREATE TABLE borrowers (
    borrower_id INT PRIMARY KEY,
    borrower_name VARCHAR(100),
    gender VARCHAR(20),
    county VARCHAR(50),
    employment_status VARCHAR(50)
);

Create the loans table

CREATE TABLE loans (
    loan_id INT PRIMARY KEY,
    borrower_id INT,
    loan_amount NUMERIC(12,2),
    interest_rate NUMERIC(5,2),
    loan_term_months INT,
    issue_date DATE,
    loan_status VARCHAR(20),
    FOREIGN KEY (borrower_id) REFERENCES borrowers(borrower_id)
);

Create the repayments table

CREATE TABLE repayments (
    repayment_id INT PRIMARY KEY,
    loan_id INT,
    payment_amount NUMERIC(12,2),
    payment_date DATE,
    FOREIGN KEY (loan_id) REFERENCES loans(loan_id)
);

Step 2: Load Data Into PostgreSQL

Import the CSV files into PostgreSQL using DBeaver or pgAdmin. Alternatively, insert the data manually using SQL INSERT statements.

Step 3: Creating a Reporting View

Rather than connecting Power BI directly to raw tables, we build a view that brings together borrower, loan, and repayment data in one place. This makes the reporting process in Power BI significantly more straightforward.

CREATE OR REPLACE VIEW vw_loan_dashboard AS
SELECT
    l.loan_id,
    b.borrower_name,
    b.gender,
    b.county,
    b.employment_status,
    l.loan_amount,
    l.interest_rate,
    l.loan_term_months,
    l.issue_date,
    l.loan_status,
    COALESCE(SUM(r.payment_amount), 0) AS total_paid,
    l.loan_amount - COALESCE(SUM(r.payment_amount), 0) AS outstanding_balance
FROM loans l
JOIN borrowers b
    ON l.borrower_id = b.borrower_id
LEFT JOIN repayments r
    ON l.loan_id = r.loan_id
GROUP BY
    l.loan_id,
    b.borrower_name,
    b.gender,
    b.county,
    b.employment_status,
    l.loan_amount,
    l.interest_rate,
    l.loan_term_months,
    l.issue_date,
    l.loan_status;

This view provides a clean dataset ready for Power BI.

The vw_loan_dashboard view serves as the reporting layer for this project. It joins the loans, borrowers, and repayments tables into a single clean structure, pulling in borrower details alongside each loan record. A LEFT JOIN is used with the repayments table to ensure that loans with no repayment activity are still captured. From there, two calculated fields are derived — total_paid, which sums all repayments per loan, and outstanding_balance, which subtracts the total paid from the original loan amount. This consolidated view makes it straightforward to build meaningful metrics and visualizations in Power BI without managing complex joins on the reporting side.

Step 4: Connect PostgreSQL to Power BI

Launch Power BI Desktop and navigate to Home, then Get Data, and select PostgreSQL Database. Enter your server name along with your database name, then load the following:

borrowers
loans
repayments
vw_loan_dashboard

Step 5: Create Power BI measures

In this step we create the following DAX measures

Total Loans

Total_Loans = COUNT('loans vw_loan_dashboard'[loan_id])

Total Disbursed Amount

Total_Disbursed_Amount = SUM('loans vw_loan_dashboard'[loan_amount])

Total Amount Paid

Total_Amount_Paid = SUM('loans vw_loan_dashboard'[total_paid])
Total Outstanding Balance

Total_Outstanding_Balance = SUM('loans vw_loan_dashboard'[outstanding_balance])

Average Loan Size

Average_Loan_Size = AVERAGE('loans vw_loan_dashboard'[loan_amount])

Defaulted Loans

Defaulted_Loans = CALCULATE(COUNT('loans vw_loan_dashboard'[loan_id]),'loans vw_loan_dashboard'[loan_status]="Defaulted")

Default Rate

Default_Rate = DIVIDE([Defaulted_Loans], [Total_Loans], 0) * 100

Expected Interest Revenue

Expected_Interest_Revenue = SUMX('loans vw_loan_dashboard','loans vw_loan_dashboard'[loan_amount] * 'loans vw_loan_dashboard'[interest_rate])

Step 6: Build The Dashboard

Dashboard Layout Example

Conclusion

Connecting PostgreSQL with Power BI allows analysts to transform structured financial data into meaningful insights.

Loan performance dashboards help organizations understand portfolio health, monitor repayment behavior, and identify potential risks.

Building an AI-Powered Personalized Learning Platform with FastAPI, PostgreSQL, and Mistral AI

John Wakaba — Tue, 10 Mar 2026 08:55:50 +0000

Artificial Intelligence is transforming education by enabling systems
that adapt to individual learning needs. In this article, I'll walk
through how I built an AI-powered personalized learning platform
that generates quizzes, tracks student progress, and provides real-time
insights for teachers.

The Problem

Traditional learning platforms often deliver the same content to every
student, regardless of their performance. However, students learn at
different speeds and struggle with different topics.

The goal of this project was to build a system that:

• Generates quizzes automatically using AI
• Tracks student learning behavior
• Detects struggling students
• Provides teachers with data-driven insights

System Architecture

The system consists of four main components:

Student Interaction Layer

FastAPI Backend

PostgreSQL Database

AI Engine (Mistral)

Architecture overview:

Students

↓

FastAPI API

↓

PostgreSQL Database

↓

Mistral AI

↓

Analytics Dashboard

AI Quiz Generation

Instead of manually creating quizzes, the platform uses Mistral AI
to generate questions dynamically.

Example API endpoint:

GET /generate-quiz/algebra

The AI returns:

Question
Multiple choice answers
Correct answer
Explanation

This allows the platform to generate unlimited quizzes for any topic.

Real-Time Feedback

When students submit answers, the backend evaluates correctness and generates explanations.

POST /submit-answer

Example response:

correct: true\
score: 100\
feedback: Explanation of the answer

Students receive immediate feedback, improving engagement and learning efficiency.

Adaptive Learning

One of the most important features is adaptive difficulty.

If a student performs well, the system generates harder questions.

If a student struggles, the system provides simpler explanations and
easier quizzes.

This creates a personalized learning experience.

Data Analytics with SQL

Every interaction is stored in PostgreSQL, allowing powerful analytics.

Example insights:

Average student performance
Topic difficulty analysis
Learning trends over time
Detection of struggling students

Example SQL query:

SELECT student_id, AVG(score) FROM quiz_results GROUP BY student_id;

Teacher Dashboard

To visualize insights, I built a Streamlit dashboard.

Teachers can view:

Student performance
Difficult topics
Performance trends
At-risk students

This allows educators to identify learning gaps early.

Technologies Used

FastAPI
PostgreSQL
Mistral AI
SQL Analytics
Streamlit
Python

Final Thoughts

AI-powered learning platforms have the potential to transform education by making learning personalized, adaptive, and data-driven.

This project is a simplified prototype of what modern EdTech platforms can achieve using open-source tools and AI models.

# Understanding Joins and Window Functions in SQL

John Wakaba — Wed, 04 Mar 2026 09:31:49 +0000

When working with relational databases, data is rarely stored in one single table. Instead, it is organized into multiple related tables to reduce redundancy and improve structure.

In a typical transactional database, you might have:

customers → customer information
orders → order transactions
books → product details

To analyze meaningful business insights, we must first combine this data. That’s where SQL Joins come in.

Once the data is combined, we can apply Window Functions to perform advanced analysis such as ranking, running totals, and trend comparisons.

This article walks you through both concepts in a logical flow — starting with joins and finishing with window functions.

SQL JOINS

Why Joins Matter

Relational databases are built on relationships.

For example:

A customer places many orders.
An order references one book.
A book can appear in many orders.

To analyze this properly, we must join the tables together using a shared key — usually customer_id or book_id.

INNER JOIN

An INNER JOIN returns only rows that exist in both tables.

If we want to see which customers placed orders:

SELECT o.order_id,
       c.first_name,
       c.second_name,
       o.order_date,
       o.quantity
FROM orders o
INNER JOIN customers c
    ON o.customer_id = c.customer_id;

✔ Returns only customers who have placed orders

❌ Excludes customers with no orders

This is the most commonly used join in data analysis.

LEFT JOIN (LEFT OUTER JOIN)

A LEFT JOIN returns:

All rows from the left table
Matching rows from the right table
NULL where no match exists

Example: Show all customers, even those who haven’t ordered anything.

SELECT c.customer_id,
       c.first_name,
       c.second_name,
       o.order_id,
       o.quantity
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id;

✔ Every customer appears

✔ Customers without orders will show NULL in order columns

This is extremely useful for identifying inactive customers.

RIGHT JOIN

A RIGHT JOIN does the opposite of a LEFT JOIN.

SELECT c.first_name,
       c.second_name,
       o.order_id
FROM customers c
RIGHT JOIN orders o
    ON c.customer_id = o.customer_id;

This ensures all orders appear, even if customer data is missing.

In practice, RIGHT JOIN is less common because we can usually rewrite it using LEFT JOIN by switching table order.

FULL JOIN (FULL OUTER JOIN)

A FULL JOIN returns all rows from both tables.

SELECT c.first_name,
       c.second_name,
       o.order_id
FROM customers c
FULL JOIN orders o
    ON c.customer_id = o.customer_id;

✔ Shows matches

✔ Shows unmatched customers

✔ Shows unmatched orders

This is useful for data auditing and reconciliation.

Joining Multiple Tables

Real-world analysis often requires more than two tables.

Example: See which customer ordered which book.

SELECT c.first_name,
       c.second_name,
       b.title,
       o.quantity,
       o.order_date
FROM orders o
JOIN customers c
    ON o.customer_id = c.customer_id
JOIN books b
    ON o.book_id = b.book_id;

Now we can see:

Customer name
Book title
Quantity ordered
Order date

This joined dataset becomes the foundation for deeper analysis.

CROSS JOIN

A CROSS JOIN produces all possible combinations between two tables.

SELECT c.first_name,
       b.title
FROM customers c
CROSS JOIN books b;

If you have:

10 customers
5 books

You get 50 rows.

This is useful when generating combinations for simulations or recommendation systems.

Anti-Join (Finding Missing Records)

SQL doesn’t have a direct ANTI JOIN, but we simulate it using LEFT JOIN + NULL filtering.

Example: Find customers who have never placed an order.

SELECT c.customer_id,
       c.first_name,
       c.second_name
FROM customers c
LEFT JOIN orders o
    ON c.customer_id = o.customer_id
WHERE o.customer_id IS NULL;

This is powerful for churn analysis and business reporting.

How Joins Prepare Data for Window Functions

Notice something important:

Most advanced SQL analysis begins like this:

FROM orders o
JOIN customers c ON o.customer_id = c.customer_id

Why?

Because:

Joins combine related data
Window functions analyze that combined dataset

Joins prepare the structure.

Window functions perform the analytics.

Now that we understand joins, let’s move into advanced analytics.

WINDOW FUNCTIONS

They allow you to perform calculations across a set of table rows that are somehow related to the current row.

Window functions can be used to

Rank Rows.
Calculate cumulative totals.
Find the difference between consecutive rows in a dataset.

Window functions return a value for each row while still providing information from the related rows.

ROW_NUMBER ()

Assign a unique row number to each row in the result set.

In a real world scenario it can help us track which order each customer made first, second.......

Assigns a unique number to each row, starting from 1 based on the order specified by the ORDER BY clause.

The number will reset for each position if PARTITION BY is used.

Assign a unique row number to each order based on the order date and we want to reset numbering for each customer

SELECT o.order_id, c.first_name, c.second_name, o.order_date,
 ROW_NUMBER() OVER (PARTITION BY o.customer_id ORDER BY o.order_date) AS row_num
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id;

ROW_NUMBER () : Assigns a unique number to each order.

PARTITION BY o.customer_id : Ensures that the row numbering starts fresh for each customer.

Query will list orders for each customer showing their row number (1,2,3---) in the sequence of orders.

Ranking the orders globally based on order date without resetting the numbering for each customer

What is needed just remove the PARTITION BY Clause

SQL Query Without Resetting Row Number:

SELECT o.order_id, c.first_name, c.second_name, o.order_date,
 ROW_NUMBER() OVER (ORDER BY o.order_date) AS row_num
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id;

Without PARTITION BY the numbering is continous for all orders across customers based on their order_date.

RANK() AND DENSE_RANK()

RANK() assigns a rank to each row, with ties getting the same rank but leaving gaps in subsequent ranks.

DENSE_RANK() works similarly but without leaving gaps in the ranking.

RANK() SQL QUERY

Rank the customers based on the total quantity of books as they are ordered

SELECT c.first_name, c.second_name, SUM(o.quantity) AS total_quantity,
 RANK() OVER (ORDER BY SUM(o.quantity) DESC) AS rank
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id
 GROUP BY c.first_name, c.second_name;

RANK() assigns a rank based on the SUM(o.quantity) in descending order.

If two customers have the same total quantity ordered they will receive the same rank and the next rank will have a gap. *Two customers rank 1 will result in the next customer being ranked 3rd.

USING DENSE_RANK()

Assigns a rank without gaps for ties.

Assigns a rank to each row but it does not leave gaps in the rankings if there are ties.

Calculate the dense rank of customers based on the total quantity of books they ordered

SELECT c.first_name, c.second_name, SUM(o.quantity) AS total_quantity,
 DENSE_RANK() OVER (ORDER BY SUM(o.quantity) DESC) AS dense_rank
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id
 GROUP BY c.first_name, c.second_name
 ORDER BY dense_rank;

If two customers are tied they will receive the same rank but the next customer will receive the next consecutive rank.

1, 1, next will be 2nd

This key difference in RANK() and DENSE_RANK() is crucial for how you want to treat tied values in your analysis.

LEAD() AND LAG()

LEAD()

LEAD() access next row's value.

It is used to access a row that follows the current row at a specific physical offset.

Generally employed to compare the value of the current row with the value of the next row following the current row.

Compare quantity ordered by each customer in the current row with the quantity ordered in the next row

SELECT o.order_id, o.customer_id, o.quantity,
 LEAD(o.quantity) OVER (ORDER BY o.order_id) AS next_quantity
 FROM orders o;

LEAD(o.quantity) allows you to access the quantity of the next row for each customer.

Query gives the quantity ordered by the customer in the current row and the quantity ordered by the same customer in the next row.

For the last row for each customer the next quantity will be NULL because there is no next row.

LAG()

Access previous rows value

It is crucial for analyzing trends or behavior change over time.

Allows you to access data from a previous row within the same result set and is crucial for comparing values in the current row with the preceding row.

Operates on partitions created by the PARTITION BY clause.

Compare quantity ordered by each customer in the current row with the quantity ordered in the previous row using the LAG() function

SELECT o.order_id, o.customer_id, o.quantity,
 LAG(o.quantity) OVER (ORDER BY o.order_id) AS prev_quantity
 FROM orders o;

LAG(o.quantity) allows you to access the quantity ordered in the previous row for each customer

The query shows the quantity ordered in the current row and the quantity ordered in the previous row for the same customer.

First row previous quantity will be NULL as there is no previous row.

NTILE() FUNCTION

Partitions data into specified number of buckets.

Crucial for data analysis and reporting as it allows users to efficiently distribute rows and analyze data in a structured manner.

We want to divide customers into 2 groups (quartiles) based on their total order quantity.

SELECT c.first_name, c.second_name, SUM(o.quantity) AS total_quantity,
 NTILE(2) OVER (ORDER BY SUM(o.quantity) DESC) AS quantity_tile
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id
 GROUP BY c.first_name, c.second_name
 ORDER BY quantity_tile;

NTILE(2) divides customers into two equal groups (quartiles) based on their total quantity ordered.

PARTITION BY

Divides result set into partitions to apply window functions independently.

This clause divides the result set into partitions and the window function works independently within each partition.

calculate the total quantity of orders for each customer and the average price of the books ordered by each customer.

SELECT c.first_name, c.second_name,
 SUM(o.quantity) AS total_quantity,
 AVG(b.price) AS avg_price,
 SUM(o.quantity) OVER (PARTITION BY o.customer_id) AS total_order_quantity
 FROM orders o
 JOIN customers c ON o.customer_id = c.customer_id
 JOIN books b ON o.book_id = b.book_id
 GROUP BY c.first_name, c.second_name, o.quantity, o.customer_id
 ORDER BY c.first_name;

SUM(o.quantity) gives total quantity ordered by each customer.

PARTITION BY o.customer_id ensures the total order quantity is calculated for each individual customer.

## I Built an AI Tourism Assistant for Kenya Using RAG, pgvector, and Streamlit

John Wakaba — Wed, 04 Mar 2026 09:21:41 +0000

Imagine asking:

"What's the best luxury safari in Maasai Mara?"

and instantly getting personalized travel recommendations powered by
AI.

That's exactly what I built --- an AI Tourism Intelligence Assistant
that helps travelers discover the best travel packages in Kenya based on
their budget, travel style, duration, and preferred destination.

In this article, I'll walk you through:

• The idea behind the project
• How I built the AI recommendation system
• The RAG architecture powering it
• How vector search makes travel discovery smarter
• Deployment with Streamlit

✨ The Idea

Kenya is one of the world's most beautiful tourism destinations,
offering:

Wildlife safaris 🦁
Tropical beaches 🏝
Mountain adventures ⛰
Cultural experiences

But planning trips can be frustrating because:

• Travel packages are scattered across multiple websites
• Platforms rarely provide personalized recommendations
• Comparing destinations based on budget or style is difficult

So I decided to build an AI-powered tourism assistant that could:

✔ Understand traveler preferences
✔ Retrieve relevant travel packages
✔ Generate intelligent recommendations

🧠 What the AI Assistant Does

Users simply input their preferences:

Budget
Travel duration
Travel style
Preferred destination

The system then returns relevant travel packages from a tourism
database.

Example query:

Budget: $2000
Days: 5
Style: Relaxing
Destination: Diani

The assistant responds with recommended travel packages matching those
criteria.

⚙️ Tech Stack

Programming

Python

Data Engineering

PostgreSQL
pgvector

AI

Mistral AI embeddings
Retrieval-Augmented Generation (RAG)

Data Collection

Playwright
BeautifulSoup

Backend

SQLAlchemy

Frontend

Streamlit

Deployment

Streamlit Cloud
Neon PostgreSQL

🏗 System Architecture

Tourism Websites
      │
      ▼
Web Scraping (Playwright)
      │
      ▼
PostgreSQL Database
      │
      ▼
Embedding Generation (Mistral AI)
      │
      ▼
Vector Database (pgvector)
      │
      ▼
Recommendation Engine
      │
      ▼
Streamlit Web Application

🔎 How the RAG System Works

The project uses Retrieval‑Augmented Generation (RAG) to deliver
intelligent responses.

Instead of the AI guessing answers, it retrieves real travel packages
from the database first.

Pipeline:

User Query
     │
     ▼
Convert Query → Embedding
     │
     ▼
Vector Similarity Search
     │
     ▼
Retrieve Relevant Travel Packages
     │
     ▼
Generate Personalized Response

This ensures the AI responds with real tourism data rather than
hallucinations.

🗄 Database Design

The database stores travel information in structured tables such as:

travel_packages
destinations

Each travel package contains:

Package name
Destination
Duration
Price
Description
Vector embedding

🔍 Why Vector Search Matters

Traditional search relies on keywords.

Vector search understands meaning and context.

For example, if a user searches:

"Affordable safari in Kenya"

The system can still return:

• Budget Maasai Mara packages
• Lake Nakuru safari deals
• Amboseli wildlife tours

Even if those exact words were not used.

💻 Building the Interface

The frontend is built using Streamlit, which makes it easy to create
interactive data apps.

Users can:

✔ Enter travel preferences
✔ Browse travel packages
✔ Receive AI‑powered recommendations

🚀 Deployment

The application is deployed using:

Streamlit Cloud for hosting the web app.

Neon PostgreSQL for the managed database.

This allows the project to run fully online.

📊 Key Results

The project successfully delivers:

✔ AI-powered tourism recommendations
✔ Semantic search using vector embeddings
✔ A fully deployed web application
✔ Personalized travel package discovery

⚠️ Challenges I Faced

Web Scraping Complexity

Many travel websites load content dynamically, which required
Playwright.

Data Quality Issues

Scraped data often contained:

• Missing prices
• Duplicate packages
• Inconsistent destination names

Embedding Rate Limits

Embedding generation triggered API rate limits, requiring retry
logic.

Deployment Configuration

Deployment required careful setup of:

Environment variables
Streamlit secrets
Database connection strings

🔮 Future Improvements

Future versions of the system could include:

• AI itinerary generation
• Social media tourism trend analysis
• Integration with booking APIs
• User accounts and saved trips

🌍 Final Thoughts

Combining vector databases, AI retrieval systems, and interactive web
apps opens powerful opportunities for building intelligent data
products.

This project demonstrates how AI can improve tourism discovery and
travel planning.

🏠 Building a Machine Learning Property Price Predictor (From Web Scraping to Deployment

John Wakaba — Mon, 23 Feb 2026 07:53:09 +0000

In this project, I built a complete end-to-end machine learning system
that:

Scrapes property listings
Cleans and engineers features
Trains multiple ML models
Deploys a pricing app
Builds a business-ready dashboard

This article walks through the entire pipeline from raw web data to a deployed ML product.

Step 1 --- Web Scraping

I built a Selenium scraper to extract:

Location
Property Type
Bedrooms
Bathrooms
Size (sqm)
Amenities
Price (KES)
Listing Date

Sample Scraping Logic

listings = driver.find_elements(
    By.XPATH,
    "//div[contains(@class,'listing') or contains(@class,'property') or contains(@class,'card')]"
)

for listing in listings:
    link = listing.find_element(By.XPATH, ".//a[contains(@href,'/listings/')]")
    property_url = link.get_attribute("href")

Sample Scraping Logic

listings = driver.find_elements(
    By.XPATH,
    "//div[contains(@class,'listing') or contains(@class,'property') or contains(@class,'card')]"
)

for listing in listings:
    link = listing.find_element(By.XPATH, ".//a[contains(@href,'/listings/')]")
    property_url = link.get_attribute("href")

Step 3 --- Exploratory Analysis

Most Expensive Locations

location_prices = df.groupby("Location")["Price (KES)"].median().sort_values(ascending=False)
print(location_prices)

Step 4 --- Modeling

Train/Test Split

from sklearn.model_selection import train_test_split

X = df[["Bedrooms", "Bathrooms", "Size (sqm)", "amenity_score"]]
y = df["Price (KES)"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Linear Regression (Baseline)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

lr = LinearRegression()
lr.fit(X_train, y_train)

pred = lr.predict(X_test)

mae = mean_absolute_error(y_test, pred)
rmse = np.sqrt(mean_squared_error(y_test, pred))
r2 = r2_score(y_test, pred)

Random Forest

from sklearn.ensemble import RandomForestRegressor

rf = RandomForestRegressor(
    n_estimators=200,
    random_state=42
)

rf.fit(X_train, y_train)
rf_pred = rf.predict(X_test)

XGBoost

from xgboost import XGBRegressor

xgb = XGBRegressor(
    n_estimators=300,
    learning_rate=0.05,
    max_depth=5,
    random_state=42
)

xgb.fit(X_train, y_train)
xgb_pred = xgb.predict(X_test)

Step 5 --- Deployment (Streamlit App)

The pricing app allows users to input:

Location
Bedrooms
Bathrooms
Size
Amenities

And returns:

Predicted price
Estimated range (± MAE)
Explanation of price drivers

Run locally:

streamlit run Streamlit_app.py

Step 6 --- Executive Dashboard

Built using Streamlit with interactive filters.

Includes:

Median price by location
Monthly price trends
Price per sqft comparison
Amenity impact analysis

Run:

streamlit run Dashboard.py

Key Insights

Size is the strongest determinant of price.
Premium neighborhoods significantly increase valuation.
Amenities increase value but are secondary drivers.

From Messy Data to Confident Decisions: How Analysts Use Power BI, DAX, and Dashboards in the Real World

John Wakaba — Tue, 10 Feb 2026 08:48:51 +0000

Power BI skills are often misunderstood as “just reporting.” In reality, professional analysts use Power BI as a decision-support system — one that transforms messy, unreliable data into insights leaders trust to allocate budgets, adjust strategy, and measure performance.

This article demonstrates how technical Power BI skills translate directly into real-world business decisions and measurable impact, following the same workflow used in real organizations.

Messy Data Is a Business Risk, Not a Technical Issue

In real organizations, data arrives incomplete and inconsistent:

Regions spelled differently across systems
Missing transaction dates
Revenue stored as text
Duplicate customer records
Placeholder values like N/A and Error

When these issues are ignored, dashboards show incorrect KPIs and misleading trends.

Business Impact

When analysts clean data correctly:

Financial metrics become trustworthy
Performance comparisons are accurate
Leaders focus on decisions instead of debating numbers

This is why analysts begin in Power Query, not visuals.

Power Query: Turning Raw Inputs into Reliable Data

Power Query is where analysts reduce business risk.

Using repeatable transformation steps, analysts:

Standardize categories for consistent grouping
Remove invalid or duplicate records
Apply correct data types for calculations and time analysis
Replace pseudo-blanks with true null values

Real-World Outcome

After proper Power Query transformations:

Monthly revenue no longer fluctuates unexpectedly
Forecasts align with finance systems
Data refreshes produce consistent results automatically

Reliable data is the foundation of every decision.

Data Modeling: Structuring Data for Decision-Making

Power BI does not analyze spreadsheets — it analyzes data models.

Professional analysts design star schemas:

Fact tables store measurable business events
Dimension tables provide descriptive context

Why This Matters

Well-designed models ensure:

Predictable filter behavior
Accurate KPIs across dashboards
Strong performance as data volumes grow

Poor modeling leads to conflicting answers and erodes stakeholder trust.

DAX: Translating Business Questions into Logic

DAX allows analysts to express business logic directly in calculations.

Executives ask questions such as:

Are we improving compared to last year?
Which regions are underperforming?
How close are we to our targets?

Using DAX, analysts move beyond raw totals to meaningful metrics like:

Profit margins
Year-over-year growth
Year-to-date performance

Business Impact

DAX enables:

Fair comparisons across time
KPI tracking against targets
Scenario-based decision-making

Without DAX, dashboards show numbers. With DAX, they show meaning.

Time Intelligence: Supporting Strategic Decisions

Time intelligence allows organizations to understand performance trends.

Using time-based analysis, analysts:

Compare current results to prior periods
Detect early signs of growth or decline
Measure progress toward annual goals

Decisions Enabled

Expanding high-growth regions
Addressing seasonal declines proactively
Adjusting forecasts based on YTD performance

Time intelligence transforms historical data into forward-looking insight.

Dashboards: From Information to Action

Dashboards are not collections of charts. They are decision interfaces.

Effective dashboards:

Highlight critical KPIs
Show trends requiring attention
Surface underperformance and exceptions
Enable fast filtering without technical effort

Measurable Outcomes

Well-designed dashboards help organizations:

Reduce time spent validating numbers
Detect issues earlier
Align teams around shared metrics
Act faster with confidence

Dashboards succeed when users know what to do next.

Measuring Success: Business Impact of Power BI

When Power BI is used effectively:

Leaders trust the data without manual validation
Decisions are backed by consistent metrics
Reporting effort decreases
Performance improvements are measurable

The real value of Power BI is not the report —

it is the decisions enabled by the report.

Conclusion: Power BI as a Strategic Asset

Analysts create measurable impact by combining:

Power Query for data reliability
Data modeling for meaningful analysis
DAX for business logic
Dashboards designed for action

This is how messy data becomes:

Trusted insights
Confident decisions
Real business outcomes

Power BI, when used professionally, is not a reporting tool — it is a strategic decision-making engine.

📊 Understanding Schemas and Data Modelling in Power BI

John Wakaba — Thu, 05 Feb 2026 11:13:12 +0000

Data modelling is the foundation of building scalable and
high-performance dashboards in Power BI. While many developers focus
heavily on visuals and DAX calculations, the true performance and
accuracy of a report depend heavily on how data is structured.

🧱 What Is Data Modelling in Power BI?

Data modelling refers to structuring data into logical formats that
support analysis and reporting. Dimensional modelling organizes data
into:

Facts (measurable metrics)
Dimensions (descriptive attributes)

A well-designed Power BI model determines:

How tables relate
How filters propagate
How fast reports load
How accurate calculations are

⭐ Star Schema

✅ What Is Star Schema?

A Star Schema organizes data using:

One central fact table
Multiple dimension tables connected directly to the fact table

The structure resembles a star, where dimension tables surround the
central fact table.

📊 Components of Star Schema

Fact Table

Contains measurable and quantitative data such as:

Sales revenue
Quantity sold
Profit
Discount

Each row represents a business event like a transaction.

Dimension Tables

Provide descriptive context to fact data such as:

Customer details
Product attributes
Date/Time
Store location

🚀 Benefits of Star Schema

✔ High query performance
✔ Simple design
✔ Easier DAX calculations
✔ Optimized for reporting and dashboards

⚠️ Limitations

Data redundancy
Higher storage usage

❄️ Snowflake Schema

✅ What Is Snowflake Schema?

A Snowflake Schema extends star schema by normalizing dimension
tables into multiple related tables.

Example:

Customer → City → Country

📌 Features

Normalized dimension tables
Supports hierarchical drill-down analysis
Improves data integrity
Requires additional joins

⚖️ Advantages

✔ Reduced redundancy
✔ Better data consistency
✔ Supports complex hierarchies

⚠️ Limitations

More complex design
Slower performance due to joins

📊 Fact Tables vs Dimension Tables

📊 Fact Tables

Store numeric metrics and foreign keys linking to dimensions.
Usually contain transactional data and large volumes of records.

🧾 Dimension Tables

Store descriptive attributes that provide context to fact tables.
Used for filtering, grouping, and categorizing data.

🔗 Relationships in Power BI

Relationships connect tables and enable filtering across datasets.

📌 Types of Relationships

One-to-Many -- One dimension record links to many fact records
Many-to-One -- Reverse of one-to-many
Many-to-Many -- Multiple matching records on both sides

🚨 Why Good Data Modelling Is Critical

⚡ Performance Optimization

Improves query speed
Reduces memory usage
Enables faster dashboard loading

🎯 Accurate Reporting

Ensures correct aggregations
Maintains reliable filter behavior

🧠 Easier DAX Calculations

Simplifies analytical queries
Improves calculation accuracy

🔧 Scalability

Supports future data expansion
Easier troubleshooting and maintenance

⭐ Star Schema vs Snowflake Schema

Feature Star Schema Snowflake Schema

Structure Denormalized Normalized
Performance Faster queries Slower queries
Complexity Simple Complex
Storage Higher storage Lower storage
Use Case Reporting dashboards Large complex warehouses

🏆 Power BI Data Modelling Best Practices

Use Star Schema whenever possible
Separate fact and dimension tables
Maintain clear relationships
Optimize data types
Reduce unnecessary joins

📌 Real-World Analogy

Think of a library:

Fact tables = Books (contain measurable information)
Dimension tables = Catalogue system (organizes and locates books)

🎯 Conclusion

Data modelling is one of the most important skills for Power BI
developers. Understanding schema design ensures dashboards are:

Fast
Accurate
Scalable
Maintainable

Before building visuals or writing DAX formulas, always ask:

Is my data model structured correctly?

Getting Started With Linux for Data Engineers (With Vi and Nano Examples)

John Wakaba — Thu, 05 Feb 2026 10:55:10 +0000

If you're getting into data engineering, there's one skill that keeps
showing up everywhere: Linux.

Whether you're working with cloud servers, big data tools, or pipeline
automation, Linux is almost always running behind the scenes. The good
news? You don't need to be a Linux wizard to get started.

In this guide, we'll break down:

Why data engineers need Linux
Basic commands you'll actually use
How to edit files using Vi
How to edit files using Nano
Real-world examples

Why Should Data Engineers Learn Linux?

Here's the honest truth --- most production data systems run on Linux
servers.

When you deploy Spark jobs, schedule Airflow pipelines, or manage
databases, you'll likely connect to a Linux machine.

It's Built for Performance

Linux handles heavy workloads really well, which is perfect for big data
processing.

It's Highly Customizable

Since Linux is open source, companies tailor it for their
infrastructure.

It Runs the Cloud

Most AWS, Azure, and Google Cloud servers run Linux.

It Supports Automation

Data engineers constantly automate workflows using shell scripts.

Linux Commands Every Beginner Should Know

Check Where You Are

pwd

List Files

ls

Move Between Folders

cd folder_name

Create a Folder

mkdir data_project

Create a File

touch notes.txt

Read a File

cat notes.txt

Why Text Editors Matter in Linux

When you log into a server, there's usually no graphical editor like VS
Code or Notepad.

Instead, you use terminal editors like: - Vi (powerful but tricky) -
Nano (simple and beginner-friendly)

Using Vi (The Power Tool)

Open or Create a File

vi sample.txt

Enter Insert Mode

Press i and start typing.

Save and Exit

Press ESC, then type:

:wq

Exit Without Saving

:q!

Example Script

vi pipeline.sh

Add:

#!/bin/bash
echo "Pipeline started"

Using Nano (The Friendly Editor)

Open a File

nano notes.txt

Save Your Work

Press:

CTRL + O

Exit Nano

CTRL + X

Example Config

nano config.conf

Add:

database=postgres
username=admin

Real-Life Data Engineering Scenario

You may need to:

Update Airflow configuration
Fix pipeline scripts
Modify database credentials
Check logs

Commands might include:

nano airflow.cfg

vi pipeline.sh

Pro Tips for Beginners

Always back up files before editing
Practice Vi commands slowly
Use Nano when learning
Learn basic shell commands daily

Final Thoughts

Linux is part of the foundation of modern data infrastructure.

Learning Linux commands and text editors gives you confidence when
working with production servers and cloud platforms.

Start with Nano.
Grow into Vi.
Practice consistently.

Real-World ETL Pipeline from a Public Google Sheet

John Wakaba — Wed, 04 Feb 2026 08:52:02 +0000

Spreadsheets are everywhere.

They’re easy to use, easy to share, and often become the first home of business data. But they’re terrible for analytics, automation, and scale.

In this article, I’ll walk through how I built a production-style ETL pipeline that:

Extracts data from a public Google Sheet
Cleans and validates the data using Python
Loads the data into PostgreSQL and MongoDB
Handles real-world issues like UUIDs, connection strings, and performance bottlenecks

The Problem

A supermarket dataset was stored in a Google Sheet. While this works for manual inspection, it introduces several problems:

No schema enforcement
No support for analytics or BI tools
Poor performance for large queries
No safe way to integrate with applications

The goal was to move this data into proper databases while following real ETL best practices.

Architecture Overview

Public Google Sheet (CSV Export)
        ↓
     Python ETL
        ↓
  Transform & Validate
        ↓
 PostgreSQL (Analytics)   MongoDB (Documents)

Why two databases?

PostgreSQL acts as the system of record for analytics and reporting
MongoDB provides flexible, document-based storage for application access

Tools Used

Python 3.12
UV for dependency and environment management
Pandas for data transformation
Requests for HTTP-based extraction
PostgreSQL (via SQLAlchemy)
MongoDB (via PyMongo)
Loguru for structured logging

Step 1: Extracting Data from a Public Google Sheet

Instead of dealing with Google Cloud authentication, I used a simpler (and very realistic) approach.

Google Sheets exposes a CSV export endpoint for public sheets.

A human-friendly link like this:

https://docs.google.com/spreadsheets/d/<sheet-id>/edit

can be converted to:

https://docs.google.com/spreadsheets/d/<sheet-id>/export?format=csv&gid=<gid>

Python can then fetch the data directly:

response = requests.get(csv_url)
df = pd.read_csv(StringIO(response.text))

No API keys. No OAuth. Fully automated.

Step 2: Transforming the Data (Where Things Get Real)

This is where assumptions break.

I initially assumed the id column was numeric. It wasn’t.

It contained UUIDs like:

47d54138-a950-4ec0-9d4a-e637e8dfb290

Trying to cast this to an integer caused the pipeline to fail.

Lesson #1: The data always wins

The fix was simple but important:

Treat id as a string
Update both transformation logic and database schemas

Step 3: Loading into PostgreSQL

PostgreSQL is the backbone of the pipeline.

Key design decisions:

Strong schema enforcement
Idempotent inserts
Safe re-runs of the pipeline

The table is created if it doesn’t exist, and inserts use:

ON CONFLICT (id) DO NOTHING

This ensures:

No duplicate records
No need to truncate tables
Safe incremental runs

Step 4: Loading into MongoDB (and Fixing Performance)

My first MongoDB implementation used update_one() in a loop.

It worked — but it was painfully slow.

The fix was switching to bulk operations:

collection.bulk_write(operations, ordered=False)

This reduced load time.

Step 5: Debugging a Nasty PostgreSQL Error

One of the most confusing errors I hit was:

could not translate host name "4401@localhost"

It turned out the PostgreSQL password contained an @ symbol.

Lesson #3: Database passwords must be URL-safe

The solution was URL encoding:

KIM@4401 → KIM%404401

Results

After fixing these issues, the pipeline:

Runs end-to-end.
Can be safely re-run without duplicates
Loads clean data into both databases
Handles real-world data quirks correctly

Key Takeaways

Spreadsheets are common sources — but not suitable destinations
Never assume data types without inspecting real data
UUIDs are extremely common in production systems
Bulk operations matter for performance
Environment variables and connection strings are frequent failure points

Final Thoughts

This project wasn’t about flashy tools.

It was about building something real, breaking it, and fixing it.

Building a MedAdvantage RAF Engine with dbt & PostgreSQL (Step-by-Step Guide)

John Wakaba — Tue, 13 Jan 2026 10:09:54 +0000

In this project, I built a mini Medicare Advantage Risk Adjustment Factor (RAF) engine using PostgreSQL, dbt Core, and synthetic healthcare data.

The goal was to simulate a real-world healthcare analytics pipeline that transforms raw claims data into member-level risk scores.

This article walks through the entire process step by step, from raw CSV files to a final analytics mart.

1. Project Overview

The project models how healthcare organizations calculate risk scores using:

Member demographic data
Medical claims with ICD-10 diagnosis codes
Pharmacy claims with NDC drug codes
Reference mapping tables that convert codes into HCC and RxHCC categories

The final output is a table that shows:

Member ID
Service year
Gender and plan
Total HCC weight
Total RxHCC weight
Final risk score

This type of table is commonly used for actuarial analysis, reimbursement modeling, and population health analytics.

2. Tools & Environment Setup

Before starting the project, I installed and configured:

PostgreSQL as the data warehouse
DBeaver as the database client
Python + virtual environment (venv)
dbt Core with the Postgres adapter

This setup allows a modern ELT (Extract → Load → Transform) workflow where data is first loaded into the warehouse and then transformed using dbt.

3. Database Design

I used one PostgreSQL database with two schemas:

med_project → holds raw data and reference tables
analytics → holds all dbt models (staging, core, and marts)

Raw tables stored in `med_project`:

Members
Medical claims
Pharmacy claims
ICD-to-HCC mapping table
NDC-to-RxHCC mapping table

dbt models stored in `analytics`:

Staging models
Core models
Mart models (final analytics tables)

This separation keeps raw data immutable and transformed data fully governed by dbt.

4. Loading the Raw Data

Synthetic CSV files were generated and loaded into PostgreSQL using DBeaver’s Import Tool.

Each CSV corresponded to one raw table:

Members
Medical claims
Pharmacy claims

All raw table columns were stored as TEXT initially. This prevents ingestion failures during loading and allows all data type enforcement to be handled inside dbt.

5. Initializing the dbt Project

A new dbt project was created inside a Python virtual environment.

During initialization:

PostgreSQL was selected as the adapter
A profile was created in profiles.yml
The default target schema was set to analytics

The dbt connection was validated to ensure connectivity between dbt and PostgreSQL.

6. Registering Raw Tables as dbt Sources

To allow dbt to reference raw tables safely, a source configuration file was created.

This file tells dbt:

Which schema the raw tables are in
Which tables are considered authoritative raw sources
Which tables act as reference data (HCC & RxHCC mappings)

This enables consistent use of dbt sources and prevents hard-coding schema names inside models.

7. Building the Staging Layer

The staging layer is where all raw data is cleaned and standardized.

At this stage, I performed:

Explicit parsing of U.S.-formatted dates
Numeric type conversions for amounts and quantities
Trimming and upper-casing of ICD and NDC codes
Basic null handling

Three staging models were created:

stg_members
stg_medical_claims
stg_pharmacy_claims

These models ensure that all downstream data has:

Consistent data types
Clean formats
Reliable date values

8. Building the Core Layer

The core layer represents analytics-ready entities.

Here I focused on:

Creating one clean row per member
Passing through only validated claim and pharmacy records
Preparing the data for aggregation

Core models included:

Clean member dimension
Medical claim fact table
Pharmacy claim fact table

This layer removes duplication and creates stable structures used by the marts.

9. Mapping Diagnoses to HCCs

Medical claims were mapped to HCC categories using the ICD-to-HCC reference table.

At this stage:

Each member’s diagnosis codes were expanded
Codes were normalized
Valid diagnoses were matched to HCC categories

The output produces one record per member per year per HCC category, with an associated risk weight.

10. Computing Final Member Risk Scores

The final step combined:

Aggregated HCC weights from medical claims
Aggregated RxHCC weights from pharmacy claims
Member demographic attributes

For each member and service year:

HCC weights were summed
RxHCC weights were summed
A base score was added
A final risk score was computed

This produced the final analytics mart: member_risk_scores.

11. Results & Key Outcomes

Results

A fully functional end-to-end healthcare analytics pipeline
Clean transformation workflow from raw CSVs to final mart
A production-style member-level risk score table
Data ready for reporting tools like Power BI or Tableau

Key Takeaways

dbt enforces strong modeling discipline through layered architecture
Reference mapping tables are the backbone of healthcare risk analytics
Explicit type casting prevents silent data quality issues
Separation of raw, staging, core, and marts ensures scalability and auditability

12. Key Challenges Faced & Resolved

PostgreSQL CSV permission issues → Resolved by using client-side imports
Cross-database reference errors in dbt → Fixed by aligning the dbt database configuration
U.S. date format parsing errors → Solved by explicitly controlling date parsing in staging
Semicolon syntax errors in dbt models → Resolved by removing trailing semicolons
Source configuration mismatches → Fixed by correcting schema references

Final Thoughts

This project demonstrates how modern analytics engineering tools can simulate real Medicare Advantage risk modeling workflows using open-source technologies.