DEV Community: Kahuthu Muriuki

Python in Data Analytics: A Practical Starting Point

Kahuthu Muriuki — Mon, 11 May 2026 08:44:45 +0000

I started learning Python not because I wanted to become a software developer, but because I kept hitting walls in Excel. I work in data analytics and business intelligence, and my career has cut across fintech, payments platforms, KYC/AML compliance, and securities brokerage. In every one of those roles, there came a point where a spreadsheet could not do what I needed — whether that was cleaning 50,000 rows of messy transaction records, automating a daily reconciliation report, or pulling data from an API that only spoke JSON.

Python was the tool that removed those walls. This article is written for people in a similar position: you work with data, you are comfortable with spreadsheets, and you are wondering whether Python is worth the investment. It is.

What Python Actually Is

Python is a general-purpose programming language created by Guido van Rossum and first released in 1991. "General-purpose" means it was not designed exclusively for data work — people build web applications, automation scripts, machine learning models, and even games with it. But its design philosophy is what makes it stand out: readability and simplicity over cleverness.

Python code reads close to plain English. Where other languages require you to declare variable types, manage memory, or write dozens of lines of boilerplate before doing anything useful, Python lets you get to the point fast. This matters enormously for data work, where the goal is answering a question, not building production software.

Here is a concrete comparison. To read a CSV file and calculate an average in Python, the code looks like this:

import pandas as pd

data = pd.read_csv("transactions.csv")
average_amount = data["amount"].mean()
print(average_amount)

Three lines. No configuration files, no class declarations, no compilation step. You write it, you run it, you get a number. That directness is why Python dominates the data analytics space today.

Why Python Took Over Data Analytics

Python was not always the default choice. R held that position in academic statistics for years, and SAS dominated in corporate environments, particularly in banking and pharma. What shifted the balance was a combination of factors that compounded over the 2010s.

The library ecosystem matured. Pandas, NumPy, Matplotlib, and Scikit-learn went from experimental projects to production-grade tools used by companies like Netflix, Spotify, and JP Morgan. These libraries gave Python capabilities that previously required expensive licensed software.

The community grew around data problems. Stack Overflow, GitHub, and platforms like Kaggle created a feedback loop: more people used Python for data work, more questions got answered, more tutorials got written, more beginners chose Python because help was easy to find.

Industry adopted it. When companies like Google, Facebook, and Amazon built their data infrastructure around Python, the job market followed. Today, virtually every data analyst or data scientist job listing mentions Python as either required or preferred. This is not a trend — it has been the case for over a decade now.

It bridges analysis and engineering. A SQL query can pull data. Excel can summarise it. But if you need to automate a pipeline that pulls data every morning, cleans it, runs calculations, and emails a report — that is a programming task. Python handles the entire chain without switching tools.

Python Libraries That Matter for Data Analytics

A library in Python is a collection of pre-written code that handles a specific type of task. You do not build everything from scratch — you import a library and use its functions. These are the ones I use regularly, and the ones any data analyst will encounter early.

Pandas

Pandas is the workhorse. It provides the DataFrame — a table structure similar to a spreadsheet or SQL table — and an enormous set of functions for filtering, grouping, merging, reshaping, and summarising data. If you work with tabular data (and in analytics, you almost always do), Pandas is the first library you learn and the last one you stop using.

Real example: I have used Pandas to merge customer transaction records from separate payment systems, where the data came in different formats with different column names and date conventions. Pandas handled the column renaming, date parsing, and the merge on a common identifier in about 20 lines of code. In Excel, the same task involved VLOOKUP chains across three workbooks and broke every time the source files changed format.

NumPy

NumPy handles numerical computation. It operates on arrays — ordered collections of numbers — and performs mathematical operations on them much faster than plain Python loops. Pandas is built on top of NumPy internally, so when you calculate a column average or a standard deviation in Pandas, NumPy is doing the actual maths underneath.

For most analytics work, you use NumPy indirectly through Pandas. It becomes essential when you are doing heavier numerical work — financial modelling, statistical simulations, or matrix operations.

Matplotlib and Seaborn

Matplotlib is the foundational plotting library. It creates line charts, bar charts, scatter plots, histograms, and every other standard chart type. It is powerful but verbose — producing a polished chart can take 15-20 lines of configuration code.

Seaborn sits on top of Matplotlib and provides higher-level functions that produce better-looking statistical charts with less code. If Matplotlib is the engine, Seaborn is the dashboard.

I use Matplotlib for custom charts where I need precise control over every axis label and annotation. I use Seaborn for quick exploratory charts when I am trying to understand a dataset before doing detailed analysis.

Openpyxl

Openpyxl reads and writes Excel files (.xlsx). This matters because in most organisations, the people who consume your analysis do not use Python. They use Excel. Openpyxl lets you generate formatted Excel reports programmatically — create sheets, set column widths, apply number formats, even add charts — and deliver output in the format your stakeholders actually open.

Requests

The Requests library handles HTTP communication, which in practice means pulling data from web APIs. Payment processors, financial data providers, government databases, and social media platforms all expose data through APIs. Requests lets you call those endpoints, receive the response (usually JSON), and feed it into Pandas for analysis.

In fintech and payments work specifically, I have used Requests to pull transaction status data from payment gateway APIs for reconciliation — comparing what the gateway reports against what our internal system recorded.

How Python Handles the Data Analytics Workflow

Data analytics is not a single task. It is a sequence: acquire data, clean it, explore it, analyse it, and communicate the findings. Python participates in every stage.

Data Acquisition

Data arrives from databases, CSV exports, Excel files, APIs, and web scraping. Python connects to all of these. The sqlite3 module talks to SQLite databases. The psycopg2 library connects to PostgreSQL. Pandas reads CSV, Excel, JSON, and Parquet files directly. Requests pulls from APIs. For web scraping, BeautifulSoup and Scrapy extract structured data from HTML pages.

Data Cleaning

This is where analysts spend the majority of their time, and where Python saves the most effort compared to manual spreadsheet work. Cleaning tasks include handling missing values, correcting data types (a column that should be numeric but was read as text), removing duplicates, standardising formats (dates in DD/MM/YYYY vs MM/DD/YYYY vs YYYY-MM-DD), and trimming whitespace.

Pandas provides direct functions for all of these:

# Drop rows where 'amount' is missing
data = data.dropna(subset=["amount"])

# Convert a text column to numeric
data["amount"] = pd.to_numeric(data["amount"], errors="coerce")

# Standardise date format
data["date"] = pd.to_datetime(data["date"], dayfirst=True)

# Remove duplicate rows
data = data.drop_duplicates()

In compliance and KYC work, data cleaning is not optional — it is regulatory. You cannot run a sanctions screening check on a name field full of trailing spaces, inconsistent capitalisation, and encoding artefacts. Python standardises that data before it reaches the screening system.

Analysis

Once data is clean, Pandas handles grouping, aggregation, and statistical summaries. SQL-like operations — GROUP BY, JOIN, WHERE — all have direct Pandas equivalents.

# Total transaction amount by region
regional_totals = data.groupby("region")["amount"].sum()

# Monthly trend
data["month"] = data["date"].dt.to_period("M")
monthly_trend = data.groupby("month")["amount"].sum()

Visualisation

Analysis without communication is incomplete. Matplotlib and Seaborn turn DataFrames into charts that tell a story:

import matplotlib.pyplot as plt

monthly_trend.plot(kind="bar", title="Monthly Transaction Volume")
plt.ylabel("Total Amount (KES)")
plt.tight_layout()
plt.savefig("monthly_trend.png")

For interactive dashboards, Python integrates with tools like Power BI (through scripted visuals or data preparation) and Looker Studio (through exported clean datasets).

IDEs — Where You Actually Write Python

An IDE, or Integrated Development Environment, is the application where you write, run, and debug your code. Choosing the right one matters because you will spend hours in it.

VS Code (Visual Studio Code)

VS Code is a free, open-source code editor made by Microsoft. It is not built exclusively for Python — it supports virtually every programming language through extensions. The Python extension for VS Code gives you syntax highlighting, error detection, code completion, and an integrated terminal to run scripts. It also has a built-in Git interface, which matters once you start version-controlling your work.

I use VS Code for writing Python scripts, building Flask applications, and managing project files. It is lightweight, fast, and does not force a specific workflow on you.

Jupyter Notebook

Jupyter Notebook is a browser-based environment that lets you write and run Python code in cells, with the output displayed directly below each cell. This is ideal for exploratory data analysis because you can run a block of code, see the result (a table, a chart, a summary statistic), and then decide what to do next.

Jupyter is not great for building applications or scripts that need to run automatically. But for sitting down with a new dataset and figuring out what it contains, nothing beats it. Most data analytics courses use Jupyter Notebooks as the primary teaching environment.

PyCharm

PyCharm is a full-featured Python IDE made by JetBrains. The Community Edition is free. It provides more advanced features than VS Code out of the box — refactoring tools, database integration, scientific mode for data work, and a built-in profiler. The tradeoff is that it is heavier and takes longer to load.

PyCharm suits people who work in Python full-time and want deep tooling support. For analysts who split their time between Python, SQL, and BI tools, VS Code is usually the more practical choice.

DBeaver

DBeaver is not a Python IDE — it is a database management tool. I include it here because in data analytics, you rarely work in Python alone. You pull data from databases using SQL, then move to Python for cleaning and analysis. DBeaver connects to PostgreSQL, MySQL, SQLite, and dozens of other database engines, and gives you a visual interface to write queries, browse tables, and export results. It pairs with Python rather than replacing it.

Google Colab

Google Colab is essentially a Jupyter Notebook that runs in Google's cloud. You do not install anything on your machine. You open a browser, write Python code, and Google provides the computing resources. The free tier gives you access to GPUs, which matters for machine learning work but is overkill for standard analytics. Colab is useful for learning and for sharing notebooks with colleagues who do not have Python installed locally.

Real-World Applications

Python in data analytics is not theoretical. Here are contexts where it does actual operational work.

Financial reconciliation. Payment platforms process thousands of transactions daily. Discrepancies between what a customer paid, what the payment gateway recorded, and what settled in the bank account need to be identified and resolved. Python scripts pull data from each source, match records on transaction IDs, and flag mismatches — turning a day-long manual process into a 10-minute automated run.

Compliance reporting. Regulatory bodies require periodic reports on transaction volumes, suspicious activity patterns, and customer risk profiles. Python automates the data extraction, applies the classification rules, and generates the report in the required format. This is not a convenience — late or inaccurate compliance reports carry real penalties.

Agricultural operations. On a tea farm, daily plucking records need to be tracked, labour costs calculated, and production trends analysed. Python reads receipt data (even via OCR on handwritten slips), stores it in a database, and feeds it into dashboards. The same pattern applies to any agricultural operation that still runs on paper records.

Customer segmentation. E-commerce and financial service providers use Python to group customers by behaviour — transaction frequency, average spend, product preferences. Pandas and Scikit-learn handle the grouping logic, and the output informs marketing strategy and product design.

Why Beginners Should Start With Python

If you are starting out in data analytics, Python gives you more return on your learning investment than any other single tool. Here is why:

It is free and open source — no licence fees, no subscription, no approval from your IT department.

It has the largest support community of any programming language. Whatever error you encounter, someone has asked about it on Stack Overflow and received a detailed answer.

It transfers across industries. The Python skills you build doing financial analysis apply directly to healthcare analytics, logistics, agriculture, or any other domain. The syntax does not change because the industry changed.

It grows with you. Start with Pandas for basic analysis. Add Matplotlib for charts. Move to Scikit-learn for machine learning. Pick up PySpark for big data. The language is the same at every stage — you are just adding libraries.

And critically, it connects to everything. Databases, APIs, Excel files, cloud services, BI tools, web applications — Python has a library for each one. You never outgrow it because whatever new data source or output format your job throws at you, Python already has a way to handle it.

The barrier to starting is lower than it looks. Install Python, open VS Code, type import pandas as pd, and load your first CSV. The gap between that first line of code and a working automated report is shorter than most people expect.

SQL Subqueries vs CTEs: What I Reach For When Writing Compliance and Reconciliation Queries

Kahuthu Muriuki — Sun, 26 Apr 2026 07:29:09 +0000

When I started writing SQL against transaction data in fintech and securities brokerage environments, my queries were a mess. Three levels of nesting, the same aggregate computed twice, and a debugging process that involved commenting out chunks until something ran. The shift came when I stopped treating subqueries as the only way to compose logic and started picking the right structure for the job.

This article walks through:

what a subquery actually is
the four shapes a subquery can take
when a subquery is the correct choice
what a Common Table Expression (CTE) is
non-recursive and recursive CTEs, with examples drawn from real reporting work
a side-by-side comparison covering readability, debugging, and performance
the rule of thumb I now use to decide between the two

Most of the examples below are loosely modelled on the kinds of tables I work with day to day — member records in a SACCO, KYC verification logs, trade tickets, and transaction ledgers. The schemas are simplified for readability, but the patterns are the ones that show up in production reporting.

What Is a Subquery?

A subquery is a SELECT statement embedded inside another SQL statement. The inner query is evaluated, its result is handed to the outer query, and the outer query uses that result as a value, a row, a list, or a derived table.

Here is the kind of query I run when reviewing trade activity — pulling client orders that exceed the average ticket size for the week:

SELECT client_id, instrument_code, order_value
FROM trade_orders
WHERE trade_date BETWEEN '2026-04-13' AND '2026-04-19'
  AND order_value > (
      SELECT AVG(order_value)
      FROM trade_orders
      WHERE trade_date BETWEEN '2026-04-13' AND '2026-04-19'
  );

The inner query computes one number — the weekly average. The outer query uses that number as the comparison threshold. The subquery is doing exactly one job: producing a value for the WHERE clause.

That is the core idea. Where a subquery sits in the statement, and what it returns, determines what type it is.

The Four Types of Subqueries

1. Scalar Subquery

Returns a single value — one row, one column. You can drop it anywhere a single value is valid: a SELECT list, a WHERE comparison, a HAVING clause.

Example — showing each member's deposit balance alongside the SACCO-wide average:

SELECT
    member_id,
    deposit_balance,
    (SELECT AVG(deposit_balance) FROM member_accounts) AS scheme_average
FROM member_accounts;

I use scalar subqueries most often when I need a benchmark figure on every row of a report without joining a second table for it.

2. Single-Row Subquery

Returns one row with one or more columns. Useful when the comparison needs more than one value but the inner query is guaranteed to produce a single tuple.

Example — finding the trade ticket that matches the highest-value order of the day:

SELECT *
FROM trade_orders
WHERE (order_value, trade_date) = (
    SELECT MAX(order_value), CURRENT_DATE
    FROM trade_orders
    WHERE trade_date = CURRENT_DATE
);

The operators that work here are the standard comparison set: =, <>, >, <, >=, <=.

3. Multi-Row Subquery

Returns a list of values. Used with IN, NOT IN, ANY, ALL, or EXISTS.

Example — pulling all KYC records that belong to clients flagged in a sanctions screening run:

SELECT kyc_id, full_name, id_number, verification_status
FROM kyc_records
WHERE client_id IN (
    SELECT client_id
    FROM sanctions_screening_hits
    WHERE hit_confidence >= 0.85
      AND screening_run_date = '2026-04-25'
);

This is the workhorse subquery in compliance work. You almost always have one set of identifiers and need to filter another table by them.

4. Correlated Subquery

The inner query references a column from the outer query, so it cannot be evaluated independently. It runs once per row of the outer query.

Example — finding members whose latest deposit is larger than their own twelve-month average:

SELECT m.member_id, m.full_name, d.deposit_amount, d.deposit_date
FROM member_accounts m
JOIN deposits d ON d.member_id = m.member_id
WHERE d.deposit_amount > (
    SELECT AVG(deposit_amount)
    FROM deposits
    WHERE member_id = m.member_id
      AND deposit_date >= CURRENT_DATE - INTERVAL '12 months'
);

Correlated subqueries are powerful but expensive. On a deposits table with a few million rows, this pattern can take the query from sub-second to several minutes. I treat correlated subqueries as a last resort when I cannot express the logic with a window function or a CTE join.

When a Subquery Is the Right Choice

Subqueries earn their place when:

the logic is used once and only once in the statement
the inner result is a single scalar or a short list of identifiers
the query is short enough that the nesting does not hurt the reader
the subquery sits inside a WHERE, HAVING, or SELECT and feels like part of the predicate

For a quick filter — "trades above the daily average", "clients matching a screening list", "the latest record per group via MAX()" — a subquery is direct and the optimiser handles it well. The moment I find myself pasting the same subquery into two different parts of the same statement, or nesting subqueries three levels deep, I stop and rewrite the thing as a CTE.

What Is a CTE (Common Table Expression)?

A Common Table Expression is a named, temporary result set declared with the WITH keyword and referenced like a table within the statement that follows. The CTE exists only for the duration of that statement.

Here is the same "trades above the weekly average" query rewritten as a CTE:

WITH weekly_average AS (
    SELECT AVG(order_value) AS avg_order_value
    FROM trade_orders
    WHERE trade_date BETWEEN '2026-04-13' AND '2026-04-19'
)
SELECT t.client_id, t.instrument_code, t.order_value
FROM trade_orders t
CROSS JOIN weekly_average w
WHERE t.trade_date BETWEEN '2026-04-13' AND '2026-04-19'
  AND t.order_value > w.avg_order_value;

Two things change. First, the average is named — weekly_average — so a reader can see what the intermediate step represents without parsing the inner query. Second, if I needed that same average a second time in the statement, I would not have to recompute it.

CTEs are a structural tool. They turn a query into a sequence of named steps.

The Two Types of CTE

1. Non-Recursive CTE

The standard form. Used for breaking a complex query into stages, deduplicating logic, or replacing nested subqueries with named blocks.

Example — a reconciliation report that lines up daily transaction totals from the core ledger against the totals reported by a payment processor:

WITH ledger_totals AS (
    SELECT transaction_date,
           SUM(amount) AS ledger_amount,
           COUNT(*)    AS ledger_count
    FROM core_ledger
    WHERE transaction_date = '2026-04-25'
    GROUP BY transaction_date
),
processor_totals AS (
    SELECT settlement_date AS transaction_date,
           SUM(net_amount) AS processor_amount,
           COUNT(*)        AS processor_count
    FROM processor_settlements
    WHERE settlement_date = '2026-04-25'
    GROUP BY settlement_date
)
SELECT
    l.transaction_date,
    l.ledger_count,
    p.processor_count,
    l.ledger_count - p.processor_count           AS count_variance,
    l.ledger_amount,
    p.processor_amount,
    l.ledger_amount - p.processor_amount         AS amount_variance
FROM ledger_totals l
JOIN processor_totals p USING (transaction_date);

Each CTE handles one source. The final SELECT does the comparison. If the variance comes out wrong, I know exactly which block to inspect — I can run SELECT * FROM ledger_totals mentally and see what that step produced.

This is the structure I default to for any reporting query that touches more than two tables or involves more than one aggregation step.

2. Recursive CTE

A recursive CTE references itself. It is the standard tool for hierarchical data — anything that forms a tree or a chain.

Example — walking a SACCO's member referral chain. Each member has a referred_by column pointing at the member who introduced them. I want the full upline for a given member:

WITH RECURSIVE referral_chain AS (
    -- anchor: start with the member of interest
    SELECT member_id, full_name, referred_by, 0 AS depth
    FROM member_accounts
    WHERE member_id = 'M-104782'

    UNION ALL

    -- recursive step: walk up to each referrer
    SELECT m.member_id, m.full_name, m.referred_by, rc.depth + 1
    FROM member_accounts m
    JOIN referral_chain rc ON m.member_id = rc.referred_by
)
SELECT depth, member_id, full_name
FROM referral_chain
ORDER BY depth;

Other places I have used recursive CTEs:

expanding a chart of accounts where each account has a parent_account_id
tracing a chain of related transactions in a fraud review (transaction A funded transaction B funded transaction C)
reconstructing organisation hierarchies for branch-level reporting

Without recursion, this kind of traversal requires an unknown number of self-joins. With a recursive CTE, the depth is determined by the data.

Subqueries vs CTEs: The Honest Comparison

Dimension	Subquery	CTE
Readability of complex logic	Drops off quickly past two levels	Stays linear regardless of depth
Reuse within the same statement	Must be rewritten each time	Defined once, referenced many times
Debugging	Must isolate by commenting out and rebuilding	Each block can be inspected independently
Recursion	Not supported	Supported with `WITH RECURSIVE`
Verbosity for simple filters	Compact	Adds boilerplate
Optimiser treatment	Often inlined and merged with the outer query	Varies by engine — see below

A Note on Performance

There is a popular claim that "CTEs are slower than subqueries". The truth is more specific than that, and it depends on the database engine.

PostgreSQL versions before 12 treated CTEs as an optimisation fence — the planner materialised the CTE and could not push predicates into it. That genuinely made some CTE queries slower than the equivalent subquery. From version 12 onward, non-recursive CTEs without side effects and referenced only once are inlined by default, behaving like subqueries. You can force the old behaviour with WITH ... AS MATERIALIZED or prevent it with AS NOT MATERIALIZED.
SQL Server has always inlined CTEs into the execution plan. A non-recursive CTE and the equivalent subquery typically produce identical plans.
MySQL added CTE support in 8.0 and treats them similarly to derived tables.
BigQuery, Snowflake, and Redshift all inline non-recursive CTEs.

The practical implication: on any modern engine, choosing a CTE over a subquery for readability does not cost you anything measurable in most cases. The exception is correlated subqueries against very large tables, which are slow regardless of how you wrap them — the fix there is usually a window function or a join to a pre-aggregated CTE, not a different syntactic form.

When I need to know for certain, I run EXPLAIN ANALYZE (PostgreSQL) or check the actual execution plan (SQL Server). Guesses about performance are worth less than the plan in front of you.

How I Decide Between the Two

My working rule, after a few years of writing this kind of code:

Reach for a subquery when the logic is one short step, used once, and reads naturally in place. A scalar in the SELECT list, a list of IDs in an IN clause, an EXISTS check.

Reach for a CTE when any of these are true:

the same intermediate result is needed more than once
the query has more than two logical stages (filter → aggregate → compare, for example)
the logic is hierarchical and needs recursion
a colleague will read this query in three months, and I want them to understand it without a debugging session

The subquery and the CTE solve overlapping problems. The choice is rarely about performance on modern engines — it is about whether the structure of your code matches the structure of the problem you are solving.

In compliance and reconciliation work, where queries get audited and revisited, I default to CTEs the moment a query crosses two stages. The five extra lines of WITH boilerplate pay for themselves the first time someone — often me — has to come back and explain what the query does.

SQL Functions You Will Actually Use in Data Work

Kahuthu Muriuki — Sat, 18 Apr 2026 22:21:28 +0000

Most SQL tutorials stop at SELECT, WHERE, and GROUP BY. That covers retrieval, but it does not cover the layer of work that happens between raw data and a meaningful result. In financial data environments — transaction records, reconciliation tables, KYC logs — the real analytical work depends on functions and operators that transform, filter, combine, and rank data before it becomes useful.

This article covers six categories of SQL functionality that come up repeatedly in practice: row-level functions, date and time handling, string manipulation, joins, window functions, and set operators. Each section includes syntax and examples grounded in the kind of data you encounter in financial and operational contexts.

1. Row-Level Functions

Row-level functions operate on individual records one at a time. They do not aggregate — they transform or evaluate each row in isolation.

The most commonly used ones fall into three groups: conditional logic, null handling, and type conversion.

Conditional Logic — CASE

SELECT
  transaction_id,
  amount,
  CASE
    WHEN amount >= 100000 THEN 'High Value'
    WHEN amount >= 10000  THEN 'Mid Range'
    ELSE 'Standard'
  END AS transaction_tier
FROM transactions;

CASE evaluates each row against a set of conditions and returns the first match. It works anywhere in a query — SELECT, WHERE, ORDER BY, and inside aggregate functions.

Null Handling — COALESCE and ISNULL

SELECT
  customer_id,
  COALESCE(phone_number, email, 'No contact on file') AS contact_detail
FROM customers;

COALESCE returns the first non-null value from a list. This is useful when records have multiple optional fields and you need to surface whichever one is populated. ISNULL (SQL Server) or IFNULL (MySQL) handles the simpler two-value version.

Type Conversion — CAST and CONVERT

SELECT
  CAST(account_balance AS DECIMAL(15, 2)) AS balance,
  CAST(transaction_date AS DATE) AS txn_date
FROM accounts;

Data pulled from flat files or external systems often arrives with the wrong data type. CAST forces a column into the type you need before filtering or calculation.

2. Date and Time Functions

Date logic is one of the areas where SQL gets used most heavily in financial and operational data work. Reporting periods, transaction timestamps, ageing calculations, and SLA tracking all depend on correct date handling.

Getting the Current Date

SELECT GETDATE();        -- SQL Server: returns current date and time
SELECT CURRENT_DATE;     -- Standard SQL / PostgreSQL: returns date only
SELECT NOW();            -- MySQL / PostgreSQL: returns date and time

Extracting Parts of a Date

SELECT
  transaction_id,
  YEAR(transaction_date)  AS txn_year,
  MONTH(transaction_date) AS txn_month,
  DAY(transaction_date)   AS txn_day
FROM transactions;

In PostgreSQL, use EXTRACT:

SELECT EXTRACT(MONTH FROM transaction_date) AS txn_month
FROM transactions;

Calculating the Difference Between Dates

-- SQL Server
SELECT
  loan_id,
  DATEDIFF(DAY, disbursement_date, repayment_date) AS days_to_repay
FROM loan_records;

-- PostgreSQL
SELECT
  loan_id,
  repayment_date - disbursement_date AS days_to_repay
FROM loan_records;

Adding or Subtracting Time

-- SQL Server
SELECT DATEADD(MONTH, 3, GETDATE()) AS quarter_ahead;

-- PostgreSQL
SELECT CURRENT_DATE + INTERVAL '3 months' AS quarter_ahead;

Formatting Dates for Display

-- SQL Server
SELECT FORMAT(transaction_date, 'dd-MMM-yyyy') AS formatted_date
FROM transactions;

-- MySQL
SELECT DATE_FORMAT(transaction_date, '%d-%b-%Y') AS formatted_date
FROM transactions;

Date formatting matters when reports are consumed by non-technical audiences who expect dates in a specific regional format.

3. String Functions

String functions clean, reshape, and extract text data. In practice, this comes up constantly — member names with inconsistent casing, account numbers with leading spaces, reference codes that need to be split or concatenated.

UPPER, LOWER, and TRIM

SELECT
  UPPER(first_name)  AS first_name,
  LOWER(email)       AS email,
  TRIM(account_ref)  AS account_ref
FROM members;

TRIM removes leading and trailing spaces. LTRIM and RTRIM handle one side at a time.

LEN and SUBSTRING

SELECT
  account_number,
  LEN(account_number) AS char_count,
  SUBSTRING(account_number, 1, 4) AS account_prefix
FROM accounts;

SUBSTRING(column, start, length) extracts a portion of a string. In MySQL the function is SUBSTR. This is useful for parsing structured codes — product categories embedded in reference numbers, branch identifiers in account strings, and similar patterns.

REPLACE and CHARINDEX

-- Remove hyphens from ID numbers
SELECT REPLACE(id_number, '-', '') AS clean_id
FROM kyc_records;

-- Find position of a character
SELECT CHARINDEX('@', email) AS at_position
FROM customers;

Concatenation

-- SQL Server / MySQL
SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM members;

-- SQL Server also supports the || operator in some versions
-- PostgreSQL uses ||
SELECT first_name || ' ' || last_name AS full_name
FROM members;

LIKE for Pattern Matching

SELECT *
FROM transactions
WHERE reference_code LIKE 'TXN-%';

The % wildcard matches any sequence of characters. _ matches a single character. These are used heavily in compliance work where you are scanning transaction references or flagging records that match a particular naming pattern.

4. JOINs

JOINs combine rows from two or more tables based on a related column. Understanding which join type to use determines whether you get matched records, all records from one side, or everything from both sides.

INNER JOIN

Returns only rows where the condition is met in both tables.

SELECT
  c.customer_id,
  c.full_name,
  t.transaction_id,
  t.amount
FROM customers c
INNER JOIN transactions t ON c.customer_id = t.customer_id;

This returns customers who have at least one transaction. Customers with no transaction history do not appear.

LEFT JOIN

Returns all rows from the left table, and matching rows from the right. Where there is no match, columns from the right table return NULL.

SELECT
  c.customer_id,
  c.full_name,
  t.transaction_id,
  t.amount
FROM customers c
LEFT JOIN transactions t ON c.customer_id = t.customer_id;

Use this when you need to identify records with no match — customers who have never transacted, accounts with no KYC record, loans with no repayment entries.

RIGHT JOIN

The mirror of LEFT JOIN. Returns all rows from the right table.

SELECT
  c.customer_id,
  t.transaction_id,
  t.amount
FROM customers c
RIGHT JOIN transactions t ON c.customer_id = t.customer_id;

In practice, RIGHT JOIN is less common because you can always rewrite it as a LEFT JOIN by swapping the table order.

FULL OUTER JOIN

Returns all rows from both tables. Where there is no match on either side, NULLs fill the gaps.

SELECT
  c.customer_id,
  c.full_name,
  t.transaction_id
FROM customers c
FULL OUTER JOIN transactions t ON c.customer_id = t.customer_id;

Useful for reconciliation queries where you need to see both unmatched customers and unmatched transactions in one result set.

CROSS JOIN

Produces a Cartesian product — every row in the first table paired with every row in the second.

SELECT
  p.product_name,
  r.region_name
FROM products p
CROSS JOIN regions r;

This is not something you use for retrieval in most contexts, but it is practical for generating combinations — pairing every product with every region to pre-populate a reporting matrix, for example.

5. Window Functions

Window functions perform calculations across a set of rows that are related to the current row, without collapsing the result into a single aggregate value. The row count in your output stays the same — which is what makes window functions different from GROUP BY.

The syntax always includes OVER(), which defines the window.

ROW_NUMBER

Assigns a unique sequential number to each row within a partition.

SELECT
  customer_id,
  transaction_date,
  amount,
  ROW_NUMBER() OVER (PARTITION BY customer_id ORDER BY transaction_date DESC) AS txn_rank
FROM transactions;

Where txn_rank = 1, you have each customer's most recent transaction. This is a common pattern for pulling the latest record per entity.

RANK and DENSE_RANK

SELECT
  agent_id,
  total_collections,
  RANK()       OVER (ORDER BY total_collections DESC) AS rank_with_gaps,
  DENSE_RANK() OVER (ORDER BY total_collections DESC) AS rank_no_gaps
FROM agent_performance;

RANK leaves gaps after tied positions (1, 2, 2, 4). DENSE_RANK does not (1, 2, 2, 3). The choice depends on whether the gaps matter for how results are consumed.

SUM, AVG, and COUNT as Window Functions

SELECT
  transaction_id,
  customer_id,
  amount,
  SUM(amount) OVER (PARTITION BY customer_id) AS customer_total,
  AVG(amount) OVER (PARTITION BY customer_id) AS customer_avg
FROM transactions;

This keeps every transaction row visible while also showing the customer-level total and average alongside each record — something GROUP BY cannot do without a subquery.

Running Totals

SELECT
  transaction_date,
  amount,
  SUM(amount) OVER (ORDER BY transaction_date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM transactions;

Running totals built this way are cleaner than self-joins and easier to read in a query review.

LAG and LEAD

SELECT
  transaction_date,
  amount,
  LAG(amount, 1)  OVER (ORDER BY transaction_date) AS previous_txn_amount,
  LEAD(amount, 1) OVER (ORDER BY transaction_date) AS next_txn_amount
FROM transactions;

LAG looks back at the previous row. LEAD looks ahead. Both are useful for period-over-period comparisons without needing a self-join.

6. SET Operators

SET operators combine the results of two or more SELECT statements. They operate on result sets rather than individual tables, which makes them different from JOINs.

UNION

Combines two result sets and removes duplicate rows.

SELECT customer_id, full_name FROM retail_customers
UNION
SELECT customer_id, full_name FROM business_customers;

Both SELECT statements must return the same number of columns in the same order, with compatible data types.

UNION ALL

Same as UNION but keeps duplicates. Runs faster because there is no deduplication step.

SELECT transaction_id, amount, 'Q1' AS quarter FROM transactions_q1
UNION ALL
SELECT transaction_id, amount, 'Q2' AS quarter FROM transactions_q2;

Use UNION ALL when you are certain duplicates are not an issue, or when the source tables are structured to avoid them — combining quarterly partitions into a full-year view, for example.

INTERSECT

Returns only rows that appear in both result sets.

SELECT customer_id FROM savings_accounts
INTERSECT
SELECT customer_id FROM loan_accounts;

This identifies customers who hold both products — useful for cross-sell analysis or eligibility filtering.

EXCEPT (or MINUS in Oracle/MySQL)

Returns rows from the first result set that do not appear in the second.

SELECT customer_id FROM savings_accounts
EXCEPT
SELECT customer_id FROM loan_accounts;

This surfaces savings account holders who do not have a loan — a segment you might target for a lending product campaign, or flag for a financial inclusion review.

Key Points to Take Away

These six categories cover most of the transformation work that sits between a raw database and a finished analysis. A few things worth noting as you work with them:

Row-level functions and aggregates work at different layers. CASE and COALESCE operate on each row individually. SUM and AVG collapse rows. Mixing them requires understanding whether your logic belongs in SELECT, WHERE, or HAVING.

Date functions are not portable across databases. GETDATE() is SQL Server. NOW() is MySQL/PostgreSQL. CURRENT_DATE is ANSI standard. If your queries move between environments, this is where they will break first.

Window functions do not filter rows — they add columns. If you want to use a window function result as a filter condition, wrap it in a subquery or CTE. You cannot reference a window function alias directly in a WHERE clause.

JOIN type choice affects row count. An INNER JOIN on a one-to-many relationship multiplies rows. A LEFT JOIN on a table with nulls keeps records you might not expect. Test against a small known dataset before running against production volumes.

UNION vs UNION ALL is a performance decision as much as a logic one. Deduplication has a cost. Where duplicates are structurally impossible — different source tables, different time periods — UNION ALL is the right default.

Understanding how these functions interact gives you the control to move from data retrieval into actual data work.

SQL Fundamentals: DDL, DML, and Practical Data Manipulation

Kahuthu Muriuki — Tue, 14 Apr 2026 13:59:29 +0000

I have spent a good portion of my career dealing with structured data — member records at SACCOs, trade logs at a securities brokerage, and KYC verification tables on payments platforms. What all of these have in common is that they sit in relational databases, and the language you use to build, fill, query, and maintain those databases is SQL.

This week's assignment brought me back to the basics: DDL, DML, filtering, and conditional logic. Here is what I worked through and how it connects to the kind of data work I do day to day.

DDL and DML — Two Sides of the Same Coin

SQL commands fall into categories depending on what they act on. The two that matter most when you are starting out are DDL and DML.

DDL (Data Definition Language) is about structure. It deals with the skeleton of your database — tables, columns, data types, constraints. The main DDL commands are CREATE, ALTER, and DROP. When you run a DDL statement, you are changing what the database looks like, not what is stored inside it. Think of it as drawing up the blueprint for a filing cabinet before you start stuffing folders into it.

DML (Data Manipulation Language) is about the actual records. Once DDL has set up the structure, DML is how you put data in, pull data out, change it, or remove it. The core DML commands are INSERT, UPDATE, DELETE, and SELECT.

The distinction matters because mixing them up causes real problems. I have seen junior analysts on a team attempt to INSERT into a table that did not exist yet — they skipped the DDL step entirely. Structure first, data second. That order never changes.

How I Used CREATE, INSERT, UPDATE, and DELETE

CREATE

For the assignment, I started by defining the tables I would need. In my case, I set up a members table, a loan_products table, and a transactions table — modelled loosely on SACCO operations I have worked with before.

CREATE TABLE members (
    member_id INT PRIMARY KEY,
    first_name VARCHAR(50),
    last_name VARCHAR(50),
    gender CHAR(1),
    date_of_birth DATE,
    branch VARCHAR(50),
    kyc_status VARCHAR(20),
    date_joined DATE
);

Every column has a defined data type, and member_id is set as the primary key so each record is uniquely identifiable. This is the same logic I followed when modelling dimension tables in Power BI — the grain of the table has to be clear from the start.

INSERT

With the table in place, I populated it with records.

INSERT INTO members (member_id, first_name, last_name, gender, date_of_birth, branch, kyc_status, date_joined)
VALUES
(1, 'Amina', 'Wanjiku', 'F', '1990-03-12', 'Westlands', 'Verified', '2019-06-15'),
(2, 'Brian', 'Ochieng', 'M', '1987-07-25', 'Mombasa CBD', 'Verified', '2018-01-10'),
(3, 'Cynthia', 'Mutua', 'F', '1992-11-05', 'Kisumu Town', 'Pending', '2021-04-22'),
(4, 'David', 'Kamau', 'M', '1985-02-18', 'Westlands', 'Verified', '2017-09-03'),
(5, 'Esther', 'Akinyi', 'F', '1995-06-30', 'Nakuru East', 'Rejected', '2022-08-11'),
(6, 'Felix', 'Otieno', 'M', '1988-09-14', 'Eldoret Central', 'Verified', '2020-02-28');

In a production environment I would be loading this from a CSV or an ETL pipeline, but the logic is the same — each INSERT statement maps values to the columns defined in the CREATE step.

UPDATE

Records change. A member moves branches, a KYC status gets resolved, a transaction amount needs correction. UPDATE handles that.

UPDATE members
SET branch = 'Kilimani', kyc_status = 'Verified'
WHERE member_id = 3;

This corrects Cynthia's branch and moves her KYC status from 'Pending' to 'Verified'. The WHERE clause is critical here — run an UPDATE without it and you overwrite every row in the table. I have seen that happen on a live SACCO database. It is not a mistake you make twice.

DELETE

Sometimes records need to go. A duplicate entry, a test row left in production, or a member account that was created in error.

DELETE FROM members
WHERE member_id = 5;

Same rule applies: always use WHERE with DELETE. Omitting it wipes the entire table clean.

Filtering with WHERE

The WHERE clause is how you narrow down results to only the rows that matter. Without it, every SELECT, UPDATE, and DELETE hits the full table.

Here are the operators I used most in the assignment:

Equality (=) — straightforward exact match.

SELECT * FROM members WHERE branch = 'Westlands';

This returns only members registered at the Westlands branch.

Greater than (>) — useful for numerical or date comparisons.

SELECT * FROM transactions WHERE amount > 50000;

In a SACCO context, this pulls transactions above KES 50,000 — the kind of threshold that triggers additional AML checks.

BETWEEN — filters within a range, inclusive on both ends.

SELECT * FROM transactions
WHERE transaction_date BETWEEN '2024-01-01' AND '2024-03-31';

I used this to isolate Q1 2024 transactions. Date range filtering comes up constantly in financial reporting.

IN — matches against a list of values.

SELECT * FROM members
WHERE branch IN ('Westlands', 'Mombasa CBD', 'Kisumu Town');

Cleaner than writing three separate OR conditions. I use IN a lot when pulling data for specific branches or account types.

LIKE — pattern matching on text fields.

SELECT * FROM members
WHERE last_name LIKE 'W%';

The % wildcard matches any sequence of characters. This finds all members whose last name starts with 'W'. The _ wildcard matches a single character if you need more precision.

All of these operators can be combined with AND and OR to build more specific filters. The key thing is that WHERE keeps your queries targeted — you get back what you need and nothing more.

Using CASE WHEN to Transform Data

Raw data is rarely presentation-ready. CASE WHEN lets you apply conditional logic inside a query, similar to nested IF statements in Excel but running directly on the database.

In the assignment, I used it to classify transaction amounts into risk tiers:

SELECT
    member_id,
    amount,
    CASE
        WHEN amount >= 100000 THEN 'High Value'
        WHEN amount >= 50000 THEN 'Medium Value'
        WHEN amount >= 10000 THEN 'Standard'
        ELSE 'Micro'
    END AS risk_tier
FROM transactions;

This adds a computed column called risk_tier to the result set without changing the underlying table. In compliance work, this kind of classification feeds directly into suspicious transaction reports — you tag records by threshold, then route them for review.

CASE WHEN evaluates top to bottom and stops at the first match, so the order of conditions matters. A transaction of KES 120,000 hits the first condition and gets labelled 'High Value' — it does not fall through to the next one.

I also used it to flag KYC status into a binary ready/not-ready indicator:

SELECT
    first_name,
    last_name,
    kyc_status,
    CASE
        WHEN kyc_status = 'Verified' THEN 'Active'
        ELSE 'Restricted'
    END AS account_status
FROM members;

This is the kind of derived column I would build into a Power BI data model as a calculated column — except here it runs at the SQL layer before the data even reaches the BI tool.

Reflection

The SQL covered this week is not new to me conceptually — I have written plenty of queries in the course of my work. But going back to the fundamentals and writing each statement out deliberately forced me to think about things I normally take for granted.

What caught my attention most was how much of my day-to-day data work maps directly to these four operations. Every member onboarding is an INSERT. Every KYC status update is an UPDATE. Every duplicate cleanup is a DELETE. And every report starts with a SELECT and a WHERE clause.

The part I found most useful was working through CASE WHEN with financial thresholds. In my compliance work, transaction classification is not optional — regulators expect it. Writing those conditions as SQL rather than handling them in Excel or Power Query is cleaner and more auditable.

One thing that tripped me up briefly was column aliasing with AS inside CASE WHEN — I initially placed the alias in the wrong position, outside the END keyword. Small syntax issue, but it threw an error that took a minute to track down. SQL is unforgiving about placement, and that is a good discipline to have when you are writing queries that run against production data.

Overall, a solid week. The fundamentals are the foundation — everything from joins to window functions to stored procedures builds on top of CREATE, INSERT, SELECT, and WHERE.

Lawrence Kahuthu Muriuki — Data & Finance Professional, Nairobi

How to Publish a Power BI Report and Embed It in a Website

Kahuthu Muriuki — Sun, 05 Apr 2026 18:55:24 +0000

Working with data professionally means that building a report is only half the job. The other half is making sure the right people can actually see it — without having to log into a dedicated analytics platform every time they need an update. That is where publishing and embedding come in.

In my work across data-heavy environments, I have found that one of the most practical things you can do with a finished Power BI report is push it to the web. Whether you are sharing dashboards with a broader team or surfacing insights on an internal site, the steps are straightforward once you know the flow. This guide walks through the full process — from setting up a workspace to getting your report live on a webpage.

What Is Power BI?

Power BI is a data analytics tool that takes raw data and turns it into interactive reports and visual dashboards. It connects to a wide range of data sources and gives you the ability to explore trends, track KPIs, and present findings in a format that non-technical stakeholders can actually engage with. It sits somewhere between a reporting tool and a full business intelligence platform — practical enough for day-to-day analysis, and capable enough for more complex data modelling work.

The Publishing and Embedding Process at a Glance

The overall flow has four stages:

Creating a workspace in the Power BI service
Uploading and publishing your report from Power BI Desktop
Generating an embed code from the published report
Pasting that code into an HTML file to display the report on a website

Let us go through each stage step by step.

1. Creating a Workspace

A workspace in Power BI is a shared environment where you organise and host your reports, dashboards, and datasets. Think of it as a project folder in the cloud — one that can be accessed and managed by multiple team members.

Step 1: Open your browser and navigate to https://app.powerbi.com.

Step 2: Sign in with your account credentials.

Step 3: Once logged in, find the left-hand navigation panel and click on Workspaces.

Step 4: Click + New Workspace to begin creating your workspace.

Step 5: A side panel will appear. Enter a name and a brief description for your workspace, then click Apply to save it. Give it a name that clearly reflects the project or report it will host — this saves confusion later, especially if you manage multiple workspaces.

2. Uploading and Publishing Your Report

With your workspace ready, the next step is getting your report from your local machine into the Power BI service.

Step 1: Open the Power BI Desktop file (.pbix) containing your report on your computer.

Step 2: At the top right of the Power BI Desktop interface, click the Publish button.

Step 3: A dialog box will appear prompting you to choose a destination. Select the workspace you just created.

Once the upload finishes, your report becomes accessible through the Power BI service. You will see a confirmation screen indicating the publish was successful.

3. Generating the Embed Code

Power BI gives you a way to embed reports directly into external web pages using an iframe snippet. Here is how to get that code.

Step 1: Open the published report in the Power BI service (navigate to your workspace and click on the report).

Step 2: In the top menu, click File.

Step 3: Hover over Embed report, then click Website or portal from the submenu that appears.

Step 4: A pop-up will display your embed code — an iframe snippet that you can copy directly.

Copy that code. You will need it in the next step. It is worth noting that this option makes your report publicly accessible to anyone with the link, so be deliberate about what data you expose this way.

4. Embedding the Report on a Website

With the embed code in hand, the final step is placing it inside an HTML file so that the report renders in a browser.

Step 1: Visit https://www.w3schools.com/html/ and copy one of the basic HTML page templates to use as your starting structure.

Step 2: Open a text editor of your choice. On your desktop, create a new folder and give it a name — this will hold your HTML file.

Step 3: In the text editor, open that folder and create a new file with a .html extension. Paste the HTML template you copied from the reference site into the file.

Step 4: Inside the HTML file, locate the body section and replace the placeholder content with your Power BI embed code. Save the file.

Step 5: Navigate to the folder you created on your desktop and open the HTML file in your browser.

Step 6: Sign in when prompted. Once authenticated, your Power BI report will render directly on the page.

Key Insights

A few things worth keeping in mind as you work through this process:

The "Publish to web" option is public by default. When you use the Website or portal embed path, the report is visible to anyone who accesses that page — no login required on the viewer's end. This works fine for general-purpose dashboards, but if your report contains sensitive data (financial records, client information, operational metrics), consider authenticated embed options instead, which require a Power BI Pro or Premium licence.

Workspaces determine access and governance. The workspace you publish to controls who can view, edit, or manage the report within the Power BI service. Set up workspace access roles deliberately before sharing.

The embed code is tied to the published report. If you update and republish the report, the embed code typically remains valid and reflects the latest version automatically. You do not need to regenerate it every time you make changes to the underlying report.

This is a starting point, not the only approach. The method covered here — using a static HTML file — is the quickest way to test that your embed works. In a production environment, you would drop the same iframe code into your CMS, website template, or internal portal instead. The embed behaviour is the same regardless of where it sits.

Data currency depends on your refresh schedule. The report on your website reflects whatever data was loaded the last time Power BI refreshed the dataset. If you need the embedded report to stay current, configure a scheduled refresh in the Power BI service settings for your dataset.

Publishing a Power BI report and embedding it on a web page is a practical skill that closes the gap between analysis and audience. The built-in tooling handles most of the heavy lifting — the main thing you are doing is connecting the right pieces in the right order.

Understanding Data Modeling in Power BI: Joins, Relationships, and Schemas Explained.

Kahuthu Muriuki — Sun, 05 Apr 2026 10:52:06 +0000

Introduction

Working across fintech environments — raw data rarely comes in a shape that is ready for analysis. You are almost always dealing with multiple tables, spread across different systems, each holding a piece of the full picture. Before you can build a single meaningful report in Power BI, you need to understand how those pieces connect.

That is what data modelling is about. It is the process of organising data into structured tables and defining how those tables relate to each other, so that reports are accurate, filters work correctly, and performance does not collapse under the weight of a large dataset. In Power BI, this modelling work sits between loading your data and building your visuals — and if you skip or rush it, everything downstream gets harder.

This article walks through the key building blocks: joins, relationships, cardinality, schemas, and a few practical mistakes worth avoiding.

1. Joins: Combining Tables at the Query Level

Before we get into how Power BI manages relationships, it helps to understand joins — because joins are how most data professionals first learn to connect tables, especially when working in SQL or within Power Query.

A join merges data from two tables into a single result set based on a shared column. Think of a payments table and a member accounts table in a SACCO system. Both tables share a member ID. A join uses that shared column to bring relevant records together.

But not every join behaves the same way, and choosing the wrong one can silently distort your analysis.

Inner Join

An inner join returns only the records where a match exists in both tables. If a member account has no corresponding payment record, it is excluded entirely. This is useful when you only care about records with complete data on both sides.

Left Join (Left Outer Join)

A left join keeps every record from the left table and pulls in matching records from the right table. Where there is no match, the right-side columns are left blank. This is particularly useful in compliance contexts — for example, when you need a full list of registered customers and want to flag those without a corresponding KYC document on file.

Right Join (Right Outer Join)

The mirror of a left join. All records from the right table are retained, and matching records from the left are attached. Non-matching left-side records are dropped.

Full Outer Join

Keeps everything from both tables, matched or not. Gaps appear where there is no corresponding record on either side. This is helpful when you want a complete picture — for instance, a reconciliation view showing all expected transactions alongside all actual ones, with clear gaps where discrepancies exist.

Left Anti Join

Returns records from the left table that have no match in the right table. In a brokerage context, this could surface client accounts that have never placed a trade — useful for identifying inactive accounts during a portfolio review.

Right Anti Join

Returns records from the right table with no match on the left. This could identify transactions that reference account IDs not found in your master client register — a red flag in any compliance or AML screening process.

Where to Apply Joins in Power BI

Joins in Power BI live in Power Query, accessed via Home → Transform Data → Power Query Editor. From there, you select the two tables, choose the matching columns, and pick your join type. The result is a merged table that can be loaded into your model.

The important thing to note is that joins are a data preparation step. They physically combine records into a single table. That is different from relationships, which we will cover next.

2. Relationships: The Backbone of Your Data Model

Once data is loaded into Power BI, the preferred way to connect tables is through relationships — not by repeatedly merging them. A relationship is a defined link between two tables based on a shared column. Define it once, and Power BI uses it automatically across every visual you build.

Consider a SACCO reporting scenario: you have a transactions table recording every deposit, withdrawal, and loan repayment, and a members table holding member details. A relationship on the member ID column means Power BI knows how to connect a member's name to their transaction history, without you writing a join every time you build a chart.

This is the fundamental difference between joins and relationships:

Joins are temporary and create a physical merged table. They increase data volume.
Relationships are persistent and logical. They keep tables separate but connected, and Power BI resolves the link at query time.

For large datasets — say, transaction logs from a payments platform — keeping tables separated through relationships rather than merged through joins is far more efficient.

Creating Relationships in Power BI

Power BI will attempt to auto-detect relationships when column names match across tables. This is convenient but unreliable. I have seen it try to link two columns both named date, from entirely unrelated tables, producing nonsensical filter behaviour. Always review auto-detected relationships and confirm or correct them manually.

There are two ways to create relationships:

Model View — drag a column from one table and drop it onto the matching column in another. A line appears between the tables showing the link.
Manage Relationships — available on the Home ribbon. This dialogue lets you create, edit, and delete relationships in one place, and gives you explicit control over cardinality and filter direction.

3. Cardinality: Defining How Tables Relate

Cardinality describes the numerical nature of the relationship between two tables — specifically, how records on one side correspond to records on the other.

One-to-Many (Most Common)

One record in the first table corresponds to many records in the second. In a securities brokerage setup, one client maps to many trades. The clients table holds one row per client; the trades table holds one row per trade, with the client ID repeated across multiple rows. This is the relationship type you will use most often — between dimension tables and fact tables.

Many-to-One

The same relationship viewed from the other direction. Many trades belong to one client. Power BI treats one-to-many and many-to-one as equivalent, depending on which table you anchor from.

One-to-One

Each record in one table matches exactly one record in another. This is uncommon but does come up — for example, linking an employee record to their biometric access profile where there is a strict one-person-one-profile constraint.

Many-to-Many

Multiple records in one table can relate to multiple records in another. This occurs in scenarios like a many-to-many product bundling setup, where one product can appear in multiple bundles, and each bundle can contain multiple products. These relationships require careful handling. Power BI supports them, but they can produce unexpected filter behaviour if not understood properly. Where possible, introduce a bridge table to resolve the many-to-many into two one-to-many relationships.

4. Cross-Filter Direction

When a relationship exists, Power BI needs to know which direction filters should travel. This is the cross-filter direction setting.

Single direction is the default and recommended approach. Filters flow from dimension tables toward the fact table. Select a specific security in a product's dimension, and the trade fact table updates to show only transactions involving that security.

Both directions allow filters to propagate in either direction between tables. This can be necessary in some advanced reporting scenarios, but it introduces complexity. Filters can cascade in unexpected ways across the model, making results harder to predict and debug. Use bidirectional filtering sparingly and document why it is needed when you do.

Active vs. Inactive Relationships

Power BI allows multiple relationships between two tables, but only one can be active at a time. The active relationship is the one Power BI uses by default in visuals.

Inactive relationships are not ignored — they exist in the model and can be called on explicitly using DAX functions like USERELATIONSHIP(). A common use case is date tables, where you might have separate relationships for trade date, settlement date, and value date, but only one can be active in the default context.

5. Fact Tables and Dimension Tables

Most professional data models are built around a clear separation between fact tables and dimension tables.

Fact tables store events or transactions. They record what happened — a payment processed, a trade executed, a loan disbursed, a member's deposit posted. Each row in a fact table is a single event. Fact tables tend to be long (many rows) and relatively narrow — they hold numeric measures and foreign keys, not descriptive details.

Dimension tables provide context for those events. They answer the who, what, where, and when. A members table, a products table, a branches table, a calendar table — these are all dimension tables. They tend to be shorter (fewer rows) and wider, with descriptive attributes that give your visuals meaningful labels.

The relationship between dimension tables and the fact table is almost always one-to-many: one member, many transactions. This structure is the foundation of the star schema.

6. Data Modelling Schemas

A schema is the overall structure of how your tables are arranged and connected. Choosing the right schema affects model performance, report flexibility, and how easy the model is to maintain as data grows.

Flat Table

Before any modelling happens, data often arrives as a flat table — one large sheet with every column crammed in: customer name, branch, product, transaction amount, date, currency, all in one place. This is common with data exports from core banking systems or when scraping transaction logs.

Flat tables are readable at a glance, but they scale poorly. Every time a customer name or branch label appears in a transaction, it is repeated in full. If a branch is renamed, you update hundreds or thousands of rows. Storage grows faster than it should, and query performance suffers. Flat tables are a starting point, not a destination.

Star Schema

The star schema is the workhorse of Power BI data modelling. One central fact table sits at the middle, with dimension tables connected directly around it. Viewed in Model View, it literally looks like a star.

In a payments context, your central fact table might hold every payment transaction — amount, currency, timestamp, sender ID, receiver ID, method ID. Surrounding it are dimension tables: a members/accounts table, a currency reference table, a payment methods table, a calendar table. Each dimension connects directly to the fact table on a one-to-many basis.

The advantages are real:

Descriptive data is stored once in dimension tables instead of being repeated in every transaction row.
The model is easy to read. You can look at the diagram and immediately understand how everything connects.
Power BI performs well with this structure. Filters propagate predictably from dimensions to the fact table.

For most reporting work — dashboards, KPI monitoring, regulatory reporting — the star schema is the right choice.

Snowflake Schema

The snowflake schema extends the star by normalising the dimension tables. Instead of one flat members dimension, you might split it into a members table, a regions table, and a countries table — each linking to the next. This breaks down hierarchical data into its component layers.

Normalisation reduces redundancy further and ensures data consistency. If Kenya is spelt "Kenya" in the countries table and linked by ID everywhere else, you never have to worry about "kenya" or "KENYA" appearing elsewhere.

The tradeoff is complexity. More tables mean more relationships to manage, more potential for misconfiguration, and more joins for Power BI to resolve at query time. Snowflake schemas make sense for very large datasets with deep hierarchies, or when the source data is already structured that way from a warehousing system. For most operational dashboards and management reporting, the star schema performs better and is much easier to maintain.

7. Common Mistakes and How to Avoid Them

A few problems come up regularly when building data models, particularly if you are moving quickly or working with unfamiliar data.

Missing relationships are the most common. If your tables are not properly linked, visuals behave unexpectedly — a slicer that should filter your transactions does nothing, or totals appear correct but do not break down properly. Always confirm that every connection between a fact table and its dimensions is in place.

Wrong cardinality can produce duplicated or inflated numbers. Setting a one-to-many relationship as many-to-many when the data does not warrant it causes Power BI to count records multiple times. Before confirming a relationship, verify whether your key column truly has unique values on the "one" side.

Descriptive columns in the fact table bloat the model unnecessarily. A member's name, branch label, or product category has no business sitting in every transaction row. That information belongs in dimension tables. Keep fact tables lean: foreign keys and numeric measures only.

Overusing bidirectional filters creates ambiguity. In a model with several connected tables, bidirectional filters can cause cascading filter propagation that makes results unpredictable. Default to single-direction filters unless you have a specific, well-understood reason to enable both.

Unnecessary snowflaking adds complexity without proportionate benefit. If a simple star schema serves your reporting needs, building a multi-level snowflake structure because it feels more rigorous will only make the model harder to troubleshoot. Start simple and add complexity only when the data or the reporting requirements genuinely demand it.

8. Putting It Together: A Practical Walkthrough

Say you are building a member activity dashboard for a SACCO. You have three tables: transactions, members, and loan_products.

Step 1 — Load data: Import all three tables into Power BI via Power Query.

Step 2 — Clean data: In Power Query, remove duplicate records, standardise column names (use member_id consistently, not ID in one table and MemberID in another), and handle any null values in key columns.

Step 3 — Build relationships: In Model View, link transactions[member_id] to members[member_id] with a one-to-many relationship. Link transactions[product_id] to loan_products[product_id] the same way.

Step 4 — Apply star schema: transactions is your fact table. members and loan_products are your dimension tables.

Step 5 — Build visuals: Now you can create a bar chart showing total disbursements by loan product, a table breaking down transaction volumes by member tier, and a slicer filtering all visuals by branch — all from a single clean model.

Conclusion

Data modelling is where reporting either earns its reliability or loses it. The mechanics — joins, relationships, cardinality, schemas — are not just theoretical abstractions. They directly determine whether a slicer filters correctly, whether a total is accurate, and whether a model can handle a growing dataset without grinding to a halt.

The star schema, clear one-to-many relationships, and lean fact tables are the foundation of any model worth building on. Get the structure right before you start building visuals, and the visuals become straightforward. Get it wrong, and you end up debugging report behaviour instead of actually analysing data.

How Excel is Used in Real-World Data Analysis

Kahuthu Muriuki — Sun, 29 Mar 2026 06:39:53 +0000

Introduction

Excel is not always the flashiest choice. There are more specialised tools — Power BI for dashboards, Python for large-scale processing, SQL for database queries. I use several of them. But Excel remains the place where raw data first lands, where quick checks happen, and where non-technical stakeholders can engage with findings without needing a login or a training session.

What Is Excel, and Why Does It Still Matter?

Microsoft Excel is a spreadsheet application that organises data in rows and columns, supports calculation through formulas and functions, and produces visual output through charts and pivot tables. What keeps it relevant despite faster, more capable alternatives is its low barrier to entry and near-universal presence in organisations. Every finance department, every compliance team, and every operations desk I have worked in has had Excel open.

For anyone entering data-related work, Excel builds the right instincts — understanding data types, spotting inconsistencies, and thinking about how records relate to each other. The skills transfer directly to more advanced tools later on.

Where Excel Shows Up in Real Work

Financial Reporting and Reconciliation

In financial services, Excel is where P&L summaries are assembled, cash flow projections are modelled, and month-end variance analyses are performed. Functions like SUMIF, VLOOKUP, and nested IF statements do most of the heavy lifting.

Compliance Tracking and KYC Management

A well-built Excel tracker for compliance work typically uses data validation to restrict entries to approved values, conditional formatting to flag overdue reviews in red, and DATEDIF formulas to calculate how many days remain before a document expires. These are not complex features, but they prevent errors that cost time and create regulatory exposure.

Data Cleaning: The Part Nobody Talks About Enough

In practice, most data arriving in Excel is not clean. Client names have inconsistent spacing, dates are stored as text, and duplicate entries appear across merged files from different systems. Cleaning this data before any analysis is not optional — it directly determines whether the conclusions drawn are accurate.

Features Regularly used

TRIM and CLEAN

TRIM removes extra spaces that accumulate when data is exported from systems like Salesforce or Zendesk. CLEAN removes non-printable characters that sometimes appear in data pulled from legacy platforms. Both are quiet functions that prevent downstream errors.

=TRIM(A2)      — removes leading, trailing and double spaces
=CLEAN(A2)     — removes non-printable characters from text

Find and Replace — `Ctrl+H`

Particularly useful when a field has inconsistent values — for example, a country column where the same country appears as KE, Kenya, and kenya. A few Find and Replace passes standardise the field before any counting or filtering happens.

Text to Columns

When a field contains combined data — a full name in one cell, a date and a reference number separated by a hyphen — Text to Columns splits it into usable parts. I have used this frequently when pulling client data from onboarding systems that concatenate fields.

Remove Duplicates

In compliance work, duplicate partner records are a real risk. The Remove Duplicates function, combined with conditional formatting to highlight matches first, is a reliable way to check data quality before running analysis.

Data Validation

The best way to reduce cleaning work is to prevent bad data from entering in the first place. Data Validation restricts cells to approved values, specific number ranges, or particular date formats. When I build input templates for teams, data validation is always included.

Transforming Data into Something Usable

Once data is clean, it often needs reshaping before it can be analysed. Transformation in Excel covers formatting for readability, converting data types, and restructuring how records are organised.

Formatting for Readability

Column widths that cut off values, missing headers, inconsistent number formats — these are not cosmetic issues. They slow down work and cause misreadings. Bolding header rows, applying consistent number formatting, and using freeze panes to keep headers visible while scrolling are small habits that save time across large datasets.

Data Type Conversion

Numbers stored as text are a persistent problem in exported data. They look correct, but will not respond to SUM or AVERAGE. The VALUE function converts them. Similarly, dates stored as text need conversion before date functions will work correctly.

=VALUE(A2)       — converts text that looks like a number into an actual number
=DATEVALUE(A2)   — converts text that looks like a date into a date serial number

Text Standardisation

In datasets containing client names or country entries, inconsistent capitalisation creates grouping errors. PROPER, UPPER, and LOWER standardise text fields so that pivot tables and COUNTIF formulas group records correctly.

=PROPER(A2)    — capitalises the first letter of each word
=UPPER(A2)     — converts all text to capitals
=LOWER(A2)     — converts all text to lower case

Analysing Data: From Raw Numbers to Decisions

The analytical stage is where the work becomes visible to others. In financial and compliance roles, this means producing numbers that someone will act on — a funding decision, a risk escalation, a process change. The formulas used here need to be correct, and the logic behind them needs to be defensible.

Core Statistical Functions

The foundational functions cover the majority of day-to-day analytical needs in operations and finance work:

=AVERAGE(range) — mean value across a set, used for KPI benchmarking
=MEDIAN(range) — middle value, more reliable than average when outliers are present
=COUNT(range) — counts numeric entries; COUNTA counts all non-empty cells
=SUM(range) — total of a range; SUMIF adds a condition
=MAX(range) and =MIN(range) — identify the ceiling and floor of a dataset

Lookup Functions

VLOOKUP and its more capable successor XLOOKUP are essential when reconciling data across multiple sources — matching partner IDs against a reference table, pulling account names from a separate register, or checking whether a client appears on a restricted list.

=XLOOKUP(lookup_value, lookup_array, return_array)

XLOOKUP is the version worth learning now. It handles left-hand lookups, returns custom values when no match is found, and is less sensitive to column order changes than VLOOKUP.

Conditional Logic

IF statements allow analysis to respond to conditions in the data. In compliance work, this might mean classifying clients into risk tiers based on transaction volume, or flagging records where a required document field is empty.

=IF(D2>1000000,"High Risk",IF(D2>100000,"Medium Risk","Low Risk"))

IFS simplifies the logic when there are multiple conditions to evaluate, avoiding deeply nested IF functions that become difficult to read and maintain.

Pivot Tables

Pivot tables are the fastest route from a large dataset to a summary. In a compliance tracker with hundreds of partner records, a pivot table can show the count of partners by risk tier, by country, and by document status in seconds — without touching the original data.

The key discipline is keeping the source data clean and consistently structured. A pivot table is only as reliable as the data feeding it. When I build reporting templates, the source data tab and the pivot summary tab are always separate, and the source data has validated, consistent entries.

Pivot tables are also the starting point for dashboards — once the summary logic is working correctly in a pivot, the chart built from it will update automatically when new data is added.

Formula Quick Reference

Category	Formula	What It Does	Where I Use It
Statistical	`=AVERAGE(range)`	Calculates mean values	Client KPI dashboards
Statistical	`=COUNT / COUNTA`	Counts entries in a range	Onboarding completion tracking
Lookup	`=VLOOKUP / XLOOKUP`	Retrieves data across tables	Matching partner records in KYC files
Logical	`=IF / IFS`	Returns results based on a condition	Risk-tier classification
Text	`=TRIM / PROPER`	Cleans and normalises text	Standardising client name fields
Date	`=DATEDIF / TODAY()`	Calculates date intervals	Monitoring SLA and review deadlines
Aggregation	`=SUMIF / COUNTIF`	Conditional sum or count	Flagging overdue compliance cases

Data Visualisation: Making Findings Accessible

Not everyone who needs to understand the data wants to scroll through a spreadsheet. Charts and dashboards translate findings into a form that supports faster decisions in meetings, presentations, and reports.

Chart Types and When to Use Them

Bar and Column Charts — suited for comparing values across categories, such as monthly onboarding volumes across different markets or contrasting resolution times across client segments.

Line Charts — best for showing change over time. In operational reporting, line charts work well for tracking weekly ticket volumes, daily transaction counts, or month-on-month revenue trends.

Pie Charts — useful for showing proportional composition, such as the share of total cases by risk category. They become harder to read with more than five or six segments.

Scatter Plots — helpful when exploring relationships between two variables, for example, whether higher transaction values correlate with longer onboarding times.

Building a Dashboard

A functional Excel dashboard connects pivot tables and charts into a single view, using slicers and dropdown controls to filter without modifying the underlying data. Keep the layout consistent, limit the number of metrics on a single page, and make the filters obvious to someone who did not build it.

Closing Thoughts

Excel rewards the person who takes the time to understand it properly. The gap between someone who uses Excel to store numbers and someone who uses it to drive decisions is not about knowing more functions — it is about understanding the data, asking the right questions, and building outputs that other people can trust and act on.

The tool itself is not the skill; the skill is knowing how to move from raw data to a conclusion that holds up under scrutiny. That is a habit worth building early, and Excel is still one of the better places to build it.