Collins Njeru

Posted on Apr 27

From Chaos to Clarity: How SQL Transforms Raw Data into Powerful Stories

#database #node #sql #deeplearning

By Collins Njeru

I still remember the first time I ran a query that actually meant something. Not a tutorial exercise with fake names and random numbers , but a real question, against real data, that came back with a real answer. That moment is hard to describe. It feels less like writing code and more like having a conversation with your database.

That is what SQL really is. A conversation. And like any conversation, the quality of what you get back depends entirely on how well you frame what you are asking.

This article is about learning to ask better questions. We will use a retail sales database - customers, products, sales records, inventory and work through the full range of SQL from the basics all the way to the kind of queries that make a hiring manager lean forward in their chair.

The Database We Are Working With

Before writing a single query, you need to understand your data. That sounds obvious, but you would be surprised how many people skip this step and wonder why their results look off.

Our database has four tables:

customers - 50 records. Each one has a name, email, phone number, the date they registered, and a membership tier: Bronze, Silver, or Gold.
products - 15 items spread across Electronics, Appliances, and Accessories. Prices range from a $25 wireless mouse all the way to a $1,500 smart TV.
sales - 15 transactions. Each one links a customer to a product, records how many units were sold, the date, and the total amount.
inventory - stock levels for each product.

Small dataset. Big concepts. The patterns you learn here work exactly the same way when you are dealing with millions of rows.

Start Simple, Stay Curious

Every SQL query starts the same way , you ask for something, and the database gives it back to you.

SELECT * FROM customers;

SELECT product_name, price
FROM products
WHERE price > 500;

That second query is where things start getting interesting. You are not just pulling data , you are filtering it. You are asking a specific question: which products cost more than $500? That shift in thinking, from "give me everything" to "give me exactly this," is the foundation of good data analysis.

Aggregates take it a step further. Instead of looking at individual rows, you start summarising them:

SELECT COUNT(*) AS total_products FROM products;
SELECT AVG(price) AS average_price FROM products;
SELECT SUM(total_amount) AS total_revenue FROM sales;

One thing that trips people up early on is the difference between WHERE and HAVING. They look similar and both filter data, but they work at completely different stages. WHERE runs before any grouping happens , it narrows down the rows going in. HAVING runs after grouping , it filters the results that came out. Mix them up and you either get an error or, worse, results that look right but are actually wrong.

SELECT product_id, SUM(quantity_sold) AS total_qty
FROM sales
GROUP BY product_id
HAVING SUM(quantity_sold) >= 2;

JOINs — Where the Real Magic Happens

Here is the thing about relational databases: the interesting answers almost never live in a single table. You need to pull things together. That is what JOINs do.

The most common one is the INNER JOIN , it only returns rows where there is a match on both sides:

SELECT 
    c.first_name || ' ' || c.last_name AS full_name,
    p.product_name,
    s.quantity_sold
FROM customers c
JOIN sales s ON c.customer_id = s.customer_id
JOIN products p ON s.product_id = p.product_id;

That one query answers: who bought what, and how much of it? Three tables, one clean result.

LEFT JOINs are for when you need to keep rows even if there is no match on the other side. The classic use case is finding gaps , things that should have happened but did not:

-- Products that have never been sold
SELECT p.*
FROM products p
LEFT JOIN sales s ON p.product_id = s.product_id
WHERE s.product_id IS NULL;

This is called an anti-join. You are not looking for what matches,you are looking for what does not. It is one of the most useful patterns in SQL and one of the least talked about in beginner tutorials.

There is also the self-join, which sounds strange until you need it. Joining a table to itself lets you compare rows within the same dataset , like finding all pairs of customers who share the same membership tier:

SELECT 
    c1.first_name || ' ' || c1.last_name AS customer1,
    c2.first_name || ' ' || c2.last_name AS customer2,
    c1.membership_status
FROM customers c1
JOIN customers c2 ON c1.membership_status = c2.membership_status
WHERE c1.customer_id < c2.customer_id;

That c1.customer_id < c2.customer_id at the end is easy to miss but important — without it, you get every customer paired with themselves, and every pair listed twice.

Subqueries — Answering Questions Inside Questions

Some questions cannot be answered in one shot. You need to calculate something first, then use that result to filter or compare. That is where subqueries come in.

-- Products priced above their own category average
SELECT product_name, category, price
FROM products p
WHERE p.price > (
    SELECT AVG(p2.price)
    FROM products p2
    WHERE p2.category = p.category
);

The inner query runs first and figures out the average price per category. The outer query then compares each product against that benchmark. This is called a correlated subquery because the inner part depends on the outer part , specifically on p.category. It is powerful, though on very large tables it can be slow. That is a trade-off worth knowing.

NOT EXISTS is one of the cleanest constructs in SQL for its clarity:

-- Customers who never made a purchase
SELECT * FROM customers c
WHERE NOT EXISTS (
    SELECT 1 FROM sales s
    WHERE s.customer_id = c.customer_id
);

Notice the SELECT 1 inside , it does not matter what you select there. You are not asking for data. You are just checking: does a matching row exist? If it does not, the customer makes the cut. Clean, readable, efficient.

CTEs — Writing SQL You Can Actually Read Later

Nested subqueries work, but they get ugly fast. Stack a few of them and your query starts looking like a puzzle that even you cannot solve two weeks later.

CTEs — Common Table Expressions fix that. You define named, temporary result sets using the WITH keyword and refer to them cleanly in the main query. The logic does not change. The readability does.

Without CTE:

SELECT * FROM (
    SELECT customer_id, SUM(total_amount) AS total_spent
    FROM sales
    GROUP BY customer_id
) sub
WHERE total_spent > (
    SELECT AVG(total_spent) FROM (
        SELECT SUM(total_amount) AS total_spent
        FROM sales GROUP BY customer_id
    ) inner_sub
);

With CTE:

WITH customer_spending AS (
    SELECT customer_id, SUM(total_amount) AS total_spent
    FROM sales
    GROUP BY customer_id
),
avg_spending AS (
    SELECT AVG(total_spent) AS avg_spent FROM customer_spending
)
SELECT * FROM customer_spending
WHERE total_spent > (SELECT avg_spent FROM avg_spending);

Same result. The CTE version reads almost like plain English ,here is the spending per customer, here is the average, now show me everyone above it. That is the kind of code that survives handovers and 3am debugging sessions.

Window Functions — The Thing That Changes How You Think About SQL

If there is one part of SQL that genuinely changes the way you approach data problems, it is window functions. Once you understand them, a whole category of questions that used to feel complicated becomes straightforward.

The key difference from GROUP BY: window functions compute across rows without collapsing them. You still see every individual row , you just get extra computed columns alongside them.

Ranking is the simplest example:

SELECT 
    customer_id,
    SUM(total_amount) AS total_spent,
    RANK() OVER (ORDER BY SUM(total_amount) DESC) AS rank
FROM sales
GROUP BY customer_id;

Every customer keeps their own row. You just get a rank column added.

NTILE is great for segmentation:

SELECT 
    customer_id,
    SUM(total_amount) AS total_spent,
    NTILE(4) OVER (ORDER BY SUM(total_amount) DESC) AS quartile
FROM sales
GROUP BY customer_id;

Split your customers into four equal groups by spending. Bucket 1 is your top 25%. That is a customer segmentation model in six lines of SQL.

LAG and LEAD let you look at the row before or after the current one , incredibly useful for trend analysis:

SELECT 
    sale_id,
    sale_date,
    total_amount,
    LAG(total_amount) OVER (ORDER BY sale_date) AS previous_sale,
    LEAD(total_amount) OVER (ORDER BY sale_date) AS next_sale
FROM sales;

And running totals — a staple of any financial report:

SELECT 
    sale_id,
    sale_date,
    total_amount,
    SUM(total_amount) OVER (ORDER BY sale_date) AS running_total
FROM sales;

PARTITION BY is where things get really interesting. It lets you run window functions independently within groups , like ranking products inside each category separately:

SELECT 
    product_name,
    category,
    price,
    RANK() OVER (PARTITION BY category ORDER BY price DESC) AS rank_in_category
FROM products;

Each category gets its own ranking that resets to 1. Electronics and Appliances are ranked independently. People often try to solve this kind of thing with complicated joins or multiple queries, when a single window function handles it cleanly.

CASE WHEN — Making Your Data Tell a Story

Raw numbers are fine for machines. People need context. CASE WHEN is how you translate numbers into meaning:

SELECT 
    product_name,
    price,
    CASE
        WHEN price > 1000 THEN 'Premium'
        WHEN price BETWEEN 500 AND 1000 THEN 'Standard'
        ELSE 'Budget'
    END AS price_category
FROM products;

You can do the same for customers based on their total spending:

SELECT 
    c.customer_id,
    c.first_name,
    COALESCE(SUM(s.total_amount), 0) AS total_spent,
    CASE
        WHEN COALESCE(SUM(s.total_amount), 0) > 20000 THEN 'VIP'
        WHEN COALESCE(SUM(s.total_amount), 0) BETWEEN 10000 AND 20000 THEN 'Regular'
        ELSE 'New'
    END AS customer_level
FROM customers c
LEFT JOIN sales s ON c.customer_id = s.customer_id
GROUP BY c.customer_id, c.first_name;

Notice the COALESCE wrapping the SUM. Customers with zero purchases would return NULL from the sum COALESCE catches that and turns it into 0 before the CASE logic runs. Small detail, big difference in your results.

The Production Stuff Nobody Talks About Enough

Writing queries is one skill. Building something reliable that other people can use is another. These are the tools that close that gap.

Stored functions let you wrap logic into something reusable and callable:

CREATE OR REPLACE FUNCTION get_customer_spending(cust_id INT)
RETURNS NUMERIC LANGUAGE plpgsql AS $$
DECLARE total NUMERIC;
BEGIN
    SELECT COALESCE(SUM(total_amount), 0) INTO total
    FROM sales WHERE customer_id = cust_id;
    RETURN total;
END;
$$;

Now instead of rewriting that aggregation every time, you call get_customer_spending(42). One place to maintain if anything ever changes.

Indexes are a performance tool most beginners ignore until their queries start timing out on real data:

CREATE INDEX IF NOT EXISTS idx_sales_product_id ON sales(product_id);

This tells PostgreSQL to build a sorted lookup structure on that column. JOINs and WHERE filters on indexed columns can run orders of magnitude faster. The trade-off is slightly slower inserts , but for read-heavy analytical workloads, that trade is almost always worth it.

Views are saved queries that behave like tables, great for hiding complexity:

CREATE OR REPLACE VIEW product_revenue AS
SELECT product_id, SUM(total_amount) AS total_revenue
FROM sales
GROUP BY product_id;

And transactions are how you protect your data when multiple things need to happen together:

BEGIN;
    INSERT INTO sales VALUES (1000, 1, 1, 2, CURRENT_DATE, 1999.98);
    UPDATE products SET stock_quantity = stock_quantity - 2 WHERE product_id = 1;
COMMIT;

Either both statements execute, or neither does. A sale gets recorded and stock gets updated or nothing changes at all. That guarantee matters more than most people realise until the day a partial update corrupts their production data.

On Dirty Data — It Will Humble You

No matter how well-designed a database is, real-world data finds a way to be messy. NULLs show up where you did not expect them. Phone numbers are empty strings instead of NULL. Emails have trailing spaces or inconsistent capitalisation. These things happen constantly.

SQL gives you the tools:

SELECT COALESCE(email, 'No Email Provided') AS email FROM customers;
SELECT LOWER(email) AS email FROM customers;
SELECT TRIM(first_name) AS clean_name FROM customers;
SELECT CASE WHEN phone_number = '' THEN NULL ELSE phone_number END AS phone FROM customers;

This is not glamorous work. But dirty data going into an aggregation means wrong numbers going into a report, which means bad decisions being made somewhere upstream. Treating data cleaning seriously is not optional , it is part of the craft.

Three Bugs Worth Learning From

These mistakes show up constantly, even in experienced developers' work:

Forgetting to include non-aggregated columns in GROUP BY:

-- Breaks
SELECT product_id, product_name, SUM(total_amount) FROM sales GROUP BY product_id;

-- Works
SELECT s.product_id, p.product_name, SUM(s.total_amount)
FROM sales s
JOIN products p ON s.product_id = p.product_id
GROUP BY s.product_id, p.product_name;

Using WHERE to filter an aggregate result:

-- Breaks
SELECT product_id, SUM(total_amount) FROM sales WHERE SUM(total_amount) > 1000 GROUP BY product_id;

-- Works
SELECT product_id, SUM(total_amount) FROM sales
GROUP BY product_id HAVING SUM(total_amount) > 1000;

Getting the JOIN condition backwards:

-- Wrong — no error thrown, just nonsense results
JOIN products p ON s.customer_id = p.product_id;

-- Right
JOIN products p ON s.product_id = p.product_id;

That last one is especially sneaky. It does not crash. It just quietly returns results that make no logical sense, and you might not notice until someone is staring at a report asking why the numbers look strange.

Where to Go From Here

SQL has been around since the 1970s and there is a good reason it has outlasted almost every technology trend that has come and gone in that time. It is precise, it is expressive, and it is built around a simple idea , your data has answers in it, and SQL is how you ask for them.

The gap between someone who is just starting and someone who is genuinely good at this is not really about syntax. Syntax is easy to look up. The gap is in how you think. Whether you can take a business question, break it into logical steps, and translate each one into a query. That kind of thinking builds up with practice, with reading other people's SQL, and honestly with making enough mistakes that the right patterns start to feel natural.

Get comfortable with JOINs until they stop feeling complicated. Then go learn window functions properly , that is probably the single investment in SQL that pays off fastest. Everything else tends to fall into place after that.

The data is already there. It is just waiting to be asked the right questions.

Written by **Collins Njeru* — Nairobi, Kenya.*

DEV Community