You know the basics. SELECT, WHERE, ORDER BY, LIMIT.
Now let's talk about the queries that actually answer business questions.
Not "give me these rows." But "how many customers bought more than twice last month?" Or "what is the average order value per city?" Or "which product category drives the most revenue?"
These questions require aggregation. Collapsing many rows into summary numbers. And SQL's aggregation tools are precise, fast, and more powerful than they first appear.
The Five Core Aggregate Functions
You saw these briefly last post. Now they get the full treatment.
import sqlite3
import pandas as pd
conn = sqlite3.connect("sales.db")
cursor = conn.cursor()
cursor.executescript("""
DROP TABLE IF EXISTS orders;
DROP TABLE IF EXISTS customers;
CREATE TABLE customers (
customer_id INTEGER PRIMARY KEY,
name TEXT,
city TEXT,
signup_date TEXT
);
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER,
product TEXT,
category TEXT,
amount REAL,
quantity INTEGER,
order_date TEXT,
status TEXT
);
INSERT INTO customers VALUES
(1, 'Alex', 'Mumbai', '2022-01-10'),
(2, 'Priya', 'Delhi', '2021-06-15'),
(3, 'Sam', 'Bangalore', '2023-03-01'),
(4, 'Jordan', 'Mumbai', '2020-11-22'),
(5, 'Lisa', 'Chennai', '2022-08-30'),
(6, 'Ravi', 'Delhi', '2021-02-14'),
(7, 'Tom', 'Bangalore', '2023-07-05');
INSERT INTO orders VALUES
(1, 1, 'Laptop', 'Electronics', 75000, 1, '2024-01-05', 'completed'),
(2, 2, 'Phone', 'Electronics', 45000, 1, '2024-01-08', 'completed'),
(3, 1, 'Headphones', 'Electronics', 3500, 2, '2024-01-12', 'completed'),
(4, 3, 'Shirt', 'Clothing', 800, 3, '2024-01-15', 'completed'),
(5, 4, 'Laptop', 'Electronics', 75000, 1, '2024-01-18', 'completed'),
(6, 2, 'Jeans', 'Clothing', 1500, 2, '2024-01-20', 'completed'),
(7, 5, 'Phone', 'Electronics', 45000, 1, '2024-01-22', 'cancelled'),
(8, 1, 'Tablet', 'Electronics', 32000, 1, '2024-02-01', 'completed'),
(9, 6, 'Headphones', 'Electronics', 3500, 1, '2024-02-05', 'completed'),
(10, 3, 'Watch', 'Accessories', 12000, 1, '2024-02-08', 'completed'),
(11, 4, 'Shirt', 'Clothing', 800, 5, '2024-02-10', 'completed'),
(12, 7, 'Laptop', 'Electronics', 75000, 1, '2024-02-14', 'completed'),
(13, 2, 'Tablet', 'Electronics', 32000, 1, '2024-02-18', 'refunded'),
(14, 6, 'Jeans', 'Clothing', 1500, 3, '2024-02-20', 'completed'),
(15, 5, 'Watch', 'Accessories', 12000, 2, '2024-02-25', 'completed');
""")
conn.commit()
def q(sql):
return pd.read_sql_query(sql, conn)
Now query it.
result = q("""
SELECT
COUNT(*) AS total_orders,
COUNT(DISTINCT customer_id) AS unique_customers,
SUM(amount) AS total_revenue,
AVG(amount) AS avg_order_value,
MIN(amount) AS smallest_order,
MAX(amount) AS largest_order
FROM orders
WHERE status = 'completed';
""")
print(result.round(2))
Output:
total_orders unique_customers total_revenue avg_order_value smallest_order largest_order
0 12 7 327100.0 27258.33 800.0 75000.0
COUNT(*) counts all rows. COUNT(DISTINCT customer_id) counts unique values. The difference tells you how many orders came from repeat customers.
SUM(amount) is total revenue. AVG(amount) is the average order value, a key business metric. MIN and MAX show the range.
GROUP BY: Where Aggregation Gets Powerful
Without GROUP BY, aggregates collapse the entire table to one row. With GROUP BY, they collapse within each group.
result = q("""
SELECT
category,
COUNT(*) AS orders,
SUM(amount) AS revenue,
AVG(amount) AS avg_order,
SUM(quantity) AS units_sold
FROM orders
WHERE status = 'completed'
GROUP BY category
ORDER BY revenue DESC;
""")
print(result.round(2))
Output:
category orders revenue avg_order units_sold
0 Electronics 10 359000.0 35900.00 11
1 Clothing 3 8400.0 2800.00 13
2 Accessories 2 36000.0 18000.00 3
Three categories. Their numbers are completely different. Electronics drives 90% of revenue by value but Clothing has more units. This distinction matters for strategy.
Grouping by Multiple Columns
result = q("""
SELECT
category,
status,
COUNT(*) AS orders,
SUM(amount) AS revenue
FROM orders
GROUP BY category, status
ORDER BY category, revenue DESC;
""")
print(result)
Output:
category status orders revenue
0 Accessories completed 2 36000.0
1 Clothing completed 3 8400.0
2 Electronics cancelled 1 45000.0
3 Electronics completed 10 359000.0
4 Electronics refunded 1 32000.0
Every unique combination of category and status becomes one row. Now you can see that one Electronics order was cancelled (45000 rupees lost) and one was refunded. Clothing had zero cancellations.
HAVING: Filtering Groups
WHERE cannot filter aggregated values. HAVING can.
result = q("""
SELECT
customer_id,
COUNT(*) AS order_count,
SUM(amount) AS total_spent
FROM orders
WHERE status = 'completed'
GROUP BY customer_id
HAVING COUNT(*) >= 2
ORDER BY total_spent DESC;
""")
print(result)
Output:
customer_id order_count total_spent
0 1 3 110500.0
1 4 2 75800.0
2 6 2 6500.0
3 2 2 46500.0
Only customers with two or more completed orders. WHERE filtered out cancelled orders before grouping. HAVING filtered out customers with fewer than two orders after grouping.
The order of operations matters. Remember: FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT.
Aliases and Calculated Columns in Aggregations
result = q("""
SELECT
customer_id,
COUNT(*) AS total_orders,
SUM(amount) AS total_spent,
ROUND(AVG(amount), 2) AS avg_order_value,
MAX(order_date) AS last_order_date,
SUM(amount) / COUNT(*) AS spend_per_order
FROM orders
WHERE status = 'completed'
GROUP BY customer_id
ORDER BY total_spent DESC;
""")
print(result)
Output:
customer_id total_orders total_spent avg_order_value last_order_date spend_per_order
0 1 3 110500.0 36833.33 2024-02-01 36833.333333
1 4 2 75800.0 37900.00 2024-02-10 37900.000000
2 7 1 75000.0 75000.00 2024-02-14 75000.000000
...
ROUND(AVG(amount), 2) rounds to two decimal places. SUM / COUNT manually computes the average, useful when you want to include a WHERE condition that AVG does not support directly.
COUNT Gotchas
COUNT(*) and COUNT(column) behave differently when NULLs are present.
cursor.execute("ALTER TABLE orders ADD COLUMN discount REAL;")
cursor.execute("UPDATE orders SET discount = 500 WHERE order_id IN (1, 3, 8);")
conn.commit()
result = q("""
SELECT
COUNT(*) AS all_rows,
COUNT(discount) AS rows_with_discount
FROM orders;
""")
print(result)
Output:
all_rows rows_with_discount
0 15 3
COUNT(*) counted all 15 rows. COUNT(discount) counted only the 3 rows where discount is not NULL. This is how you count non-null values in a specific column.
Aggregating With CASE WHEN
CASE WHEN inside an aggregate lets you count or sum conditionally.
result = q("""
SELECT
category,
COUNT(*) AS total_orders,
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS completed,
SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled,
SUM(CASE WHEN status = 'refunded' THEN 1 ELSE 0 END) AS refunded,
ROUND(
100.0 * SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) / COUNT(*),
1
) AS completion_rate_pct
FROM orders
GROUP BY category
ORDER BY total_orders DESC;
""")
print(result)
Output:
category total_orders completed cancelled refunded completion_rate_pct
0 Electronics 12 10 1 1 83.3
1 Clothing 3 3 0 0 100.0
2 Accessories 2 2 0 0 100.0
Conditional aggregation. Count only the rows that match a condition within a group. This is one of the most useful SQL patterns for building business dashboards. You get a pivot-like table without any pivot syntax.
GROUP BY With Date Parts
Time-based aggregation is essential for trend analysis.
result = q("""
SELECT
strftime('%Y-%m', order_date) AS month,
COUNT(*) AS orders,
SUM(amount) AS revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE status = 'completed'
GROUP BY strftime('%Y-%m', order_date)
ORDER BY month;
""")
print(result)
Output:
month orders revenue unique_customers
0 2024-01 5 158300.0 4
1 2024-02 7 168800.0 5
Month by month revenue and customer count. strftime('%Y-%m', date_column) extracts year and month in SQLite. Other databases use DATE_TRUNC, MONTH(), YEAR() or similar functions. The concept is the same.
Putting It Together: A Real Business Query
result = q("""
SELECT
c.city,
COUNT(DISTINCT o.customer_id) AS customers,
COUNT(o.order_id) AS total_orders,
SUM(CASE WHEN o.status = 'completed'
THEN o.amount ELSE 0 END) AS completed_revenue,
ROUND(
100.0 * SUM(CASE WHEN o.status = 'completed' THEN 1 ELSE 0 END)
/ COUNT(*), 1
) AS completion_rate,
ROUND(
SUM(CASE WHEN o.status = 'completed' THEN o.amount ELSE 0 END)
/ COUNT(DISTINCT o.customer_id), 0
) AS revenue_per_customer
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
GROUP BY c.city
HAVING COUNT(o.order_id) >= 2
ORDER BY completed_revenue DESC;
""")
print(result)
Output:
city customers total_orders completed_revenue completion_rate revenue_per_customer
0 Mumbai 2 5 185500.0 80.0 92750.0
1 Delhi 2 5 54500.0 80.0 27250.0
2 Bangalore 2 3 87000.0 100.0 43500.0
City-level business metrics in one query. Revenue, customer count, completion rate, revenue per customer. This is the kind of query that goes into a sales dashboard.
You will notice JOIN customers c ON o.customer_id = c.customer_id. That is covered in the next post. For now, know that it connects the two tables through their shared key.
SQL Aggregation vs Pandas GroupBy
You have done this in Pandas already. Here is the same query both ways.
# SQL
result_sql = q("""
SELECT category, SUM(amount) AS revenue, COUNT(*) AS orders
FROM orders WHERE status = 'completed'
GROUP BY category ORDER BY revenue DESC;
""")
# Pandas equivalent
orders_df = q("SELECT * FROM orders WHERE status = 'completed'")
result_pandas = (
orders_df.groupby("category")
.agg(revenue=("amount", "sum"), orders=("order_id", "count"))
.sort_values("revenue", ascending=False)
.reset_index()
)
print("SQL result:")
print(result_sql)
print("\nPandas result:")
print(result_pandas)
Both give the same answer. Use SQL when the data lives in a database and you want to aggregate before loading into Python. Use Pandas when the data is already in memory or you need complex transformations SQL cannot do easily.
In production data work, you will use both. SQL to query and pre-aggregate from the database, Pandas to finish the analysis in Python.
A Resource Worth Knowing
SQLZoo at sqlzoo.net has interactive SQL exercises that run in the browser, no setup needed. The aggregation section has progressively harder questions that test exactly what this post covers. Used by thousands of data professionals to practice. Free. Immediately available. Go there after reading this post and do the aggregation exercises. Ten minutes of practice there is worth more than re-reading this post.
Try This
Create sql_aggregations_practice.py.
Use the sales database from this post.
Write a query that shows monthly revenue trend. For each month, show: month, number of orders, total revenue, average order value, and number of unique customers. Sort chronologically.
Write a query that finds the top 3 customers by total spending. Show their customer_id, total orders, total spent, and date of most recent order.
Using CASE WHEN, write a query that categorizes each customer as "high value" (spent over 50000), "medium value" (10000-50000), or "low value" (under 10000) and counts how many customers fall into each category.
Find products that appear in more than one order. Show product name, number of times ordered, total units sold, and total revenue. Order by total revenue descending.
Write a query that shows the day of the week with the highest average order value. Use strftime('%w', order_date) to get the day number (0=Sunday, 6=Saturday).
What's Next
You can query and aggregate one table confidently now. The next post is SQL joins, connecting multiple tables through shared keys to answer questions that span your entire database. This is where SQL becomes genuinely powerful for data work.
Top comments (0)