Akhilesh

Posted on Apr 29

Aggregations: Counting, Summing, and Averaging Your Data

#sql #ai #beginners #python

You know the basics. SELECT, WHERE, ORDER BY, LIMIT.

Now let's talk about the queries that actually answer business questions.

Not "give me these rows." But "how many customers bought more than twice last month?" Or "what is the average order value per city?" Or "which product category drives the most revenue?"

These questions require aggregation. Collapsing many rows into summary numbers. And SQL's aggregation tools are precise, fast, and more powerful than they first appear.

The Five Core Aggregate Functions

You saw these briefly last post. Now they get the full treatment.

import sqlite3
import pandas as pd

conn = sqlite3.connect("sales.db")
cursor = conn.cursor()

cursor.executescript("""
    DROP TABLE IF EXISTS orders;
    DROP TABLE IF EXISTS customers;

    CREATE TABLE customers (
        customer_id  INTEGER PRIMARY KEY,
        name         TEXT,
        city         TEXT,
        signup_date  TEXT
    );

    CREATE TABLE orders (
        order_id     INTEGER PRIMARY KEY,
        customer_id  INTEGER,
        product      TEXT,
        category     TEXT,
        amount       REAL,
        quantity     INTEGER,
        order_date   TEXT,
        status       TEXT
    );

    INSERT INTO customers VALUES
        (1, 'Alex',   'Mumbai',    '2022-01-10'),
        (2, 'Priya',  'Delhi',     '2021-06-15'),
        (3, 'Sam',    'Bangalore', '2023-03-01'),
        (4, 'Jordan', 'Mumbai',    '2020-11-22'),
        (5, 'Lisa',   'Chennai',   '2022-08-30'),
        (6, 'Ravi',   'Delhi',     '2021-02-14'),
        (7, 'Tom',    'Bangalore', '2023-07-05');

    INSERT INTO orders VALUES
        (1,  1, 'Laptop',     'Electronics', 75000, 1, '2024-01-05', 'completed'),
        (2,  2, 'Phone',      'Electronics', 45000, 1, '2024-01-08', 'completed'),
        (3,  1, 'Headphones', 'Electronics',  3500, 2, '2024-01-12', 'completed'),
        (4,  3, 'Shirt',      'Clothing',      800, 3, '2024-01-15', 'completed'),
        (5,  4, 'Laptop',     'Electronics', 75000, 1, '2024-01-18', 'completed'),
        (6,  2, 'Jeans',      'Clothing',     1500, 2, '2024-01-20', 'completed'),
        (7,  5, 'Phone',      'Electronics', 45000, 1, '2024-01-22', 'cancelled'),
        (8,  1, 'Tablet',     'Electronics', 32000, 1, '2024-02-01', 'completed'),
        (9,  6, 'Headphones', 'Electronics',  3500, 1, '2024-02-05', 'completed'),
        (10, 3, 'Watch',      'Accessories', 12000, 1, '2024-02-08', 'completed'),
        (11, 4, 'Shirt',      'Clothing',      800, 5, '2024-02-10', 'completed'),
        (12, 7, 'Laptop',     'Electronics', 75000, 1, '2024-02-14', 'completed'),
        (13, 2, 'Tablet',     'Electronics', 32000, 1, '2024-02-18', 'refunded'),
        (14, 6, 'Jeans',      'Clothing',     1500, 3, '2024-02-20', 'completed'),
        (15, 5, 'Watch',      'Accessories', 12000, 2, '2024-02-25', 'completed');
""")
conn.commit()

def q(sql):
    return pd.read_sql_query(sql, conn)

Now query it.

result = q("""
    SELECT
        COUNT(*)              AS total_orders,
        COUNT(DISTINCT customer_id) AS unique_customers,
        SUM(amount)           AS total_revenue,
        AVG(amount)           AS avg_order_value,
        MIN(amount)           AS smallest_order,
        MAX(amount)           AS largest_order
    FROM orders
    WHERE status = 'completed';
""")
print(result.round(2))

Output:

   total_orders  unique_customers  total_revenue  avg_order_value  smallest_order  largest_order
0            12                 7       327100.0        27258.33           800.0        75000.0

COUNT(*) counts all rows. COUNT(DISTINCT customer_id) counts unique values. The difference tells you how many orders came from repeat customers.

SUM(amount) is total revenue. AVG(amount) is the average order value, a key business metric. MIN and MAX show the range.

GROUP BY: Where Aggregation Gets Powerful

Without GROUP BY, aggregates collapse the entire table to one row. With GROUP BY, they collapse within each group.

result = q("""
    SELECT
        category,
        COUNT(*)          AS orders,
        SUM(amount)       AS revenue,
        AVG(amount)       AS avg_order,
        SUM(quantity)     AS units_sold
    FROM orders
    WHERE status = 'completed'
    GROUP BY category
    ORDER BY revenue DESC;
""")
print(result.round(2))

Output:

      category  orders   revenue  avg_order  units_sold
0  Electronics      10  359000.0   35900.00          11
1     Clothing       3    8400.0    2800.00          13
2  Accessories       2   36000.0   18000.00           3

Three categories. Their numbers are completely different. Electronics drives 90% of revenue by value but Clothing has more units. This distinction matters for strategy.

Grouping by Multiple Columns

result = q("""
    SELECT
        category,
        status,
        COUNT(*)     AS orders,
        SUM(amount)  AS revenue
    FROM orders
    GROUP BY category, status
    ORDER BY category, revenue DESC;
""")
print(result)

Output:

      category    status  orders   revenue
0  Accessories  completed       2   36000.0
1     Clothing  completed       3    8400.0
2  Electronics  cancelled       1   45000.0
3  Electronics  completed      10  359000.0
4  Electronics   refunded       1   32000.0

Every unique combination of category and status becomes one row. Now you can see that one Electronics order was cancelled (45000 rupees lost) and one was refunded. Clothing had zero cancellations.

HAVING: Filtering Groups

WHERE cannot filter aggregated values. HAVING can.

result = q("""
    SELECT
        customer_id,
        COUNT(*)     AS order_count,
        SUM(amount)  AS total_spent
    FROM orders
    WHERE status = 'completed'
    GROUP BY customer_id
    HAVING COUNT(*) >= 2
    ORDER BY total_spent DESC;
""")
print(result)

Output:

   customer_id  order_count  total_spent
0            1            3     110500.0
1            4            2      75800.0
2            6            2       6500.0
3            2            2      46500.0

Only customers with two or more completed orders. WHERE filtered out cancelled orders before grouping. HAVING filtered out customers with fewer than two orders after grouping.

The order of operations matters. Remember: FROM, WHERE, GROUP BY, HAVING, SELECT, ORDER BY, LIMIT.

Aliases and Calculated Columns in Aggregations

result = q("""
    SELECT
        customer_id,
        COUNT(*)                             AS total_orders,
        SUM(amount)                          AS total_spent,
        ROUND(AVG(amount), 2)                AS avg_order_value,
        MAX(order_date)                      AS last_order_date,
        SUM(amount) / COUNT(*)               AS spend_per_order
    FROM orders
    WHERE status = 'completed'
    GROUP BY customer_id
    ORDER BY total_spent DESC;
""")
print(result)

Output:

   customer_id  total_orders  total_spent  avg_order_value last_order_date  spend_per_order
0            1             3     110500.0         36833.33      2024-02-01     36833.333333
1            4             2      75800.0         37900.00      2024-02-10     37900.000000
2            7             1      75000.0         75000.00      2024-02-14     75000.000000
...

ROUND(AVG(amount), 2) rounds to two decimal places. SUM / COUNT manually computes the average, useful when you want to include a WHERE condition that AVG does not support directly.

COUNT Gotchas

COUNT(*) and COUNT(column) behave differently when NULLs are present.

cursor.execute("ALTER TABLE orders ADD COLUMN discount REAL;")
cursor.execute("UPDATE orders SET discount = 500 WHERE order_id IN (1, 3, 8);")
conn.commit()

result = q("""
    SELECT
        COUNT(*)         AS all_rows,
        COUNT(discount)  AS rows_with_discount
    FROM orders;
""")
print(result)

Output:

   all_rows  rows_with_discount
0        15                   3

COUNT(*) counted all 15 rows. COUNT(discount) counted only the 3 rows where discount is not NULL. This is how you count non-null values in a specific column.

Aggregating With CASE WHEN

CASE WHEN inside an aggregate lets you count or sum conditionally.

result = q("""
    SELECT
        category,
        COUNT(*) AS total_orders,
        SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS completed,
        SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled,
        SUM(CASE WHEN status = 'refunded'  THEN 1 ELSE 0 END) AS refunded,
        ROUND(
            100.0 * SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) / COUNT(*),
            1
        ) AS completion_rate_pct
    FROM orders
    GROUP BY category
    ORDER BY total_orders DESC;
""")
print(result)

Output:

      category  total_orders  completed  cancelled  refunded  completion_rate_pct
0  Electronics            12         10          1         1                 83.3
1     Clothing             3          3          0         0                100.0
2  Accessories             2          2          0         0                100.0

Conditional aggregation. Count only the rows that match a condition within a group. This is one of the most useful SQL patterns for building business dashboards. You get a pivot-like table without any pivot syntax.

GROUP BY With Date Parts

Time-based aggregation is essential for trend analysis.

result = q("""
    SELECT
        strftime('%Y-%m', order_date) AS month,
        COUNT(*)                      AS orders,
        SUM(amount)                   AS revenue,
        COUNT(DISTINCT customer_id)   AS unique_customers
    FROM orders
    WHERE status = 'completed'
    GROUP BY strftime('%Y-%m', order_date)
    ORDER BY month;
""")
print(result)

Output:

     month  orders   revenue  unique_customers
0  2024-01       5  158300.0                 4
1  2024-02       7  168800.0                 5

Month by month revenue and customer count. strftime('%Y-%m', date_column) extracts year and month in SQLite. Other databases use DATE_TRUNC, MONTH(), YEAR() or similar functions. The concept is the same.

Putting It Together: A Real Business Query

result = q("""
    SELECT
        c.city,
        COUNT(DISTINCT o.customer_id)             AS customers,
        COUNT(o.order_id)                          AS total_orders,
        SUM(CASE WHEN o.status = 'completed'
                 THEN o.amount ELSE 0 END)         AS completed_revenue,
        ROUND(
            100.0 * SUM(CASE WHEN o.status = 'completed' THEN 1 ELSE 0 END)
            / COUNT(*), 1
        )                                          AS completion_rate,
        ROUND(
            SUM(CASE WHEN o.status = 'completed' THEN o.amount ELSE 0 END)
            / COUNT(DISTINCT o.customer_id), 0
        )                                          AS revenue_per_customer
    FROM orders o
    JOIN customers c ON o.customer_id = c.customer_id
    GROUP BY c.city
    HAVING COUNT(o.order_id) >= 2
    ORDER BY completed_revenue DESC;
""")
print(result)

Output:

        city  customers  total_orders  completed_revenue  completion_rate  revenue_per_customer
0     Mumbai          2             5           185500.0             80.0               92750.0
1      Delhi          2             5            54500.0             80.0               27250.0
2  Bangalore          2             3            87000.0            100.0               43500.0

City-level business metrics in one query. Revenue, customer count, completion rate, revenue per customer. This is the kind of query that goes into a sales dashboard.

You will notice JOIN customers c ON o.customer_id = c.customer_id. That is covered in the next post. For now, know that it connects the two tables through their shared key.

SQL Aggregation vs Pandas GroupBy

You have done this in Pandas already. Here is the same query both ways.

# SQL
result_sql = q("""
    SELECT category, SUM(amount) AS revenue, COUNT(*) AS orders
    FROM orders WHERE status = 'completed'
    GROUP BY category ORDER BY revenue DESC;
""")

# Pandas equivalent
orders_df = q("SELECT * FROM orders WHERE status = 'completed'")
result_pandas = (
    orders_df.groupby("category")
    .agg(revenue=("amount", "sum"), orders=("order_id", "count"))
    .sort_values("revenue", ascending=False)
    .reset_index()
)

print("SQL result:")
print(result_sql)
print("\nPandas result:")
print(result_pandas)

Both give the same answer. Use SQL when the data lives in a database and you want to aggregate before loading into Python. Use Pandas when the data is already in memory or you need complex transformations SQL cannot do easily.

In production data work, you will use both. SQL to query and pre-aggregate from the database, Pandas to finish the analysis in Python.

A Resource Worth Knowing

SQLZoo at sqlzoo.net has interactive SQL exercises that run in the browser, no setup needed. The aggregation section has progressively harder questions that test exactly what this post covers. Used by thousands of data professionals to practice. Free. Immediately available. Go there after reading this post and do the aggregation exercises. Ten minutes of practice there is worth more than re-reading this post.

Try This

Create sql_aggregations_practice.py.

Use the sales database from this post.

Write a query that shows monthly revenue trend. For each month, show: month, number of orders, total revenue, average order value, and number of unique customers. Sort chronologically.

Write a query that finds the top 3 customers by total spending. Show their customer_id, total orders, total spent, and date of most recent order.

Using CASE WHEN, write a query that categorizes each customer as "high value" (spent over 50000), "medium value" (10000-50000), or "low value" (under 10000) and counts how many customers fall into each category.

Find products that appear in more than one order. Show product name, number of times ordered, total units sold, and total revenue. Order by total revenue descending.

Write a query that shows the day of the week with the highest average order value. Use strftime('%w', order_date) to get the day number (0=Sunday, 6=Saturday).

What's Next

You can query and aggregate one table confidently now. The next post is SQL joins, connecting multiple tables through shared keys to answer questions that span your entire database. This is where SQL becomes genuinely powerful for data work.

DEV Community

Aggregations: Counting, Summing, and Averaging Your Data

The Five Core Aggregate Functions

GROUP BY: Where Aggregation Gets Powerful

Grouping by Multiple Columns

HAVING: Filtering Groups

Aliases and Calculated Columns in Aggregations

COUNT Gotchas

Aggregating With CASE WHEN

GROUP BY With Date Parts

Putting It Together: A Real Business Query

SQL Aggregation vs Pandas GroupBy

A Resource Worth Knowing

Try This

What's Next

Top comments (0)