Early in my SQL journey, subqueries were my go-to solution for tackling complex problems. They seemed like a simple and intuitive way to nest one query inside another to get the job done. However, as my queries grew more complex, this approach quickly became overwhelming. Subqueries started to feel messy—hard to read, challenging to maintain, and sometimes frustratingly slow. Debugging deeply nested queries, in particular, was a headache.
Moreover, at my current workplace, we do not use any ORM tools—instead, we rely entirely on raw SQL queries. This hands-on approach gave me a deeper understanding of SQL but also made the limitations of subqueries even more apparent. Managing intricate logic and transforming data with raw SQL often felt cumbersome. That’s when a colleague introduced me to Common Table Expressions (CTEs). Intrigued, I decided to give them a shot. It turns out that subqueries are not the only way to solve complex nested query problems.
In this article, I’ll share what I’ve explored about subqueries and CTEs, using a case study from my own experience to highlight their differences. Let’s dive in!
Subquery
Subqueries are queries nested within another SQL query, often used to perform intermediate calculations or data filtering. They are enclosed within parentheses and can be used in SELECT
, WHERE
, or FROM
clauses.
SELECT column1, column2
FROM table_name
WHERE column1 IN (
SELECT column1
FROM another_table
WHERE condition
);
CTE
Common Table Expressions (CTEs) are temporary named result sets defined within a SQL statement using the WITH keyword. They simplify complex queries by breaking them into smaller, reusable components, improving readability and maintainability.
WITH cte_name AS (
SELECT column1, column2
FROM table_name
WHERE condition
)
SELECT *
FROM cte_name
WHERE other_condition;
CASE STUDY 👩💻
One of the case studies I encountered while working with a payment gateway involved managing financial data related to sellers, specifically their transactions and withdrawals. The system required calculations to provide insights into each seller's financial activity over time.
Problem Description:
The payment gateway database contains data for multiple sellers, and for each seller, there are two types of records:
- Transactions: Represent funds added to the seller's account (e.g., sales revenue or incoming payments).
- Withdrawals: Represent funds deducted from the seller's account (e.g., payouts or withdrawals to their bank account).
To understand the problem, let’s look at the structure of the tables involved, which I replicated from the actual case study:
Table: transactions
Columns:
- id (INT): Primary key for the transaction.
- seller_id (INT): Unique identifier for the seller.
- amount (DECIMAL): The amount of the transaction.
- created_at (BIGINT): The timestamp when the transaction occurred, stored as Unix epoch time in milliseconds.
Sample Data of transactions:
| id | seller_id | amount | created_at |
|-----|-----------|---------|-----------------|
| 1 | 101 | 500.00 | 1735687200000 |
| 2 | 102 | 300.00 | 1735690800000 |
| 3 | 101 | 200.00 | 1735773600000 |
Table: withdrawals
Columns:
- id (INT): Primary key for the withdrawal.
- seller_id (INT): Unique identifier for the seller.
- amount (DECIMAL): The amount of the withdrawal.
- created_at (BIGINT): The timestamp when the withdrawal occurred, stored as Unix epoch time in milliseconds.
Sample Data of withdrawals:
| id | seller_id | amount | created_at |
|-----|-----------|---------|-----------------|
| 1 | 101 | 100.00 | 1735694400000 |
| 2 | 102 | 50.00 | 1735766400000 |
| 3 | 101 | 150.00 | 1735852800000 |
The task was to:
- Calculate the total balance for each seller, summing up all transactions and withdrawals.
- Determine the balance after each action for every seller, maintaining the order of actions chronologically.
- Classify each action as a debit or credit, distinguish transactions as credits and withdrawals as debits in the output.
Please note that PostgreSQL is being used for this analysis. In this case, we seeded the transactions
& withdrawals
tables with 50,000 records each using the generate_series
function. The generated data helps us test and compare the performance of the subquery and CTE approaches in calculating the running balance for the selected seller_id
.
INSERT INTO transactions (seller_id, amount, created_at)
SELECT
FLOOR(RANDOM() * 100 + 1)::INT AS seller_id,
ROUND((RANDOM() * (100000.00 - 1000.00) + 1000.00)::NUMERIC, 2) AS amount,
(EXTRACT(EPOCH FROM NOW()) * 1000)::BIGINT - (RANDOM() * 1000000000)::BIGINT
FROM generate_series(1, 50000);
INSERT INTO withdrawals (seller_id, amount, created_at)
SELECT
FLOOR(RANDOM() * 100 + 1)::INT AS seller_id,
ROUND((RANDOM() * (50000.00 - 1000.00) + 1000.00)::NUMERIC, 2) AS amount,
(EXTRACT(EPOCH FROM NOW()) * 1000)::BIGINT - (RANDOM() * 1000000000)::BIGINT
FROM generate_series(1, 50000);
Now, let’s proceed to solve this using both approaches: subqueries and CTEs, to compare the differences.
Example Codes with Subquery
Here’s a query using a subquery approach to calculate the total balance and balance after each action:
SELECT
t.seller_id,
t.created_at,
t.action_type,
t.amount,
(
SELECT SUM(CASE WHEN action_type = 'credit' THEN amount ELSE -amount END)
FROM (
SELECT
seller_id,
created_at,
amount,
'credit' AS action_type
FROM transactions
WHERE seller_id = t.seller_id
UNION ALL
SELECT
seller_id,
created_at,
amount,
'debit' AS action_type
FROM withdrawals
WHERE seller_id = t.seller_id
) AS combined
WHERE combined.seller_id = t.seller_id
AND combined.created_at <= t.created_at
) AS balance_after
FROM (
SELECT
seller_id,
created_at,
amount,
'credit' AS action_type
FROM transactions
WHERE seller_id = 1
UNION ALL
SELECT
seller_id,
created_at,
-amount AS amount,
'debit' AS action_type
FROM withdrawals
WHERE seller_id = 1
) AS t
ORDER BY t.created_at;
Code Explanation
In this subquery example, we calculate the running balance for each action (credit
or debit
) by combining transaction data and withdrawal data into a unified dataset. Here's how it works:
Outer Query:
Retrievesseller_id
,created_at
,action_type
, andamount
.
Uses a correlated subquery to calculate thebalance_after
for each row.Inner Subquery:
Combines records from thetransactions
andwithdrawals
tables using aUNION ALL
. Each record is labeled as either acredit
ordebit
based on its origin.Correlated Subquery:
Computes the running balance up to the current row (t.created_at
) by summing theamount
values. A CASE statement differentiates betweencredit
(positive) anddebit
(negative).Order and Filter:
The outer query is filtered forseller_id = 1
and ordered byt.created_at
.
Result and Query Performance Analysis
The subquery approach accurately calculates the cumulative balance_after
for each transaction or withdrawal. As shown in the result screenshot below, the query retrieves the seller_id
, created_at
, action_type
, amount
, and balance_after
fields. Each row corresponds to an action (credit
or debit
) and its impact on the running balance.
The cumulative balance is computed correctly by summing the amounts based on the action type, using a correlated subquery for every row in the result set. However, when the query returns 1,034 rows, it takes 4.340 seconds to execute, highlighting performance concerns.
To analyze the performance of this query, an EXPLAIN ANALYZE
was executed. The query plan shows that the subquery is executed repeatedly for each row in the outer query, leading to a significant performance overhead. This is particularly evident when the dataset is large, as the repeated execution of the subquery introduces higher computational costs.
From the analysis, the key factors affecting performance include:
- Repeated Execution: The correlated subquery runs for each row in the outer query.
- Full Table Scans: If no indexes are available on seller_id and created_at, the database will perform full table scans for each subquery execution.
In summary, while the subquery approach produces the desired results, it may not be the most efficient solution for larger datasets or frequent execution.
Example Codes with CTE
Here’s the same solution rewritten using Common Table Expressions (CTEs):
WITH CombinedRecords AS (
SELECT
seller_id,
created_at,
amount,
'credit' AS action_type
FROM transactions
WHERE seller_id = 1
UNION ALL
SELECT
seller_id,
created_at,
-amount AS amount,
'debit' AS action_type
FROM withdrawals
WHERE seller_id = 1
),
RunningBalance AS (
SELECT
seller_id,
created_at,
action_type,
amount,
SUM(amount) OVER (ORDER BY created_at) AS balance_after
FROM CombinedRecords
)
SELECT
seller_id,
created_at,
action_type,
amount,
balance_after
FROM RunningBalance
ORDER BY created_at;
Code Explanation
In the CTE example, the same logic is implemented in a more structured manner using two CTEs: CombinedRecords
and RunningBalance
.
CombinedRecords CTE:
Combinestransactions
andwithdrawals
into a unified dataset with acredit
ordebit
label, similar to the subquery example.RunningBalance CTE:
Uses aSUM
window function with anOVER
clause to calculate the cumulative balance (balance_after
) for each row, ordered bycreated_at
.Final Query:
Selects the relevant columns (seller_id
,created_at
,action_type
,amount
,balance_after
) from theRunningBalance
CTE.
Result and Query Performance Analysis
The CTE-based approach efficiently calculates the cumulative balance_after
for each transaction or withdrawal using a window function. The result, as shown in the screenshot below, displays the seller_id
, created_at
, action_type
, amount
, and balance_after
, similar to the subquery approach. Each row is processed in chronological order, ensuring the running balance is calculated accurately. The query returns 1,034 rows for seller_id = 1
.
The CombinedRecords
CTE unifies the transactions
and withdrawals
tables, while the RunningBalance
CTE leverages a SUM
window function to calculate the cumulative balance efficiently.
To evaluate the performance of this query, an EXPLAIN ANALYZE
was executed too. The query plan highlights the use of a single-pass calculation for the cumulative sum, enabled by the window function, which processes the dataset in a more optimized manner compared to a correlated subquery.
Key observations include:
-
Efficient Processing: The use of a window function ensures that the cumulative balance is computed in a single scan over the
CombinedRecords
dataset. - Reduced Redundancy: Unlike the subquery approach, the window function avoids repeated execution, making the query scale better with larger datasets.
From the performance analysis, it's evident that, in this case, the CTE-based approach is significantly more efficient than the subquery method. This makes it a better choice for handling large datasets or meeting high-performance requirements, while also maintaining clear and modular query logic.
Conclusion
Both subquery and CTE-based approaches can achieve the desired result of calculating cumulative balances from combined transaction and withdrawal data in this case. However, their performance and readability differ significantly, particularly when working with larger datasets.
Subquery Approach:
Pros:
- Self-contained logic within a single query.
- Easier to implement for simple use cases.
Cons:
- Correlated subqueries can be inefficient because they execute repeatedly for each row in the outer query.
- Scalability issues with large datasets or frequent query execution.
CTE Approach:
Pros:
- Modular structure that separates logic into readable steps.
- Improved performance by leveraging window functions, which execute in a single pass over the data.
- Better scalability and efficiency for large datasets.
Cons:
- Requires support for CTEs and window functions, which are available in PostgreSQL and most modern databases.
- Requires a database system that supports CTEs (most modern systems like PostgreSQL, SQL Server, and MySQL 8.0 do).
Recommendation:
For scenarios where performance is critical, especially with larger datasets in this case, the CTE-based approach is the clear winner due to its efficient handling of cumulative calculations. While the subquery method may be sufficient for smaller datasets or one-off queries, its limitations make it less suitable for production environments with high data volumes or performance requirements. PostgreSQL's support for CTEs and window functions makes it an ideal choice for implementing the CTE-based approach.
Top comments (0)