Fitri Anggraini

Posted on Feb 10

Exploring Subquery Alternatives: Understanding and Using CTE

#sql #database #backend #programming

Early in my SQL journey, subqueries were my go-to solution for tackling complex problems. They seemed like a simple and intuitive way to nest one query inside another to get the job done. However, as my queries grew more complex, this approach quickly became overwhelming. Subqueries started to feel messy—hard to read, challenging to maintain, and sometimes frustratingly slow. Debugging deeply nested queries, in particular, was a headache.

Moreover, at my current workplace, we do not use any ORM tools—instead, we rely entirely on raw SQL queries. This hands-on approach gave me a deeper understanding of SQL but also made the limitations of subqueries even more apparent. Managing intricate logic and transforming data with raw SQL often felt cumbersome. That’s when a colleague introduced me to Common Table Expressions (CTEs). Intrigued, I decided to give them a shot. It turns out that subqueries are not the only way to solve complex nested query problems.

In this article, I’ll share what I’ve explored about subqueries and CTEs, using a case study from my own experience to highlight their differences. Let’s dive in!

Subquery

Subqueries are queries nested within another SQL query, often used to perform intermediate calculations or data filtering. They are enclosed within parentheses and can be used in SELECT, WHERE, or FROM clauses.

SELECT column1, column2
FROM table_name
WHERE column1 IN (
    SELECT column1
    FROM another_table
    WHERE condition
);

CTE

Common Table Expressions (CTEs) are temporary named result sets defined within a SQL statement using the WITH keyword. They simplify complex queries by breaking them into smaller, reusable components, improving readability and maintainability.

WITH cte_name AS (
    SELECT column1, column2
    FROM table_name
    WHERE condition
)
SELECT *
FROM cte_name
WHERE other_condition;

CASE STUDY 👩‍💻

One of the case studies I encountered while working with a payment gateway involved managing financial data related to sellers, specifically their transactions and withdrawals. The system required calculations to provide insights into each seller's financial activity over time.

Problem Description:
The payment gateway database contains data for multiple sellers, and for each seller, there are two types of records:

Transactions: Represent funds added to the seller's account (e.g., sales revenue or incoming payments).
Withdrawals: Represent funds deducted from the seller's account (e.g., payouts or withdrawals to their bank account).

To understand the problem, let’s look at the structure of the tables involved, which I replicated from the actual case study:

Table: transactions

Columns:
- id (INT): Primary key for the transaction.
- seller_id (INT): Unique identifier for the seller.
- amount (DECIMAL): The amount of the transaction.
- created_at (BIGINT): The timestamp when the transaction occurred, stored as Unix epoch time in milliseconds.

Sample Data of transactions:

| id  | seller_id | amount  | created_at      |
|-----|-----------|---------|-----------------|
| 1   | 101       | 500.00  | 1735687200000   |
| 2   | 102       | 300.00  | 1735690800000   |
| 3   | 101       | 200.00  | 1735773600000   |

Table: withdrawals

Columns:
- id (INT): Primary key for the withdrawal.
- seller_id (INT): Unique identifier for the seller.
- amount (DECIMAL): The amount of the withdrawal.
- created_at (BIGINT): The timestamp when the withdrawal occurred, stored as Unix epoch time in milliseconds.

Sample Data of withdrawals:

| id  | seller_id | amount  | created_at      |
|-----|-----------|---------|-----------------|
| 1   | 101       | 100.00  | 1735694400000   |
| 2   | 102       | 50.00   | 1735766400000   |
| 3   | 101       | 150.00  | 1735852800000   |

The task was to:

Calculate the total balance for each seller, summing up all transactions and withdrawals.
Determine the balance after each action for every seller, maintaining the order of actions chronologically.
Classify each action as a debit or credit, distinguish transactions as credits and withdrawals as debits in the output.

Please note that PostgreSQL is being used for this analysis. In this case, we seeded the transactions & withdrawals tables with 50,000 records each using the generate_series function. The generated data helps us test and compare the performance of the subquery and CTE approaches in calculating the running balance for the selected seller_id.

INSERT INTO transactions (seller_id, amount, created_at)
SELECT 
    FLOOR(RANDOM() * 100 + 1)::INT AS seller_id,
    ROUND((RANDOM() * (100000.00 - 1000.00) + 1000.00)::NUMERIC, 2) AS amount,
    (EXTRACT(EPOCH FROM NOW()) * 1000)::BIGINT - (RANDOM() * 1000000000)::BIGINT
FROM generate_series(1, 50000);

INSERT INTO withdrawals (seller_id, amount, created_at)
SELECT 
    FLOOR(RANDOM() * 100 + 1)::INT AS seller_id,
    ROUND((RANDOM() * (50000.00 - 1000.00) + 1000.00)::NUMERIC, 2) AS amount,
    (EXTRACT(EPOCH FROM NOW()) * 1000)::BIGINT - (RANDOM() * 1000000000)::BIGINT
FROM generate_series(1, 50000);

Now, let’s proceed to solve this using both approaches: subqueries and CTEs, to compare the differences.

Example Codes with Subquery

Here’s a query using a subquery approach to calculate the total balance and balance after each action:

SELECT 
  t.seller_id,
  t.created_at,
  t.action_type,
  t.amount,
  (
    SELECT SUM(CASE WHEN action_type = 'credit' THEN amount ELSE -amount END)
    FROM (
      SELECT 
        seller_id, 
        created_at, 
        amount, 
        'credit' AS action_type
      FROM transactions
      WHERE seller_id = t.seller_id
      UNION ALL
      SELECT 
        seller_id, 
        created_at, 
        amount, 
        'debit' AS action_type
      FROM withdrawals
      WHERE seller_id = t.seller_id
    ) AS combined
    WHERE combined.seller_id = t.seller_id
      AND combined.created_at <= t.created_at
  ) AS balance_after
FROM (
  SELECT 
    seller_id, 
    created_at, 
    amount,
    'credit' AS action_type
  FROM transactions
  WHERE seller_id = 1
  UNION ALL
  SELECT 
    seller_id, 
    created_at, 
    -amount AS amount,
    'debit' AS action_type
  FROM withdrawals
  WHERE seller_id = 1
) AS t
ORDER BY t.created_at;

Code Explanation
In this subquery example, we calculate the running balance for each action (credit or debit) by combining transaction data and withdrawal data into a unified dataset. Here's how it works:

Outer Query:
Retrieves seller_id, created_at, action_type, and amount.
Uses a correlated subquery to calculate the balance_after for each row.
Inner Subquery:
Combines records from the transactions and withdrawals tables using a UNION ALL. Each record is labeled as either a credit or debit based on its origin.
Correlated Subquery:
Computes the running balance up to the current row (t.created_at) by summing the amount values. A CASE statement differentiates between credit (positive) and debit (negative).
Order and Filter:
The outer query is filtered for seller_id = 1 and ordered by t.created_at.

Result and Query Performance Analysis
The subquery approach accurately calculates the cumulative balance_after for each transaction or withdrawal. As shown in the result screenshot below, the query retrieves the seller_id, created_at, action_type, amount, and balance_after fields. Each row corresponds to an action (credit or debit) and its impact on the running balance.

The cumulative balance is computed correctly by summing the amounts based on the action type, using a correlated subquery for every row in the result set. However, when the query returns 1,034 rows, it takes 4.340 seconds to execute, highlighting performance concerns.

To analyze the performance of this query, an EXPLAIN ANALYZE was executed. The query plan shows that the subquery is executed repeatedly for each row in the outer query, leading to a significant performance overhead. This is particularly evident when the dataset is large, as the repeated execution of the subquery introduces higher computational costs.

From the analysis, the key factors affecting performance include:

Repeated Execution: The correlated subquery runs for each row in the outer query.
Full Table Scans: If no indexes are available on seller_id and created_at, the database will perform full table scans for each subquery execution.

In summary, while the subquery approach produces the desired results, it may not be the most efficient solution for larger datasets or frequent execution.

Example Codes with CTE

Here’s the same solution rewritten using Common Table Expressions (CTEs):

WITH CombinedRecords AS (
  SELECT 
    seller_id,
    created_at,
    amount,
    'credit' AS action_type
  FROM transactions
  WHERE seller_id = 1
  UNION ALL
  SELECT 
    seller_id,
    created_at,
    -amount AS amount,
    'debit' AS action_type
  FROM withdrawals
  WHERE seller_id = 1
),
RunningBalance AS (
  SELECT 
    seller_id,
    created_at,
    action_type,
    amount,
    SUM(amount) OVER (ORDER BY created_at) AS balance_after
  FROM CombinedRecords
)
SELECT 
  seller_id,
  created_at,
  action_type,
  amount,
  balance_after
FROM RunningBalance
ORDER BY created_at;

Code Explanation
In the CTE example, the same logic is implemented in a more structured manner using two CTEs: CombinedRecords and RunningBalance.

CombinedRecords CTE:
Combines transactions and withdrawals into a unified dataset with a credit or debit label, similar to the subquery example.
RunningBalance CTE:
Uses a SUM window function with an OVER clause to calculate the cumulative balance (balance_after) for each row, ordered by created_at.
Final Query:
Selects the relevant columns (seller_id, created_at, action_type, amount, balance_after) from the RunningBalance CTE.

Result and Query Performance Analysis
The CTE-based approach efficiently calculates the cumulative balance_after for each transaction or withdrawal using a window function. The result, as shown in the screenshot below, displays the seller_id, created_at, action_type, amount, and balance_after, similar to the subquery approach. Each row is processed in chronological order, ensuring the running balance is calculated accurately. The query returns 1,034 rows for seller_id = 1.

The CombinedRecords CTE unifies the transactions and withdrawals tables, while the RunningBalance CTE leverages a SUM window function to calculate the cumulative balance efficiently.

To evaluate the performance of this query, an EXPLAIN ANALYZE was executed too. The query plan highlights the use of a single-pass calculation for the cumulative sum, enabled by the window function, which processes the dataset in a more optimized manner compared to a correlated subquery.

Key observations include:

Efficient Processing: The use of a window function ensures that the cumulative balance is computed in a single scan over the CombinedRecords dataset.
Reduced Redundancy: Unlike the subquery approach, the window function avoids repeated execution, making the query scale better with larger datasets.

From the performance analysis, it's evident that, in this case, the CTE-based approach is significantly more efficient than the subquery method. This makes it a better choice for handling large datasets or meeting high-performance requirements, while also maintaining clear and modular query logic.

Conclusion

Both subquery and CTE-based approaches can achieve the desired result of calculating cumulative balances from combined transaction and withdrawal data in this case. However, their performance and readability differ significantly, particularly when working with larger datasets.

Subquery Approach:

Pros:

Self-contained logic within a single query.
Easier to implement for simple use cases.

Cons:

Correlated subqueries can be inefficient because they execute repeatedly for each row in the outer query.
Scalability issues with large datasets or frequent query execution.

CTE Approach:

Pros:

Modular structure that separates logic into readable steps.
Improved performance by leveraging window functions, which execute in a single pass over the data.
Better scalability and efficiency for large datasets.

Cons:

Requires support for CTEs and window functions, which are available in PostgreSQL and most modern databases.
Requires a database system that supports CTEs (most modern systems like PostgreSQL, SQL Server, and MySQL 8.0 do).

Recommendation:
For scenarios where performance is critical, especially with larger datasets in this case, the CTE-based approach is the clear winner due to its efficient handling of cumulative calculations. While the subquery method may be sufficient for smaller datasets or one-off queries, its limitations make it less suitable for production environments with high data volumes or performance requirements. PostgreSQL's support for CTEs and window functions makes it an ideal choice for implementing the CTE-based approach.

Top comments (2)

Adi Primanda Ginting • Apr 1 • Edited

Mbak, mau tanya. Karena isi dari tabel transactions dan withdrawals itu sama, apakah tidak lebih bagus secara desain kalau keduanya dijadikan satu table? Nanti yang membedakan jenis transaksinya field tipe transaksi?

Deby Silvia Agnes • Apr 11

ikut menanggapi,
secara kolom di contoh ini memang dibuat sama. Namun jika di real case nanti harusnya ada tambahan - tambahan kolom yang berbeda antara tabel transactions dan withdrawals.
Jika dijadikan 1 table, yang cukup sulit adalah pengambilan data nantinya. Karena jika hanya butuh data withdrawal saja, kita perlu menambahkan filter where condition yang juga akan berpengaruh ke performa & accuracy data