Modern web applications deal with massive amounts of complex data, and relational databases are a popular choice for keeping everything organized. But even with structured data, queries can turn into nightmares when you need to retrieve large amounts of interconnected information.
Take a music streaming app that wants to retrieve song suggestions by combining a user's listening history, genre preferences, and favorite artists. The SQL query quickly becomes unwieldy with endless subqueries and WHERE
clauses, resulting in what's known as "spaghetti SQL" - code that's nearly impossible to read, maintain, or debug.
Common table expressions (CTEs) is one way to solve this problem. They let you break down complex queries into separate, manageable steps. Each step is a simple SQL query with clear results, labeled so you actually know what each part does. This can completely transform how readable and maintainable your queries become.
In this post, I'll show you how CTEs work in Postgres and how they can turn convoluted spaghetti SQL into clean, clearly defined statements. I’ll take you through practical demonstrations of CTEs used in a music streaming app. The demo app is hosted on Supabase, a backend service that offers a Postgres database for apps with built-in auth and edge functions. The CTEs will simplify logic used in some of the app’s queries. I'll also cover some advanced use cases and best practices.
What Are CTEs?
CTEs, defined in SQL using the WITH
clause, let you create one or more temporary named result sets that can be used within a single SELECT
, INSERT
, UPDATE
, or DELETE
statement. The expressions are executed in sequential order and can reference previous result sets in the same statement.
Below is an example of a SQL statement that utilizes a CTE to filter out archived song tracks, which is then used in the final query:
WITH undeleted_tracks AS (
SELECT track_id, title, artist_id, genre_id
FROM tracks
WHERE is_deleted = false
)
SELECT ut.title, a.name as artist_name, g.name as genre_name
FROM undeleted_tracks ut
INNER JOIN artists a ON a.artist_id = ut.artist_id
INNER JOIN genres g ON g.genre_id = ut.genre_id
First, a CTE is created to store a temporary collection called undeleted_tracks
with all the undeleted songs in the tracks
table. The CTE is then used in the final query. Notice you don't need to check if a track is deleted in the final query because the CTE already filtered them out.
Now, why would you use a CTE to achieve this? You could retrieve the same results using a predefined view or SQL subquery. CTEs, views, and subqueries do share some similarities; however, they have differences that make each appropriate for distinct scenarios:
- Subqueries are nested SQL queries placed within a larger query. These nested queries can span multiple levels, which makes it difficult to follow the logical flow of the data and understand how the query is constructed.
- Views are predefined SQL queries that are stored as persistent database objects. They are ideal for encapsulating query logic that spans multiple database tables and needs to be reused across different queries. Postgres lets you define a regular view, which runs its query each time it's accessed, and materialized views, which execute the query once and store the results for fast retrieval. Since materialized views store data, they require manual refreshes to update with the latest information.
- CTEs are temporary "views" that only exist for the execution of a SQL statement. They are effective at replacing complex, nested subqueries, improving the readability and maintainability of the overall query by breaking it down into logical, named steps.
With these distinctions in mind, CTEs can be invaluable for writing recursive queries, which repeatedly reference their own output to retrieve hierarchical or network data. Imagine trying to debug an SQL query that has several levels of embedded SELECT
statements, each dependent on the statement before it. The query's complexity obscures its intent, making it challenging to identify the underlying issue. In this case, CTEs' modularity makes the query's intent clear, which makes debugging and modifying the query a straightforward task as your application evolves.
Additionally, since much of Supabase's functionality is built on top of a Postgres database, you can use CTEs to query and manage your application's data if you're hosting your service there, as is the case with this music streaming app. This ensures that your backend remains easy to modify as requirements evolve and more features are implemented.
The following section will demonstrate how a complex query can be enhanced by replacing subqueries with CTEs.
The Problem with Complex SQL Queries
A key feature of a music streaming app is compiling personalized song recommendations based on the user's most-listened-to genre over the last month, excluding any songs they have already listened to or liked.
Without CTEs, a query to achieve this would consist of multiple subqueries:
SELECT t.title, ar.name AS artist_name
FROM tracks t
INNER JOIN artists ar ON t.artist_id = ar.artist_id
WHERE
t.track_id NOT IN (
-- Hardcode user_id (would be supplied as a parameter in the function)
SELECT lh.track_id
FROM listening_history lh
WHERE lh.user_id = 'a1b2c3d4-e5f6-7890-1234-567890abcdef'
)
AND t.track_id NOT IN (
-- Hardcoded user_id (would be supplied as a parameter in the function)
SELECT ls.track_id
FROM liked_songs ls
WHERE ls.user_id = 'a1b2c3d4-e5f6-7890-1234-567890abcdef'
)
AND t.genre_id = (
-- Subquery to find the user's top genre
SELECT g.genre_id
FROM genres g
JOIN tracks t_inner ON g.genre_id = t_inner.genre_id
JOIN (
-- Sub-subquery to count plays per track for the user
SELECT lh_inner.track_id, COUNT(*) as play_count
FROM listening_history lh_inner
WHERE lh_inner.user_id = 'a1b2c3d4-e5f6-7890-1234-567890abcdef'
AND lh_inner.listened_at >= NOW() - INTERVAL '30 days'
GROUP BY lh_inner.track_id
) AS user_track_plays ON t_inner.track_id = user_track_plays.track_id
GROUP BY g.genre_id
ORDER BY SUM(user_track_plays.play_count) DESC
LIMIT 1
)
-- Show newest songs first
ORDER BY t.release_date DESC
LIMIT 5;
You'd have to spend some time analyzing the spaghetti SQL to figure out how it works. Debugging would also be a nightmare as you have to extract each subquery, mock dependent values that come from other queries, and then analyze the results. If you modify a subquery, you must be careful not to break another part of the query that relies on that subquery. Finally, performance can suffer because, while Postgres's optimizer is smart, deeply nested and correlated subqueries can generate less optimal execution plans.
Solving Spaghetti SQL with CTEs
The query presented above is a perfect candidate for CTEs. It has multiple nested queries that obscurely depend on one another. The CTE will rewrite the subqueries, separating each into its own named query, which will improve the original's readability and performance.
To begin, you'll first write a CTE that counts the number of times a user has listened to different tracks in the last thirty days. You can also use a CTE to store the user ID so that it's not duplicated across queries. When written, these two CTEs will look like this:
WITH user_data AS (
-- Unnecessary if user id is an incoming parameter
SELECT 'a1b2c3d4-e5f6-7890-1234-567890abcdef'::uuid as user_id
),
user_recent_play_count AS (
SELECT lh.track_id, COUNT(*) as play_count
FROM listening_history lh
JOIN user_data ud ON lh.user_id = ud.user_id
WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
GROUP BY lh.track_id
)
SELECT *
FROM user_recent_play_count
ORDER BY play_count DESC;
Notice how you can write a SELECT
statement at the end of your CTE to debug the results before moving on to the next subquery. If you run the query above, you'll see a list of track IDs and the number of times each was played by the user:
You can perform some spot checks on the results to ensure the data being returned by the CTE is correct. If it's not, you can easily modify the CTE and re-execute it until it's right. In this case, everything looks good, so you can continue with extracting the following subquery, which finds the user's top genre:
WITH user_data AS (
SELECT 'a1b2c3d4-e5f6-7890-1234-567890abcdef'::uuid as user_id
),
user_recent_play_count AS (
SELECT lh.track_id, COUNT(*) as play_count
FROM listening_history lh
JOIN user_data ud ON lh.user_id = ud.user_id
WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
GROUP BY lh.track_id
),
user_top_genre AS (
SELECT g.genre_id, g.name
FROM genres g
JOIN tracks t ON g.genre_id = t.genre_id
-- Reference the previous CTE instead of a subquery
JOIN user_recent_play_count urpc ON urpc.track_id = t.track_id
GROUP BY g.genre_id
ORDER BY SUM(urpc.play_count) DESC
LIMIT 1
)
SELECT *
FROM user_top_genre;
The new query selects data from the new CTE, which should return the user's most popular genre:
Now, you can select tracks from that genre to recommend to the user. However, remember that the requirements specify that songs the user has already listened to or liked should be excluded. The next CTE you'll create will contain a list of track IDs that should not be recommended:
WITH user_data AS (
SELECT 'a1b2c3d4-e5f6-7890-1234-567890abcdef'::uuid as user_id
),
user_recent_play_count AS (
SELECT lh.track_id, COUNT(*) as play_count
FROM listening_history lh
JOIN user_data ud ON lh.user_id = ud.user_id
WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
GROUP BY lh.track_id
),
user_top_genre AS (
SELECT g.genre_id, g.name
FROM genres g
JOIN tracks t ON g.genre_id = t.genre_id
JOIN user_recent_play_count urpc ON urpc.track_id = t.track_id
GROUP BY g.genre_id
ORDER BY SUM(urpc.play_count) DESC
LIMIT 1
),
user_excluded_tracks AS (
SELECT lh.track_id
FROM listening_history lh
JOIN user_data ud ON lh.user_id = ud.user_id
UNION
SELECT ls.track_id
FROM liked_songs ls
JOIN user_data ud ON ls.user_id = ud.user_id
)
SELECT uet.*, t.title
FROM user_excluded_tracks uet
INNER JOIN tracks t ON t.track_id = uet.track_id;
If you run the query above, you'll see a list of track IDs and titles of songs the user has already heard or liked, similar to the result set below:
Finally, combine all your CTEs into a single query to retrieve five song recommendations for the user:
WITH user_data AS (
SELECT 'a1b2c3d4-e5f6-7890-1234-567890abcdef'::uuid as user_id
),
user_recent_play_count AS (
SELECT lh.track_id, COUNT(*) as play_count
FROM listening_history lh
JOIN user_data ud ON lh.user_id = ud.user_id
WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
GROUP BY lh.track_id
),
user_top_genre AS (
SELECT g.genre_id, g.name
FROM genres g
JOIN tracks t ON g.genre_id = t.genre_id
JOIN user_recent_play_count urpc ON urpc.track_id = t.track_id
GROUP BY g.genre_id
ORDER BY SUM(urpc.play_count) DESC
LIMIT 1
),
user_excluded_tracks AS (
SELECT lh.track_id
FROM listening_history lh
JOIN user_data ud ON lh.user_id = ud.user_id
UNION
SELECT ls.track_id
FROM liked_songs ls
JOIN user_data ud ON ls.user_id = ud.user_id
)
SELECT t.title, a.name AS artist_name
FROM tracks t
INNER JOIN artists a ON t.artist_id = a.artist_id
-- Only consider songs in the user's favourite genre
INNER JOIN user_top_genre utg ON t.genre_id = utg.genre_id
-- Left join songs to find tracks NOT in the excluded list
LEFT JOIN user_excluded_tracks uet ON uet.track_id = t.track_id
WHERE uet.track_id IS NULL -- Make sure track_id is NOT in excluded tracks
ORDER BY t.release_date DESC -- Show newest songs first
LIMIT 5;
Notice how each step in the CTE functions as a variable that holds a collection of temporary data. Even though the final query above might still look a bit complicated, it doesn't contain any complex subqueries. Rather, each CTE acts as a variable that stores a clearly defined set of results. These result sets are then combined in the final query using straightforward JOIN
statements. The WHERE
clause in the final query is also easier to read because it doesn't use subqueries. Overall, these factors make the query self-documenting and easy to debug and enhance in the future.
Optimizing CTE Performance
While the above CTE is easier to understand than the original query with nested subqueries, it does not always result in better performance.
Postgres's query planner is smart and in-lines CTEs if they are only referenced once, treating them as subqueries. However, if a CTE is referenced multiple times, in-lining becomes inefficient, and Postgres then opts to materialize the results. It does this by creating a temporary result set that can be reused across different subqueries.
Suppose you want to override the default behavior. You can do this by explicitly instructing Postgres to optimize the query using a materialized result set with the following syntax:
WITH user_data AS MATERIALIZED (...)
Likewise, you can use the following syntax to force it to in-line the query:
WITH user_data AS NOT MATERIALIZED (...)
If you're interested in how Postgres optimizes your queries, use the EXPLAIN ANALYZE
command to view the execution plan:
EXPLAIN ANALYZE WITH user_data AS (...)
SELECT ...;
Below is a screenshot of a portion of the execution plan for the CTE developed above:
So while the query planner will try its best to find the most optimal method of running a CTE, keep in mind that you can always fine-tune it if needed.
Advanced CTE Use Cases
CTEs can massively improve the readability of your query, but they're also able to do a whole lot more. Below are some more advanced use cases where CTEs excel.
Recursive CTEs
Relational databases, such as Postgres, allow tables to have self-referential fields, where a field in a table refers to another record within the same table. This is useful when storing hierarchical or network data. Recursive CTEs let you query all the rows that make up these relationships.
In music, genres are often further categorized into subgenres. Suppose you want to find all the subgenres under "Pop" and list all the tracks in those genres. You can use the following CTE:
WITH RECURSIVE genre_hierarchy (genre_id, name, parent_genre_id, parent_genre_name, level) AS (
-- Anchor Member: Start with the upper-most parent genre
SELECT g.genre_id, g.name, g.parent_genre_id, NULL, 0 AS level
FROM genres g
WHERE g.name = 'Pop'
UNION ALL
-- Recursive Member: Find child genres by joining back to genre_hierarchy
SELECT child_g.genre_id, child_g.name, child_g.parent_genre_id, gh.name, gh.level + 1
FROM genres child_g
JOIN genre_hierarchy gh ON child_g.parent_genre_id = gh.genre_id
)
SELECT gh.parent_genre_name,
gh.name AS genre_name,
gh.level,
t.title AS track_title,
ar.name AS artist_name
FROM genre_hierarchy gh
LEFT JOIN tracks t ON gh.genre_id = t.genre_id
LEFT JOIN artists ar ON t.artist_id = ar.artist_id
ORDER BY gh.level, gh.name, t.title;
This returns the following:
Notice how the CTE is defined using the WITH RECURSIVE
syntax. This instructs Postgres that the CTE will reference itself. When building the CTE query, you use the following layout:
<non-recursive query>
UNION [ALL]
<recursive query>
As the example above demonstrates, CTEs make querying hierarchical or network data straightforward. For example, they make it possible to query organizational hierarchies where employees are managed and can manage others, retrieve a file system hierarchy where each file and folder is stored as a record linked to another record, or query categories and their nested subcategories. A word of warning when working with recursive CTEs: make sure you specify an explicit termination clause that'll stop the recursive loop; otherwise, your query will loop indefinitely.
Batch Updates
CTEs are not restricted to SELECT
statements. You can also modify records using INSERT
, UPDATE
, or DELETE
statements within a CTE. These let you perform sequential data-manipulation operations and query the results.
The CTE below demonstrates how a user's old listening_history
is archived:
WITH deleted_history AS (
-- Delete old listening history and return the deleted rows
DELETE FROM listening_history
WHERE listened_at < NOW() - INTERVAL '2 years'
RETURNING user_id, track_id, listened_at
),
archived_data AS (
-- Insert the deleted rows into an archive table
INSERT INTO archived_listening_history (user_id, track_id, listened_at)
SELECT user_id, track_id, listened_at FROM deleted_history
RETURNING user_id
)
-- Report number of deleted rows per user_id
SELECT ad.user_id, COUNT(*) AS archived_tracks_count
FROM archived_data ad
GROUP BY ad.user_id
ORDER BY archived_tracks_count DESC;
First, records from the listening_history
table that are older than two years are deleted. The deleted records are returned using the RETURNING
statement. The following CTE inserts these records into an archive table and returns the user ID for each archived record. Finally, a summary is generated to count the number of records archived for each user.
The query results will look something like this:
Structure Business Logic into SQL Functions Using CTEs
CTEs can help you break down complex business logic into sequential, descriptive steps. To reuse these CTEs, you can then consider wrapping them in a SQL function.
Consider the song-recommendation CTE from earlier. This can be encapsulated in an SQL function like the one below:
CREATE OR REPLACE FUNCTION get_user_track_recommendations(p_user_id UUID, p_limit INT)
RETURNS TABLE (track_title TEXT, artist_name TEXT)
LANGUAGE sql
AS $$
WITH user_recent_play_count AS (
SELECT lh.track_id, COUNT(*) as play_count
FROM listening_history lh
WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
AND lh.user_id = p_user_id
GROUP BY lh.track_id
),
user_top_genre AS (
SELECT g.genre_id, g.name
FROM genres g
JOIN tracks t ON g.genre_id = t.genre_id
JOIN user_recent_play_count urpc ON urpc.track_id = t.track_id
GROUP BY g.genre_id
ORDER BY SUM(urpc.play_count) DESC
LIMIT 1
),
user_excluded_tracks AS (
SELECT lh.track_id
FROM listening_history lh
WHERE lh.user_id = p_user_id
UNION
SELECT ls.track_id
FROM liked_songs ls
WHERE ls.user_id = p_user_id
)
SELECT t.title,
a.name AS artist_name
FROM tracks t
INNER JOIN artists a ON t.artist_id = a.artist_id
-- Only consider songs in the user's favourite genre
INNER JOIN user_top_genre utg ON t.genre_id = utg.genre_id
-- Left join songs to find tracks NOT in the excluded list
LEFT JOIN user_excluded_tracks uet ON uet.track_id = t.track_id
WHERE uet.track_id IS NULL -- Make sure track_id is NOT in excluded tracks
ORDER BY t.release_date DESC -- Show newest songs first
LIMIT p_limit;
$$;
The function accepts the user ID and the number of recommendations to return as input parameters. The result is a table with a track and the artist's name.
You can now simply call this complex CTE with the following function call:
SELECT *
FROM get_user_track_recommendations('a1b2c3d4-e5f6-7890-1234-567890abcdef', 5);
This returns the same results as before:
Using CTEs Instead of Temporary Tables
Temporary tables are a common method for storing intermediate results in a multistep query. You'll often find this used in complex queries or SQL functions that need to perform multiple steps and queries to compile the final results. However, CTEs can often provide a cleaner solution.
Consider the temporary tables below, which store a list of songs the user has listened to and their top genre:
CREATE TEMP TABLE temp_user_listens AS SELECT ...;
CREATE TEMP TABLE temp_top_genre AS SELECT ...;
SELECT ... FROM temp_top_genre JOIN ...;
DROP TABLE temp_top_genre;
DROP TABLE temp_user_listens;
CTEs offer a cleaner approach:
WITH user_listens AS (SELECT ...),
top_genre AS (SELECT ...)
SELECT ... FROM temp_top_genre JOIN ...;
Each CTE functions like a temporary table but is limited to a single query and is automatically removed after use. The entire CTE expression is cleaned up automatically, even if an error occurs. This saves you from having to manually drop the temporary tables when finished.
However, if you need to compile intermediate results that multiple subsequent queries will access, a temporary table may be a better choice. But if those results are only necessary for a single query, CTEs are a better option.
Best Practices and Pitfalls
Before you start refactoring all your Postgres code to use CTEs, it's best to keep a few best practices and pitfalls in mind.
Choosing Between CTEs, Views, and Functions
Knowing when to choose between CTEs, views, and functions is important for writing clean SQL.
Use CTEs for single queries that involve multistep data manipulation and aggregation. Also, consider using a CTE within a view or function if the query logic warrants it.
Views are ideal for simplifying queries that retrieve data from complex, multitable data structures. The view encapsulates the complex query logic and defines a stable interface that other queries can use when selecting the view. Unlike CTE expressions, they are not temporary and are often created once and then reused.
Functions are ideal for encapsulating business logic, especially when multiple input parameters are required. The function has a descriptive name and can be queried like a view or table. As you saw above, a function can still utilize a view or CTE internally to clearly define the query.
Considering Performance
Remember that CTEs don't automatically improve performance. Use the EXPLAIN ANALYZE
command to investigate your CTE's performance gaps.
While Postgres does try to create and manage the CTE intelligently, you can give it hints on how to create and consume the CTE. These hints are MATERIALIZED
and NOT MATERIALIZED
.
MATERIALIZED
indicates to Postgres to create a temporary result set for that CTE query. NOT MATERIALIZED
hints at the query engine to in-line the CTE queries before running them.
Don't assume CTEs perform better. Double-check them yourself and make the necessary optimizations if necessary.
Common Mistakes and How to Avoid Them
You'll want to avoid these common mistakes that you might run into while using CTEs.
Firstly, don't overuse CTEs. They are a powerful tool when dealing with complex query logic but can complicate a simple query unnecessarily.
Take the CTE below as an example:
WITH user_data AS (
SELECT *
FROM users
WHERE email = 'alice@example.com'
)
SELECT *
FROM listening_history lh
INNER JOIN user_data ud ON ud.user_id = lh.user_id;
It's much simpler to write as such:
SELECT *
FROM listening_history lh
INNER JOIN users u ON u.user_id = lh.user_id
WHERE u.email = 'alice@example.com';
Secondly, ensure your CTEs have appropriate names. Misleading names make it difficult to understand each query's purpose. Choosing suitable names goes a long way in clarifying the intent of each CTE.
Finally, don't use the names of existing views and tables in the database to name your CTEs. Below is an example:
WITH users AS (
SELECT username FROM users
)
SELECT * FROM users;
Some relational database management systems (RDBMS) might throw an error or warn you about naming collisions between tables and CTEs, such as the one above between the users
table and the users
CTE. However, other RDBMS software might not, in which case the CTE will take precedence over other tables in the database, leading to potentially confusing results. Save yourself future confusion and use a different name.
Tips for Testing and Debugging CTEs
When building CTEs, make sure you're debugging early and often. After writing each CTE query, run a SELECT
on that CTE to verify it returns the data you expect it to return. Since CTEs often build upon one another, it's much easier to identify and resolve issues as you go rather than trying to locate the root cause of a problem after completing all the queries.
For example, you saw in the walkthrough above how each CTE was tested before moving on to the next:
WITH user_data AS (
-- Unnecessary if user id is an incoming parameter
SELECT 'a1b2c3d4-e5f6-7890-1234-567890abcdef'::uuid as user_id
),
user_recent_play_count AS (
SELECT lh.track_id, COUNT(*) as play_count
FROM listening_history lh
JOIN user_data ud ON lh.user_id = ud.user_id
WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
GROUP BY lh.track_id
)
SELECT *
FROM user_recent_play_count
ORDER BY play_count DESC;
Save yourself the headache of debugging a monolithic query by testing each CTE as you build it.
Conclusion
CTEs are an essential tool for transforming spaghetti SQL into well-structured, readable, and maintainable queries. They allow you to break down subqueries into named, sequential steps, which clarifies the data manipulations occurring to produce the final results. The article illustrated how a particularly convoluted query can become much easier to read when using CTEs. Additionally, it explored some advanced use cases for CTEs, such as recursive and batch-update CTEs and using them to organize business logic while replacing temporary tables. Finally, some pitfalls and best practices were highlighted to ensure your CTEs remain both readable and efficient.
Top comments (0)