Ivan Kahl

Posted on Oct 2 • Originally published at blog.ivankahl.com

Stop Spaghetti SQL: Simplify Complex Queries with Postgres CTEs in Supabase

#supabase #postgres #sql #webdev

Modern web applications deal with massive amounts of complex data, and relational databases are a popular choice for keeping everything organized. But even with structured data, queries can turn into nightmares when you need to retrieve large amounts of interconnected information.

Take a music streaming app that wants to retrieve song suggestions by combining a user's listening history, genre preferences, and favorite artists. The SQL query quickly becomes unwieldy with endless subqueries and WHERE clauses, resulting in what's known as "spaghetti SQL" - code that's nearly impossible to read, maintain, or debug.

Common table expressions (CTEs) is one way to solve this problem. They let you break down complex queries into separate, manageable steps. Each step is a simple SQL query with clear results, labeled so you actually know what each part does. This can completely transform how readable and maintainable your queries become.

In this post, I'll show you how CTEs work in Postgres and how they can turn convoluted spaghetti SQL into clean, clearly defined statements. I’ll take you through practical demonstrations of CTEs used in a music streaming app. The demo app is hosted on Supabase, a backend service that offers a Postgres database for apps with built-in auth and edge functions. The CTEs will simplify logic used in some of the app’s queries. I'll also cover some advanced use cases and best practices.

What Are CTEs?

CTEs, defined in SQL using the WITH clause, let you create one or more temporary named result sets that can be used within a single SELECT, INSERT, UPDATE, or DELETE statement. The expressions are executed in sequential order and can reference previous result sets in the same statement.

Below is an example of a SQL statement that utilizes a CTE to filter out archived song tracks, which is then used in the final query:

WITH undeleted_tracks AS (
    SELECT track_id, title, artist_id, genre_id
    FROM tracks
    WHERE is_deleted = false
)
SELECT ut.title, a.name as artist_name, g.name as genre_name
FROM undeleted_tracks ut
    INNER JOIN artists a ON a.artist_id = ut.artist_id
    INNER JOIN genres g ON g.genre_id = ut.genre_id

First, a CTE is created to store a temporary collection called undeleted_tracks with all the undeleted songs in the tracks table. The CTE is then used in the final query. Notice you don't need to check if a track is deleted in the final query because the CTE already filtered them out.

Now, why would you use a CTE to achieve this? You could retrieve the same results using a predefined view or SQL subquery. CTEs, views, and subqueries do share some similarities; however, they have differences that make each appropriate for distinct scenarios:

Subqueries are nested SQL queries placed within a larger query. These nested queries can span multiple levels, which makes it difficult to follow the logical flow of the data and understand how the query is constructed.
Views are predefined SQL queries that are stored as persistent database objects. They are ideal for encapsulating query logic that spans multiple database tables and needs to be reused across different queries. Postgres lets you define a regular view, which runs its query each time it's accessed, and materialized views, which execute the query once and store the results for fast retrieval. Since materialized views store data, they require manual refreshes to update with the latest information.
CTEs are temporary "views" that only exist for the execution of a SQL statement. They are effective at replacing complex, nested subqueries, improving the readability and maintainability of the overall query by breaking it down into logical, named steps.

With these distinctions in mind, CTEs can be invaluable for writing recursive queries, which repeatedly reference their own output to retrieve hierarchical or network data. Imagine trying to debug an SQL query that has several levels of embedded SELECT statements, each dependent on the statement before it. The query's complexity obscures its intent, making it challenging to identify the underlying issue. In this case, CTEs' modularity makes the query's intent clear, which makes debugging and modifying the query a straightforward task as your application evolves.

Additionally, since much of Supabase's functionality is built on top of a Postgres database, you can use CTEs to query and manage your application's data if you're hosting your service there, as is the case with this music streaming app. This ensures that your backend remains easy to modify as requirements evolve and more features are implemented.

The following section will demonstrate how a complex query can be enhanced by replacing subqueries with CTEs.

The Problem with Complex SQL Queries

A key feature of a music streaming app is compiling personalized song recommendations based on the user's most-listened-to genre over the last month, excluding any songs they have already listened to or liked.

Without CTEs, a query to achieve this would consist of multiple subqueries:

SELECT t.title, ar.name AS artist_name
FROM tracks t
    INNER JOIN artists ar ON t.artist_id = ar.artist_id
WHERE 
    t.track_id NOT IN (
        -- Hardcode user_id (would be supplied as a parameter in the function)
        SELECT lh.track_id 
        FROM listening_history lh 
        WHERE lh.user_id = 'a1b2c3d4-e5f6-7890-1234-567890abcdef'
 )
    AND t.track_id NOT IN (
        -- Hardcoded user_id (would be supplied as a parameter in the function)
        SELECT ls.track_id 
        FROM liked_songs ls 
        WHERE ls.user_id = 'a1b2c3d4-e5f6-7890-1234-567890abcdef'
 )
    AND t.genre_id = (
        -- Subquery to find the user's top genre
        SELECT g.genre_id
        FROM genres g
        JOIN tracks t_inner ON g.genre_id = t_inner.genre_id
        JOIN (
            -- Sub-subquery to count plays per track for the user
            SELECT lh_inner.track_id, COUNT(*) as play_count
            FROM listening_history lh_inner
            WHERE lh_inner.user_id = 'a1b2c3d4-e5f6-7890-1234-567890abcdef'
                  AND lh_inner.listened_at >= NOW() - INTERVAL '30 days'
            GROUP BY lh_inner.track_id
 ) AS user_track_plays ON t_inner.track_id = user_track_plays.track_id
        GROUP BY g.genre_id
        ORDER BY SUM(user_track_plays.play_count) DESC
        LIMIT 1
 )
-- Show newest songs first
ORDER BY t.release_date DESC
LIMIT 5;

You'd have to spend some time analyzing the spaghetti SQL to figure out how it works. Debugging would also be a nightmare as you have to extract each subquery, mock dependent values that come from other queries, and then analyze the results. If you modify a subquery, you must be careful not to break another part of the query that relies on that subquery. Finally, performance can suffer because, while Postgres's optimizer is smart, deeply nested and correlated subqueries can generate less optimal execution plans.

Solving Spaghetti SQL with CTEs

The query presented above is a perfect candidate for CTEs. It has multiple nested queries that obscurely depend on one another. The CTE will rewrite the subqueries, separating each into its own named query, which will improve the original's readability and performance.

To begin, you'll first write a CTE that counts the number of times a user has listened to different tracks in the last thirty days. You can also use a CTE to store the user ID so that it's not duplicated across queries. When written, these two CTEs will look like this:

WITH user_data AS (
    -- Unnecessary if user id is an incoming parameter
    SELECT 'a1b2c3d4-e5f6-7890-1234-567890abcdef'::uuid as user_id
),
user_recent_play_count AS (
    SELECT lh.track_id, COUNT(*) as play_count
    FROM listening_history lh
        JOIN user_data ud ON lh.user_id = ud.user_id
    WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
    GROUP BY lh.track_id
)
SELECT * 
FROM user_recent_play_count
ORDER BY play_count DESC;

Notice how you can write a SELECT statement at the end of your CTE to debug the results before moving on to the next subquery. If you run the query above, you'll see a list of track IDs and the number of times each was played by the user:

You can perform some spot checks on the results to ensure the data being returned by the CTE is correct. If it's not, you can easily modify the CTE and re-execute it until it's right. In this case, everything looks good, so you can continue with extracting the following subquery, which finds the user's top genre:

WITH user_data AS (
    SELECT 'a1b2c3d4-e5f6-7890-1234-567890abcdef'::uuid as user_id
),
user_recent_play_count AS (
    SELECT lh.track_id, COUNT(*) as play_count
    FROM listening_history lh
        JOIN user_data ud ON lh.user_id = ud.user_id
    WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
    GROUP BY lh.track_id
),
user_top_genre AS (
    SELECT g.genre_id, g.name
    FROM genres g
    JOIN tracks t ON g.genre_id = t.genre_id
    -- Reference the previous CTE instead of a subquery
    JOIN user_recent_play_count urpc ON urpc.track_id = t.track_id 
    GROUP BY g.genre_id
    ORDER BY SUM(urpc.play_count) DESC
    LIMIT 1
)
SELECT * 
FROM user_top_genre;

The new query selects data from the new CTE, which should return the user's most popular genre:

Now, you can select tracks from that genre to recommend to the user. However, remember that the requirements specify that songs the user has already listened to or liked should be excluded. The next CTE you'll create will contain a list of track IDs that should not be recommended:

WITH user_data AS (
    SELECT 'a1b2c3d4-e5f6-7890-1234-567890abcdef'::uuid as user_id
),
user_recent_play_count AS (
    SELECT lh.track_id, COUNT(*) as play_count
    FROM listening_history lh
        JOIN user_data ud ON lh.user_id = ud.user_id
    WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
    GROUP BY lh.track_id
),
user_top_genre AS (
    SELECT g.genre_id, g.name
    FROM genres g
    JOIN tracks t ON g.genre_id = t.genre_id
    JOIN user_recent_play_count urpc ON urpc.track_id = t.track_id 
    GROUP BY g.genre_id
    ORDER BY SUM(urpc.play_count) DESC
    LIMIT 1
),
user_excluded_tracks AS (
    SELECT lh.track_id
    FROM listening_history lh
        JOIN user_data ud ON lh.user_id = ud.user_id
    UNION
    SELECT ls.track_id
    FROM liked_songs ls
        JOIN user_data ud ON ls.user_id = ud.user_id
)
SELECT uet.*, t.title
FROM user_excluded_tracks uet
    INNER JOIN tracks t ON t.track_id = uet.track_id;

If you run the query above, you'll see a list of track IDs and titles of songs the user has already heard or liked, similar to the result set below:

Finally, combine all your CTEs into a single query to retrieve five song recommendations for the user:

WITH user_data AS (
    SELECT 'a1b2c3d4-e5f6-7890-1234-567890abcdef'::uuid as user_id
),
user_recent_play_count AS (
    SELECT lh.track_id, COUNT(*) as play_count
    FROM listening_history lh
        JOIN user_data ud ON lh.user_id = ud.user_id
    WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
    GROUP BY lh.track_id
),
user_top_genre AS (
    SELECT g.genre_id, g.name
    FROM genres g
    JOIN tracks t ON g.genre_id = t.genre_id
    JOIN user_recent_play_count urpc ON urpc.track_id = t.track_id 
    GROUP BY g.genre_id
    ORDER BY SUM(urpc.play_count) DESC
    LIMIT 1
),
user_excluded_tracks AS (
    SELECT lh.track_id
    FROM listening_history lh
        JOIN user_data ud ON lh.user_id = ud.user_id
    UNION
    SELECT ls.track_id
    FROM liked_songs ls
        JOIN user_data ud ON ls.user_id = ud.user_id
)
SELECT t.title, a.name AS artist_name
FROM tracks t
  INNER JOIN artists a ON t.artist_id = a.artist_id
  -- Only consider songs in the user's favourite genre
  INNER JOIN user_top_genre utg ON t.genre_id = utg.genre_id
  -- Left join songs to find tracks NOT in the excluded list
  LEFT JOIN user_excluded_tracks uet ON uet.track_id = t.track_id
WHERE uet.track_id IS NULL -- Make sure track_id is NOT in excluded tracks
ORDER BY t.release_date DESC -- Show newest songs first
LIMIT 5;

Notice how each step in the CTE functions as a variable that holds a collection of temporary data. Even though the final query above might still look a bit complicated, it doesn't contain any complex subqueries. Rather, each CTE acts as a variable that stores a clearly defined set of results. These result sets are then combined in the final query using straightforward JOIN statements. The WHERE clause in the final query is also easier to read because it doesn't use subqueries. Overall, these factors make the query self-documenting and easy to debug and enhance in the future.

Optimizing CTE Performance

While the above CTE is easier to understand than the original query with nested subqueries, it does not always result in better performance.

Postgres's query planner is smart and in-lines CTEs if they are only referenced once, treating them as subqueries. However, if a CTE is referenced multiple times, in-lining becomes inefficient, and Postgres then opts to materialize the results. It does this by creating a temporary result set that can be reused across different subqueries.

Suppose you want to override the default behavior. You can do this by explicitly instructing Postgres to optimize the query using a materialized result set with the following syntax:

WITH user_data AS MATERIALIZED (...)

Likewise, you can use the following syntax to force it to in-line the query:

WITH user_data AS NOT MATERIALIZED (...)

If you're interested in how Postgres optimizes your queries, use the EXPLAIN ANALYZE command to view the execution plan:

EXPLAIN ANALYZE WITH user_data AS (...)
SELECT ...;

Below is a screenshot of a portion of the execution plan for the CTE developed above:

So while the query planner will try its best to find the most optimal method of running a CTE, keep in mind that you can always fine-tune it if needed.

Advanced CTE Use Cases

CTEs can massively improve the readability of your query, but they're also able to do a whole lot more. Below are some more advanced use cases where CTEs excel.

Recursive CTEs

Relational databases, such as Postgres, allow tables to have self-referential fields, where a field in a table refers to another record within the same table. This is useful when storing hierarchical or network data. Recursive CTEs let you query all the rows that make up these relationships.

In music, genres are often further categorized into subgenres. Suppose you want to find all the subgenres under "Pop" and list all the tracks in those genres. You can use the following CTE:

WITH RECURSIVE genre_hierarchy (genre_id, name, parent_genre_id, parent_genre_name, level) AS (
    -- Anchor Member: Start with the upper-most parent genre
    SELECT g.genre_id, g.name, g.parent_genre_id, NULL, 0 AS level
    FROM genres g
    WHERE g.name = 'Pop'
    UNION ALL
    -- Recursive Member: Find child genres by joining back to genre_hierarchy
    SELECT child_g.genre_id, child_g.name, child_g.parent_genre_id, gh.name, gh.level + 1
    FROM genres child_g
    JOIN genre_hierarchy gh ON child_g.parent_genre_id = gh.genre_id
)
SELECT gh.parent_genre_name,
       gh.name AS genre_name,
       gh.level,
       t.title AS track_title,
       ar.name AS artist_name
FROM genre_hierarchy gh
  LEFT JOIN tracks t ON gh.genre_id = t.genre_id
  LEFT JOIN artists ar ON t.artist_id = ar.artist_id
ORDER BY gh.level, gh.name, t.title;

This returns the following:

Notice how the CTE is defined using the WITH RECURSIVE syntax. This instructs Postgres that the CTE will reference itself. When building the CTE query, you use the following layout:

<non-recursive query>
UNION [ALL]
<recursive query>

As the example above demonstrates, CTEs make querying hierarchical or network data straightforward. For example, they make it possible to query organizational hierarchies where employees are managed and can manage others, retrieve a file system hierarchy where each file and folder is stored as a record linked to another record, or query categories and their nested subcategories. A word of warning when working with recursive CTEs: make sure you specify an explicit termination clause that'll stop the recursive loop; otherwise, your query will loop indefinitely.

Batch Updates

CTEs are not restricted to SELECT statements. You can also modify records using INSERT, UPDATE, or DELETE statements within a CTE. These let you perform sequential data-manipulation operations and query the results.

The CTE below demonstrates how a user's old listening_history is archived:

WITH deleted_history AS (
    -- Delete old listening history and return the deleted rows
  DELETE FROM listening_history
    WHERE listened_at < NOW() - INTERVAL '2 years'
    RETURNING user_id, track_id, listened_at
),
archived_data AS (
    -- Insert the deleted rows into an archive table
 INSERT INTO archived_listening_history (user_id, track_id, listened_at)
    SELECT user_id, track_id, listened_at FROM deleted_history
    RETURNING user_id
)
-- Report number of deleted rows per user_id
SELECT ad.user_id, COUNT(*) AS archived_tracks_count
FROM archived_data ad
GROUP BY ad.user_id
ORDER BY archived_tracks_count DESC;

First, records from the listening_history table that are older than two years are deleted. The deleted records are returned using the RETURNING statement. The following CTE inserts these records into an archive table and returns the user ID for each archived record. Finally, a summary is generated to count the number of records archived for each user.

The query results will look something like this:

Structure Business Logic into SQL Functions Using CTEs

CTEs can help you break down complex business logic into sequential, descriptive steps. To reuse these CTEs, you can then consider wrapping them in a SQL function.

Consider the song-recommendation CTE from earlier. This can be encapsulated in an SQL function like the one below:

CREATE OR REPLACE FUNCTION get_user_track_recommendations(p_user_id UUID, p_limit INT)
RETURNS TABLE (track_title TEXT, artist_name TEXT)
LANGUAGE sql
AS $$
    WITH user_recent_play_count AS (
        SELECT lh.track_id, COUNT(*) as play_count
        FROM listening_history lh
        WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
              AND lh.user_id = p_user_id
        GROUP BY lh.track_id
    ),
 user_top_genre AS (
        SELECT g.genre_id, g.name
        FROM genres g
        JOIN tracks t ON g.genre_id = t.genre_id
        JOIN user_recent_play_count urpc ON urpc.track_id = t.track_id 
        GROUP BY g.genre_id
        ORDER BY SUM(urpc.play_count) DESC
        LIMIT 1
    ),
 user_excluded_tracks AS (
        SELECT lh.track_id
        FROM listening_history lh
        WHERE lh.user_id = p_user_id
        UNION
        SELECT ls.track_id
        FROM liked_songs ls
        WHERE ls.user_id = p_user_id
    )
    SELECT t.title,
 a.name AS artist_name
    FROM tracks t
      INNER JOIN artists a ON t.artist_id = a.artist_id
      -- Only consider songs in the user's favourite genre
      INNER JOIN user_top_genre utg ON t.genre_id = utg.genre_id
      -- Left join songs to find tracks NOT in the excluded list
      LEFT JOIN user_excluded_tracks uet ON uet.track_id = t.track_id
    WHERE uet.track_id IS NULL -- Make sure track_id is NOT in excluded tracks
    ORDER BY t.release_date DESC -- Show newest songs first
    LIMIT p_limit;
$$;

The function accepts the user ID and the number of recommendations to return as input parameters. The result is a table with a track and the artist's name.

You can now simply call this complex CTE with the following function call:

SELECT *
FROM get_user_track_recommendations('a1b2c3d4-e5f6-7890-1234-567890abcdef', 5);

This returns the same results as before:

Using CTEs Instead of Temporary Tables

Temporary tables are a common method for storing intermediate results in a multistep query. You'll often find this used in complex queries or SQL functions that need to perform multiple steps and queries to compile the final results. However, CTEs can often provide a cleaner solution.

Consider the temporary tables below, which store a list of songs the user has listened to and their top genre:

CREATE TEMP TABLE temp_user_listens AS SELECT ...;
CREATE TEMP TABLE temp_top_genre AS SELECT ...;
SELECT ... FROM temp_top_genre JOIN ...;
DROP TABLE temp_top_genre;
DROP TABLE temp_user_listens;

CTEs offer a cleaner approach:

WITH user_listens AS (SELECT ...),
     top_genre AS (SELECT ...)
SELECT ... FROM temp_top_genre JOIN ...;

Each CTE functions like a temporary table but is limited to a single query and is automatically removed after use. The entire CTE expression is cleaned up automatically, even if an error occurs. This saves you from having to manually drop the temporary tables when finished.

However, if you need to compile intermediate results that multiple subsequent queries will access, a temporary table may be a better choice. But if those results are only necessary for a single query, CTEs are a better option.

Best Practices and Pitfalls

Before you start refactoring all your Postgres code to use CTEs, it's best to keep a few best practices and pitfalls in mind.

Choosing Between CTEs, Views, and Functions

Knowing when to choose between CTEs, views, and functions is important for writing clean SQL.

Use CTEs for single queries that involve multistep data manipulation and aggregation. Also, consider using a CTE within a view or function if the query logic warrants it.

Views are ideal for simplifying queries that retrieve data from complex, multitable data structures. The view encapsulates the complex query logic and defines a stable interface that other queries can use when selecting the view. Unlike CTE expressions, they are not temporary and are often created once and then reused.

Functions are ideal for encapsulating business logic, especially when multiple input parameters are required. The function has a descriptive name and can be queried like a view or table. As you saw above, a function can still utilize a view or CTE internally to clearly define the query.

Considering Performance

Remember that CTEs don't automatically improve performance. Use the EXPLAIN ANALYZE command to investigate your CTE's performance gaps.

While Postgres does try to create and manage the CTE intelligently, you can give it hints on how to create and consume the CTE. These hints are MATERIALIZED and NOT MATERIALIZED.

MATERIALIZED indicates to Postgres to create a temporary result set for that CTE query. NOT MATERIALIZED hints at the query engine to in-line the CTE queries before running them.

Don't assume CTEs perform better. Double-check them yourself and make the necessary optimizations if necessary.

Common Mistakes and How to Avoid Them

You'll want to avoid these common mistakes that you might run into while using CTEs.

Firstly, don't overuse CTEs. They are a powerful tool when dealing with complex query logic but can complicate a simple query unnecessarily.

Take the CTE below as an example:

WITH user_data AS (
    SELECT * 
    FROM users 
    WHERE email = 'alice@example.com'
)
SELECT *
FROM listening_history lh
  INNER JOIN user_data ud ON ud.user_id = lh.user_id;

It's much simpler to write as such:

SELECT *
FROM listening_history lh
  INNER JOIN users u ON u.user_id = lh.user_id
WHERE u.email = 'alice@example.com';

Secondly, ensure your CTEs have appropriate names. Misleading names make it difficult to understand each query's purpose. Choosing suitable names goes a long way in clarifying the intent of each CTE.

Finally, don't use the names of existing views and tables in the database to name your CTEs. Below is an example:

WITH users AS (
  SELECT username FROM users
)
SELECT * FROM users;

Some relational database management systems (RDBMS) might throw an error or warn you about naming collisions between tables and CTEs, such as the one above between the users table and the users CTE. However, other RDBMS software might not, in which case the CTE will take precedence over other tables in the database, leading to potentially confusing results. Save yourself future confusion and use a different name.

Tips for Testing and Debugging CTEs

When building CTEs, make sure you're debugging early and often. After writing each CTE query, run a SELECT on that CTE to verify it returns the data you expect it to return. Since CTEs often build upon one another, it's much easier to identify and resolve issues as you go rather than trying to locate the root cause of a problem after completing all the queries.

For example, you saw in the walkthrough above how each CTE was tested before moving on to the next:

WITH user_data AS (
    -- Unnecessary if user id is an incoming parameter
    SELECT 'a1b2c3d4-e5f6-7890-1234-567890abcdef'::uuid as user_id
),
user_recent_play_count AS (
    SELECT lh.track_id, COUNT(*) as play_count
    FROM listening_history lh
        JOIN user_data ud ON lh.user_id = ud.user_id
    WHERE lh.listened_at >= NOW() - INTERVAL '30 days'
    GROUP BY lh.track_id
)
SELECT * 
FROM user_recent_play_count
ORDER BY play_count DESC;

Save yourself the headache of debugging a monolithic query by testing each CTE as you build it.

Conclusion

CTEs are an essential tool for transforming spaghetti SQL into well-structured, readable, and maintainable queries. They allow you to break down subqueries into named, sequential steps, which clarifies the data manipulations occurring to produce the final results. The article illustrated how a particularly convoluted query can become much easier to read when using CTEs. Additionally, it explored some advanced use cases for CTEs, such as recursive and batch-update CTEs and using them to organize business logic while replacing temporary tables. Finally, some pitfalls and best practices were highlighted to ensure your CTEs remain both readable and efficient.

DEV Community