Roel Peters

Posted on Oct 1

Beyond Flat Tables: Model Hierarchical Data in Supabase with Recursive Queries

#postgres #database #fullstack #dataengineering

Modern applications often need to manage complex hierarchical relations, such as nested product categories, threaded comment systems, and even organizational charts. While obvious in the physical world, these structures pose significant challenges when they're stored and queried in a traditional relational database. Parent-child relationships require either multiple database roundtrips or complex application-level logic to reconstruct the hierarchy on demand.

Supabase, powered by Postgres at its core, provides an efficient solution for handling hierarchical data: recursive common table expressions (CTEs). This tool allows developers to query treelike structures without resorting to complex backend processing or specialized graph databases.

Read on to see how I use Postgres's recursive CTEs within Supabase to model and manipulate hierarchical data directly in the database, enabling readable and performant queries for common operations, like finding all descendants, calculating tree depths, and traversing organizational hierarchies. I'll also share some tips on how to optimize those queries.

Understanding Hierarchical Data and Recursive Queries

Hierarchical data structures organize information in a treelike format. A record typically has one parent and one or more children. This information pattern is all around us. Think about your own family tree or how we classify plants and animals. But it is even more prevalent on the web: conversational threads on social media, organizational charts, file systems, menu navigation, and so on.

Hierarchical data can be stored in various ways. XML or JSON files are flexible, portable, and widely supported. NoSQL is another database solution, which not only scales nearly infinitely but also offers extreme schema flexibility. Finally, relational databases also support hierarchical data structures, but they require specific data models to use them efficiently. These include the adjacency list model, the nested set model, path enumeration, and a closure table:

The adjacency list model is straightforward: Each record contains a reference to its parent (often a parent_id column). This allows joining the table onto itself to recreate the hierarchy.
The nested set model addresses some of these issues by assigning left and right values to each node, allowing for efficient subtree queries in both directions.
The path enumeration option is highly inefficient in terms of storage as it stores the full path from the root to itself. It does require fewer computing resources.
Finally, a closure table is like a junction table that stores all links between the records in each level of the hierarchy.

By far, the most popular is the adjacency list model, probably because of the introduction of recursive queries, which drastically simplify using it in a query. The recursive pattern was introduced in the SQL:1999 standard. However, in Postgres, recursive queries have only been made available in version 8.4, which was released in July 2009.

Recursive queries (often wrapped in recursive CTEs) leverage the adjacency list model so that developers don't need to write complex logic to extract every level of an information tree. They allow you to define a query that references itself, enabling the database to traverse parent-child relationships to any depth.

Setting Up a Hierarchical Table in Supabase/Postgres

A hierarchical table typically contains a record's ID and the ID of its parent. The rest of the article uses a table that contains a list of employees as an example. In this section, I show you how to configure the table, and in the next section, you'll query it recursively.

You can use the CREATE TABLE statement below to define what the employee list should look like:

CREATE TABLE employees (
  employee_id INTEGER PRIMARY KEY,
  full_name TEXT NOT NULL,
  manager_id INTEGER REFERENCES employees(employee_id)
);

Each employee in the table has an employee_id, a full_name, and a manager_id. The manager_id is interesting as it is constrained to be a self-referencing foreign key. The constraint ensures that any value entered in manager_id must either be NULL (for top-level employees with no manager) or match an existing employee_id in the employee table. This prevents orphaned references, meaning you can't assign an employee to a manager that doesn't exist in the table.

Now, let's assume you load the following dummy data into the employee table of the database:

employee_id	full_name	manager_id
1	Alice	NULL
2	Bob	1
3	Carol	1
4	Dave	2
5	Eve	2
6	Frank	3
7	Grace	3
8	Heidi	4

Clearly, Alice is the top boss as she does not report to a manager. Heidi, on the other hand, reports to Bob, who in turn reports to Alice.

Querying the Hierarchy with Recursive CTEs

In this section, you'll learn how to write and interpret a recursive CTE.

Although a recursive subquery pattern can be added in various clauses of a query, it is typically done with a recursive CTE. Postgres and Supabase use the WITH RECURSIVE clause. That's like a regular CTE that uses the WITH clause but with the RECURSIVE clause attached to it. It consists of two parts: an anchor member (the base case) and a recursive member (the self-referencing part).

Here's an example where a recursive CTE in Postgres finds all the subordinates of Bob, including indirect subordinates:

WITH RECURSIVE subordinates AS (
  SELECT employee_id, full_name, manager_id
  FROM employees
  WHERE manager_id = 2  -- anchor: start from manager with id 2
  UNION ALL
  SELECT e.employee_id, e.full_name, e.manager_id
  FROM employees e
  INNER JOIN subordinates s ON e.manager_id = s.employee_id
)
SELECT * FROM subordinates;

This query starts with the direct reports of the manager (anchor) and then repeatedly joins to find indirect reports (recursive member), continuing until no more subordinates are found. Here's how the recursion works:

Iteration 1: The anchor member runs, finding all direct subordinates of manager 2.
Iteration 2: The recursive member runs, finding employees who report to those found in the previous step, and collates them to the initial list using the UNION ALL statement.
Subsequent iterations: The process repeats, each time using the latest set of found employees as the new "managers" to look for further subordinates.
Termination: The recursion stops when no new employees are found in an iteration (ie the recursive member returns an empty set).

The result is a complete list of all employees reporting, directly or indirectly, to Bob. Dave and Eve report to him directly, while Heidi reports to him indirectly.

employee_id	full_name	manager_id
4	Dave	2
5	Eve	2
8	Heidi	4

This is one example of a recursive query, but it can be used for various other purposes too:

Generating sequences of numbers, like a Fibonacci sequence or factorial numbers, using the recursive pattern as a loop
Traversing a connection of nodes within a network
Listing a breadcrumb trail of products in a web store

That covers the fundamentals of hierarchical data retrieval with recursive queries, but you'll also need to be able to optimize those queries.

Optimizing Recursive Queries

Recursive queries can quickly become resource-heavy due to the many joins that happen in the database engine. Fortunately, several optimization strategies can mitigate these performance challenges. This section explores indexing, caching, and denormalization and concludes with some guidance on monitoring and troubleshooting.

Indexing Strategies for Hierarchical Data

Proper indexing is essential for speedy recursive queries in Postgres. For adjacency list models, always create indexes on the columns involved in parent-child relationships—typically both the id and parent_id fields. This enables Postgres to quickly locate parent or child nodes during each recursion step. Composite indexes can further help if your queries filter on multiple columns.

Caching Considerations

Postgres caches query plans for prepared statements, which can reduce planning overhead for frequently executed recursive queries. For expensive or frequently accessed hierarchical traversals, consider caching results at the application layer or using materialized views to store precomputed results. Materialized views can be refreshed periodically to balance freshness and performance, especially for read-heavy workloads.

For example, this is how you create a materialized view in Postgres, which calculates the revenue per product_id.

CREATE MATERIALIZED VIEW product_sales_summary AS
SELECT
  product_id,
  SUM(amount) AS total_sales,
  COUNT(*) AS sales_count
FROM
  sales
GROUP BY
  product_id;

Make sure to refresh the materialized view by running the following query recurrently:

SELECT * FROM product_sales_summary WHERE total_sales > 1000;

Denormalization

Once recursive queries become a bottleneck due to the extreme depth of your hierarchy, it may be necessary to denormalize your data. Denormalization is the process of intentionally adding redundant data to a database that has already been normalized, with the goal of improving read performance and simplifying complex queries.

In a normalized database, related data is distributed across multiple tables to eliminate redundancy and ensure data integrity. With denormalization, you combine related data into fewer tables or even a single table. This reduces the number of joins needed during queries and enables faster data retrieval, although it results in data redundancy.

Denormalization allows queries to access hierarchical relationships in a single read rather than recursively traversing parent-child links. As a result, the depth of recursion required for queries is dramatically reduced or even eliminated since the path information is precomputed and readily available. This approach trades increased storage and more complex updates for much faster reads, making it ideal when most queries need to navigate large or deep hierarchies that can't be efficiently handled with indexing alone.

Monitoring and Troubleshooting

While monitoring your queries won't speed them up, it can be used to identify unforeseen bottlenecks. Postgres's EXPLAIN and EXPLAIN ANALYZE commands help you analyze the execution plans of recursive queries, identifying issues such as sequential scans or inefficient joins.

EXPLAIN: reveals the planned execution strategy for a query, showing how tables will be accessed and joined, but does not actually run the query
EXPLAIN ANALYZE: executes the query and reports actual runtime statistics, including timing, row counts, and resource usage for each step

A typical inefficient join that can occur in a recursive query is a Cartesian join, which results from a non-unique parent_id.

Finally, Postgres stores intermediate results in temporary working tables during recursion, so keep an eye on disk I/O if queries are especially large or complex.

Conclusion

In summary, managing hierarchical data in modern applications is challenging, but Postgres's recursive CTEs offer a robust, efficient solution. By modeling hierarchies with the adjacency list approach and leveraging recursive queries, developers can traverse complex parent-child relationships directly in the database, eliminating the need for convoluted application logic.

This article also elaborated on proper indexing, caching, and, when necessary, denormalization to enhance performance further. These tools allow you to handle nested structures like organizational charts or threaded discussions with readable, maintainable SQL, ensuring both scalability and data integrity for real-world, data-driven applications.

DEV Community