DEV Community

Dico
Dico

Posted on • Originally published at dico.duba.dev on

Nested sets: Performant attribute calculation on collections

We all know what a tree is. Computer science taught us that the root is in the top, that it is occasionally red and black and that leaves look exactly like the trunk of the tree. It teaches you that if you want to store a tree, you have various options to do so: Nodes with a parent and position, nodes with a parent, left child and right sibling or even a node with parent, left and right child for binary trees. Obtaining things like a parent, children or siblings is usually not a problem with these storage formats. Obtaining properties such as all ancestors, all descendants or depth however quickly becomes a chore when trying to obtain them in a mysql database.

I only became aware of nested sets as a way of storing trees after a co-worker mentioned it to me. Unlike the previously mentioned storage methods, nested sets have an elegant way of obtaining all these properties, often only in a single query.

Tree represented as a nested set
Tree represented as a nested set. Picture made by 0x24a537r9 and released in the Public Domain.

Nested sets are based on set theory, but they have some interesting properties. If we have a node A that appears as the child of node B in a tree, with nested sets we represent that as a set A that is contained in a set B. In the example above, “Slacks” is a child of “Suits” in a tree with 4 layers. While sets are usually just collections in a dimensional universe, we need to introduce the concept of “order” in sets to be able to order the children of a node. This will allow us model trees with sets.

Storing a node as a nested set then only requires three fields, namely:

  • The id of the parent
  • The left edge of this node’s set
  • The right edge of this node’s set

Strictly speaking you could omit the id of the parent node and still define a well-formed tree, but it would make querying direct children and the direct parent of a particular node unnecessarily hard. Instead of the parent id you could instead store the depth of a node instead, but calculating the depth with a sql query is an easier task than calculating the direct parent of a node.

The problem

As mentioned before, obtaining various complex properties for a single node is fairly simple with nested sets. I will go in more detail how nested sets accomplish that later, but I want you to realise that while we sometimes are only interested in a single node, or a couple of nodes, we sometimes want to know these special properties for an entire tree. With 10.000 nodes. Meaning that a naive approach costs a whopping 40.000 queries, where some contain nested queries. Is it possible to do that in a reasonable time? The answer is yes. As it turns out, all of these properties can be calculated by visiting each node only once.

Properties on single nodes

Simple properties

Calculating the simple properties of a node, namely the children and parent node are usually easy to calculate and this is no different for nodes stored as a nested set.

Since we store the parent id on nodes itself, getting the parent node is as simple as reading a field on the node. Getting a node’s children is simply a query on a parent node:

SELECT * FROM `nodes` WHERE `parent_node` = ?

Compact nested set representation of the tree above.
Compact nested set representation of the tree above.

Complex properties

From here on I will be using the graphical representation as shown above to showcase why properties are calculated as they are. This representation represents exactly the same tree as the first image on this page.

Descendants of node “Women’s”
Descendants of node “Women’s”

Descendants are all nodes that are in a subtree of a given parent node. That subtree contains the children of the parent node, as well as their children etc. Since we defined that if a node appears under a parent node, they are contained in the set of the parent node, getting all descendants becomes very easy. Those descendants are represented by all nodes within the set of the parent node. Since we stored the left and right edge of that set, we can simply query for all nodes within the left and right bounds of a given node.

First we find the left and right bounds of that node

SELECT `left`, `right` FROM `nodes` WHERE `id` = ?

Then we use those values to find all nodes contained within those bounds.

SELECT * FROM `nodes` WHERE `left` > ? AND `right` < ?

Ancestors of node “Sun dresses”
Ancestors of node “Sun dresses”

Ancestors are all nodes you would need to travel through if you wanted to reach a given node from the root. In the case above, the ancestors of “Sun dresses” are “Clothing”, “Women’s” and “Dresses”.

Finding ancestors is a little bit harder to wrap your head around, but remember that I told you that all the descendants of a node are within the bounds of that node? Well, we can use that.

See, all ancestors of our node have one thing in common, namely that our node is in the descendants property of each of the ancestor nodes. Meanwhile, there are no other nodes that has that node in their descendants property. If our node must be within the bounds of its ancestor nodes, the ancestor nodes of our node must have their bounds outside our bounds.

Similarly as with obtaining the descendants, we start with finding our own bounds.

SELECT `left`, `right` FROM `nodes` WHERE `id` = ?

Then we find all nodes that begin left of our node and end after the end of our node

SELECT * FROM `nodes` WHERE `left` < ? AND `right` > ?

The depth of a node is how many layers deep a node is nested. It directly correlates with how many ancestor nodes a given node has.

I keep repeating myself, but once again we start with finding our own bounds

SELECT `left`, `right` FROM `nodes` WHERE `id` = ?

Then we simply count the number of ancestor nodes

SELECT COUNT(*) + 1 as `depth` FROM `nodes` WHERE `left` < ? AND `right` > ?

The position (or index or order) attribute of a node in relation to its sibling is basically an integer that allows you to order nodes under the same parent. It is arguably the least useful of the attributes described here if you are designing a tree and controllers for it from the ground up. In our case we transitioned from a different tree structure to nested sets and needed this attribute to prevent us from having to rewrite the front-end. Instead of sending a position and a parent to move nodes around, you are better of using a parent node id and/or a sibling node id to pinpoint the location you want to move/insert a node to/at.

To find this attribute, we simply need to count how many siblings appear before us under the same parent.

You can probably guess what we are about to do first…

SELECT `left`, `parent_node_id` FROM `nodes` WHERE `id` = ?

Afterwards, we find all nodes under the parent, then filter on what is before the current node, and then count the result

SELECT COUNT(*) as `position` FROM `nodes` WHERE `parent_node_id` = ? AND `left` < ?

Finding properties on collections

As outlined in the problem statement, a naive approach to finding these properties on a collection of nodes is to repeat the queries above on each node. While this approach does not seem to be that bad on a small tree, the number of queries and processing on those queries required for large trees gets out of hand quickly. Luckily we have more tricks we can use.

The following techniques all assume that you are working with a complete tree or a complete subtree. All functions only work on the data that is passed to that function. If you pass a complete subtree, it will find properties as if that subtree was the complete tree. In other words, if you calculate the ancestors property on a node while passing only a subtree, it will only contain the ancestors in that subtree. You would need to manually add the other ancestors if you wanted to calculate the ancestors of that node in the entire tree.

Repeated image of the compacted tree

Flat tree using depth-first walk

A depth-first walk, pre-order or NLR-traversal starts at the root of a tree and visits the left subtree of each node completely before visiting the right subtree of each node. There are two great things about this:

  • You are guaranteed to visit all parents of any given node before visiting that node itself
  • It is extremely easy to create a nested tree (where children are put in a children attribute of their parent) from a flat tree created this way

Why do I bring this up? The first property of this walk is extremely useful when finding the complex properties we talked about earlier. The other reason I bring this up, is because it is extremely easy to get a pre-order walk directly from the database. If you look closely at the image above and scan through the image from the left to the right, you will notice that if you write down the name of a node when you encounter its left edge, you will end up with the pre-order walk we just talked about. In other words: We only need to sort our nodes by their left edge.

SELECT * from `nodes` ORDER BY `left` ASC

We found “Sun dresses” in our walk
We found “Sun dresses” in our walk

Ancestors

The ancestors property is probably the easiest to calculate on a collection. In fact, if your language supports generators, you can easily yield nodes in pre-order while adding this property to your nodes.

When creating the ancestor property we will use the fact that we have found all ancestors of a node before reaching that node, as well as the fact that if we have reached the right edge of a parent node, we have encountered all nodes that needed that parent node as an ancestor. If we combine those two things, we can create a running stack of ancestors we can easily keep up-to-date.

If we look at the image, the light coloured area is what we have already processed. The darker areas are nodes we have processed, and have removed from the ancestor stack already. Let’s take a closer look at the code we are executing when calculating the property for the node “Sun dresses”. At the beginning of the loop we start with an ancestor stack containing “Clothing”, “Women’s”, “Dresses” and “Evening gowns”. We then purge “Evening gowns”, because the right side of this node is before the left side of the node we are currently processing. Afterwards we add our new ancestor stack containing “Clothing”, “Women’s” and “Dresses” as an attribute on our node and then pass it on. Finally we add “Sun dresses” to the ancestor stack before starting at the beginning of the loop to process “Skirts”.

If you are processing a subtree rather than an entire tree, you can seed $ancestor_stack with the ancestors of the root node to get the ancestors relative to the entire tree rather than the ancestors relative to the subtree.

Descendants and children

Getting the descendants of each node is a bit more troublesome. Just like ancestors of a node are guaranteed to appear before a node, descendants are guaranteed to appear after that node when going through the nodes in a pre-order walk. However, we are not able to use the same trick of keeping a running stack of descendants like we did with ancestors, not even if we go through the nodes from the end to the beginning.

Descendants of node “Women’s”
Descendants of node “Women’s”

Since we require data after the current node to calculate the descendants property, we are stuck in one of the following situations:

  • We need to cache all nodes to be able to yield them one by one
  • We need to prepare the entire output before returning it
  • We need to yield the nodes in a different order

Each of these situations have their up- and downsides, but I will give two examples for the first two situations.


If we want to yield nodes the moment we process them, we need a reliable way to find all descendant nodes. When we do a pre-order walk, we process the node itself, then its left subtree, then its right subtree. In other words: When we process a node, we then get a list of all our descendant nodes before going back up to our parent node. Thus, if we know how many nodes are in the subtree, we can use this information to get that many nodes after the current node.

Each node has a left and a right bound, and none of the bounds in a normalised tree overlap. This means that each node takes up two “spaces”. In a normalised tree, the left and right bound of a parent node should fit snugly around the nodes in its subtree. Since a single node takes up a fixed amount of space, we can calculate how many nodes are between the bounds of the parent node with the following calculation. That is exactly what we are going to use in our code.

numberOfNodes = (right - left - 1) / 2

We can find the children attribute by using the fact that the sets of sibling nodes touch each other, start at the left edge of the parent node and end at the right edge of the parent node. Finding the first child is easy… the node after the parent node is either the first child of that node, or some unrelated node after the right bound of the parent node. To get the next sibling all we need to do is skip over the subtree of the current sibling. All we need to do for that is figure out how many nodes are in that subtree using the left and right bound of that node, then skip ahead that many nodes.

I leave it as a reader’s exercise to combine these snippets to get the children and descendants attribute at the same time.


As an alternative, we can instead use the code structure we used to determine the ancestors. If we have all the ancestors of a node, all those ancestors have that node as their descendant. We can gradually build up the descendant attribute of each node in that ancestor stack. This means we need to either keep the entire (flat) tree in memory until the end, or yield the nodes when we pop them from the stack, scrambling the tree.

Positions / index / order

The position / index / order attribute as mentioned earlier is easiest calculated when adding the children one by one, by simply counting the number of children at the time of adding the child. Since both techniques outlined above use this approach, I will leave it to the reader to write those two lines.

Combining all of the above

We combined some of these snippets into some unsightly loop, which I will not share here. The result was that an api call that previously took between 15 and 20 seconds now barely took 2. I have not looked too closely at what is hindering performance right now, but my gut feeling is that initialising up to 10.000 models has something to do with it.

Further reading

Top comments (0)