Business applications often need to represent and store information such as a company's organizational hierarchy, a country's administrative divisions, an inventory structure, or simply a set of folders and files. Despite their apparent differences, these examples share a common feature— they are all trees. Not the green, wood-made kind that provide pleasant shade on a sunny day, but, mathematically speaking, connected acyclic graphs. In computer science, the links between nodes are described in terms of parent-child relationships. Each node in the tree, except the root, has exactly one parent and zero, one, or many children.
The figure above shows an example of a tree. Vertex A is called the root of the tree. Nodes without subtrees — C, D, F, G, and H — are called leaves. Edges of the tree show parent-child relationships between nodes. A rooted tree is k-ary if each node has no more than k children. A binary tree is a well-known particular case of a k-ary tree where k = 2.
One of the common tasks performed on tree data is the retrieval of all child nodes at any nesting level, given a parent node. For example, calculating the total turnover of an account involves determining the turnovers of all its subaccounts, or compiling a list of goods belonging to a given goods group includes lists from all the nested subgroups.
In imperative programming languages, we usually use a graph traversal algorithm to find all subnodes. Although most modern database servers support SQL queries with recursive common table expressions, using the recursive part inside complex queries on large amounts of data can significantly degrade performance.
The reason is that SQL deals primarily with sets, not graphs, so one should implement dedicated data structures to represent nodes and edges as nested sets for efficient data selection.
In this article, we will consider four ways of storing tree-like structures inside a relational database and evaluate them in terms of query performance and the space needed for storing the underlying data.
Link to the parent
Perhaps the most intuitive and easiest way to represent tree-like data is to use the adjacency matrix in a relational table. Every record in the table corresponds to a node in the tree and holds the unique ID of the node (primary key) and the ID of its parent (for root nodes, the parent is null).
Below is a DDL query that creates the test1
table:
CREATE TABLE test1 (
id INTEGER NOT NULL,
parent INTEGER,
name VARCHAR(60) NOT NULL,
PRIMARY KEY (id),
FOREIGN KEY (parent) REFERENCES test1 (id)
ON UPDATE CASCADE
ON DELETE CASCADE
)
Pay attention to the ON DELETE CASCADE
clause of the FOREIGN KEY
constraint declared on the PARENT
field. It guarantees that the entire subtree will be deleted at once if the parent node is deleted. To protect the subtree from accidental deletion, the rule should be changed from CASCADE
to RESTRICT
.
Corresponding data for the tree shown at the beginning of the article looks like:
id | parent | name |
---|---|---|
1 | NULL | A |
2 | 1 | B |
3 | 1 | E |
4 | 1 | H |
5 | 2 | C |
6 | 2 | D |
7 | 3 | F |
8 | 3 | G |
Operations on tree data
Addition of a new tree node is done by the INSERT INTO
operator:
INSERT INTO test1 (id, parent)
VALUES (<unique id of the node>,
<id of the parent node or NULL for a root node>)
To move the node from one parent to another, a value of the PARENT
field has to be changed:
UPDATE test1
SET parent = <id of the new parent>
WHERE id = <id of the node being moved>
Obviously, to delete the node, one should invoke DELETE FROM
operator providing ID of the node.
DELETE FROM test1
WHERE id = <id of the node>
Selecting whole subtree of the given node
With such an organization of the data, a simple SELECT
operator allows us to fetch only one level of nesting of node's descendants.
SELECT id
FROM test1
WHERE parent = :P
So, what should we do if the task is to extract all subnodes, regardless of how deep they lie in the subtree?
An unconventional solution would be to write a query for every possible level of nesting and then merge their outputs using the UNION
operator. Fortunately, we can use either recursive stored procedures or recursive common table expressions to assist us.
The procedure below returns the IDs of all subnodes of a given node. The second parameter specifies whether the parent node will be included in the resulting set.
CREATE PROCEDURE RecTest1(AnID INTEGER, Self INTEGER)
RETURNS (ID INTEGER)
AS
BEGIN
IF (:Self <> 0) THEN
BEGIN
ID = :AnID;
SUSPEND;
END
FOR SELECT id FROM test1
WHERE parent = :AnID INTO :ID
DO FOR SELECT id FROM RecTest1(:ID, 1) INTO :ID
DO SUSPEND;
END;
We can use the procedure on its own or JOIN
its output with the tree data to select the required columns:
SELECT t.*
FROM test1 t
JOIN RecTest1(:P, 0) r ON t.id = r.id
If a specific SQL server supports recursive common table expressions, we can write a query that selects all nested subnodes of the given node:
WITH RECURSIVE
tree AS (
SELECT id, parent
FROM test1
WHERE parent = <id of the given node>
UNION ALL
SELECT g.id, g.parent
FROM test1 g JOIN tree h
ON g.parent = h.id
)
SELECT
gt.id,
gt.parent
FROM
tree gt
Storing auxiliary links
In the previous example, we placed an id-parent link in the table with tree data. Since we only know about the relationship between a given node and its children, we need recursive queries to scan the tree in depth, which is not efficient when working with large amounts of data.
To overcome this limitation, an auxiliary table should be introduced. Such a table holds pairs of connected nodes along with the distance between them, where a distance of 1 indicates a parent-child relationship, and distances greater than 1 correspond to deeper levels of ancestry.
CREATE TABLE test2(
id INTEGER NOT NULL,
name VARCHAR(60) NOT NULL,
PRIMARY KEY (id)
);
CREATE TABLE test2_link(
id_from INTEGER NOT NULL,
id_to INTEGER NOT NULL,
distance INTEGER NOT NULL,
PRIMARY KEY (id_from, id_to),
FOREIGN KEY (id_from) REFERENCES test2 (id)
ON UPDATE CASCADE
ON DELETE CASCADE,
FOREIGN KEY (id_to) REFERENCES test2 (id)
ON UPDATE CASCADE
ON DELETE CASCADE
);
Here is data for the tree:
id_from | id_to | distance |
---|---|---|
B | A | 1 |
C | B | 1 |
D | B | 1 |
C | A | 2 |
D | A | 2 |
E | A | 1 |
F | E | 1 |
G | E | 1 |
F | A | 2 |
G | A | 2 |
H | A | 1 |
Let's get back to our business. Selection of the subtree now done much easier. No recursion required.
SELECT
t.*
FROM
test2 t JOIN test2_link l
ON l.id_from = t.id
WHERE
l.id_to = :P
Using distance
field, we can arrange the resulting set by node's closeness to the given root node.
Manipulating Tree Data
While we have managed to avoid recursion when retrieving data, this has significantly complicated the processes of inserting and modifying records.
Adding a Node
To add a node, you need to perform two operations: insert into the tree nodes table and the links table.
For example, to add a child node with an identifier 2
to a parent node with an identifier 1
, execute the following commands:
INSERT INTO test2 (id) VALUES (2);
INSERT INTO test2_link (id_from, id_to, distance)
VALUES (2, 1, 1);
Creating Triggers
To automate the maintenance of tree relationships, it's advisable to create two triggers.
The first trigger fires after inserting a record into the test2
table and adds a self-referential link with a distance of 0
into test2_link
:
CREATE TRIGGER test2_ai FOR test2
AFTER INSERT
POSITION 32000
AS
BEGIN
INSERT INTO test2_link (id_from, id_to, distance)
VALUES (NEW.id, NEW.id, 0);
END;
The second trigger fires after inserting a record into the test2_link
table.
Its purpose is to create links between the new node and its descendants (if adding a subtree) and all ancestors of the parent node:
CREATE TRIGGER test2_link_ai FOR test2_link
AFTER INSERT
POSITION 32000
AS
DECLARE VARIABLE AnID INTEGER;
DECLARE VARIABLE ADistance INTEGER;
BEGIN
IF (NEW.distance = 1) THEN
BEGIN
INSERT INTO test2_link (id_from, id_to, distance)
SELECT down.id_from, up.id_to,
down.distance + up.distance + 1
FROM test2_link up, test2_link down
WHERE up.id_from = NEW.id_to
AND down.id_to = NEW.id_from
AND down.distance + up.distance > 0;
END
END;
Moving a Node
To move a node (or subtree) from one parent to another, follow these steps:
- Remove the link to the current parent, detaching the subtree:
DELETE FROM test2_link
WHERE id_from = <ID of the node or subtree>
AND id_to = <ID of the current parent>
AND distance = 1;
- Add the subtree to the new parent:
INSERT INTO test2_link (id_from, id_to, distance)
VALUES (
<ID of the node or subtree>,
<ID of the new parent>,
1
);
The previously created insert trigger will handle the new relationships automatically.
Create a trigger that, upon deleting a link with a distance of 1
, removes all links between the nodes of the subtree and the nodes along the path from the parent node to the tree root:
CREATE TRIGGER test2_link_bd FOR test2_link
BEFORE DELETE
POSITION 32000
AS
DECLARE VARIABLE AnIDFrom INTEGER;
DECLARE VARIABLE AnIDTo INTEGER;
BEGIN
IF (OLD.distance = 1) THEN
BEGIN
FOR
SELECT down.id_from, up.id_to
FROM test2_link down
JOIN test2_link up
ON up.id_from = OLD.id_from
AND up.distance + down.distance > 1
AND down.id_to = OLD.id_from
INTO :AnIDFrom, :AnIDTo
DO
DELETE FROM test2_link
WHERE id_from = :AnIDFrom
AND id_to = :AnIDTo;
END
END;
Deleting a Node
Deleting a node doesn't require special handling, as foreign keys in the test2_link
table are configured for cascading deletions.
When a node is deleted from the test2
table, all related records in test2_link
are automatically removed:
DELETE FROM test2 WHERE id = <Node ID>;
Ensuring Data Integrity
To maintain logical data integrity and adherence to our model, create two additional triggers.
The first trigger prevents editing records in the test2_link
table:
CREATE EXCEPTION tree_e_incorrect_operation
'Incorrect operation';
CREATE TRIGGER test2_link_au FOR test2_link
BEFORE UPDATE
POSITION 32000
AS
BEGIN
EXCEPTION tree_e_incorrect_operation;
END;
The second trigger deletes a node if a record with a distance of 0
is removed from the links table:
CREATE TRIGGER test2_link_ad FOR test2_link
AFTER DELETE
POSITION 32000
AS
BEGIN
IF (OLD.distance = 0) THEN
BEGIN
DELETE FROM test2 WHERE id = OLD.id_from;
END
END;
These measures will help maintain data integrity and ensure the correct tree structure within the database.
Composite Path
It's not necessary to use a separate table to store the list of nodes leading to the root of a tree. Instead, you can concatenate the ordered list of node identifiers into a string, using a chosen delimiter, and store it directly within the node records table. For example:
CREATE TABLE test3 (
id INTEGER NOT NULL,
parent INTEGER,
path VARCHAR(60) NOT NULL,
PRIMARY KEY (id),
FOREIGN KEY (parent) REFERENCES test3 (id)
ON UPDATE CASCADE
ON DELETE CASCADE
);
To optimize retrieval operations using the STARTING WITH
operator, it's essential to create the following index to enhance performance:
CREATE INDEX test3_x_path ON test3 (path);
If you choose a forward slash (/
) as the delimiter, the test3
table for the tree depicted in the figure above would be populated as follows:
id | parent | path |
---|---|---|
A | NULL | / |
B | A | /A/ |
C | B | /A/B/ |
D | B | /A/B/ |
E | A | /A/ |
F | E | /A/E/ |
G | E | /A/E/ |
H | A | /A/ |
Retrieving Subtree Nodes
To retrieve all nodes in the subtree rooted at node P
, execute the following query:
SELECT *
FROM test3
WHERE path STARTING WITH
(SELECT path FROM test3 WHERE id = :P);
If you wish to exclude the root node itself from the results, modify the query accordingly:
SELECT *
FROM test3
WHERE path STARTING WITH
(SELECT path FROM test3 WHERE id = :P)
AND id <> :P;
Manipulating Tree Data
Similar to the simplest method of organizing hierarchical data discussed earlier, adding, moving, or deleting a node can be accomplished using standard SQL commands: INSERT
, UPDATE
, or DELETE
.
The automatic generation of the path string can be managed by implementing the following triggers:
CREATE TRIGGER test3_bi FOR test3
BEFORE INSERT
POSITION 32000
AS
DECLARE VARIABLE P INTEGER;
BEGIN
NEW.path = '';
P = NEW.parent;
WHILE (NOT :P IS NULL) DO
BEGIN
NEW.path = CAST(:P AS VARCHAR(60)) || '/'
|| NEW.path;
SELECT parent
FROM test3
WHERE id = :P
INTO :P;
END
NEW.path = '/' || NEW.path;
END;
CREATE EXCEPTION tree_e_incorrect_operation2
'Incorrect operation';
CREATE TRIGGER test3_bu FOR test3
BEFORE UPDATE
POSITION 32000
AS
DECLARE VARIABLE P INTEGER;
DECLARE VARIABLE T INTEGER;
DECLARE VARIABLE NP VARCHAR(60);
DECLARE VARIABLE OP VARCHAR(60);
BEGIN
IF((NEW.parent <> OLD.parent)
OR (NEW.parent IS NULL AND NOT OLD.parent IS NULL)
OR (NOT NEW.parent IS NULL
AND OLD.parent IS NULL)) THEN
BEGIN
NEW.path = '';
P = NEW.parent;
WHILE (NOT :P IS NULL) DO
BEGIN
NEW.path = CAST(:P AS VARCHAR(60)) || '/'
|| NEW.path;
SELECT parent
FROM test3
WHERE id = :P
INTO :P;
END
NEW.path = '/' || NEW.path;
/* Check for circular references */
IF (POSITION(('/' ||
CAST(NEW.id AS VARCHAR(60)) || '/')
IN NEW.path) > 0) THEN
EXCEPTION tree_e_incorrect_operation2;
NP = NEW.path ||
CAST(NEW.id AS VARCHAR(60)) || '/';
OP = OLD.path ||
CAST(OLD.id AS VARCHAR(60)) || '/';
T = LENGTH(OP) + 1;
UPDATE test3
SET path = :NP || SUBSTRING(path FROM :T FOR 60)
WHERE path STARTING WITH :OP;
END
END;
Storing the composite path from the tree root to the current node imposes a limitation on the maximum tree depth, approximately equal to L div (K + 1)
, where L
is the length of the string field used to store the path, and K
is the average length of the string representation of the record key.
Go to the part 2...
Top comments (0)