Trees in SQL (part 2)

#sql #database #computerscience #tutorial

This is the second and final part of the article “Trees in SQL”. Following the principles of a good narrative, the most interesting part is saved for the end.

Interval Tree

Up until now, we have tried to work with a tree structure as if it were a graph. The challenges we faced—such as using recursion, an additional table, or a bulky string field—arose because SQL is inherently designed for working with sets rather than graphs.

To represent a tree structure as nested sets, each node must be assigned two integer parameters: a left boundary and a right boundary. These parameters must satisfy the following conditions:

The right boundary of a node must be greater than or equal to the left boundary.
The left boundaries of descendants must be greater than the left boundary of their parent node.
The right boundaries of descendants must be less than or equal to the right boundary of their parent node.
The intervals defined by the left and right boundaries of nodes with the same parent must not overlap.

The following diagram illustrates the representation of a tree as nested sets:

To store an interval-based tree structure, we create the following table:

CREATE TABLE test4 (
  id INTEGER NOT NULL,
  parent INTEGER,
  lb INTEGER NOT NULL,
  rb INTEGER NOT NULL,

  PRIMARY KEY (id),
  FOREIGN KEY (parent) REFERENCES test4 (id)
    ON UPDATE CASCADE
    ON DELETE CASCADE,
  CHECK (lb <= rb)
);

To improve data access speed, two indexes are required:

CREATE INDEX test4_x_lb ON test4 (lb);
CREATE DESCENDING INDEX test4_x_rb ON test4 (rb);

Extracting Subtree Nodes

A query to extract all nodes of a subtree with root node P would look like this:

SELECT t.*
FROM test4 t JOIN test4 t2
  ON t.lb > t2.lb AND t.rb <= t2.rb
WHERE
  t2.id = :P

If we replace the strict inequality on the left boundary with a non-strict one, the root of the subtree will also be included in the result set.

Manipulating Tree Data

Compared to previously discussed structures, interval trees require the most effort when adding a new node or moving a node from one branch to another. In addition to two triggers, a stored procedure is also needed.

We start with the procedure EL_TEST4, which identifies and returns the left boundary for a descendant of a given node (PARENT). If the parent’s interval is exhausted, it will be expanded by at least the value specified by the DELTA parameter. During the interval expansion, all higher-level intervals are also expanded, and intervals to the right are shifted. Intervals falling within the boundaries defined by LB2 and RB2 are excluded from processing to prevent infinite loops when moving a subtree from one parent to another.

CREATE PROCEDURE EL_TEST4 (
  PARENT INTEGER,
  DELTA INTEGER,
  LB2 INTEGER,
  RB2 INTEGER)
RETURNS (
  LEFTBORDER INTEGER)
AS
  DECLARE VARIABLE R INTEGER;
  DECLARE VARIABLE L INTEGER;
  DECLARE VARIABLE R2 INTEGER;
  DECLARE VARIABLE MKey INTEGER;
  DECLARE VARIABLE MultiDelta INTEGER;
BEGIN
  /* Retrieve parent boundaries */
  SELECT rb, lb
  FROM test4
  WHERE id = :Parent
  INTO :R, :L;

  /* Get the rightmost boundary of the parent's descendants, if any */
  R2 = NULL;
  SELECT MAX(rb) FROM test4 WHERE parent = :Parent
    INTO :R2;

  /* If there are no descendants, use the parent's left boundary */
  IF (:R2 IS NULL) THEN
    R2 = :L;

  /* Check if there is enough space for another descendant */
  IF (:R - :R2 < :Delta) THEN
  BEGIN
    /* If not, expand the interval */
    MultiDelta = :R - :L + 100;

    /* Ensure the new range is sufficient */
    IF (:Delta > :MultiDelta) THEN
      MultiDelta = :Delta;

    /* Shift the right boundary of parent intervals */
    IF (:LB2 > -1) THEN
      UPDATE test4 SET rb = rb + :MultiDelta
      WHERE lb <= :L AND rb >= :R
        AND NOT (lb >= :LB2 AND rb <= :RB2);
    ELSE
      UPDATE test4 SET rb = rb + :MultiDelta
      WHERE lb <= :L AND rb >= :R;

    /* Shift both boundaries of intervals to the right of the parent interval */
    IF (:LB2 > -1) THEN
      UPDATE test4
        SET lb = lb + :MultiDelta,
        rb = rb + :MultiDelta
        WHERE lb > :R
          AND NOT (lb >= :LB2 AND rb <= :RB2);
    ELSE
      UPDATE test4
        SET lb = lb + :MultiDelta,
        rb = rb + :MultiDelta
        WHERE lb > :R;
  END

  /* Return the found boundary */
  LeftBorder = :R2 + 1;
END;

The trigger TEST4_BI is called after inserting a record and assigns its left and right boundaries:

CREATE TRIGGER test4_bi FOR test4
  BEFORE INSERT
  POSITION 32000
AS
BEGIN
  /* Check if adding the root of a new tree or a descendant to an existing node */
  IF (NEW.parent IS NULL) THEN
  BEGIN
    /* If adding a new tree, set its left boundary after the highest known right boundary */
    NEW.lb = NULL;
    SELECT MAX(rb) FROM test4 INTO NEW.lb;

    IF (NEW.lb IS NULL) THEN
      /* Adding the first element to the table; use integers >0 for interval boundaries */
      NEW.lb = 1;
    ELSE
      NEW.lb = NEW.lb + 1;
  END ELSE
  BEGIN
    /* Call the procedure to find a possible left boundary for the descendant. 
       If the parent's interval is exhausted, it will be expanded. */
    EXECUTE PROCEDURE el_test4 (NEW.parent, 1, -1, -1)
      RETURNING_VALUES NEW.lb;
  END

  NEW.rb = NEW.lb;
END;

The trigger TEST4_BU adjusts node intervals when moving a node to a different parent:

CREATE EXCEPTION tree_e_invalid_parent
  'Invalid parent';

CREATE TRIGGER test4_bu FOR test4
  BEFORE UPDATE
  POSITION 32000
AS
  DECLARE VARIABLE OldDelta INTEGER;
  DECLARE VARIABLE L INTEGER;
  DECLARE VARIABLE R INTEGER;
  DECLARE VARIABLE NewL INTEGER;
BEGIN
  /* Check for changes in the parent */
  IF ((NEW.parent <> OLD.parent)
    OR (OLD.parent IS NULL AND NOT NEW.parent IS NULL)
    OR (NEW.parent IS NULL AND NOT OLD.parent IS NULL))
  THEN
  BEGIN
    /* Check for circular references */
    IF (EXISTS(SELECT * FROM test4
      WHERE id = NEW.parent
      AND lb >= OLD.lb AND rb <= OLD.rb)) THEN
    BEGIN
      EXCEPTION tree_e_invalid_parent;
    END ELSE
    BEGIN
      IF (NEW.parent IS NULL) THEN
      BEGIN
        /* Retrieve the rightmost boundary */
        SELECT MAX(rb)
        FROM test4
        WHERE parent IS NULL
        INTO :NewL;
        /* Assume at least one element with a null parent exists */
        NewL = :NewL + 1;
      END ELSE
      BEGIN
        /* Retrieve the new left boundary */
        EXECUTE PROCEDURE el_test4 (NEW.parent,
          OLD.rb - OLD.lb + 1, OLD.lb, OLD.rb)
        RETURNING_VALUES :NewL;
      END

      /* Calculate the shift */
      OldDelta = :NewL - OLD.lb;
      /* Shift main branch boundaries */
      NEW.lb = OLD.lb + :OldDelta;
      NEW.rb = OLD.rb + :OldDelta;
      /* Shift child boundaries */
      UPDATE test4
        SET lb = lb + :OldDelta, rb = rb + :OldDelta
        WHERE lb > OLD.lb AND rb <= OLD.rb;
    END
  END
END;

Deleting all nested elements when a node is deleted is ensured by the ON DELETE CASCADE rule on the PARENT field, so no additional actions are required to compress the intervals.

Conclusion

A reasonable question might arise: is there a universally optimal way to organize tree structures in SQL that suits all applications? Unfortunately, the answer is no. Each structure has its strengths and weaknesses that must be evaluated based on the specific requirements of the task. For instance:

What is the total number of records in the table? For small trees, performance differences between methods are minimal and insignificant.
How often will the tree data be modified? Interval trees achieve the best results for data retrieval, but demand significant computational resources for inserting new records or moving nodes.
What is the maximum depth of the tree? For depths of 10 levels or more, the second and third methods (mentioned earlier) result in a significant increase in database size due to the need to store all node connections along the path to the root.
What is more critical for the task: retrieval speed or data storage size?

Ultimately, the choice is left to the database designer, and we hope this article serves as a helpful guide.