Manage hierarchical data using MySql

Summary

We know that relational database tables are better suited for flat lists than for hierarchical data with parent-child relationships that can be directly managed like XML.

First, define the hierarchy we are discussing, which is a set of data where each entry can have only one parent and zero or more children (the only exception is the root entry, which has no parent).Many database-dependent applications encounter hierarchical data, such as threads to forums or mailing lists, organization charts of enterprises, content management systems, or catalogs of stores.Let's take the following data as an example:

Mike Hillyer considers two different models, Adjacency List and Nested Set, to implement this hierarchy.

Adjacency List model

We can intuitively save the structure shown below.

Create a table named distributions:

CREATE TABLE distributions (
  id INT NOT NULL AUTO_INCREMENT,
  name VARCHAR(32) NOT NULL,
  parent INT NULL DEFAULT NULL,
  PRIMARY KEY (id)
)
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;

Insert data:

INSERT INTO distributions VALUES
  (1, 'Linux', NULL),
  (2, 'Debian', 1),
  (3, 'Knoppix', 2),
  (4, 'Ubuntu', 2),
  (5, 'Gentoo', 1),
  (6, 'Red Hat', 1),
  (7, 'Fedora Core', 6),
  (8, 'RHEL', 6),
  (9, 'CentOS', 8),
  (10, 'Oracle Linux', 8);

Execution:

SELECT * FROM distributions;

You can see the data in the table as follows:

+----+--------------+--------+
| id | name         | parent |
+----+--------------+--------+
|  1 | Linux        |   NULL |
|  2 | Debian       |      1 |
|  3 | Knoppix      |      2 |
|  4 | Ubuntu       |      2 |
|  5 | Gentoo       |      1 |
|  6 | Red Hat      |      1 |
|  7 | Fedora Core  |      6 |
|  8 | RHEL         |      6 |
|  9 | CentOS       |      8 |
| 10 | Oracle Linux |      8 |
+----+--------------+--------+

Using the linked table model, each record in a table contains a pointer to its upper record.The value of this field for the top level record (Linux in this example) is NULL.The advantage of adjacency tables is intuitive and simple, and we can see at a glance that CentOS derives from RHEL, which in turn evolved from Red Hat.Although client programs may be fairly simple to handle, using pure SQL to process adjacency tables can cause some difficulties.

Get the full path of the whole tree and a single node

The first common task for working with hierarchies is to display the entire hierarchy, usually with some indentation.Using pure SQL processing often involves the use of so-called self-join techniques:

SELECT
  t1.name AS level1,
  t2.name as level2,
  t3.name as level3,
  t4.name as level4
FROM
  distributions AS t1
  LEFT JOIN distributions AS t2
    ON t2.parent = t1.id
  LEFT JOIN distributions AS t3
    ON t3.parent = t2.id
  LEFT JOIN distributions AS t4
    ON t4.parent = t3.id
WHERE t1.name = 'Linux';

The results are as follows:

+--------+---------+-------------+--------------+
| level1 | level2  | level3      | level4       |
+--------+---------+-------------+--------------+
| Linux  | Red Hat | RHEL        | CentOS       |
| Linux  | Red Hat | RHEL        | Oracle Linux |
| Linux  | Debian  | Knoppix     | NULL         |
| Linux  | Debian  | Ubuntu      | NULL         |
| Linux  | Red Hat | Fedora Core | NULL         |
| Linux  | Gentoo  | NULL        | NULL         |
+--------+---------+-------------+--------------+

As you can see, in fact, it is not easy for the client code to get this result.Compared with the original text, we find that the order in which the results are returned is also uncertain.There is no reference in practice.However, you can get the full path of a node by adding a WHERE condition:

SELECT
  t1.name AS level1,
  t2.name as level2,
  t3.name as level3,
  t4.name as level4
FROM
  distributions AS t1
  LEFT JOIN distributions AS t2
    ON t2.parent = t1.id
  LEFT JOIN distributions AS t3
    ON t3.parent = t2.id
  LEFT JOIN distributions AS t4
    ON t4.parent = t3.id
WHERE
  t1.name = 'Linux'
  AND t4.name = 'CentOS';

The results are as follows:

+--------+---------+--------+--------+
| level1 | level2  | level3 | level4 |
+--------+---------+--------+--------+
| Linux  | Red Hat | RHEL   | CentOS |
+--------+---------+--------+--------+

Find all leaf nodes

Using LEFT JOIN, we can find all the leaf nodes:

SELECT
  distributions.id, distributions.name
FROM
  distributions
  LEFT JOIN distributions as child
    ON distributions.id = child.parent
WHERE child.id IS NULL;

The results are as follows:

+----+--------------+
| id | name         |
+----+--------------+
|  3 | Knoppix      |
|  4 | Ubuntu       |
|  5 | Gentoo       |
|  7 | Fedora Core  |
|  9 | CentOS       |
| 10 | Oracle Linux |
+----+--------------+

Limitations of the adjacency table model

Using pure SQL to process adjacent table models is difficult even in the best case.Before we can get a complete path to a classification, we need to know how deep it is.In addition, we need to be extra cautious when deleting a node because it may potentially orphan the entire subtree during processing (e.g., deleting Portable Household Appliances orphans all its subcategories).Some of these limitations can be located and handled in client code or stored procedures.For example, during storage we can traverse the structure from bottom to top to return to the whole tree or a path.We can also use stored procedures to delete nodes and avoid orphaning the entire subtree by raising the level of one of its children and resetting the parent of all other children.

Nested Set model

Mike Hillyer solemnly introduces the Nested Set model because of the inconveniences of using pure SQL to process the adjacent table model.When using this model, we erase the nodes and paths of hierarchies from our mind and think of them as containers:

You can see that the hierarchy has not changed, and that large containers contain subcontainers.We use the left and right values of the container to create the data table:

CREATE TABLE nested (
  id INT NOT NULL AUTO_INCREMENT,
  name VARCHAR(32) NOT NULL,
  `left` INT NOT NULL,
  `right` INT NOT NULL,
  PRIMARY KEY (id)
)
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;

It is important to note that left and right are reserved words for MySQL, so they are marked with an identity separator.

Insert data:

INSERT INTO nested VALUES
  (1, 'Linux', 1, 20),
  (2, 'Debian', 2, 7),
  (3, 'Knoppix', 3, 4),
  (4, 'Ubuntu', 5, 6),
  (5, 'Gentoo', 8, 9),
  (6, 'Red Hat', 10, 19),
  (7, 'Fedora Core', 11, 12),
  (8, 'RHEL', 13, 18),
  (9, 'CentOS', 14, 15),
  (10, 'Oracle Linux', 16, 17);

View content:

SELECT * FROM nested ORDER BY id;

You can see:

+----+--------------+------+-------+
| id | name         | left | right |
+----+--------------+------+-------+
|  1 | Linux        |    1 |    20 |
|  2 | Debian       |    2 |     7 |
|  3 | Knoppix      |    3 |     4 |
|  4 | Ubuntu       |    5 |     6 |
|  5 | Gentoo       |    8 |     9 |
|  6 | Red Hat      |   10 |    19 |
|  7 | Fedora Core  |   11 |    12 |
|  8 | RHEL         |   13 |    18 |
|  9 | CentOS       |   14 |    15 |
| 10 | Oracle Linux |   16 |    17 |
+----+--------------+------+-------+

How do we determine the left and right numbers? From the following illustration, we can visually see that just a few can be completed:

Returning to the tree model, anyone with a slight idea of the data structure will know that this numbering can be accomplished with a slightly modified first-order traversal algorithm:

Get the whole tree

The left number of a node is always between the left and right numbers of its parent node. Use this feature to link to the parent node using self-join to get the whole tree:

SELECT node.name
FROM
  nested AS node,
  nested AS parent
WHERE
  node.`left` BETWEEN parent.`left` AND parent.`right`
  AND parent.name = 'Linux'
ORDER BY node.`left`;

The results are as follows:

+--------------+
| name         |
+--------------+
| Linux        |
| Debian       |
| Knoppix      |
| Ubuntu       |
| Gentoo       |
| Red Hat      |
| Fedora Core  |
| RHEL         |
| CentOS       |
| Oracle Linux |
+--------------+

But then we lose the hierarchical information.What shall I do?This can be achieved by using the COUNT() function and the GROUP BY clause:

SELECT
  node.name, (COUNT(parent.name) - 1) AS depth
FROM
  nested AS node,
  nested AS parent
WHERE
  node.`left` BETWEEN parent.`left` AND parent.`right`
GROUP BY node.name
ORDER BY ANY_VALUE(node.`left`);

Note that MySQL 5.7.5 Begins Default only_full_group_by mode enabled to make GROUP BY behave in accordance with the SQL92 standard Thus, direct use of ORDER BY node.`left` results in errors:

ERROR 1055 (42000): Expression #1 of ORDER BY clause is not in GROUP BY clause and contains nonaggregated column'test.node.left' which is not functionally dependent on columns in GROUP BY clause; this is incompatible withsql_mode=only_full_group_by

Use ANY_VALUE() Is a way to avoid this problem.

The results are as follows:

+--------------+-------+
| name         | depth |
+--------------+-------+
| Linux        |     0 |
| Debian       |     1 |
| Knoppix      |     2 |
| Ubuntu       |     2 |
| Gentoo       |     1 |
| Red Hat      |     1 |
| Fedora Core  |     2 |
| RHEL         |     2 |
| CentOS       |     3 |
| Oracle Linux |     3 |
+--------------+-------+

With a few adjustments, you can display the hierarchy directly:

SELECT
  CONCAT(REPEAT(' ', COUNT(parent.name) - 1), node.name) AS name
FROM
  nested AS node,
  nested AS parent
WHERE
  node.`left` BETWEEN parent.`left` AND parent.`right`
GROUP BY node.name
ORDER BY ANY_VALUE(node.`left`);

The result is pretty:

+-----------------+
| name            |
+-----------------+
| Linux           |
|  Debian         |
|   Knoppix       |
|   Ubuntu        |
|  Gentoo         |
|  Red Hat        |
|   Fedora Core   |
|   RHEL          |
|    CentOS       |
|    Oracle Linux |
+-----------------+

Client code, of course, may prefer to use a depth value to loop through the returned result set, which Web developers can increase or decrease by using <li>/</li> or <ul>/</ul>, etc.

Gets the depth of a node in a subtree

To get the depth of a node in a subtree, we need a third self-join and a subquery to restrict the results to a specific subtree and make the necessary calculations:

SELECT
  node.name, (COUNT(parent.name) - ANY_VALUE(sub_tree.depth) - 1) AS depth
FROM
  nested AS node,
  nested AS parent,
  nested AS sub_parent,
  (
    SELECT
      node.name, (COUNT(parent.name) - 1) AS depth
    FROM
      nested AS node,
      nested AS parent
    WHERE
      node.`left` BETWEEN parent.`left` AND parent.`right`
       AND node.name = 'Red Hat'
    GROUP BY node.name, node.`left`
    ORDER BY node.`left`
  ) AS sub_tree
WHERE
  node.`left` BETWEEN parent.`left` AND parent.`right`
  AND node.`left` BETWEEN sub_parent.`left` AND sub_parent.`right`
  AND sub_parent.name = sub_tree.name
GROUP BY node.name
ORDER BY ANY_VALUE(node.`left`);

The result is:

+--------------+-------+
| name         | depth |
+--------------+-------+
| Red Hat      |     0 |
| Fedora Core  |     1 |
| RHEL         |     1 |
| CentOS       |     2 |
| Oracle Linux |     2 |
+--------------+-------+

Find a direct child of a node

This is fairly simple when using the adjacency table model.When using nested sets, we can add a HAVING clause to get the depth of each node in the subtree above:

SELECT
  node.name, (COUNT(parent.name) - ANY_VALUE(sub_tree.depth) - 1) AS depth
FROM
  nested AS node,
  nested AS parent,
  nested AS sub_parent,
  (
    SELECT
      node.name, (COUNT(parent.name) - 1) AS depth
    FROM
      nested AS node,
      nested AS parent
    WHERE
      node.`left` BETWEEN parent.`left` AND parent.`right`
       AND node.name = 'Red Hat'
    GROUP BY node.name, node.`left`
    ORDER BY node.`left`
  ) AS sub_tree
WHERE
  node.`left` BETWEEN parent.`left` AND parent.`right`
  AND node.`left` BETWEEN sub_parent.`left` AND sub_parent.`right`
  AND sub_parent.name = sub_tree.name
GROUP BY node.name
HAVING depth = 1
ORDER BY ANY_VALUE(node.`left`);

Result:

+-------------+-------+
| name        | depth |
+-------------+-------+
| Fedora Core |     1 |
| RHEL        |     1 |
+-------------+-------+

Get all leaf nodes

Looking at the numbered nested model, the judgment of leaf nodes is fairly simple. A node whose right number is exactly 1 more than the left number is a leaf node:

SELECT id, name FROM nested WHERE `right` = `left` + 1;

The results are as follows:

+----+--------------+
| id | name         |
+----+--------------+
|  3 | Knoppix      |
|  4 | Ubuntu       |
|  5 | Gentoo       |
|  7 | Fedora Core  |
|  9 | CentOS       |
| 10 | Oracle Linux |
+----+--------------+

Get the full path of a single node

The self-join technique is still used, but now you don't have to worry about the depth of the nodes:

SELECT parent.name
FROM
  nested AS node,
  nested AS parent
WHERE
  node.`left` BETWEEN parent.`left` AND parent.`right`
  AND node.name = 'CentOS'
ORDER BY parent.`left`;

The results are as follows:

+---------+
| name    |
+---------+
| Linux   |
| Red Hat |
| RHEL    |
| CentOS  |
+---------+

Aggregation operation

Let's add a release table to show aggregate operations under a nested set model:

CREATE TABLE releases (
  id INT NOT NULL AUTO_INCREMENT,
  distribution_id INT NULL,
  name VARCHAR(32) NOT NULL,
  PRIMARY KEY (id),
  INDEX distribution_id_idx (distribution_id ASC),
  CONSTRAINT distribution_id
    FOREIGN KEY (distribution_id)
    REFERENCES nested (id)
    ON DELETE CASCADE
    ON UPDATE CASCADE
)
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8;

Add some data, assuming it refers to a software-supported distribution:

INSERT INTO releases (distribution_id, name) VALUES
  (2, '7'), (2, '8'),
  (4, '14.04 LTS'), (4, '15.10'),
  (7, '22'), (7, '23'),
  (9, '5'), (9, '6'), (9, '7');

Then, the following query can tell the number of releases involved under each node. If this is a list of software-supported releases, maybe testers want to know how many virtual machines they have to prepare:

SELECT
  parent.name, COUNT(releases.name)
FROM
  nested AS node ,
  nested AS parent,
  releases
WHERE
  node.`left` BETWEEN parent.`left` AND parent.`right`
  AND node.id = releases.distribution_id
GROUP BY parent.name
ORDER BY ANY_VALUE(parent.`left`);

The results are as follows:

+-------------+----------------------+
| name        | COUNT(releases.name) |
+-------------+----------------------+
| Linux       |                    9 |
| Debian      |                    4 |
| Ubuntu      |                    2 |
| Red Hat     |                    5 |
| Fedora Core |                    2 |
| CentOS      |                    3 |
+-------------+----------------------+

If the hierarchy is a catalog of categories, this technique can be used to query how many items are associated under each category.

Add Node

Looking back at this picture again:

If we want to add a Slackware after Gentoo, the left and right numbers of this new node are 10 and 11, respectively, while all the numbers from 10 need to be added 2.We can:

LOCK TABLE nested WRITE;

SELECT @baseIndex := `right` FROM nested WHERE name = 'Gentoo';

UPDATE nested SET `right` = `right` + 2 WHERE `right` > @baseIndex;
UPDATE nested SET `left` = `left` + 2 WHERE `left` > @baseIndex;

INSERT INTO nested (name, `left`, `right`) VALUES
  ('Slackware', @baseIndex + 1, @baseIndex + 2);

UNLOCK TABLES;

Use the techniques you learned before to see what's happening now:

SELECT
  CONCAT(REPEAT(' ', COUNT(parent.name) - 1), node.name) AS name
FROM
  nested AS node,
  nested AS parent
WHERE
  node.`left` BETWEEN parent.`left` AND parent.`right`
GROUP BY node.name
ORDER BY ANY_VALUE(node.`left`);

The results are:

+-----------------+
| name            |
+-----------------+
| Linux           |
|  Debian         |
|   Knoppix       |
|   Ubuntu        |
|  Gentoo         |
|  Slackware      |
|  Red Hat        |
|   Fedora Core   |
|   RHEL          |
|    CentOS       |
|    Oracle Linux |
+-----------------+

If the parent node of the new node is a leaf node, we need to adjust the previous code a little.For example, we want to add Slax as a child node of Slackware:

LOCK TABLE nested WRITE;

SELECT @baseIndex := `left` FROM nested WHERE name = 'Slackware';

UPDATE nested SET `right` = `right` + 2 WHERE `right` > @baseIndex;
UPDATE nested SET `left` = `left` + 2 WHERE `left` > @baseIndex;

INSERT INTO nested(name, `left`, `right`) VALUES ('Slax', @baseIndex + 1, @baseIndex + 2);

UNLOCK TABLES;

Now the data looks like:

+-----------------+
| name            |
+-----------------+
| Linux           |
|  Debian         |
|   Knoppix       |
|   Ubuntu        |
|  Gentoo         |
|  Slackware      |
|   Slax          |
|  Red Hat        |
|   Fedora Core   |
|   RHEL          |
|    CentOS       |
|    Oracle Linux |
+-----------------+

Delete Node

Deleting a node is the opposite of adding. When a leaf node is deleted, the node is removed and all codes larger than the right code of the node are subtracted by 2.This idea can be extended to delete a node and all its children, delete all nodes whose left code is between the left and right numbers of the nodes, and move all the right node numbers to the left of the original number width of the node:

LOCK TABLE nested WRITE;

SELECT
  @nodeLeft := `left`,
  @nodeRight := `right`,
  @nodeWidth := `right` - `left` + 1
FROM nested
WHERE name = 'Slackware';

DELETE FROM nested WHERE `left` BETWEEN @nodeLeft AND @nodeRight;

UPDATE nested SET `right` = `right` - @nodeWidth WHERE `right` > @nodeRight;
UPDATE nested SET `left` = `left` - @nodeWidth WHERE `left` > @nodeRight;

UNLOCK TABLES;

You can see that the Slackware subtree has been deleted:

+-----------------+
| name            |
+-----------------+
| Linux           |
|  Debian         |
|   Knoppix       |
|   Ubuntu        |
|  Gentoo         |
|  Red Hat        |
|   Fedora Core   |
|   RHEL          |
|    CentOS       |
|    Oracle Linux |
+-----------------+

A slight adjustment can be made to remove a node whose child nodes are one level higher if the number corresponding to the node directly between the left and right numbers of the node to be deleted is moved left 1 and the number corresponding to the right node is moved left 2. For example, we try to delete the RHEL but keep its child nodes:

LOCK TABLE nested WRITE;

SELECT
  @nodeLeft := `left`,
  @nodeRight := `right`
FROM nested
WHERE name = 'RHEL';

DELETE FROM nested WHERE `left` = @nodeLeft;

UPDATE nested SET `right` = `right` - 1, `left` = `left` - 1 WHERE `left` BETWEEN @nodeLeft AND @nodeRight;
UPDATE nested SET `right` = `right` - 2 WHERE `right` > @nodeRight;
UPDATE nested SET `left` = `left` - 2 WHERE `left` > @nodeRight;

UNLOCK TABLES;

The results are:

SELECT
  CONCAT(REPEAT(' ', COUNT(parent.name) - 1), node.name) AS name
FROM
  nested AS node,
  nested AS parent
WHERE
  node.`left` BETWEEN parent.`left` AND parent.`right`
GROUP BY node.name
ORDER BY ANY_VALUE(node.`left`);
+----------------+
| name           |
+----------------+
| Linux          |
|  Debian        |
|   Knoppix      |
|   Ubuntu       |
|  Gentoo        |
|  Red Hat       |
|   Fedora Core  |
|   CentOS       |
|   Oracle Linux |
+----------------+

Tags: Programming Linux Red Hat CentOS RHEL

Posted on Tue, 17 Dec 2019 21:15:50 -0500 by Michael Lasky