DEV Community: Sergey Kamenev

Entity-attribute-value model in relational databases. Should globals be emulated on tables? Part 2.

Sergey Kamenev — Tue, 24 Nov 2020 12:47:01 +0000

A More Industrial-Looking Global Storage Scheme

In the first article in this series, we looked at the entity–attribute–value (EAV) model in relational databases, and took a look at the pros and cons of storing those entities, attributes and values in tables. We learned that, despite the benefits of this approach in terms of flexibility, there are some real disadvantages, in particular a basic mismatch between the logical structure of the data and its physical storage, which causes various difficulties.

To solve these issues, we decided to see whether using globals — which are optimized for storing hierarchical information — for tasks the EAV approach typically handles would work.

In Part 1, we created a catalog for an online store, first using tables, then using just one global. Now, let's try to implement the same structure for several globals.

In the first global, ^catalog, we’ll store the directory structure. In the second global, ^good, we’ll store goods. And in the global ^index, we’ll store indexes. Since our properties are tied to a hierarchical catalog, we won’t create a separate global for them.

With this approach, for each entity (except for properties), we have a separate global, which is good from the point of view of logic. Here’s the global catalog structure:

Set ^сatalog(root_id, "Properties", "capacity", "name") = "Capacity, GB"
Set ^сatalog(root_id, "Properties", "capacity", "sort") = 1

Set ^сatalog(root_id, sub1_id, "Properties", "endurance", "name") = "Endurance, TBW"
Set ^сatalog(root_id, sub1_id, "Properties", "endurance", "sort") = 2

Set ^сatalog(root_id, sub1_id, "goods", id_good1) = 1
Set ^сatalog(root_id, sub1_id, "goods", id_good2) = 1

Set ^сatalog(root_id, sub2_id, "Properties", "avg_seek_time", "name") = "Rotate speed, ms"
Set ^сatalog(root_id, sub2_id, "Properties", "avg_seek_time", "sort") = 3

Set ^сatalog(root_id, sub2_id, "goods", id_good3) = 1
Set ^сatalog(root_id, sub2_id, "goods", id_good4) = 1

A global with goods will look something like this:

Set ^good(id_good, property1) = value1
Set ^good(id_good, property2) = value2
Set ^good(id_good, property3) = value3
Set ^good(id_good, "catalog") = catalog_id

Of course, we need indexes so that for any section of the catalog with goods, we can sort by the properties we need. The index global will have a structure something like this:

Set ^index(id_catalog, property1, id_good) = 1
; To quickly get the full path to concrete sub-catalog
Set ^index("path", id_catalog) = "^catalog(root_id, sub1_id)"

Thus, in any section of the catalog, we can get a sorted list. An index global is optional. It’s useful only if the number of products in this section of the catalog is large.

ObjectScript Code for Working with Demo Data

Now we’ll use ObjectScript to work with our data. To start, let’s get the properties of a specific good. We have the ID of a particular good and we need to display its properties in the order given by the sort value. Here’s the code for that:

get_sorted_properties(path, boolTable)
{
  ; remember all the properties in the temporary global
  While $QLENGTH(@path) > 0 {
    if ($DATA(@path("Properties"))) {
      set ln=""
      for {
        Set ln = $order(@path("Properties", ln))
        Quit: ln = ""

        IF boolTable & @path("Properties", ln, "table_view") = 1 {
          Set ^tmp(@path("Properties", ln, "sort"), ln) = @path("Properties", ln, "name")
        }
      ELSE {
        Set ^tmp(@path("Properties", ln, "sort"), ln) = @path("Properties", ln, "name")
      }
    }
  }
}

print_sorted_properties_of_good(id_good)
{
  Set id_catalog = ^good(id_good, "catalog")
  Set path = ^index("path", id_catalog)

  Do get_sorted_properties(path, 0)

  set ln =""
  for {
   Set ln = $order(^tmp(ln))
   Quit: ln = ""
   Set fn = ""
   for {
    Set fn = $order(^tmp(ln, fn))
    Quit: fn = ""
    Write ^tmp(ln, fn), " ", ^good(id_good, fn),!
   }
  }
}

Next, we want to get products from the catalog section in the form of a table, based on id_catalog:

print_goods_table_of_catalog(id_catalog)
{ 
  Set path = ^index("path", id_catalog)
  Do get_sorted_properties(path, 1)

  set id=""
  for {
    Set id = $order(@path("goods"), id)
    Quit: id = ""

    Write id," ", ^good(id, "price"), " "

    set ln =""
    for {
      Set ln = $order(^tmp(ln))
      Quit: ln = ""
      Set fn = ""
      for {
        Set fn = $order(^tmp(ln, fn))
        Quit: fn = ""
        Write ^tmp(ln, fn), " ", ^good(id, fn)
      }
      Write !
    }
  }
}

Readability: EAV SQL Versus Globals

Now let’s compare the use of EAV and SQL against using globals. With regard to code clarity, it’s clear that this is a subjective parameter. But let's look, for example, at creating a new product.

We’ll start with the EAV approach, using SQL. First, we need to get a list of object properties. This is a separate task and quite time-consuming. Assume we already know the IDs of these three properties: capacity, weight, and endurance.

START TRANSACTION
INSERT INTO good (name, price, item_count, catalog_id) VALUES ('F320 3.2TB AIC SSD', 700, 10, 15);

SET @last_id = LAST_INSERT_ID ();

INSERT INTO NumberValues Values(@last_id, @id_capacity, 3200);
INSERT INTO NumberValues Values(@last_id, @id_weight, 0.4);
INSERT INTO NumberValues Values(@last_id, @id_endurance, 29000);
COMMIT

In this example, we have only three properties, and therefore the example doesn’t look so scary. In the general case, we’d still have a few inserts into the text table inside the transaction:

INSERT INTO TextValues Values(@last_id, @ id_text_prop1, 'Text value of property 1');
INSERT INTO TextValues Values(@last_id, @ id_text_prop2, 'Text value of property 2');
...
INSERT INTO TextValues Values (@last_id, @id_text_propN, 'Text value of property N');

Of course, we could simplify the SQL version a little if we used text notation instead of ID properties, such as “capacity” instead of a number. But in the SQL world, this isn’t acceptable. It’s customary instead to use a numeric ID to enumerate entity instances. This results in faster indexes (you need to index fewer bytes), it’s easier to track uniqueness, and it’s easier to automatically create a new ID. In this case, the insert fragment would look like this:

INSERT INTO NumberValues Values(@last_id, 'capacity', 3200);
INSERT INTO NumberValues Values(@last_id, 'weight', 0.4);
INSERT INTO NumberValues Values(@last_id, 'endurance', 29000);

Here’s the same example using globals:

TSTART
Set ^good(id, "name") = "F320 3.2TB AIC SSD"
Set ^("price") = 700, ^("item_count") = 10, ^("reserved_count") = 0, ^("catalog") = id_catalog
Set ^("capacity") = 3200, ^("weight") = 0.4, ^("endurance") = 29000
TCOMMIT

Now let’s delete a good using the EAV approach:

START TRANSACTION
DELETE FROM good WHERE id = @ good_id;
DELETE FROM NumberValues WHERE good_id = @ good_id;
DELETE FROM TextValues WHERE good_id = @ good_id;
COMMIT

And then do the same with globals:

Kill ^good(id_good)

We can also compare the two approaches in terms of code length. As you can see from the previous examples, when you use globals, the code is shorter. This is good. The shorter the code, the fewer the errors and the easier it is to understand and maintain.

Generally, shorter code is also faster. And, in this case, that’s certainly true, since globals are a lower-level data structure than relational tables.

Scaling Data with EAV and Globals

Next, let’s look at horizontal scaling. With the EAV approach we have to at least distribute the three largest tables to the servers: Good, NumberValues,and TextValues. Tables with entities and attributes can simply be completely copied to all servers, since they have little information.

On each server, with horizontal scaling, different products would be stored in the Good, NumberValues, and TextValues tables. We’d have to allocate certain ID blocks for products on each server so that there’s no duplication of IDs for different products.

For horizontal scaling with globals, we’d have to configure ID ranges in the global and assign a global range to each server.

The complexity is approximately the same for EAV and for globals, except that for the EAV approach we’d have to configure ID ranges for three tables. With globals, we’d configure IDs for just one global. That is, it’s easier to organize horizontal scaling for globals.

Data Loss with EAV and Globals

Finally, let’s consider the risk of data loss due to corrupted database files. Where is it easier to save all the data: in five tables or in three globals (including an index global)?

I think it's easier in three globals. With the EAV approach, the data for different goods is mixed in tables, while for globals the information is stored more holistically. The underlying branches are stored and sorted sequentially. Therefore, corruption of part of the global is less likely to lead to damage than corruption of any of the tables in the EAV approach, where data is stored like intertwined pasta.

Another headache in data recovery is the display of information. With the EAV approach, information is divided among several tables and special scripts are required to assemble it into a single whole. In the case of globals, you can simply use the ZWRITE command to display all the values and the underlying branches of the node.

InterSystems IRIS Globals: A Better Approach?

The EAV approach has emerged as a trick for storing hierarchical data. Tables weren’t originally designed to store nested data. The de facto EAV approach is the emulation of globals in tables. Given that tables are a higher-level and slower data storage structure than globals, the EAV approach fails in comparison with globals.

In my opinion, for hierarchical data structures, globals are more convenient and more comprehensible in terms of programming, and they’re faster.

If you’ve been planning an EAV approach for your project, I suggest you consider using InterSystem IRIS globals to store hierarchical data.

Entity-attribute-value model in relational databases. Should globals be emulated on tables? Part 1.

Sergey Kamenev — Thu, 19 Nov 2020 10:05:59 +0000

Introduction

In the first article in this series, we’ll take a look at the entity–attribute–value (EAV) model in relational databases to see how it’s used and what it’s good for. Then we'll compare the EAV model concepts to globals.

Sometimes you have objects with an unknown number of fields, or perhaps hierarchically nested fields, for which, as a rule, you need to search.

Take, for example, an online store with diverse groups of products. Each product group has its own unique set of properties and has common properties as well. For example, SSD and HDD drives have the common property “capacity,” while both also have unique properties, “Endurance, TBW” for SSDs and “average head positioning time” for HDDs.

In some situations, the same product, produced by different manufacturers, has its own unique properties.

So, let's imagine we have an online store that sells 50 different groups of goods. Each product group has its own five unique properties, which can be numeric or text.

If we create a table in which each product has 250 properties, though only five are really used, we not only greatly increase (50 times!) the requirements for disk space, we also greatly reduce the speed characteristics of the database, since the cache will be clogged with useless, empty properties.

But that’s not all. Every time we add a new product group with its own properties, we need to change the structure of the table using the ALTER TABLE command. On large tables, this operation might take hours or days, which is unacceptable for business.

“Yes,” the attentive reader will note, “but we can use a different table for each group of products.” Of course, you're right, but this approach gives us a database with tens of thousands of tables for a large store, which is difficult to administer. Moreover, the code, which needs to be supported, becomes increasingly complex.

On the other hand, there’s no need to change the structure of the database when adding a new group of products. You only need to add a new table for a new group of products.

In any case, users need to be able to easily search the products in a store, get a convenient tabular display of goods showing their current properties, and also be able to compare products.

As you can imagine, a search form with 250 fields would be extremely inconvenient for the user, as would seeing 250 columns of various properties in the product table when only five properties for the group are needed. The same applies to product comparisons.

A marketing database might be another useful example. For each person stored in it, you’d need many properties (often nested) that might be constantly added, changed, or removed. A person in the past might have bought something for a certain amount, or bought certain groups of goods, participated somewhere, worked somewhere, has relatives, lives in this city, belongs to a certain class of society, and so on and on. There could be thousands of possible fields, constantly changing. Marketers are always thinking about how to distinguish different groups of customers and make compelling special offers to them.

To solve these problems and at the same time have a clear and definite database structure, the entity–attribute–value approach was developed.

The EAV Approach

The essence of the EAV approach is the separate storage of entities, attributes, and attribute values. Typically, to illustrate the EAV approach, only three tables are used, called Entity, Attribute, and Value:

The structure of the demo data that we will store.

Implementing the EAV Approach Using Tables

Now let’s consider a more complex example using five tables (four if you choose to consolidate the last two tables into one).

The first table is Сatalog:

CREATE TABLE Catalog (
id INT,
name VARCHAR (128),
parent INT
);

This table actually corresponds to Entity in the EAV approach. It will store sections of the hierarchical catalog of goods.

The second table is Field:

CREATE TABLE Field (
id INT,
name VARCHAR (128),
typeOf INT,
searchable INT,
catalog_id INT,
table_view INT,
sort INT
);

In this table, we specify the name of the attribute, its type, and whether the attribute is searchable. We also indicate the section of the catalog that holds the goods to which these properties belong. All products in the catalog section of catalog_id or lower might have different properties that are stored in this table.

The third table is Good. It’s designed to store goods, along with their prices, the total quantity of the goods, the reserved quantity of the goods, and the name of the goods. In principle, you don’t really need this table but, in my opinion, it’s useful to have a separate table for the goods.

CREATE TABLE Good (
id INT,
name VARCHAR (128),
price FLOAT,
item_count INT,
reserved_count,
catalog_id INT
);

The fourth table (TextValues) and the fifth table (NumberValues) are designed to store the values of text and the numeric attributes of goods, and have a similar structure.

CREATE TABLE TextValues (
good_id INT,
field_id INT,
fValue TEXT
);

CREATE TABLE NumberValues (
good_id INT,
field_id INT,
fValue INT
);

Instead of the text and number values tables, you could use a single CustomValues table with this structure:

CREATE TABLE CustomValues (
good_id INT,
field_id INT,
text_value TEXT,
number_value INT
);

I prefer to store different types of data separately as it increases speed and saves space.

Accessing the Data Using the EAV Approach

Let’s start by displaying the catalog structure mapping using SQL:

SELECT * FROM Catalog ORDER BY id;

In order to form a tree from these values, some separate code is required. In PHP, it would look something like this:

$stmt = $ pdo-> query ('SELECT * FROM Catalog ORDER BY id');
$aTree = [];
$idRoot = NULL;

while ($row = $ stmt->fetch())
{
    $aTree [$row ['id']] = ['name' => $ row ['name']];

    if (! $row['parent'])
      $idRoot = $row ['id'];
    else
      $aTree [$row['parent']] ['sub'] [] = $row['id'];
}

In the future, we can simply draw the tree if we start from the root node $aTree[$ idRoot].

Now let’s get the properties of a specific product.

First, we’ll get a list of properties specific to this product, then attach to it those properties that are in the database. In real life, not all indicated properties are filled and therefore we’re forced to use LEFT JOIN:

SELECT * FROM
(
SELECT g. *, F.name, f.type_of, val.fValue, f.sort FROM Good as g
INNER JOIN Field as f ON f.catalog_id = g.catalog_id
LEFT JOIN TextValues as val ON tv.good = g.id AND f.id = val.field_id
WHERE g.id = $ nGood AND f.type_of = 'text'
UNION
SELECT g. *, F.name, f.type_of, val.fValue, f.sort FROM Good as g
INNER JOIN Field as f ON f.catalog_id = g.catalog_id
LEFT JOIN NumberValues as val ON val.good = g.id AND f.id = val.field_id
WHERE g.id = $nGood AND f.type_of = 'number'
) t
ORDER BY t.sort;

If we use only one table for storing both numerical and text values, the query is greatly simplified:

SELECT g. *, F.name, f.type_of, val.text_value, val.number_value, f.sort FROM Good as g
INNER JOIN Field as f ON f.catalog = g.catalog
LEFT JOIN CustomValues as val ON tv.good = g.id AND f.id = val.field_id
WHERE g.id = $nGood
ORDER BY f.sort;

Now we’ll get the products in the form of a table contained in the $nCatalog catalog section. First, we get a list of properties that should be reflected in the table view for this section of the catalog:

SELECT f.id, f.name, f.type_of FROM Catalog as c
INNER JOIN Field as f ON f.catalog_id = c.id
WHERE c.id = $nCatalog AND f.table_view = 1
ORDER BY f.sort;

Then we construct the query to create the table. Suppose for a tabular view we need three additional properties (not counting those in the Good table). To simplify the query, we assume that:

SELECT g.if, g.name, g.price,

            f1.fValue as f1_val,

            f2.fValue as f2_val,

            f3.fValue as f3_val,

FROM Good

LEFT JOIN TextValue as f1 ON f1.good_id = g.id

LEFT JOIN NumberValue as f2 ON f2.good_id = g.id

LEFT JOIN NumberValue as f3 ON f3.good_id = g.id

WHERE g.catalog_id = $nCatalog;

Pros and Cons of the EAV Approach

The obvious plus of the EAV approach is flexibility. With fixed data structures such as tables, we can afford to store a wide variety of property sets for objects. And we can store different data structures without changing the database schema.

We can also use SQL, which is familiar to a great many developers.

The most obvious minus is the mismatch between the logical structure of the data and its physical storage, which causes various difficulties.

Moreover, the programming often involves very complex SQL queries. Debugging can be difficult as you need to create non-standard tools for viewing EAV data. Also, you might have to use LEFT JOIN queries, which slow down the database.

Globals: An Alternative to EAV

Since I’m familiar with both the SQL world and the world of globals, I got the idea that using globals for tasks the EAV approach solves would be much more attractive.

Globals are data structures that allow you to store sparse and hierarchical information. A very important point is that globals are carefully optimized for storing hierarchical information. Globals themselves are lower-level structures than tables, which allows them to work much faster than tables.

At the same time, the global structure itself can be selected according to the data structure, which makes the code very simple and clear.

Global Structure for Storing Demo Data

A global is such a flexible and elegant structure for storing data that we could manage with only one global for storing data in catalog sections, properties and products, for example, like this:

Notice how similar the global structure is to the data structure. This compliance greatly simplifies coding and debugging.

In practice, it’s better to use several globals, although the temptation to store all the information in one is quite strong. It makes sense to make separate globals for indices. You can also separate the storage of the directory partition structure from the goods.

What’s Next?

In the second article in this series, we’ll talk about the details and benefits of storing data in InterSystem Iris globals instead of following the EAV model.