Mustafa ERBAY

Posted on May 22 • Originally published at mustafaerbay.com.tr

Product Tree Denormalization: The Hidden Killer of ERP Performance

#tutorials #erp #performance #database

The Backstage of Product Tree Denormalization: A Performance Trap

In a manufacturing ERP system, managing the Bill of Materials (BOM) forms the foundation of every stage, from project planning to shipment. This tree details which sub-components make up a main product, how much of each component is needed, and how these components are grouped internally. However, a situation I frequently encounter is that this product tree structure is filled with denormalizations that cause serious performance issues. In this post, I will explain the reasons behind these denormalizations, how they negatively impact performance, and how I dealt with these problems based on my own experiences.

The product tree is inherently hierarchical. We start from the main part that forms a product and go down to the smaller parts that make up that part. During development processes, approaches that flatten, i.e., denormalize, this hierarchy might be preferred to improve query performance or reduce data redundancy. However, if this denormalization is not managed correctly, it can create an incredible load on the database and severely undermine the system's overall performance. This situation becomes more pronounced, especially in manufacturing firms with large and complex product trees.

Why Do We Fall into the Denormalization Trap?

The primary motivation for denormalization is usually to increase query speed. In a normalized database, you might need to JOIN multiple tables to see all components of a product. This can be time-consuming, especially for frequently accessed data. With denormalization, all necessary information is gathered in a single table or fewer tables, which speeds up read operations. For example, storing the direct main product ID or the full hierarchical path for each product component.

However, denormalization comes at a cost: maintaining data consistency is more difficult, and write operations become complex. Changing or adding a component in a product tree might require you to update multiple places in the denormalized data. If these updates are incomplete or erroneous, data inconsistency occurs. These inconsistencies lead to errors in reporting, inaccuracies in production planning, and ultimately, noticeable drops in system performance. This risk increases exponentially, especially in product trees updated thousands of times a month.

ℹ️ Real-Life Example

In a manufacturing ERP project, when users updated a component in the product tree management module, the system took 15-20 seconds to respond. During detailed investigations, we found that updating a single component required updating over 50 rows in 5 different denormalized tables. This situation created a continuous database locking and blocking problem.

Performance Impacts of Denormalized Product Trees

The performance impacts of denormalized product trees are often overlooked or not noticed initially. However, over time, especially as data volume increases, these impacts become more apparent. The most common impacts I encounter are:

Slow Queries: While denormalization aims to increase read speed, if data redundancy and complex updates are involved, it can have the opposite effect. Queries might have to scan unnecessarily large amounts of data.
High CPU and Disk Usage: The database server can become overloaded as it tries to manage multiple tables simultaneously during updates. This increases CPU usage and causes intense disk I/O.
Locking Issues: The complexity arising during write operations can cause database tables or rows to remain locked for extended periods. This leads to other operations waiting and the system freezing.
Data Inconsistency and Erroneous Reporting: This is one of the most critical impacts. Incorrectly updated or missing data leads to incorrect reports and disrupts production plans.

These impacts directly affect not only database administrators but also end-users and business processes. Slowly opening screens, lengthy reporting times, and inaccurate stock information all slow down workflow and increase costs.

Realistic Scenarios and Solutions

When I encounter such performance issues, the first step is usually to understand the root cause. This involves not only examining database logs but also understanding queries at the application layer and data access patterns.

Scenario 1: Artificial Hierarchy Path

In an ERP system, storing the full product tree path (e.g., Main Product > Sub Assembly 1 > Sub Assembly 2 > Component X) in a separate column is a common denormalization technique. This allows quickly finding which main product a specific component belongs to. However, when a component is moved from one sub-assembly to another, or the main product changes, this path needs to be updated.

If this update is not done correctly, the hierarchical path can remain incorrect even if the main product ID is still accurate. This leads to serious errors in reporting or tree visualization.

-- Example problematic update
UPDATE product_components
SET hierarchy_path = REPLACE(hierarchy_path, 'Old_Sub_Assembly', 'New_Sub_Assembly')
WHERE parent_id = 123; -- Such a simple update doesn't always work.

An update like this, if the product_components table has millions of records and all of them need to be scanned, will run incredibly slowly. Furthermore, if a component has multiple parents (which is rare in product trees but complex structures can force it), updating this path becomes even more complex.

Solution: In this scenario, instead of storing the denormalized path information, it's more sensible to calculate it when needed or use a smarter structure. For example, dynamically generating the tree using recursive queries with Common Table Expressions (CTE) or storing only the immediate parent for each component and building the tree on demand.

-- Example of querying the product tree with Recursive CTE
WITH RECURSIVE BOM_CTE AS (
    -- Anchor member: Main products
    SELECT
        id,
        name,
        parent_id,
        0 AS level
    FROM products
    WHERE parent_id IS NULL

    UNION ALL

    -- Recursive member: Sub-components
    SELECT
        p.id,
        p.name,
        p.parent_id,
        BOM_CTE.level + 1
    FROM products p
    INNER JOIN BOM_CTE ON p.parent_id = BOM_CTE.id
)
SELECT * FROM BOM_CTE;

This CTE approach makes it easier to ensure data consistency because the only stored information is parent_id. While query performance might be slightly lower initially, the update complexity is eliminated, and the system runs more stably in the long run.

Scenario 2: Aggregate Quantity Denormalization

Another common denormalization is storing the final total component quantity of a main product in a separate column. For example, Total Wheel Count: 4 in the production of a car. This information can be directly used in reporting or inventory tracking.

The problem is that when a sub-component changes or a new sub-component is added, this total quantity needs to be updated correctly. If the quantity of a component in a sub-assembly changes, the total quantity in all higher-level products affected by this change must also be recalculated. This leads to a cascading effect, especially in deep hierarchies.

For instance, every new bolt added to a motorcycle's main frame increases the total component count of that main frame by one. If a product has 10,000 different components, and you change one of these components, it can affect the "total component count" field of all parent assemblies and final products containing that product.

⚠️ Update Complexity

In one company, the "total component count" field displayed on the main product page was incorrectly calculated after an update. Upon investigation, we found that after a minor change in a sub-assembly, the total component count for 3 main products affected by this change was not updated. The reason was that the update script only ran up to the directly affected sub-assembly level and did not calculate the cascading effect. This was not just a data error but also meant incorrect inventory and production planning.

Solution: Instead of denormalizing such totals, it's safer to calculate them when needed or manage them with more advanced architectures like Event Sourcing. If denormalization is absolutely necessary, database transactions and stored procedures should be used carefully to ensure these updates are atomic and guaranteed. Additionally, a background job that periodically checks the accuracy of these totals is also a precaution.

-- Example Stored Procedure for safe update
DELIMITER $$
CREATE PROCEDURE UpdateTotalComponentCount(IN productId INT)
BEGIN
    DECLARE current_total INT;
    DECLARE parent_product_id INT;

    -- First, get the current total
    SELECT total_components INTO current_total
    FROM products
    WHERE id = productId;

    -- Calculate the new total (This part can be complex, might require a recursive function)
    SET new_total = CalculateNewTotalComponents(productId); -- Hypothetical function

    IF current_total <> new_total THEN
        UPDATE products
        SET total_components = new_total
        WHERE id = productId;

        -- Update parent-level parents as well
        SELECT parent_id INTO parent_product_id
        FROM products
        WHERE id = productId;

        IF parent_product_id IS NOT NULL THEN
            CALL UpdateTotalComponentCount(parent_product_id); -- Recursive call
        END IF;
    END IF;
END$$
DELIMITER ;

This stored procedure ensures that the update is performed within a transaction and that the hierarchy is traversed upwards to update all relevant records. However, such recursive stored procedures can lead to performance issues with large datasets. Therefore, this method should be used with caution.

Alternatives Instead of Denormalization

Denormalization is not always the best solution. Especially with the advancements in modern database systems, high performance can be achieved even with normalized structures.

1. Smart Queries and Index Strategies

In a normalized database, query performance can be dramatically improved with the right indexes. For hierarchical data like product trees, special index types (e.g., ltree or hstore extensions in PostgreSQL) or applying B-tree indexes to the correct fields can make a big difference.

Index Recommendations for Hierarchical Data:
- Index on the parent_id field.
- If a hierarchy path is stored, a special index for this path (e.g., text_pattern_ops or gin index).
- Composite indexes based on frequently performed queries.

2. Views and Materialized Views

If specific denormalized datasets are frequently needed, obtaining these data through Views or Materialized Views preserves data integrity and improves performance.

View: Runs at query time, always provides up-to-date data, but query performance is similar to normal JOINs.
Materialized View: Physically stores query results, which greatly speeds up read operations. However, it needs to be refreshed periodically for data currency. It's ideal for situations where the product tree doesn't change frequently or changes can be synchronized periodically.

-- Example Materialized View
CREATE MATERIALIZED VIEW mv_product_hierarchy AS
SELECT
    p.id,
    p.name,
    -- Other necessary fields...
    -- Calculating or storing the hierarchical path
FROM products p
-- Necessary JOINs and calculations...
WITH DATA; -- Populate data initially

By periodically refreshing this materialized view, we can combine the performance of a denormalized table with the security of a normalized structure.

3. Next-Generation Database Technologies

Graph databases are naturally more suitable for managing complex relational data like product trees. In graph databases like Neo4j, relationships between nodes (products, components) and edges (creates, is part of) are modeled directly, making hierarchical queries much more efficient.

However, migrating an existing ERP system entirely to a graph database requires significant effort. Therefore, hybrid approaches are often preferred; meaning, a separate graph database is used for critical hierarchical data, while other data continues to be stored in traditional relational databases.

Conclusion: A Balanced Approach

Product tree denormalization is a common cause of performance issues in ERP systems. While it may seem attractive initially with the promise of increased query speed, it can lead to system slowdowns and errors in the long run due to data consistency and update complexity.

Even when denormalization is unavoidable, its effects must be managed carefully. Risks can be minimized with smart indexes, materialized views, and secure update mechanisms. However, the best approach is often to avoid denormalization and instead optimize a normalized structure or use technologies specifically designed for hierarchical data (e.g., ltree, graph databases).

In my experience, getting to the root of performance issues always requires patience and in-depth analysis. Product tree denormalization is a typical example of such an analysis. It's important to see that the problem is not just at the database level but is also closely related to application architecture and business processes. Therefore, it is essential to always find a balance, i.e., to improve performance while ensuring data integrity and system sustainability.

As I mentioned in my previous post, [Related: Challenges of Real-Time Inventory Tracking in ERP], as business processes become more complex, our infrastructure choices must be able to manage this complexity correctly. A wrong architectural choice for a critical structure like the product tree can invite years of performance problems.

DEV Community