Product Tree Denormalization in ERP: 3 Steps for Performance

#tutorials #erp #performance #database

Product Tree Denormalization in ERP: Why and How?

In manufacturing ERP systems, product tree (Bill of Materials - BOM) queries are often among the most challenging areas. Especially for companies with complex product structures, finding all sub-components of a product or identifying which products a component belongs to can take hours, or even cause the system to slow down. In such scenarios, optimizing the database design is critical for achieving performance improvements.

In this post, we will examine step-by-step how we can improve query performance by denormalizing the product tree structure in ERP systems. Drawing from my real-world field experience, I will share the problems I encountered, the solutions I tried, and the practical approaches I finally arrived at. By implementing these methods, you too can speed up slow queries in your ERP systems and increase operational efficiency.

The Problem: Complex Product Trees and Slow Queries

At the heart of manufacturing ERP systems often lies the product tree structure. This is a hierarchical structure that shows which sub-components a main product consists of, how much of each component is needed, and their own sub-components. However, managing this hierarchical structure directly in a relational database can lead to significant performance issues, especially on the query side.

For instance, to retrieve the full BOM of a main product, we typically use recursive queries. While the WITH RECURSIVE clause in databases like PostgreSQL is ideal for this task, performance rapidly degrades as the tree's depth increases or when multiple queries are run simultaneously. In a manufacturing plant, when thousands of shipping orders are generated daily, the need to fetch or update the product tree for each shipment can lead to operational disruptions due to this slowness.

ℹ️ Real-World Scenario

While working on a manufacturing ERP project, fetching the BOM for one of the main products took up to 30 seconds. This situation significantly slowed down the order preparation process. For thousands of orders daily, this waiting time was unacceptable.

One common method to cope with such problems is to denormalize the database structure. Denormalization, although leading to data redundancy, can significantly improve performance by speeding up specific queries. In the context of a product tree, denormalization means keeping parts or the entirety of the hierarchy in additional tables or existing tables in a way that is easily accessible during queries.

Solution 1: Denormalization with Path Enumeration

One of the first and most common steps in denormalization is to store the full path for each node in the product tree. With this method, a string representing a product's position in the hierarchy is created. For example, for B under main product A, and C under B, the path might look like: /A/B/C.

The fundamental principle of this method is to quickly find all sub-elements or super-elements of a specific product using simple string matching instead of recursive queries. For example, all paths starting with /A/ will represent all sub-components of product A.

-- PostgreSQL example
CREATE TABLE products_denormalized AS
SELECT
    p.product_id,
    p.product_name,
    -- Generating the path with a recursive query (example)
    (SELECT string_agg(parent.product_name, '/' ORDER BY level DESC)
     FROM (
         WITH RECURSIVE cte AS (
             SELECT product_id, parent_id, 0 as level
             FROM product_hierarchy
             WHERE product_id = p.product_id
             UNION ALL
             SELECT ph.product_id, ph.parent_id, cte.level + 1
             FROM product_hierarchy ph
             JOIN cte ON ph.product_id = cte.parent_id
         )
         SELECT product_id, parent_id, level
         FROM cte
         WHERE parent_id IS NOT NULL
     ) AS sub
     JOIN products parent ON sub.parent_id = parent.product_id
    ) AS product_path
FROM products p
JOIN product_hierarchy ph ON p.product_id = ph.product_id;

-- Indexing
CREATE INDEX idx_product_path ON products_denormalized (product_path);

The advantage of this method is that queries are extremely fast. For instance, LIKE or = operators working on product_path can efficiently utilize database indexes to return results within milliseconds. This provides a significant performance boost, especially for queries like "list all super-components of a product" or "show all sub-assemblies of a specific main product."

However, this method also has its disadvantages. Every time the tree structure changes (adding a new product, removing one, changing the hierarchy), all affected paths must be updated. This slows down insertion and update operations. If product tree changes are frequent, this method can be costly. Additionally, as the length of path strings increases, the database size also grows.

💡 Performance Tips

When creating the product_path field, adding a delimiter (e.g., '/') at the beginning of the path helps prevent incorrect matches in queries. For example, if you have two products named A and AB, storing their paths as /A/ and /AB/ prevents queries starting with /A/ from fetching AB as well. Furthermore, creating an appropriate B-tree or similar index on the product_path field maximizes performance.

Solution 2: Denormalization with Level and Parent References

Another effective denormalization technique is to store the level of each product in the hierarchy and the ID of its direct parent. This method involves less data redundancy compared to the "Path Enumeration" method but can still significantly improve query performance.

In this approach, fields like level and parent_id are added for each product. The level field is a value that starts at 0 or 1 for the root product and increases as you go down the hierarchy. The parent_id stores the ID of the direct parent.

CREATE TABLE products_denormalized_levels AS
SELECT
    p.product_id,
    p.product_name,
    COALESCE(ph.parent_id, p.product_id) AS parent_id, -- Root products can reference themselves as parent
    COALESCE(ph.level, 0) AS level -- Root product level is 0
FROM products p
LEFT JOIN (
    -- Finding level and parent_id with a recursive query
    WITH RECURSIVE cte AS (
        SELECT product_id, parent_id, 0 as level
        FROM product_hierarchy
        WHERE parent_id IS NULL -- Find root nodes
        UNION ALL
        SELECT ph.product_id, ph.parent_id, cte.level + 1
        FROM product_hierarchy ph
        JOIN cte ON ph.parent_id = cte.product_id
    )
    SELECT product_id, parent_id, level
    FROM cte
) AS ph ON p.product_id = ph.product_id;

-- Indexing
CREATE INDEX idx_products_denormalized_levels_parent_id ON products_denormalized_levels (parent_id);
CREATE INDEX idx_products_denormalized_levels_level ON products_denormalized_levels (level);

This structure allows you to very quickly retrieve all direct sub-elements of a specific parent using WHERE parent_id = <parent_id>. Similarly, you can list all products at a specific level using WHERE level = <level_number>.

While this method is not as flexible as "Path Enumeration," it offers excellent performance when combined with optimized indexes for the frequently queried parent_id and level fields. Update operations are also more manageable; only the levels of affected nodes and their subtrees need to be updated.

⚠️ Things to Consider

In this method, cases where the parent_id field is NULL, i.e., main products or root nodes, must be managed carefully. The parent_id of root nodes can either be itself or a special value can be assigned. Also, calculating the level field can initially take time, especially for large trees.

The biggest advantage of this method is that data redundancy is more controlled, and update operations are less costly compared to the "Path Enumeration" method. Performance improvements can be further enhanced with indexing strategies.

Solution 3: Exploded BOM Table

One of the most aggressive denormalization methods is to create a separate table containing a list of all sub-components for each main product. This table can be called an "Exploded BOM" or "BOM Usage" table. Each row in this table represents a main product and a sub-component required to build that main product.

For example, for a "Bicycle" main product, this table might look like this:

Main_Product_ID	Component_ID	Quantity	Level
Bicycle	Frame	1	0
Bicycle	Wheel	2	1
Bicycle	Handlebar	1	1
Wheel	Rim	1	2
Wheel	Tire	1	2

This structure is extremely effective for retrieving all components of a main product with a single query. There is no need for recursive queries or complex joins.

CREATE TABLE exploded_bom AS
SELECT
    main_product.product_id AS main_product_id,
    component.product_id AS component_id,
    component.product_name AS component_name,
    bom.quantity,
    bom.level
FROM product_hierarchy bom
JOIN products main_product ON bom.root_product_id = main_product.product_id -- If root_product_id exists in hierarchy
JOIN products component ON bom.product_id = component.product_id
WHERE bom.root_product_id IS NOT NULL; -- Get only sub-components belonging to main products

-- Indexing
CREATE INDEX idx_exploded_bom_main ON exploded_bom (main_product_id);
CREATE INDEX idx_exploded_bom_component ON exploded_bom (component_id);

The biggest advantage of this method is the massive increase in query performance. It is ideal, especially for frequently performed queries like "list all sub-components of a product."

However, the biggest disadvantage of this method is the highest level of data redundancy. If a sub-component is part of multiple main products, there will be multiple rows for this sub-component in the exploded_bom table. This situation can significantly increase database size and make insert/update operations costly.

💡 Data Consistency

To ensure the consistency of the exploded_bom table, changes in the main hierarchy table must be automatically reflected in this table. This can be automated using database triggers or an ETL (Extract, Transform, Load) process. Triggers reflect changes in real-time, while an ETL process can update the data at specific intervals (e.g., overnight).

When choosing this method, it's important to strike a good balance between update frequency and query intensity. If the product tree changes rarely and query performance is critical, this method can offer the best solution.

Which Method When?

Choosing the right denormalization strategy depends on your project's specific requirements. Here's a guide:

Path Enumeration:
- When: If the tree structure is relatively static, changes are infrequent, and queries like "find all ancestors/descendants of a product" are very intensive.
- Advantage: Query performance is very high.
- Disadvantage: Updates can be costly; database size can increase.
Level and Parent References:
- When: If there are moderate changes in the tree structure, and queries like "find direct sub-elements of a specific parent" or "list elements at a specific level" are dominant.
- Advantage: Updates are more manageable; data redundancy is lower.
- Disadvantage: Complex recursive queries might still be needed (though simplified with level and parent_id).
Exploded BOM:
- When: If the tree structure changes very rarely (e.g., a few times a year), and queries like "list all components of a main product" are operationally critical and require the highest performance.
- Advantage: Query performance is incredibly high.
- Disadvantage: Data redundancy is at its highest; database size increases the most; update operations are the most costly.

🔥 Don't Forget the Trade-offs!

Denormalization is always a trade-off. While gaining performance, you face challenges like ensuring data consistency, managing update costs, and keeping database size under control. Therefore, whichever method you choose, you must perform comprehensive tests and validate its suitability for your system's workload.

Real-World Example: BOM Optimization in a Financial Application

In the backend of a financial calculator application I developed, I needed to manage the components of complex financial products. A financial product consisted of multiple sub-financial products and operational components. Retrieving all components of a main product was very slow with recursive queries.

Initially, I used WITH RECURSIVE, but I experienced performance issues when the system was heavily used. Then, I tried a denormalization method that stored the full path for each product. This significantly improved query performance. For example, with a path like /FinancialCalculator/ProductX/ComponentY/, I could retrieve all relevant components in milliseconds.

However, the structure of financial products sometimes changed, requiring path updates. This made update operations a bit slow. Finally, as a more balanced approach, I switched to a table containing parent_id and level fields. This kept query performance at a very good level while making update operations more manageable. Today, I use this approach in many scenarios where performance is critical and the tree structure may change moderately.

Conclusion: Smart Design for Performance

Product tree management in ERP systems is a critical issue that increases operational efficiency when done correctly, and degrades system performance when done incorrectly. Denormalization is a powerful tool to solve this problem, but it requires careful planning and implementation.

The three main denormalization techniques I've discussed – Path Enumeration, Level and Parent References, and Exploded BOM – each have their own advantages and disadvantages. You should choose the most suitable method by considering your project's specific requirements, your query patterns, and your data update frequency.

Remember that database optimization is an ongoing process. Making the right decision from the start will provide significant benefits in the long run. By experimenting with these techniques in your own systems, you can see how to improve your ERP performance.