Product Tree Denormalization: 3 Ways to Boost ERP Efficiency

#life #erp #database #performance

Product Tree Issues: The Chains Holding Back Efficiency

During my years working with a manufacturing ERP, the slowness of product tree (Bill of Materials - BOM) queries was one of the most frustrating issues for me. Especially with complex products and frequently changing configurations, intricate recursive CTEs or nested JOINs running on the database could lock up the system. I even recall a time when a BOM query in the production planning module caused the entire system to become unresponsive. Such situations not only degrade the user experience but also lead to delays and increased costs in production processes.

At the root of these problems often lie the challenges posed by normalized database designs. The principles of normalization, applied to minimize data redundancy in relational databases, can increase query complexity when working with hierarchical data. For hierarchical data like product trees, which are queried deeply and frequently, rebuilding the entire tree with every query creates significant performance bottlenecks. This pushed me to try different approaches to overcome this issue.

The Power of Denormalization: Three Keys to Performance Improvement

While the benefits of normalization are undeniable, there are situations where prioritizing performance is necessary. In special scenarios like product trees, controlled denormalization techniques can incredibly boost efficiency. Based on my experience, three main approaches stand out:

Path Enumeration (or Addressing) Method: Adding a field to each product record that contains its full path in the hierarchy, starting from the root node.
Tree Level (Level/Depth) and Parent ID: Storing the product's level in the tree and its direct parent's ID in each product record.
Summary Tables: Pre-calculating and storing frequently used product tree information in a separate table for specific queries.

Each of these methods has its own advantages and disadvantages. Determining which method is most suitable for you depends on your existing system's structure, query profile, and data change frequency.

1. Path Enumeration Method: Capturing Hierarchy in a Single Row

In this method, each product record is augmented with a string field containing the IDs or names of all its ancestors, from the root product down to itself. For example, a path string like root > component_A > sub_component_B > final_product_X. This path is typically separated by delimiter characters (e.g., / or |).

ℹ️ Path Enumeration Example

Let's say a product's hierarchy is as follows:
Main Product (ID: 1)

Component A (ID: 10)

Sub-Component B (ID: 101)

Final Product X (ID: 1011)

In this case, the path field in the final_product_X record might look like: 1/10/101/1011 or Main Product | Component A | Sub-Component B | Final Product X.

The biggest advantage of this approach is that it incredibly speeds up querying all ancestors of a specific product or all descendants of a particular ancestor. String queries using the LIKE operator or more advanced string search functions run much faster than complex JOINs. For instance, a query like "find all sub-products of the component with ID 101" can be solved in a single row with path LIKE '1/10/101/%'.

However, this method also has disadvantages. The length of the path string increases as the depth of the hierarchy grows, which can consume more disk space and lead to performance issues with string operations in some database systems. Furthermore, if a parent's ID changes, all the path strings of the products under that parent must be updated, which is a complex and costly operation. Therefore, it's more useful in systems where data entry/update operations are not intensive or where the hierarchy is relatively stable.

When I implemented this method in a manufacturing ERP, I observed a speed increase of up to 70% in BOM queries. Especially on the reporting side, data collection tasks that previously took hours were reduced to minutes. This directly translated into operational efficiency. In my tests on PostgreSQL 14, I found that LIKE '...' queries performed better than recursive CTEs, but when the path string length exceeded 1000 characters, even BRIN indexes became insufficient. Thus, the indexing strategy is also of critical importance.

2. Tree Level and Parent ID: Simplifying the Structure

This approach involves less data redundancy compared to the path enumeration method but is still designed to speed up hierarchical queries. Two additional fields are added to each product record:

level (or depth): A numerical value indicating the product's distance from the root node. The root product is considered level 0 or 1.
parent_id: The ID of the product's direct parent.

These two fields are used to understand the tree structure and quickly retrieve products at specific levels or belonging to a particular parent. For example, when you want to retrieve all products directly under a specific main component (e.g., parent_id = 10) at level = 2, the query would look like this:

SELECT *
FROM products
WHERE parent_id = 10 AND level = 2;

This query is much simpler and faster than the JOINs required to get the same information in a normalized database. Also, since long strings like in the path enumeration method are not stored, disk space usage is lower. Data updates are also easier; only the level and parent_id fields of the affected products need to be updated.

However, this method is not as effective as path enumeration for querying all ancestors or all descendants of a specific product. For such queries, multiple JOINs or recursive CTEs might still be necessary. Nevertheless, in many ERP systems, the main requirement is to find the parts directly under a specific component or to identify which main assembly a product belongs to. In these scenarios, parent_id and level fields offer an excellent solution.

I once used this method in an assembly line automation project to display to operator screens in real-time which part should arrive at which station. By using parent_id and level fields, we could quickly retrieve all components under the main assembly being processed at that moment. This helped prevent unexpected stoppages in the production flow. Adding these two simple fields reduced the execution time of relevant queries by approximately 50%. Adding BTREE indexes to these fields in PostgreSQL further improved query performance.

3. Summary Tables: Restructuring Data

Summary tables are a more aggressive form of denormalization. Here, the results of frequently executed queries that require complex calculations are pre-calculated and stored in a separate table. In the context of a product tree, this could be a table containing a list of all sub-components of a main product, its total cost, total weight, or other metric-related information.

For example, while the main product table might store information like product_id and product_name, a summary table named product_bom_summary could be created. This table might contain the following information for each main product: main_product_id, sub_product_id, quantity, total_cost_at_main_product_level.

-- Example Summary Table (Simplified)
CREATE TABLE product_bom_summary (
    main_product_id INT,
    sub_product_id INT,
    sub_product_quantity NUMERIC(10, 3),
    total_cost_for_main NUMERIC(18, 2),
    PRIMARY KEY (main_product_id, sub_product_id)
);

This summary table allows you to retrieve all sub-components of a main product and their impact on the main product in a single query. This provides an incredible performance boost, especially in reporting and analytics modules. Instead of running a complex recursive CTE to calculate the entire BOM of a main product, a single SELECT query from the product_bom_summary table is sufficient.

The biggest challenge with this method is keeping the summary table up-to-date. Whenever there's a change in the main product tree (adding a new component, changing a quantity, updating a cost), the summary table must also be updated accordingly. This is usually done via a trigger mechanism or a periodic batch job. If this update process is not managed well, the data in the summary table can become stale, leading to incorrect reporting.

I used this method in an ERP integration for an e-commerce platform to display product stock status and costs in real-time. Using the product_bom_summary table to calculate the main product's total cost and the total value of its sub-components in stock reduced query times from seconds to milliseconds. This enabled the customer service team to respond much faster to stock and price inquiries. However, I also observed that the background process we wrote to update this summary table sometimes caused delays during intensive data changes. Therefore, database transactions and reliable messaging queues are critical for managing such updates.

⚠️ Summary Table Update Risks

Be cautious when updating summary tables. If the update mechanism is unreliable or cannot handle intensive data flow, the data in the table can become inconsistent. This can lead to incorrect stock information or cost calculations, causing significant business losses.

Understanding Trade-offs: When to Denormalize?

Denormalization is a powerful way to improve performance, but it's not always the right solution. It increases data redundancy, makes managing data consistency difficult, and raises disk space consumption. Therefore, when making a denormalization decision, it's important to consider the following factors:

Query Intensity and Performance Needs: If product tree queries are one of the system's most critical and frequent performance bottlenecks, denormalization can be a significant solution.
Data Change Frequency: How often the hierarchy changes is important. For structures that change very frequently, the update cost of denormalization might outweigh the performance gains.
Data Consistency Requirements: Absolute data consistency is critical for some business processes. Ensuring consistency in denormalized systems is more complex.
Development and Maintenance Cost: Denormalization techniques can make the data model and queries more complex, increasing development and maintenance costs.

In my experience, denormalization often becomes unavoidable, especially in ERP systems with complex product trees used for manufacturing, engineering, or configuration management. In these systems, quickly listing sub-components or assemblies of products, calculating costs, or checking stock status is vital for operational efficiency.

In conclusion, product tree denormalization is an effective tool for solving performance issues in ERP systems. Choosing the right method among different techniques like path enumeration, level/parent ID, or summary tables depends on a careful analysis of your existing system's needs. When implemented correctly, these approaches can lead to a noticeable increase in your system's response speed and significantly boost your operational efficiency. Remember, the best design is the one shaped by pragmatic needs.