Product Tree Denormalization and the Anatomy of Technical Debt

#life #technicaldebt #denormalization #erp

Recently, I noticed that a report related to complex product trees in a manufacturing ERP was taking over 3 hours to complete. The delay in the report directly impacted shipment planning, which in turn severely disrupted the company's daily operations. Digging into the root cause, I saw how a "performance-oriented" denormalization decision made years ago in the database design had evolved into massive technical debt over time.

This situation is essentially a summary of a scenario I've encountered many times in my life. Steps taken to solve a problem quickly can lead to bigger problems in the long run. Product trees (Bill of Materials – BOM) play a critical role in manufacturing processes. These structures, which detail which parts a product consists of, their quantities, and the production sequence, can lead to operational disasters if not managed correctly. My experience here once again showed me how the steps we take as developers can affect every aspect of a company's life.

Product Tree Complexity and Delayed Reports

In manufacturing companies, the product tree, or BOM, is the heart of everything. It shows all intermediate steps from raw materials to the final product, every component used, and its quantity. These structures are often multi-layered, meaning a part itself can be composed of other parts. This necessitates a recursive relationship in the database.

Managing these complex structures can lead to severe performance issues, especially in operations like reporting and cost calculation. In the case I encountered, the weekly shipment report needed to delve into the depths of the product tree to calculate all necessary sub-components for each final product and their stock statuses. Initially, there were 100-200 different product types, but as the company grew, this number exceeded 2000, and the report duration increased exponentially. The report, which initially took 10 minutes, stretched to 3 hours.

⚠️ Hidden Danger: The Pursuit of Performance

As developers, we often resort to easy solutions to fix immediate performance problems. Denormalization is one of these "easy" ways. However, this can create a larger technical debt and maintenance cost in the long run. Especially with complex and frequently changing data structures, denormalization can become the nightmare of the next update.

The delay in this report wasn't a simple software bug. It was causing disruptions in weekly production and shipment plans, risking the inability to keep promises made to customers. A shipment planning meeting scheduled for Friday morning at 09:00 AM was pushed to the afternoon because the report only finished at 12:00 PM. This situation increased the stress not only for the IT team but for all production and logistics departments.

The Allure and Pitfalls of Denormalization in Database Design

Denormalization is a method often employed in database design to overcome performance issues. While normalization aims to ensure data integrity by reducing data redundancy, denormalization allows data redundancy to improve query performance. We might resort to denormalization, especially to avoid multiple JOIN operations or to pre-store frequently used calculated values. I, too, fell for this allure in my early career. In many projects, I had seen how "fast" such solutions were, particularly for reporting needs.

In the context of a product tree, denormalization often manifests as storing a list of all sub-components for a product (e.g., in a JSONB column) or writing the total cost of each component into the main product table. For example, in a manufacturing ERP, a material_cost column in the main product table held the sum of the costs of all sub-components. This eliminated the need for deep JOINs for the "Total Cost" report. Initially, this approach made reports come in seconds, which sounded fantastic.

-- Normalization Example: Each part and its relationship in a separate table
CREATE TABLE products (
    product_id SERIAL PRIMARY KEY,
    product_name VARCHAR(255)
);

CREATE TABLE bill_of_materials (
    bom_id SERIAL PRIMARY KEY,
    parent_product_id INT REFERENCES products(product_id),
    component_product_id INT REFERENCES products(product_id),
    quantity NUMERIC(10, 4),
    UNIQUE (parent_product_id, component_product_id)
);

-- Denormalization Example: Storing total cost in the main product table
-- However, how this cost is updated is critical
ALTER TABLE products
ADD COLUMN total_material_cost NUMERIC(18, 4);

However, this seemingly simple solution has serious pitfalls. When the cost of a component changes, all main products that include this component must have their total_material_cost column updated. If this update is not done manually or via a simple TRIGGER, data inconsistency becomes inevitable. The problem I encountered was precisely this: as the system grew, the domino effect of these updates caused performance bottlenecks. Cost changes, especially at month-end, nearly brought the system to a standstill. This was a painful experience showing how even denormalizing a single cost column could lead to such a chain reaction.

How Technical Debt Accumulates in Product Tree Structures

Technical debt is inevitable in the world of software development. Like financial debt, it's the cost of decisions made for short-term gains that must be paid back with interest in the long run. In complex and dynamic structures like product trees, this debt often accumulates unknowingly. In my experience, there were several main reasons for this debt to build up:

Versioning and Validity Dates: Product trees change over time. New parts are added, quantities of existing parts change, or they are removed entirely. Versioning and validity dates are used to track these changes. However, in a denormalized structure, all denormalized data must be correctly updated with every version change. Was the cost of an old version updated, or a new one? These questions increase complexity.
Pre-storing Calculated Values: Like the total_material_cost mentioned in the previous section. This approach, which seems logical at first, turns into a nightmare as business rules change or new calculation metrics are added. For example, when "labor cost" or "production scrap" are added as new items to the cost, the definition of the existing total_material_cost column changes, and all historical data must be recalculated. This can mean days of batch processing for millions of records.
Domino Effect of Instantaneous Changes: A change in the cost of a sub-component affects all intermediate products that use it, and ultimately the final products. In a normal database, this can be resolved with a series of SELECT and UPDATE statements, but in a denormalized structure, managing these updates manually or with complex TRIGGERs is necessary. These TRIGGERs, especially in PostgreSQL, fire on every INSERT, UPDATE, or DELETE operation and add overhead.

-- A simple TRIGGER example to update denormalized cost
-- However, this becomes much more complex in recursive tree structures.
CREATE OR REPLACE FUNCTION update_product_total_cost()
RETURNS TRIGGER AS $$
BEGIN
    -- This example is simple; a real BOM tree requires a much deeper calculation
    UPDATE products
    SET total_material_cost = (
        SELECT SUM(p.price * bom.quantity)
        FROM bill_of_materials bom
        JOIN products p ON bom.component_product_id = p.product_id
        WHERE bom.parent_product_id = NEW.product_id
    )
    WHERE product_id = NEW.product_id;
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_update_product_cost
AFTER INSERT OR UPDATE ON bill_of_materials
FOR EACH ROW
EXECUTE FUNCTION update_product_total_cost();

These types of TRIGGERs, while seemingly a small performance issue at first, can become a serious bottleneck as the data volume in the system increases. In the case I experienced, a single UPDATE operation on the bill_of_materials table led to the recalculation of over 1500 products table records, causing lockups lasting seconds. Especially during periods of heavy write operations, this situation negatively affected VACUUM operations and could trigger problems like WAL bloat.

The Cost of Denormalization in a "Production" Environment: Real Scenarios

In a "production" environment, we can pay dearly for decisions that seem logical on paper. My experience in a manufacturing company's ERP was a live example of this. The company shipped over 250,000 final products per month, and these products had product trees with an average of 15-20 layers. There were over 15 million bill_of_materials records in total.

The problem erupted with the "Daily Production Summary" report, which started every first business day of the month at 08:30 AM and collected production data from the last 24 hours. While the report was normally expected to finish in 5 minutes, it suddenly started taking 45 minutes, then 2 hours. Eventually, the report's execution pushed the system's entire CPU usage above 95%, slowing down other critical operations. As the system administrator, looking at the journald logs, I saw that during the report's execution, PostgreSQL processes were hitting CPU limits, and some query timeouts were occurring.

The pg_stat_activity outputs showed that the query running the report was using a massive amount of temporary disk space. EXPLAIN ANALYZE results indicated that the query used a deep recursive CTE (Common Table Expression) on the bill_of_materials table and was fetching and summing costs from the products table for each sub-component. The worst part was that these costs were already stored denormalized in the products table, but the report, instead of relying on this potentially outdated denormalized data, was recalculating it every time. This was a complete paradox and a double cost.

-- An example CTE structure within the report that degrades performance
WITH RECURSIVE bom_tree AS (
    SELECT
        b.parent_product_id,
        b.component_product_id,
        b.quantity,
        p.price AS component_price,
        1 AS level
    FROM bill_of_materials b
    JOIN products p ON b.component_product_id = p.product_id
    WHERE b.parent_product_id = <Final_Product_ID>

    UNION ALL

    SELECT
        b.parent_product_id,
        b.component_product_id,
        b.quantity,
        p.price AS component_price,
        bt.level + 1
    FROM bill_of_materials b
    JOIN bom_tree bt ON b.parent_product_id = bt.component_product_id
    JOIN products p ON b.component_product_id = p.product_id
)
SELECT
    parent_product_id,
    SUM(component_price * quantity) AS total_calculated_cost
FROM bom_tree
GROUP BY parent_product_id;

This complexity affected not only reporting but also data entry. When a new product tree was defined or the quantity of an existing component changed, the transaction times extended due to triggers. Operators, thinking the screen had frozen, repeatedly pressed the save button, leading to more lockups and inconsistent data entries. One afternoon, between 2:00 PM and 4:00 PM, production came to a standstill because new production orders could not be entered into the ERP system. This situation led to a daily production loss of $50,000 USD. It resembled a previous experience I had with performance bottlenecks in enterprise software architecture, but this time the problem stemmed from denormalization itself.

Solutions and Strategies for Managing Technical Debt

It's usually not possible to resolve this kind of technical debt all at once and fundamentally. It's necessary to proceed step by step without disrupting the live operations of the existing system. The strategy I applied consisted of several phases:

Identification and Cleanup of Existing Denormalized Fields: First, I identified denormalized columns like total_material_cost in the products table. I wrote a batch job to check if these columns were up-to-date. This job usually ran at night, during less busy system hours. If an inconsistency was detected, the correct costs were recalculated and updated. Although this was a temporary solution, it ensured that reports ran with correct data at least.
Using Materialized Views: Recursive product tree queries were repeating the same calculations every time they ran. To prevent this, I decided to store pre-calculated product tree costs for specific main products in MATERIALIZED VIEWs. These views were updated at specific intervals (e.g., hourly or when production costs changed) using the REFRESH MATERIALIZED VIEW command. This allowed reports to access pre-calculated, up-to-date data rather than performing instantaneous queries.

-- A Materialized View storing the product tree's cost
CREATE MATERIALIZED VIEW product_bom_total_costs AS
WITH RECURSIVE bom_calc AS (
    SELECT
        b.parent_product_id,
        b.component_product_id,
        b.quantity,
        p.price AS component_price
    FROM bill_of_materials b
    JOIN products p ON b.component_product_id = p.product_id

    UNION ALL

    SELECT
        bc.parent_product_id,
        b.component_product_id,
        b.quantity * bc.quantity, -- Correct calculation by multiplying quantities
        p.price AS component_price
    FROM bill_of_materials b
    JOIN bom_calc bc ON b.parent_product_id = bc.component_product_id
    JOIN products p ON b.component_product_id = p.product_id
)
SELECT
    parent_product_id,
    SUM(component_price * quantity) AS total_calculated_cost,
    NOW() AS last_calculated_at
FROM bom_calc
GROUP BY parent_product_id;

-- Refreshing the Materialized View
REFRESH MATERIALIZED VIEW product_bom_total_costs;

These MATERIALIZED VIEWs improved reporting performance by 80%. Reports that took 3 hours were reduced to 30 minutes. However, this meant that the data would be "eventually consistent"; that is, it wasn't guaranteed that the report would show updated data immediately after a cost change. I clearly communicated this to the business units. For them, getting up-to-date data with a 1-hour delay was much better than not getting it at all.

CQRS (Command Query Responsibility Segregation) and Event Sourcing Approach: In the long term, I thought that CQRS and Event Sourcing architectures would be more suitable for such complex business domains. Write operations (Command) and read operations (Query) are handled in separate models. Every significant event (Event), such as a cost change, is recorded in an event stream. Read models (e.g., a denormalized table for product cost reports) are updated by feeding from this event stream. This ensures data integrity while maximizing read performance. I had tried this architecture in my own financial calculators for a side project and achieved very successful results.

"Eventually Consistent" Approach and Integration with Business Processes

Being "eventually consistent" is a great solution when instantaneous consistency is not required. In cases like product tree costs, which generally don't change instantly or where instant change isn't critical, this approach works. The important thing is to discuss this situation clearly with the business units. "Your report shows the most current data not from right now, but from the last hour. Is this acceptable for you?" The "yes" answer I received made the technical part of the job much easier.

When managing such a transition, I extensively used feature flag and dark launch techniques. While running the new MATERIALIZED VIEWs in the background, the existing reporting system continued with the old, slow queries. I hid the new reporting infrastructure behind a feature flag. First, we tested the new reports with a small group of users (e.g., a few people from the finance department). After approval, we switched to the new system by opening the feature flag for all users. This helped minimize risks and ensure a smooth transition.

ℹ️ Importance of Communication with Business Processes

Understanding the impact of your technical decisions on business processes and explaining these impacts to stakeholders in clear terms is critical. Concepts like "eventual consistency" can be confusing for non-technical people. Open communication is key to managing expectations and ensuring project success.

I continuously monitored data integrity and performance using observability tools such as vacuum monitoring and replication lag tracking in the system. Using pg_stat_statements in PostgreSQL to identify the slowest queries, I checked if they were correctly using MATERIALIZED VIEWs. Additionally, I considered setting up a CDC (Change Data Capture) mechanism to guarantee the freshness of denormalized fields, but due to the complexity of the existing system, this remained a longer-term goal.

Lessons Learned and My Recommendations for the Future

This product tree denormalization adventure once again showed me how technical debt can take deep roots not only in code but also in database design. Denormalization, which initially seemed like a "quick fix," eventually turned into an exponentially increasing cost and operational headache. Here are a few key lessons I've drawn:

Understand the Business Domain Deeply: Before undertaking any database design, it's essential to thoroughly understand the business domain's requirements, data flow, and potential future changes. Complex structures like product trees do not forgive superficial approaches. Which data can be instantaneous and which can be "eventually consistent" should be determined collaboratively with business units.
Think Twice About Denormalization Decisions: Denormalization can be a powerful tool for performance, but it comes with a trade-off: data consistency and maintenance cost. When making this decision, it's crucial to evaluate not only the instantaneous query performance but also data freshness, update costs, and the impact of future changes. Often, a well-indexed and optimized normalized structure can provide sufficient performance in most cases.
Technical Debt is Inevitable, But Manageable: Technical debt cannot be completely eliminated. In every project, some shortcuts may be chosen due to time pressure or lack of knowledge. The important thing is to be aware of this debt, monitor it regularly, and make plans to reduce it strategically. Just like financial debt, if you don't continue to pay your technical debt, it will drown you with interest.

💡 Continuous Monitoring and Optimization

Constantly monitor your systems. Regularly identify slow-running queries using tools like pg_stat_statements in PostgreSQL or EXPLAIN ANALYZE. Follow journald logs and cgroup limits. Sometimes the problem isn't denormalization itself but a simple missing index or a wrong query plan.

This experience taught me a lot in my career as a technology professional. I once again saw how our technical decisions not only affect the code level but also an entire company's operations, the stress levels of its employees, and even customer relationships. Therefore, for your next database design or performance optimization, I recommend considering the long-term effects before succumbing to the allure of denormalization. It's always good to be pragmatic, but when you say "it's okay," you should also account for how big that "okay" might be in the future.