DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

ERP Product Tree Denormalization: The Maintenance and Scale Conundrum

The Denormalized Product Tree: Initial Relief, Later a Deep Pit

While working on a manufacturing ERP, I've repeatedly seen how appealing the product tree structure looks initially. Typically, denormalization is preferred to speed up database queries and access data directly without getting into complex relationships. This might seem logical, especially during the initial development phases or in systems with a limited number of products. However, this relief often doesn't last long, leading to a complexity that becomes impossible to maintain and scale as the system grows.

For example, in one of my client's ERP systems, there was a field containing all the top-level categories for each product. This meant, for a "Washing Machine" product, this field would contain a string like "Home Appliances > Kitchen & Bath > Washing Machines". While this was a perfectly fast way to list products initially, when a category name changed or a new main category was added, a script had to be run to update this string. This script could take hours to run on millions of product records and had a high probability of error.

Why Does Denormalization Seem Appealing?

The primary appeal of denormalization is that it simplifies data access. In a normalized database, you might need to JOIN multiple tables to get all the information for a product. This can lead to performance issues, especially when dealing with many tables and complex relationships. With denormalization, frequently accessed data is consolidated into a single table, significantly speeding up read operations.

ℹ️ The Basic Logic of Denormalization

Denormalization is a technique in database design that improves read performance by duplicating or grouping data. The goal is to reduce query complexity and shorten data access time. However, this makes ensuring data consistency more difficult.

Another reason is the need for speed in the early stages of the development process. Project managers or clients want to see a working prototype quickly. At this point, denormalizing with the thought of "we'll fix it later" can be attractive. However, this "later" often never comes or is too late.

The Maintenance Nightmare: Change Management and Data Inconsistency

The biggest problem with a denormalized product tree structure is how difficult it is to maintain. When an item or category name changes, this change needs to be updated in all relevant places in the system. If this update is done through a single defined string or repeated columns, this process becomes incredibly cumbersome.

In a manufacturing company's ERP system, we developed a module to manage the product tree. Initially, the "full path" of each product (e.g., Main Category/Sub Category/Sub Sub Category) was stored in a text field. When a new sub-category was added or a product was moved to another category, this text field needed to be updated. The stored procedure we wrote to perform this update re-generated this path for each product. This became incredibly slow when run on thousands, even millions, of products. An update operation could take 3-4 hours.

⚠️ Change Management Challenges

Even a small change in denormalized data structures can create a domino effect in many parts of the system, leading to unexpected problems. This also makes the debugging process extremely complex.

This situation not only increases maintenance costs but also carries the risk of data inconsistency. If an error occurs during an update or the process is interrupted, some parts of the system might show old data while others show new data. This leads to reporting errors, incorrect inventory tracking, and ultimately, disruptions in business processes.

Maintenance Cost with Real Numbers

At another client, product tree information was stored separately for each product in a denormalized manner. There were separate columns for a product's "main category," "second-level category," "third-level category," and so on. When a product's category changed, these three columns had to be updated.

Once, when a product's main category changed, we ran a script for this update. The script took a full 4 hours and 17 minutes to complete. During this time, the CPU usage of the relevant database server did not drop below 90%, and other system operations slowed down. We had to repeat this type of operation once or twice a week. This meant at least 40-50 hours of unnecessary operational load per month. Imagine, this was just the duration of a single operation; the situation became even more dire when we added the cost of debugging and potentially resolving data inconsistencies.

Scaling Issues: The Performance Bottleneck

While denormalization might improve performance in the short term, it becomes one of the biggest bottlenecks preventing the system from scaling as the data volume increases. Especially in situations where product variety and transaction volume increase, denormalized structures start to collapse in terms of performance.

In an e-commerce platform project, we used a hierarchical structure for products. Initially, we stored a JSON blob containing all parent categories of a product in each product record. This was fast for product listings. However, as the platform grew and the number of products reached millions, managing and querying these JSON blobs became impossible. Adding a new product or updating an existing one meant processing this massive JSON data.

🔥 The Cost of Unscalable Design

Denormalization, done to increase performance initially, has the opposite effect as the system grows, reducing performance and hindering scalability. This is an approach that exponentially increases technical debt.

To give an example, when we wanted to change a product's category, we didn't just update the relevant fields; we also had to update the category information for all "child products" related to that product. This was an operation that sometimes involved thousands of records and created a significant load on the database each time. Such operations, especially when performed during peak traffic hours, directly negatively impacted user experience.

The Couscous Architecture Anti-Pattern

These types of denormalization strategies point to an anti-pattern in the software world, also known as "Couscous Architecture." As the name suggests, everything is piled on top of each other, and it's difficult to see the underlying structure. In couscous architecture, data duplication and complexity increase to such an extent that understanding, maintaining, and developing the system becomes nearly impossible.

In a client project, I saw another example of this. Tables holding product variants like color and size were linked to the main product table through different fields. When a new variant of a product was added, data had to be entered not only into the variant table but also into specific fields in the main product table. This extended development time and made ensuring data integrity difficult. A single missed field update by a developer could lead to major problems in the system.

Rediscovering Normalization: A Path to a Solution

When we face the problems brought about by denormalization, returning to the initial solution, i.e., switching to a normalized database design, is often the most logical path. Normalization increases data consistency by reducing data duplication and simplifies maintenance.

In a manufacturing ERP project, I developed a solution for managing product tree data using PostgreSQL's powerful features. We defined products and categories in separate tables. By establishing a parent_id relationship between categories, we created a tree structure.

-- Categories table
CREATE TABLE categories (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    parent_id INTEGER REFERENCES categories(id) NULL -- Self-referencing relationship
);

-- Products table
CREATE TABLE products (
    id SERIAL PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    -- Other product attributes...
);

-- Product-Category relationship table (Many-to-many relationship)
CREATE TABLE product_categories (
    product_id INTEGER REFERENCES products(id) ON DELETE CASCADE,
    category_id INTEGER REFERENCES categories(id) ON DELETE CASCADE,
    PRIMARY KEY (product_id, category_id)
);
Enter fullscreen mode Exit fullscreen mode

With this structure, we could access all category information for a product using the parent_id relationship in the categories table and with Common Table Expressions (CTE) or Recursive Queries. While this meant more queries initially, it provided incredible flexibility in terms of maintenance and scalability.

💡 Tree Querying with PostgreSQL CTE

PostgreSQL's CTE (Common Table Expressions) and especially recursive CTE features are very powerful for querying hierarchical data. This allows you to efficiently traverse complex tree structures.

This approach meant that when you changed a category name, you only had to update a single record. This change was automatically reflected in all products using that category. This dramatically reduced the operational load and eliminated the risk of data inconsistency.

Understanding the Trade-offs: Performance vs. Maintenance

Of course, moving to a normalized structure also has trade-offs. Read performance might not be as fast as a denormalized structure. Especially for querying very deep tree structures, more complex queries might be needed. However, this is where database optimization comes into play. Performance issues can be resolved using techniques like proper indexing strategies (e.g., GIN or BRIN indexes in PostgreSQL), query optimization, and, when necessary, using materialized views for specific queries.

In one project, we created a materialized view to improve the performance of product tree queries. This view stored the category hierarchy up to a certain depth in a pre-calculated manner. This way, we could use this view for frequently performed queries that required performance. When a new category was added or an existing category was updated, we would recreate the materialized view using the REFRESH MATERIALIZED VIEW command. This made read performance almost as fast as a denormalized structure while allowing us to maintain ease of maintenance.

ℹ️ Trade-off Analysis: Denormalization vs. Normalization

Feature Denormalization Normalization
Read Performance Generally higher (initially) Lower (if not optimized)
Write Performance Lower (due to data duplication) Higher (single point of update)
Data Consistency Low (difficulty in updates) High (no data duplication)
Ease of Maintenance Low (complex updates) High (simple updates)
Disk Space More (data duplication) Less (no data duplication)
Scalability Low (performance bottlenecks) High (can be optimized)

In conclusion, for complex and constantly changing data structures like ERP product trees, normalization is the most sustainable and scalable solution in the long run. Opting for denormalization for the sake of minor initial performance gains will confront us with exponentially increasing maintenance costs and scaling issues over time. Therefore, when making data modeling decisions, it is of great importance to consider long-term sustainability rather than short-term gains.

Top comments (0)