Product Tree Denormalization in Side Projects: Is It Really Necessary?

#life #denormalization #database #architecture

When working on my side projects or in a production ERP, the structure called a "product tree" always comes up. Essentially, it's a hierarchical structure that shows which components a product consists of, and how those components are further divided into sub-components. In a database, it's usually modeled with a self-referencing table or a separate relationship table.

This structure plays a key role, especially in critical processes like production planning, cost calculation, and inventory management. However, when complex queries and performance issues arise, denormalization immediately comes to mind. But is it always the right solution, especially in our small personal projects?

The Product Tree Complexity and My First Experiences

I first experienced the true complexity of a product tree when I was working in the ERP of a large manufacturing company. Imagine an engine consisting of hundreds of parts, with each part having dozens of sub-parts. When one part changed, the entire tree needed to be affected, and this had to be reflected in the costs.

In such situations, software architecture often reflects organizational flow more than just software. When designing the database model, you have to consider not only how data will be stored but also the actual workflows on the production line. A simple raw material change can directly affect the final product's price and delivery time.

ℹ️ Organizational Flow

In enterprise software development, I've seen that architectural decisions are often a reflection of business processes and organizational structure rather than purely technical considerations. The complexity of a product tree is directly related to the company's manufacturing flexibility and the depth of its supply chain management.

Once, in a client project, I saw a product tree reach 7 levels deep. Every "where used" query on this tree scanned hundreds, even thousands of rows, taking seconds. Reports were delayed, and production planning was disrupted. This situation was one of the first major signs that pushed us to consider performance improvements.

When Does Denormalization Come to Mind?

Denormalization is often seen as a savior when performance bottlenecks emerge. Especially with hierarchical data, deep queries and JOIN operations can be costly. Finding all components of a product tree or determining which parent products a specific component is used in can be challenging for the database.

For example, to calculate the cost of a specific part in a product tree, you need to sum the costs of all its sub-components and propagate them upwards. These recursive queries, especially with large datasets, increase CPU and I/O consumption. WITH RECURSIVE queries in PostgreSQL are quite powerful, but they can still slow down after a certain scale.

WITH RECURSIVE part_hierarchy AS (
    SELECT
        p.part_id,
        p.part_name,
        p.cost,
        p.part_id AS top_level_part_id
    FROM parts p
    WHERE p.parent_id IS NULL -- Root parts
    UNION ALL
    SELECT
        cp.child_part_id AS part_id,
        child_p.part_name,
        child_p.cost,
        ph.top_level_part_id
    FROM part_components cp
    JOIN part_hierarchy ph ON ph.part_id = cp.parent_part_id
    JOIN parts child_p ON child_p.part_id = cp.child_part_id
)
SELECT * FROM part_hierarchy;

A query like the one above is used to fully expand the product tree. If you run this query for every cost calculation or every reporting instance, the overall system performance will decrease. This is precisely where the idea of adding a denormalized "path" or "depth" information to the main table becomes appealing. For example, approaches like storing a tree path like "001.002.003" for each part or keeping an array of parent nodes for each node are considered.

Is the Situation Different in Side Projects?

There's a world of difference between the scale and performance expectations in an enterprise ERP and what I encounter in my own side projects. While a production ERP processes tens of thousands of orders a day, the backend of my own side product might serve a few hundred users a month. In this situation, going for a complex and high-maintenance solution like denormalization is often overkill.

In my philosophy, the "good enough" approach is very valuable in side projects. If a query takes 200ms instead of 100ms, and this doesn't noticeably affect the user's experience, it's not worth spending weeks on that optimization. The important thing is for the project to survive and continuously evolve. I'm not starting with "99.999% uptime" goals like in big companies.

⚠️ Side Project Philosophy

In side projects, it's more valuable to get a feature 80% functional and move on to the next step, rather than striving for 100% perfection. Since our resources (time, energy) are limited, pragmatism should be paramount in every decision. The denormalization decision is included in this.

In one of my side products, there was a task management application where users created their own product trees. Initially, I went with a simple self-referencing table and wasn't concerned about performance. The tree depth didn't exceed 3-4 levels, and the number of elements was limited to hundreds. WITH RECURSIVE queries ran quite fast. If, in the future, I reach, say, 10,000 users and reports slow down, then I'll consider denormalization. But until then, the simple solution is sufficient.

Reviewing Costs and Trade-offs

Denormalization has significant costs and trade-offs, as well as benefits. First and foremost, data redundancy arises. Since the same information is stored in multiple places, the risk of update anomalies increases. When a part name or cost changes, you need to update not only the main table but also all denormalized fields.

This situation can lead to consistency issues, especially if there are multiple write operations in the system. More complex transaction management or approaches like event-sourcing might be needed to maintain ACID guarantees. This prolongs development time and increases the likelihood of errors.

-- Updating a denormalized cost field
UPDATE products
SET denormalized_total_cost = (
    SELECT SUM(p.cost) FROM parts_denormalized_path pdp
    JOIN parts p ON p.part_id = pdp.part_id
    WHERE pdp.product_id = products.product_id
)
WHERE products.product_id = 'XYZ';

-- This query shows that the denormalized field needs to be updated manually.
-- In real life, this could be done by triggers or an event handler.

In my experience, such costs usually outweigh the benefits in side projects. I don't want to increase future maintenance and debugging overhead for the sake of gaining a few milliseconds. Once, in a client project, incorrect reports were generated for months because a denormalized field was forgotten to be updated. Finding and fixing this error required days of effort.

A Pragmatic Approach: Step-by-Step Optimization

My approach to denormalization is usually step-by-step optimization. That is, instead of denormalizing from the start, I prefer to proceed gradually when performance issues are truly felt. First, I focus on good indexing and query optimizations.

Proper B-tree or GIN indexes in PostgreSQL can significantly improve the performance of WITH RECURSIVE queries. By looking at EXPLAIN ANALYZE outputs, identifying where the query spends time and adding indexes accordingly is often the first and most effective step. For example, adding indexes to parent_id or path fields speeds up tree traversals.

-- Example indexing strategy
CREATE INDEX idx_parts_parent_id ON parts (parent_id);
CREATE INDEX idx_part_components_parent_child ON part_components (parent_part_id, child_part_id);

If indexing isn't enough, materialized views offer a great intermediate solution. You can store a specific view of the product tree or total costs as a MATERIALIZED VIEW and refresh it at regular intervals. This improves read performance while isolating the complex update logic that denormalization brings.

💡 Materialized Views

When you experience performance issues, consider using MATERIALIZED VIEW before resorting to denormalization. It's an excellent solution, especially for frequently read but not instantly updated data like reporting and analytics. Running the REFRESH MATERIALIZED VIEW command at appropriate intervals is sufficient.

Caching is another powerful alternative. Storing frequently accessed product tree paths or cost calculations in a Redis instance can reduce the load on the database. For example, after calculating the full cost of a product, I can write it to Redis with a 1-hour TTL (Time-To-Live). When the next request comes, I read from Redis, avoiding a database query. This approach typically brings the system's response time down to milliseconds.

My Clear Position: Generally No

For my side projects or small-scale systems, performing product tree denormalization is usually a question I answer with "no." I rarely go down this path unless there's a truly critical performance bottleneck and all other optimization options have been exhausted. Development time and ease of maintenance often take precedence over raw performance for me.

Denormalization, once implemented, is a difficult and costly decision to reverse. It complicates the data model, weakens consistency guarantees, and makes future changes harder. If your project hasn't scaled yet, there's no point in shouldering this extra burden. I experienced a similar situation before with a financial calculator project hosted on my own VPS; instead of trying to optimize everything from the start, it made much more sense to begin with a simple data model and then take subsequent steps based on performance measurements.

There are, of course, scenarios where denormalization might be justified:

Extreme Read Loads: Systems with millions of read operations per day and very rare write operations.
Static Data Structures: Situations where the product tree changes very rarely or not at all.
Specific Reporting Needs: Situations where an instant "snapshot" of the entire tree is needed, and creating this snapshot takes a very long time.

However, these scenarios don't apply to most of my side projects. For me, if a query takes 200ms instead of 50ms, but doing so requires 3 days of extra development and continuous maintenance, I'd prefer the 200ms one. My time is limited, and I prefer to spend it developing the core features of my project.

Conclusion: Simplicity Always Wins

In summary, the topic of product tree denormalization, especially in my own side projects, always starts with the question, "Is it really necessary?" For me, simplicity and sustainability are always priorities. Generally, less invasive solutions like good indexing, materialized views, and caching are sufficient to meet my performance needs.

If your project isn't dealing with hundreds of thousands or millions of data points, the complexity and costs introduced by denormalization might outweigh the performance benefits it provides. Let's remember that the best decisions in software architecture are usually those that are most appropriate for current needs and least restrict future flexibility. In my next post, I'll talk about how systemd units reliably work in production environments and a cgroup limit error I encountered.