In the early stages of application development, the data schema is typically designed with normalization as the guiding principle. However, as the application grows and tables expand, specific queries can become sluggish, negatively impacting the user experience. In such situations, it becomes essential to consider denormalizing the data.
What is denormalization?
Denormalization involves strategically duplicating a small amount of data across tables, reducing the need for joins that combine data from multiple sources. Another denormalization technique is maintaining precomputed data aggregations instead of calculating them in real time. While this approach introduces some data redundancy, it can significantly enhance query performance.
Some "textbook" examples:
- The Classic Approach: Instead of joining the "orders" and "customers" tables, which might be large, to retrieve the customer name, we add the customer name as a column in the orders table.
- The "Justin Bieber" Case: In a social media app, rather than maintaining the relationship of "users liked a post" and then aggregating the count of likes for a post, we keep the current likes_count in the posts table. This approach is named the "Justin Bieber" case because Instagram used it to manage likes for celebrity posts.
- Short-Circuiting: In scenarios with three or more levels of joins, such as organization → users → categories → posts, to eliminate one or more join levels, we add organization_id and/or user_id directly to the posts table.
What are the drawbacks of denormalization?
After denormalizing the data, new problems arise, including the need for extra storage space due to data duplication. When a value changes, all its copies in other tables must be updated to avoid anomalies and discrepancies. Updating all the copies of the data can lead to longer write operations.
These are classic textbook examples of denormalization. Databases have evolved, and we have new tools to maintain our basic data model.
Materialized view
What is a materialized view?
A materialized view is a database object that includes a copy of a query's result. This way, we don't need to fetch, process, and combine the data from all the tables involved in the query or aggregate and group the data repeatedly.
Examples (postgres syntax):
CREATE MATERIALIZED VIEW MV_MY_VIEW
AS SELECT * FROM <table_name>;
The supported RDBMS are Oracle (>8i), PostgreSQL (>9.3, since 2013), SQL Server (> 2000), DynamoDB, and BigQuery.
MySQL doesn't support it out of the box.
Stream processing frameworks: Apache Kafka (since v0.10.2), Apache Spark (since v2.0), Apache Flink, Kinetica DB, Materialize, and RisingWave all support materialized views on data streams.
The benefits of using materialized views:
- No need to change the DB schema
- Faster data retrieval because it doesn't re-calculate but stores the results
- It's possible to index the results.
The cons of using materialized views:
- Usually, materialized views are not auto-refreshed, so data may not be fresh.
- Maintainance outside of the code
- Requires more storage
Offload as a Document
When I worked at Tapingo/GrubHub a few years ago, we aimed to decouple the ordering system from the central database. One reason, though not the primary one, was the frequent need to fetch related order data for various purposes, such as payments, POS display, ticket printing, Kitchen Display System, courier information, and order content.
We addressed this by creating a comprehensive document encapsulating all relevant data contexts and distributing it to various microservices. This document included duplicate data about shops and customers, which was accurate for the specific time. Each microservice had the autonomy to decide where and how long to store the document, allowing for fast query times by order-id, shop-id, and other indices.
The document could be stored in a cache like Redis or MongoDB or streamed/pushed to a message broker or stream processing system.
Use Array/JSON columns
At EasyTeam, we implemented a "take a break feature," allowing admins to define which breaks are available during a shift, which the worker then clocks in for.
A simple approach would be to create a many-to-many table between the scheduled shift and break types tables. However, since scheduled shifts are replicated frequently (new shifts every week), this would lead to an enormous many-to-many table. Instead, we used an array column on the scheduled shifts table to store the break types' IDs, with a GIN index on the column for faster searches.
While this method may compromise data integrity, it significantly reduces the stored data.
From my experience, joining a table with a JSON field and selecting the JSON field in the query (either with *or specifically) could hurt query performance.
In summary, denormalization helps fine-tune frequent heavy-read queries that lack performance. Use it wisely to avoid complex maintenance, extra storage costs, and data integrity.
This post was originally published on my newsletter, Percentile 99th. If you wish to learn how to improve application performance, I will write about it extensively there.
Top comments (0)