Demystifying normalization and denormalization

#normalization #programming #database #denormalization

One of the first elements to be defined during the design of an application is the data model. The data model is the data representation: it has to be correct and designed in a way that fruition is as simple as possible. The first aspect to consider is what data need to be saved: despite the apparently trivial question, it implies different reasonings that aren’t trivial at all. One of them is how much to normalize data.

In the context of relational databases there are several standard forms of normalization, which you could choose to apply to your data model. Usually you will strive for coherence in applying one normal form, but as usual the circumstances could require a different approach. You can find several articles discussing this topic, but most are too exhaustive or too academic to serve as a “quick reference”, so I thought a leaner approach to the subject could be useful.

Every normal form assumes the previous forms to be respected. This, again, may not apply necessarily do your data model, but be sure to have good reasons if it is so.

First normal form

In this normal form every attribute of your entity is composed by unique values in respect to the domain chosen for that attribute. You violate it if you save a list of values in a single attribute.

| Name\* | Address |
|----------------|------------------------------------------------|
| Telematica srl | Via Rigoletto 20, Milano; Via Manzoni 15, Roma |
| NextImage spa | Via Isonzo 3, La Spezia |

1 - This table does not satisfy the First Normal Form

Second normal form

An attribute is not-prime when it’s not a candidate key. In this normal form you don’t have non-prime attributes depending on a subset of a candidate key. In simpler words, you may have certain attributes of your key that, taken aside, can be the key to other attributes. You may take these other attributes to another table using a less specific key. For example, in a table having year and month as a key the season attributes violates the second normal form, since it depends solely on the month.

| Year\*| Month\* | Season |
|------|----------|--------|
| 2019 | January | Winter |
| 2019 | August | Summer |
| 2018 | February | Winter |

2 - This table does not satisfy the Second Normal Form

Third normal form

In this normal form you don’t have transitive dependencies between key and attributes. In other words, you don’t have attributes depending from another attributes, since it would mean that they’re depending from the key only transitively. For example, in a table with month as a key and season as an attribute, the climate attribute violates this normal form if it’s merely depending from the season.

| Month\* | Season | Climate |
|---------|--------|----------------|
| January | Winter | Freezing cold |
| August | Summer | Unbearably hot |

3 - This table does not satisfy the Third Normal Form

Boyce-Codd normal form

This normal form is trickier to define. In this normal form you remove any possible functional redundancy still present in a table, so that the only functional dependencies present are between a key and some attributes. Usually, third normal form satisfies this requirement. There are, however, situations where on the same information several functional dependencies superimpose, or there is a missing link not expressed.

For the first case, let’s imagine a table representing a list of mixed type preferences, having the requirement that any person can have only one preference per type.

| Name\* | Type\* | Preference |
|-------|---------|------------|
| Paolo | City | Rome |
| Luca | Color | Red |
| Lucia | Climate | Hot |

4 - This table does not satisfy Boyce-Codd Normal Form

The expressed dependencies, as the primary key suggests, is Name,Type -> Preference.

There is however an unexpressed dependency: Value -> Type. Indeed, the value of Preference univocally determines which Type is the correct one. In this instance a conversion to Boyce-Codd Normal Form is not possible: this would imply the removal of the Type attribute (using it as an attribute in a Preference, Type table) but the removal would destroy the primary key of the original table. If you just add the aforementioned second table without mutating the first one you don’t satisfy BCNF but you’re satisfying another normal form, the Elementary Key Normal Form, so that would still be an improvement.

For the second case, imagine a table of the prescribed equipment assigned to a list of people and where they’re planning to use it.

| Name\* | Destination\*| Equipment |
|---------|-------------|-----------|
| Paolo | Mountain | Ski suit |
| Lucia | Mountain | Sunscreen |
| Alberto | Sea | Sunscreen |
| Roberta | Sea | Gym suit |

5 - This table does not satisfy Boyce-Codd Normal Form

In this case we can imagine an unexpressed dependency Equipment <-> Destination, that is only certain combinations of Equipment and Destination make sense. If we make explicit the hidden variable Season we can normalize the information.

| Name\* | Destination\*| Season |
|---------|-------------|--------|
| Paolo | Mountain | Winter |
| Lucia | Mountain | Summer |
| Alberto | Sea | Summer |
| Roberta | Sea | Winter |

| Destination\*| Season\*| Equipment |
|-------------|--------|-----------|
| Sea | Winter | Gym suit |
| Sea | Summer | Sunscreen |
| Mountain | Winter | Ski suit |
| Mountain | Summer | Sunscreen |

6 - These tables satisfies Boyce-Codd Normal Form

Further normal forms

There exist other normal forms. A mention should be given to the Sixth Normal Form, which prescribes tables with only two columns, key-value.

The normalization forms showed until now only indirectly have something to say about the modeling of the applicative domain, but the Domain-Key Normal Form tackles explicitly this aspect. It states that any possible constraint between data should be expressed by a key-value relationship or be a domain constraint. There shouldn’t be implicit dependencies and the focus is on good modeling more than the correct construction of the tables.

For example, imagine you have modeled the requirement to know age and condition of a list of people as such:

| Name\* | Condition | Age |
|---------|------------|-----|
| Paolo | Unemployed | 20 |
| Lucia | Employed | 36 |
| Clara | Unemployed | 20 |
| Stefano | Retired | 70 |

7 - Under our premises, this table does not satisfy Domain-Key Normal Form

For the sake of example, let’s suppose that there is an exact relationship between age and possible conditions; in particular, that not working before age 70 is regarded as unemployment, while not working after age 70 simply means being retired. A refactor of the model itself is in order:

| Name\* | Working | Age |
|---------|---------|-----|
| Paolo | No | 20 |
| Lucia | Yes | 36 |
| Clara | No | 20 |
| Stefano | No | 70 |

| Age segment\*| Non-working condition |
|-------------|-----------------------|
| \>= 70 | Retired |
| \< 70 | Unemployed |

8 - These tables satisfy Domain-Key Normal Form

Denormalization

A database with redundancies is, by definition, not normalized. Redundancy is not desirable per se, but can be accepted if useful in specific circumstances. When possible, it should be isolated.

Common problems of redundancy can be avoided most of the time if the redundant or source information (in case of derived data) is immutable: the typical problems linked to redundancy are “insertion anomaly”, “update anomaly” and “deletion anomaly”. If the data is immutable only the first one applies.

The process where you consciously put some redundancies is called denormalization. This implies there is a base, normalized, data model, and redundancies is limited to some specific cases. Denormalization can be a form of performance optimization or, eventually, be dictated by specific technical considerations. One consequence of denormalization is the increasing complexity of write operations compensated by the decreasing complexity of read operation, a trade off that can be beneficial.

Before choosing to opt for denormalization, it is important to bear in mind that databases are not only meant to store information, but also to find them efficiently. There are some native “tools” offered by databases to optimize performance without resorting to redundancies in the data model:

Indexes
Views (especially materialized views)
Internal caching

In some instances, denormalization can be seen as a form of eager data caching.

There are several ways you can apply denormalization, but I find all of them referable to three types:

saving derived information: you save the result of some elaboration. You avoid having to do again the elaboration.
saving pre-join information: you save in a table attributes of another linked table. You avoid having to do one or multiple joins.
short-circuit keys: this can be seen as a particular instance of the previous, where you save in a table a foreign key resulting from a transitive relation with another table. You avoid having to do one or multiple joins.

I hope this article can be a useful reference when designing data models. Some examples were a bit contrived with the aim to keep them simple.