Database Normalization

#database #productivity

Database normalization is a process of restructuring database tables in order to minimize data redundancy and avoid data anomalies such as insert anomaly, delete anomaly, and update anomaly.

Insert anomaly occurs when we have to insert unnecessary similar data. for example when we have the data above. in that case, if we want to add another student that has the same course, then we have to add the same course information such as the lecturer's name, even though that is unnecessarily needed, because, for each same course, the lecturer will be the same.

Next, there is a delete anomaly that occurs when we want to delete data in a certain domain and it affects the loss of other data domains. let's take the previous table as an example. when all students in the fundamental programming course have passed the course, we might delete those students' information because we don't need those students in that table anymore. When we remove the data, it will impact the loss of course information, such as who teaches that course. In this case, we will lose the information on who teaches the fundamental programming course.

And then, there is an update anomaly that occurs when we have to replace all same existing data that belongs to the same data in a data domain. let's take the previous table again as an example. Just imagine that somehow, the lecture that teaches the fundamental programming course (Mr. Marc Whitfield) cannot teach the fundamental programming anymore and have to be replaced by another lecturer, just call him mr. Trystan Mccall. Because of that, we have to replace all lecturer data that belongs to the fundamental programming course (Mr. Marc Whitfield) with mr. Trystan Mccall.

There are some steps to normalize the database table, which are :

1NF
2NF
3NF
BCNF
4NF
5NF

1NF

in order to make the table satisfy the first normal form, there is one simple rule that has to be applied for each table, which is to make every attribute or column in each table store a single atomic data, not a group of data. In the example above, a student have more than one course which is violae the first normal form. we can separate those course data into different rows without the loss of the student and lecturer information, so the table will look like this.
The table above satisfies the first normal form of a database table, but there is partial dependency and transitive dependency which will be discussed in the higher normal form.

2NF

In order to make the table satisfy the second normal form, the table must be in the first normal form. Then, there is an additional rule, the table has no partial dependency. Partial dependency is the dependency of a non-primary attribute which only on a part of a composite primary key, not a whole composite primary key. In the previous table, there is a composite primary key formed by student_number and course_code attribute. There are some attributes that depend only on one primary key, such as student_name that depends only on the student_number attribute, and course_name that depends only on the course_code. These attributes violate the second normal form.

To get rid of that problem, we have to separate the partial dependent attribute into a new table with the corresponding primary key. So, the tables will look like this.

3NF

In order to make the table satisfy the third normal form, the table must be in the second normal form. Then, there is an additional rule, the table has no transitive dependency. Transitive dependency is the dependency of a non-primary attribute on another non-primary attribute. In the previous table, we get rid of student_name and course name that depends on one of the composite primary key, but what about the lecturer_name attribute? it doesn't depend on any of a composite primary key, but instead, it depends on lecturer_code which is not a part of composite primary key nor a single primary key itself.

to convert the table to satisfy the third normal form, we have to separate the transitive dependent attribute into a new table with the corresponding primary key, so the table will look like this.

BCNF (Boyce-Codd Normal Form)

The table should be converted to BCNF only if the table is in the third normal form and every determinant attribute in the table is a candidate key. In the previous table, besides the student_number and course_code that were chosen to be a composite key, there is also another determinant attribute that is also a candidate key, that is lecturer_code, but for some reason, it was not chosen to be a part of the composite primary key. If we observe, one course can be taught by many lecturers, but one lecturer only teaches one course, which means that the course depends on another non-primary key attribute. This condition obviously violates the BCNF.

To convert the table into a BCNF, we can modify the composite primary key from student_number + course_code into student_number + lecturer_code, then we move attribute course code into the lecturer's table. Therefore, there is no candidate key and the tables will look like this.

4NF

To achieve the fourth normal level of a database table, the table must have no multivalued dependency. Multivalued dependency is a dependency of two or more non-primary keys on a primary key or composite of it, where these non-primary keys are independent of each other. Let's take another example in the table below.

In that table, there are language and skill attribute that depend on the same primary key (EmployeeId), but those attributes are not related to each other. In other words, the dependency of those attributes violates the fourth normal form of the database table. To deal with this, we can separate both dependent attribut into a new table with the primary key, so the table will look like this

5NF

The table can satisfy the fifth normal form if the table is in the fourth normal form and the table cannot be decomposed into some smaller tables without losing some information. It's a bit complicated to understand, but let's try to take a look at the table of relationship between brand, distributor and the customer.

There are some relationships that exist between the data, such as distributors that sell stuff with a certain brand name, customers that buy stuff from distributors, and customers that buy stuff with a certain brand name. Let's try to decompose the table into three different tables that represent those relationships.

From the Brand Customer table, we can see that Reggie Pearson buys stuff with brand ASUS. Now let's find out from which distributor Reggie Pearson buys ASUS stuff. If we observe the Distributor Customer Table, Reggie Pearson only buys stuff from distributors AAX, KGR, and BGC. If we observe the Brand Distributor table, only distributors AAX, KGR, and BGC that sell ASUS stuff. From this information, we know that Reggie Pearson might buy ASUS stuff from distributors AAX, KGR, or BGC, but if we take a look back at the original table, Reggie Pearson only buys ASUS stuff from distributor AAX. This misleading information might be caused by some conditions, maybe there is someone that buy ASUS stuff from distributor KGR and BGC, or maybe there is another brand that bought by Reggie Pearson from distributor KGR and BGC. That's why we cannot decompose the original table into some smaller tables because we lose information on who is actually buying stuff with which brand name from which distributor. In other words, the original table already satisfies the fifth normal form.