Database Normalization in Simple Terms (1st, 2nd and 3rd Normal Forms)
Picture this: you’re getting ready for a date, aiming to impress in your favorite T-shirt. But here’s the twist — your wardrobe is an absolute mess. The frantic hunt through heaps of clothes to locate that one piece is driving you crazy. This desperate search mirrors the struggle of navigating a non-normalized database. Just as a chaotic closet sabotages your quest for the perfect outfit, an unorganized database thwarts efficient data retrieval due to redundancy and inconsistencies. In this article, I’ll be diving into the world of database normalization — a systematic approach that brings order to data, much like tidying up your clothes to reclaim order in your closet.
What is Database Normalization?
Database normalization is simply the process of structuring and organizing data within a database to reduce redundancy and improve efficiency. It involves breaking down or decomposing large and complex tables into smaller and simpler tables, and then connecting each of these tables using what is known as a foreign key. The purpose of this is to ensure data accuracy, make querying faster, and maintain consistency in a database by minimizing data anomalies and errors.
Why do database engineers normalize databases?
Database engineers or developers normalize databases to reduce data duplication, avoid data modification implications, which in turn helps to simplify data queries from the database. To gain a better understanding of normalization and the challenges it addresses, let’s look at an example of a table that hasn’t been normalized.
The table above presents a list of university students, courses, and departments. It also illustrates the relationships and associations between students, courses, and departments. Additionally, it includes the names and contact details of the department heads.
Developing tables like this that serve multiple purposes causes serious challenges and problems for database systems. The most common of these challenges include:
Insert anomaly
Update anomaly and
Deletion anomaly
Let’s explore these issues with a slightly more detailed approach.
Insert Anomaly
Insert anomaly occurs when new data is inserted into a table, which then requires the insertion of additional data. Let’s use the Enrollment Table above as an example. In this table, each student is identified by a special ID. But the issue is, before I can add new records for things like courses, I must first put data in the student ID column. This is because it is a PRIMARY KEY and can not be NULL, even if it does not make sense. Essentially, this means I can’t add a new course without enrolling new students, and I can’t enroll new students without giving them an ID. And this ID can’t be left empty. So, I end up stuck — I can’t add a new course unless I also add new student data. This is what we mean by an insert anomaly, where adding one type of information depends on adding another type that shouldn’t be directly connected.
Update anomaly
An update anomaly occurs when you attempt to update a record in a table column only to discover that this will result in further updates across the table. Let’s go back to the Enrollment Table to see how this happens. In that table, the course name and department info is repeated for every student in that course. This repeating makes the database bigger and harder to keep accurate. Now, let’s say Dr. Harry, the head of the Computing Department, leaves and a new head comes in. I need to change Dr. Harry’s name to the new head’s name everywhere in the table. But I also have to change it for all the students in that department. This is tough because if I miss any students, the table will have wrong info. This is the update anomaly problem. Changing data in one spot causes changes in many other spots.
Deletion Anomaly
A deletion anomaly is when the deletion of a record of data causes the deletion of more than one set of data required in the database. For instance, let’s consider Emmanuel, a student with the ID number 4, who wants to quit his course. If I remove Emmanuel’s data, it unexpectedly erases records from the Engineering department as well, since they rely on Emmanuel’s ID. This is a case of the deletion anomaly issue. Deleting one set of data causes the removal of other related records too.
So, how can these problems be solved? The answer lies in database normalization.
Normalization optimizes the database design by creating a single purpose for each table. To normalize the Enrollment Table, I need to redesign it. As you discovered earlier, the table’s current design serves three different purposes. So the solution is to split the table in three, essentially, creating a single table for each purpose. This means that I now have a Student Table with information on each student, a Course Table that contains the records for each course, and a Department Table with information for each department. This separation of information helps to solve the anomaly challenges and also makes it easier to write sequel queries in order to search for, sort and analyze data.
How to Normalize a Database
To normalize a database table, you need to adhere to three fundamental normalization forms.
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
First Normal Form(1NF)
The main goal of the first normal form is to remove unnecessary repeating data groups by enforcing the concept of data atomicity.
Data atomicity simply means that each column attribute in a table field should have only a single instance value. In simpler terms, your tables should contain only one value per field. By getting rid of these repeating data groups, you can prevent the unnecessary duplication of data in the database. Instances of duplicated data can lead to both data redundancy and inconsistency.
To understand this better, I’ve built an unnormalized table called Course Table within a college database as seen below.
It includes information about the colleges tech courses, along with the names and contact details of the course tutors.The course ID column serves as the table’s primary key. However, there are multiple values in each row of the contact number column. For every tutor, there exists both a cell phone number and a landline number. The current structure of this table doesn’t conform to the first normal form (1NF) due to the violation of the atomicity principle. To address this, one potential solution is to introduce separate rows for each contact number, as demonstrated below. This adjustment effectively resolves the concern related to data atomicity.
The table now has just one value in each field. But this solution has also created another problem. The primary key is no longer unique because multiple rows now has the same course ID. Another way that I could solve the problem of atomicity while retaining the primary key is by creating two columns for contact numbers. One column for cell phones and second column for landline numbers as seen below.
But I still have the issue of unnecessary repeated groups of data. John Conor is the assigned tutor for two of the courses. His name appears twice in the table as do his contact details. These instances of data will continue to reappear if he’s assigned more courses to teach. It’s likely that his details will appear in other tables within a database system. This means I could have even more groups of repeated data. This creates another problem. If this user changes any of their details, then I’ll have to update their details in this table and all others in which it appears. If I miss any of these tables, then I’ll have inconsistency and invalid data within my database system. To solve this issue, I can redesign my table to adhere to 1NF or first normal form.
First, I identify the repeating groups of data. In this case it is the tutor’s name and contact numbers. Next, I identify the entities I’m dealing with, which are course and tutor. Then I split the course table so that I now have one table for each entity;A course table that contains information about the course and a Tutor’s table that maintains the name and contact numbers of each tutor. Now I need to assign a primary key to the tutor table by introducing a tutor ID column. I’ve solved the problem of data atomicity, but I also need to provide a link between the two tables. I can connect the two tables by using a foreign key. Likewise, I just add the tutor ID column to the course table. Now both tables are linked. I’ve now achieved data atomicity and eliminated unnecessary repeating groups of data. Then the final database then looks like this :
Second Normal Form(2NF)
Before we delve into second normal form or 2NF, understand that database normalization is a progressive process. So a database has to be in first normal form before you can implement second normal form (2NF). To bring a database to 2NF you need to first understand what is meant by the terms functional and partial dependency.
In a relation (table), a functional dependency occurs when the value of one or more attributes uniquely determines the value of another attribute. In simple terms, if the value of attribute A determines the value of attribute B, then A functionally determines B and B is functionally dependent on A.
Consider a table named “Students” with attributes: StudentID, FirstName, LastName, and Email. Let’s say that StudentID determines both FirstName and LastName.
Here, StudentID determines the firstname and lastname of each student therefore both first and last names are functionally dependent of studentID.
A partial dependency occurs when an attribute is functionally dependent on only part of a candidate key (OrderID+ProductID) rather than the entire key. In other words, an attribute depends on only a subset of the primary key, leading to redundancy.
Let’s consider a table named “Orders” with attributes: OrderID, ProductID, ProductName, and CustomerID. Suppose that both OrderID and ProductID together uniquely identify ProductName, but ProductName is dependent on ProductID alone and NOT (OrderID+ProductID). In other words, ProductID -> ProductName is a partial dependency because ProductName depends only on part of the candidate key (ProductID), not the whole key (OrderID + ProductID).
Now, what we aim to achieve by implementing second normal form is to avoid instances like partial dependencies as it violates the 2NF rule.Let’s look at how to upgrade this table to 2NF.Looking at the table, the non-key attributes (ProductName) are functionally dependent only on part of the candidate key (ProductID), which leads to partial dependency. To achieve 2NF, we’ll need to split the table into two tables: one for Orders and another for **Products **and then link these 2 tables using a foreign key.By doing this, we end up with the following tables;
Finally, we have the non-key attributes (ProductName) depending on ProductID, and CustomerID dependening on OrderID, with this we ’ve eliminated all unnecessary replication of data table. The 2 tables are now in the second normal form, or 2NF.
Third Normal Form(3NF)
For a relation (table) in a database to be in the third normal form, it must already be in the second normal form (2NF) and no non-key attribute is transitively dependent on the candidate key.
Transitive dependency occurs when a non-key attribute depends on another non-key attribute, which itself depends on the candidate key. In other words, the value of a non-key attribute is indirectly determined by the candidate key through another non-key attribute. Let’s take a look at this Employees table which is already in 2NF.
In this example DepartmentName depends on DepartmentID, and DepartmentID depends on EmployeeID. This is a case of transitive dependency.
To achieve 3NF, we want to remove the transitive dependencies by creating separate tables. Let’s use the “Employees” example to illustrate this process.
To remove the transitive dependency and achieve 3NF, we create separate tables for employees and departments.
Now, DepartmentName is no longer transitively dependent on EmployeeID. Instead, it is directly dependent on DepartmentID.
By splitting the original table into two normalized tables, we’ve achieved 3NF. Each table now represents a distinct entity, and the transitive dependency has been eliminated, ensuring data integrity and reducing redundancy.
You might want to ask : “what’s the difference between transitive and partial dependency, from your explanation they are almost same thing”
The key difference be is this;
Partial dependency occurs when a non-key attribute depends on only part of the candidate key. In other words, a non-key attribute depends on only some, but not all, attributes of the candidate key, the candidate key here is most times a composite primary key. This leads to redundancy and is typically resolved by decomposing the table into smaller tables.
For example, in a table with attributes (A, B, C), if B is part of the candidate key but non-key attribute C depends only on A, then it’s a case of partial dependency.
Transitive dependency, on the other hand, occurs when a non-key attribute depends on another non-key attribute, which itself depends on the candidate key. In transitive dependency, the non-key attribute is indirectly determined by the candidate key through an intermediate non-key attribute.
For example, in a table with attributes (A, B, C, D), if non-key attribute D depends on B, and B depends on the candidate key (A), then it’s a case of transitive dependency.
Difference:
The key difference lies in the nature of the dependency chain:
Partial Dependency involves a direct dependency of a non-key attribute on part of the candidate key.
Transitive Dependency involves an indirect dependency of a non-key attribute on the candidate key through an intermediate non-key attribute.
In practice, both partial and transitive dependencies violate higher normal forms (2NF, 3NF) because they lead to redundancy and anomalies. They are resolved by breaking down the original table into smaller, normalized tables to eliminate these dependency issues and improve data integrity.
Top comments (0)