Heyπ SQL enthusiast
This is a guide on Database Normalization and how to implement normalization using SQL, let's get started.
What is Database Normalization
Normalization is a technique used for evaluating and reorganizing table structures to minimize redundancies, improve data integrity, improve storage efficiency, and reduce the need to re-design the database if new data is introduced. Database normalization removes data inconsistencies which may cause the analysis of our data to be more complicated. These inconsistencies could come from updating records, inserting, or deleting records, which is known as update anomaly, insertion, and deletion anomaly, normalization also includes the removal of duplicate records which in turn saves storage.
The figure below shows the stages of Database normalization.
Let's take a look at the first normal form, what does it take for a database table to be in the first normal form. Here are the rules for a table to be in first normal form (1NF)
- Each column should contain an atomic value
- A column should contain data of the same data type eg (if a column contains an integer all the other elements in the column should be an integer)
- Each column should have a unique column name
- The order in which data are stored in the columns does not matter.
let's take a look at a table to better understand the concept of 1NF
As you can see from the above table there are columns with duplicate elements in the database table, from the last row, the columns are (student major Std_major
, courses title CTitle
, and Instructor location Ilocation
).
to solve this problem we are going to create a new table and insert only one atomic value into every column in the table.
Here is the code used to create the table and insert the element into the table from the original table.
--creating a new table for the schools Record 1NF
CREATE TABLE Schools_Record_1NF (Std_ID VARCHAR(50) NOT NULL,
Std_Name VARCHAR(50),
Std_address VARCHAR(50),
Std_Major VARCHAR(50),
CID VARCHAR NOT NULL,
CTitle VARCHAR(50),
Iname VARCHAR(50),
Ilocation VARCHAR(100),
Grade CHAR(5))
--Inserting into the Schools Record 1NF table
INSERT INTO Schools_Record_1NF(Std_ID, Std_Name,
Std_address, Std_Major,
CID, CTitle, Iname,
Ilocation, Grade)
--Using Case statement to deal with columns with duplicate entries
SELECT Std_ID, Std_Name, Std_address,
-- The CASE Statement here checks the first entry in the column Using (SUBSTR AND INSTR)
CASE WHEN
SUBSTR(Std_Major, 1, INSTR(Std_Major, ',')-1)='' THEN Std_Major
ELSE SUBSTR(Std_Major,1,INSTR(Std_Major, ',')-1)
END AS Std_Major,
CASE WHEN
SUBSTR(CID,1,INSTR(CID, ' ')-1)='' THEN CID
ELSE SUBSTR(CID,1,INSTR(CID, ' ')-1)
END AS CID,
CASE WHEN
SUBSTR(CTitle,1,INSTR(CTitle,',')-1)='' THEN CTitle
ELSE SUBSTR(CTitle,1,INSTR(CTitle,',')-1)
END AS CTitle,
CASE WHEN
SUBSTR(Iname,1,INSTR(Iname,',')-1)='' THEN Iname
ELSE SUBSTR(Iname,1,INSTR(Iname,',')-1)
END AS Iname,
CASE WHEN
SUBSTR(Ilocation,1,INSTR(Ilocation, ',')-1)='' THEN Ilocation
ELSE SUBSTR(Ilocation,1,INSTR(Ilocation, ',')-1)
END AS Ilocation,
CASE WHEN
SUBSTR(Grade,1,INSTR(Grade, ',')-1)='' THEN Grade
ELSE SUBSTR(Grade,1,INSTR(Grade, ',')-1)
END AS Grade
FROM School_Record
-- Using the union operator to join the first and the second entry of the table
UNION
SELECT Std_ID, Std_Name, Std_address,
-- Using SUBSTR and INSTR to access the second entry inside the columns
SUBSTR(Std_Major,INSTR(Std_Major,',')+1) As Std_Major,
SUBSTR(CID, INSTR(CID,' ')+1) AS CID,
SUBSTR(CTitle, INSTR(CTitle, ',')+1) AS CTitle,
SUBSTR(Iname, INSTR(Iname, ',')+1) AS Iname,
SUBSTR(Ilocation, INSTR(Ilocation, ',')+1) AS Ilocation,
SUBSTR(Grade, INSTR(Grade, ',')+1) AS Grade
FROM School_Record
-- Checking the columns with multiple entries
WHERE Std_Major LIKE "%,%" or CID LIKE "% %"
Here is the new table after applying the above code
one new row is added to the table rather than having columns with non atomic values.
Next up let's move into the second normal form (2NF)
The Second Normal Form (2NF)
The above table is not in 2NF because not all the columns in the table are depending on the primary keys of the table. e.g. columns like the student name Std_name
only depends on the student Id Std_ID
column which is one part of the primary key since both Std_ID
and CID
are the primary keys in this case and also courses title only depends on courses Id CID
without depending on student Id Std_ID
which means we are having a partial functional dependency.
To Convert a table to Second Normal Form
The table must be in 1NF.
- A table in 2NF should not have a partial functional dependency.
What is a partial functional dependency?
let's understand Functional dependency first
Dependency in a table is when the columns in the table are depending on a particular column like the primary key of a table.
Partial functional dependency is when a particular column is not full depending on all the primary key columns of the table (in this case our primary key is more than one column).
To implement the second normal form 2NF we will start by creating new tables one for the student information, instructor information, and another table for the student grade information.
--Creating the Tables for the Second Normal Form 2NF
CREATE TABLE Student_Info_2NF (
Std_ID VARCHAR(50) NOT NULL,
Std_Name VARCHAR(50),
Std_address VARCHAR(50),
Std_Major VARCHAR(50),
PRIMARY KEY(Std_ID));
CREATE TABLE Instructor_Info_2NF(
CID VARCHAR NOT NULL,
CTitle VARCHAR(50),
Iname VARCHAR(50),
Ilocation VARCHAR(100),
PRIMARY KEY(CID));
CREATE TABLE Grade_Info_2NF(
Std_ID VARCHAR(50) NOT NULL,
CID VARCHAR NOT NULL,
Grade CHAR(5),
PRIMARY KEY(Std_ID, CID),
FOREIGN KEY(Std_ID) REFERENCES Student_Info_2NF(Std_ID) ,
FOREIGN KEY(CID) REFERENCES Instructor_Info_2NF(CID));
students grade information table
Next we will be moving on to the 3NF
Third Normal Form 3NF
For Relation to be in 3NF the Relation First must be in 1NF and 2NF
- They should be no transitive dependencies
Well, transitive dependency simply means that all other columns in the table should only depend on the PRIMARY KEY Column only and primary key alone.
Based on the tables we currently have,
there is only one table that is showing an issue which is the instructor information table because the Primary key here is the course id (CID) and the instructor location can be determined by the CID
and it can also be determined by the instructor name Iname
column, so to solve this problem we are going to create a new table for the Instructor information and another table with the courses information.
here is the code to create the table and insert the data into the table from the Instructor information table above.
CREATE TABLE Instructor_Info_3NF (
Iname VARCHAR(50) NOT NULL,
Ilocation VARCHAR(100),
PRIMARY KEY (Iname));
CREATE TABLE Courses_Info_3NF(
CID VARCHAR NOT NULL,
CTitle VARCHAR(50),
Iname VARCHAR(50),
PRIMARY KEY(CID)
FOREIGN KEY(Iname) REFERENCES Instructor_Info_3NF(Iname));
--Inserting into the Instructor_Info_3NF
INSERT INTO Instructor_Info_3NF(Iname, Ilocation)
SELECT Iname, Ilocation
FROM Instructor_Info_2NF;
--Inserting into the Courses_Info_3NF
INSERT INTO Courses_Info_3NF(CID, CTitle, Iname )
SELECT CID, CTitle, Iname
FROM Instructor_Info_2NF;
Here are the new tables
Instructor information table 3NF
We are going to stop on the 3NF
Conclusion
We have now completed the normalization process, In general most Databases are in 3NF, in this post you've understood what normalization is and also learned how to implement normalization using SQL.
Hope this blog helps you understand Database normalization. If you enjoy my blog please don't forget to like or comment on the article. It will encourage me to write more articles.
Top comments (2)
Interesting stuff, takes me back to collage.
I would've added the sql code after the logical explanation.
Thank you @detzam for the comment , I will make sure I do that next time.