Effective Database Design: Part 2

#database

In the last post, we walked through the importance of proper database design, and learned about normalization - a process we can follow to ensure data integrity in our databases.

Normalization is achieved by making sure your database meets a number of normal forms. Each normal form builds upon the last, and each one makes your data a little better. In this post, we'll discuss how to achieve the first normal form (1NF).

First Normal Form

To ensure that a table meets 1NF, we need three things:

Each record is represented by a unique primary key.
Each column has only atomic (single) values.
There are no repeating groups.

I'd like to explain each bullet by starting with a table that is not well designed, and explain three steps we can take to achieve 1NF. Let's create a table that records a student, their year, and the classes they're taking.

name	year	classes
Adam	Freshman	Math, Programming
Prince	Freshman	History

Primary Key

Each record in a database table should have a unique identifier called a primary key. In many cases, this will be a number. In some cases, though, a string could suffice (think of a user name, or social security number - these should be unique among people).

A primary key is often one column, but does not have to be so. Two columns together could be the unique identifier for a record.

Our table so far has no unique identifier in any way. A student's name is not unique, nor is the combination of (name, year), nor is the combination of (name, year, classes). To resolve this, we can just add a unique id:

id	name	year	classes
1	Adam	Freshman	Math, Programming
2	Prince	Freshman	History

Atomic Values

When we talk about atomic values, what we mean is that it can not be divided any further. In plain English, this means we shouldn't have a list of values in a column. Thus, our classes column does not comply with 1NF, because we shouldn't have multiple values in one cell.

The quick solution to this, is to break them out into their own column.

id	name	year	class_one	class_two
1	Adam	Freshman	Math	Programming
2	Prince	Freshman	History	null

However, we've created a problem here.

Repeating Groups

A table should not have a repeating group of columns. What this means is, we should not have class_one, class_two, ..., class_n. The reason this is bad is because we would end up with a lot of null values for students who don't take N classes, and if there ended up being a student with more than N classes, we'd have to modify the table to add a new column.

The quick solution to this, is to treat class as one column and add a new row for each time a student takes a class:

name	year	class
Adam	Freshman	Math
Adam	Freshman	Programming
Prince	Freshman	History

Further Improvements

This design does comply with 1NF. This is, though, the exact same table we had in the first post, that I explained was not well designed due to data redundancy and integrity issues.

Another key concept of database design is ensuring each table only represents a specific entity. Here, we have the concepts of students and of classes together in one table. Let's break them out into separate tables:

A students table:

id	name	year
1	Adam	Freshman
2	Prince	Freshman

A classes table:

id	class_name
1	Math
2	Programming
3	History

The relationship between them can be captured separately, in a students_classes table:

student_id	class_id
1	1
1	2
2	3

In this table, the primary key is a composition of both (student_id, class_id) together.

I hope you found this helpful! The 1NF is an important step toward database normalization, and even making these few quick updates we have data that is far better than we started with. There's no redundant data, and we aren't at risk of any data anomalies.

In the next post we'll continue to add on to this database and make sure it complies with the 2NF.