ELI5: Database Normalization?

#explainlikeimfive #database #design #beginners

Dear fellow devs 👋🏻,

🤓 I just started learning about databases, took the mode SQL tutorial and am taking a course on Fullstack Web Development.

👩🏻‍🎓 At some point, I will take a deep dive into database design, but for now, can you help me get a general idea of database normalization?

🤔What does it mean when a table is in first, second, or third normal form (1NF, 2NF, 3NF)?

Thanks,
Kat 🐱🐾

Top comments (5)

Abdur Rehman Khalid • Dec 7 '19

It is a topic that should be covered in detail and I am sure that there are many resources available online from where you can learn about different types of normalization and keep one more thing in mind that we do normalization till 3NF. When I was studying the Database in my 5th Semester our teacher told us the following rules that helped us to understand the concept of normalization very well. And those rules are as following:

Keep Data According to the Table. e.g. If you have created a table named "Students" then you have to put data of students only not of "Teachers" and "Subjects".
Make Every Row Distinct by the Means of Primary Key or Composite Primary Key.
Always Make a Third Table for the Purpose of Many-Many Relationship.
Before making any database make a clear and effective Entity Relationship Diagram that should be first step as it will make the creation of database simple and effective.

Kat 🐆🐾 • Dec 8 '19

Thank you, that's really helpful. Also good to know that you validate my believe that I should take a deep dive into this topic!

Kasey Speakman • Dec 10 '19 • Edited

1NF just sets a few basic standards. Making sure a relation (row) is uniquely identifiable. And that an attribute (column) only contains a single value. For example, do not put 2 phone numbers in the Phone attribute. Also do not make Phone1, Phone2 attributes. Each phone number should be in its own row, along with a copy of the other data on the row. 1NF allows duplication of data.

2NF, 3NF, BCNF progressively add rules to eliminate different kinds of duplication.

Going up to BCNF (which is a slight refinement on 3NF), you usually end up with a set of tables with no data duplication. That means that data only has to be written in a single place. So if I need to update Sue's name to Susan, in a 3NF set of tables, I only have to change it in one place. But in a 1NF table, I might have to find every row where I stored Sue's name and update all of them.

But the downside of being in a higher normalized form like 3NF is that reads are slower and potentially more complex to implement. Since Sue's name is only in one place, when I am querying a table that references Sue's ID but I need the name for display purpose, I have to go find Sue's name in that other table. (A JOIN in SQL.) Whereas a set of tables in 1NF (but not 2NF or higher) will have Sue's name on each row and so it won't have to be separately looked up and joined in.

Also note that highly normalized tables can be a real pain to work with in programming languages. Chopping up a nested data structure into it's constituent elements and issuing SQL statements to save each of those pieces to tables, and then reading all the pieces and reconstituting them back into a programming data structure. That is very tedious work. In these cases where I need to load and save a nested data structure as a single unit I will instead opt to serialize them to something like JSON and save as text or binary into a single column. A common alternative is to use an ORM (object-relational mapper) framework to automagically do that chopping up and reconstituting for you. However, ORMs are complex beasts which have many pitfalls themselves. These are often not discovered until later, when it is very hard to pull the ORM out of the code base.

I just wanted to point those couple of things out to temper the common conclusion that normalized is the only right way to store data. There is no single correct way for every occasion. Although 3NF is a solid starting point for small applications.

Rolf Streefkerk • Jan 21 '20

Perhaps a bit outside of the scope of this discussion, but the reason for data normalization and SQL is because of an outdated paradigm. Data storage used to be really expensive and storing redundant information is therefore cost prohibitive. These days, storage is cheap and we're CPU bound more so than disk space bound.

To cut a long story short some data redundancy isn't a bad thing and in fact if you look at NoSQL engines they actually prefer data redundancy as a means to increased query performance.

The answer here is, it depends how relevant normalization is with respect to its context.