coder7475

Posted on Aug 7

Database Normalization and Denormalization

#database #sql #programming #backend

Normalization and denormalization are cornerstone techniques in database design, each serving distinct purposes in balancing data integrity and performance. In critical systems like medical databases, where accurate patient records and rapid data access can impact lives, these approaches are vital. Using a medical system as an example, this blog explores normalization and denormalization through a Socratic lens, encouraging you to reason through their principles, applications, and trade-offs.

What Are Database Anomalies and Why Do They Matter?

Imagine a medical database where all patient, doctor, and appointment data is stored in a single table. If a patient updates their phone number, how many records must you change to keep the data consistent? If you delete a canceled appointment, what happens if it also removes the patient’s only record? These issues—update anomalies (inconsistent updates), delete anomalies (unintended data loss), and insert anomalies (inability to add data without related records)—can disrupt a hospital’s operations.

Normalization is designed to prevent these anomalies by organizing data systematically. Let’s explore how it works using a medical system with entities like Patients, Doctors, Appointments, Medical Records, and Medical Facilities.

Normalization: Structuring Data for Integrity

Normalization is like organizing a hospital’s records into a logical system where each piece of information has a unique place, much like a librarian cataloging books to avoid duplicates and ensure easy access. It involves breaking down a large table into smaller, related tables linked by keys, eliminating redundancy and preventing anomalies.

Normalization follows a series of rules called normal forms, each addressing specific data dependencies. Let’s apply these to a medical database.

Functional Dependencies: The Building Blocks

Before diving into normal forms, consider functional dependencies. If a PatientID uniquely determines a patient’s FirstName and LastName, written as PatientID → FirstName, LastName, what does this suggest about how we should store patient data? Functional dependencies ensure attributes are logically related, guiding the normalization process.

First Normal Form (1NF): The Foundation of Organization

A table is in 1NF if:

All attributes are atomic (indivisible).
Each column has a unique name.
All entries in a column are of the same data type.
Each row is unique, typically enforced by a primary key.

Consider a denormalized table in our medical system:

AppointmentID	PatientName	DoctorName	AppointmentDetails
1	John Doe	Alice Smith	2023-01-01, 10:00, Checkup
2	Jane Roe	Bob Johnson	2023-01-02, 11:00, Surgery

The AppointmentDetails column contains multiple values (date, time, type), violating 1NF.To achieve 1NF, we split the data:

AppointmentID	PatientName	DoctorName	AppointmentDate	AppointmentTime	AppointmentType
1	John Doe	Alice Smith	2023-01-01	10:00	Checkup
2	Jane Roe	Bob Johnson	2023-01-02	11:00	Surgery

Now, each cell is atomic, and the table is in 1NF. Next up is 2NF.

Second Normal Form (2NF): Eliminating Partial Dependencies

A table is in 2NF if:

It is in 1NF.
All non-key attributes fully depend on the entire primary key, avoiding partial dependencies.

Suppose the primary key is (AppointmentID, PatientID):

AppointmentID	PatientID	PatientName	DoctorName	AppointmentDate	AppointmentTime	AppointmentType
1	1	John Doe	Alice Smith	2023-01-01	10:00	Checkup
2	2	Jane Roe	Bob Johnson	2023-01-02	11:00	Surgery

Here, PatientName depends only on PatientID, not the full key (AppointmentID, PatientID). To achieve 2NF, we split the table:

Patients Table:

PatientID	FirstName	LastName
1	John	Doe
2	Jane	Roe

Appointments Table:

AppointmentID	PatientID	DoctorName	AppointmentDate	AppointmentTime	AppointmentType
1	1	Alice Smith	2023-01-01	10:00	Checkup
2	2	Bob Johnson	2023-01-02	11:00	Surgery

Now, PatientName depends only on PatientID. Now that we've eliminated partial dependencies with 2NF, the next concern is transitive dependencies—where non-key fields depend on other non-key fields

Third Normal Form (3NF): Removing Transitive Dependencies

A table is in 3NF if:

It is in 2NF.
No non-key attribute depends on another non-key attribute (transitive dependency).

Let’s add doctor specialization to the Appointments table:

AppointmentID	PatientID	DoctorName	DoctorSpecialization	AppointmentDate	AppointmentTime	AppointmentType
1	1	Alice Smith	Cardiology	2023-01-01	10:00	Checkup
2	2	Bob Johnson	Neurology	2023-01-02	11:00	Surgery

Here, DoctorSpecialization depends on DoctorName, not the primary key (AppointmentID, PatientID). This is a transitive dependency.To achieve 3NF, we create a Doctors table:

Doctors Table:

DoctorID	FirstName	LastName	Specialization
1	Alice	Smith	Cardiology
2	Bob	Johnson	Neurology

Appointments Table (Updated):

AppointmentID	PatientID	DoctorID	AppointmentDate	AppointmentTime	AppointmentType
1	1	1	2023-01-01	10:00	Checkup
2	2	2	2023-01-02	11:00	Surgery

Now, Specialization depends on DoctorID, eliminating the transitive dependency.

Boyce-Codd Normal Form (BCNF): The Strictest Form

A table is in BCNF if it is in 3NF and every determinant (an attribute that determines another) is a candidate key. Consider a table for doctor schedules:

DoctorID	FacilityID	FacilityName	ScheduleDay
1	1	City Hospital	Monday
2	2	Clinic North	Tuesday

If FacilityName determines FacilityID, this violates BCNF because FacilityName is not a candidate key. To achieve BCNF, we split the table:

Facilities Table:

FacilityID	FacilityName
1	City Hospital
2	Clinic North

DoctorSchedules Table:

DoctorID	FacilityID	ScheduleDay
1	1	Monday
2	2	Tuesday

Denormalization: Prioritizing Performance

Normalization ensures data integrity but can slow queries requiring multiple joins, such as retrieving a patient’s appointment history with doctor and facility details. In a hospital, why might quick access to such data be essential, like for an emergency room dashboard? Denormalization introduces redundancy to optimize query performance.

When to Denormalize

Denormalization is useful for:

Read-Heavy Operations: Dashboards showing appointment schedules.
Complex Queries: Combining patient, doctor, and medical record data.
Precomputed Results: Storing aggregates, like the number of appointments per doctor.

For example, a normalized query for appointment details might be:

SELECT a.AppointmentID, p.FirstName AS PatientFirst, p.LastName AS PatientLast,
       d.FirstName AS DoctorFirst, d.LastName AS DoctorLast, a.AppointmentDate
FROM Appointments a
JOIN Patients p ON a.PatientID = p.PatientID
JOIN Doctors d ON a.DoctorID = d.DoctorID;

This query is slow for large datasets. A denormalized table might look like:

AppointmentID	PatientFirst	PatientLast	DoctorFirst	DoctorLast	AppointmentDate
1	John	Doe	Alice	Smith	2023-01-01
2	Jane	Roe	Bob	Johnson	2023-01-02

A simple SELECT * FROM AppointmentDetails is faster.

Denormalization Pros and Cons

Aspect	Pros	Cons
Performance	Faster reads due to fewer joins	Slower writes due to redundancy
Query Simplicity	Simpler SELECT statements	Complex updates to maintain consistency
Storage	-	Increased storage needs
Data Integrity	-	Risk of inconsistencies

Applying Normalization and Denormalization to a Medical System

Consider a medical database with:

Patients: PatientID, FirstName, LastName, DateOfBirth, ContactNumber, Email
Doctors: DoctorID, FirstName, LastName, Specialization, ContactNumber
Appointments: AppointmentID, PatientID, DoctorID, AppointmentDate, AppointmentTime, Status
Medical Records: RecordID, AppointmentID, Diagnosis, Prescription, TestResults
Facilities: FacilityID, FacilityName, Location
DoctorFacilities: Junction table for many-to-many relationships between doctors and facilities.

Normalized Design

The database is split into tables linked by foreign keys, ensuring integrity. For example, Appointments references PatientID and DoctorID.

Denormalized Design

For a dashboard showing appointment schedules, a materialized view might combine data:

CREATE MATERIALIZED VIEW AppointmentSummary AS
SELECT a.AppointmentID, p.FirstName AS PatientFirst, p.LastName AS PatientLast,
       d.FirstName AS DoctorFirst, d.LastName AS DoctorLast, d.Specialization,
       a.AppointmentDate, a.AppointmentTime
FROM Appointments a
JOIN Patients p ON a.PatientID = p.PatientID
JOIN Doctors d ON a.DoctorID = d.DoctorID;

This view speeds up queries but requires periodic refreshes to stay consistent.

Enforcing Integrity with Constraints

In a medical system, constraints ensure data reliability:

Primary Key: PatientID ensures unique patient records.
Foreign Key: PatientID in Appointments references Patients.
Unique Constraint: Email in Patients prevents duplicates.
Check Constraint: AppointmentDate ensures future dates.

Conclusion

Normalization and denormalization are complementary strategies in database design. Normalization builds a robust, anomaly-free structure, ideal for transactional medical systems where accuracy is paramount. Denormalization enhances performance for read-heavy tasks, like generating appointment reports. By understanding and balancing these approaches, you can design a medical database that is both reliable and efficient.

DEV Community