Table of Contents
- Introduction: What is Data Modeling Really?
-
The Three Levels of Data Modeling
- Conceptual Data Modeling
- Logical Data Modeling
- Physical Data Modeling
-
Entity-Relationship Modeling: The Foundation
- Entities, Attributes, and Relationships
- Cardinality and Participation
-
Keys: The Anchors of Your Data
- Primary Keys, Foreign Keys, and Composite Keys
- Surrogate vs Natural Keys
-
Normalization: The Science of Organization
- First Normal Form through BCNF
- When to Normalize and When to Stop
- Denormalization: Breaking the Rules Intelligently
-
Dimensional Modeling: The Analytics Powerhouse
- Facts and Dimensions
- Conformed Dimensions
- Slowly Changing Dimensions
- Junk Dimensions and Degenerate Dimensions
- Schema Patterns: Star, Snowflake, and Galaxy
- Data Vault: The Agile Approach
-
Advanced Concepts
- Bridge Tables and Many-to-Many Relationships
- Hierarchies and Recursive Relationships
- Temporal Modeling
-
Data Modeling for Modern Systems
- NoSQL Data Modeling
- Data Lake Modeling
- Real-time Analytics Modeling
- Best Practices and Common Pitfalls
- Conclusion: Choosing Your Path
Introduction: What is Data Modeling Really?
Raj (software developer, settling down with his chai): Priya, I've been coding for 4 years now, and honestly, I've realized I don't really understand data modeling. I mean, I create tables, write queries, but when architects talk about "conceptual models" and "dimensional modeling," I'm completely lost.
Priya (data architect, smiling): You know Raj, you're not alone. Most developers learn SQL and think that's data modeling. But that's like learning to write sentences and thinking you know literature. Data modeling is much deeper—it's about understanding how to represent the real world in a database.
Raj: Real world? I just store customer names and order details!
Priya: Exactly! And that seemingly simple task has so many questions: Should you store customer addresses as one field or multiple? What happens when a customer has multiple addresses? What if a product's price changes—do you update it everywhere or keep history? What if your queries are slow—do you duplicate data? These are all data modeling decisions.
Raj: Okay, now I'm interested. Where do we even start?
Priya: Let's start from the very beginning—the three levels of data modeling. Think of it like building a house. First you have an idea, then you make blueprints, then you actually construct it.
The Three Levels of Data Modeling
Priya: Data modeling happens at three distinct levels, and most people confuse them. Each level serves a different purpose and audience.
Level 1: Conceptual Data Model
Priya: This is the high-level view. It's what you discuss with business people who don't know or care about databases. You're just identifying the main entities and how they relate.
Raj: Can you give me an example?
Priya: Sure! Let's say you're building a university system. Your conceptual model might look like this:
CONCEPTUAL MODEL - University System
Entities:
- Student
- Course
- Professor
- Department
- Enrollment
Relationships:
- Students ENROLL IN Courses
- Professors TEACH Courses
- Departments OFFER Courses
- Students BELONG TO Departments
- Professors WORK IN Departments
Raj: This is just boxes and lines, no?
Priya: Exactly! No data types, no primary keys, no technical details. Just business concepts. This is what you show to the Dean or HOD. They can validate: "Yes, students enroll in courses, professors teach courses"—that makes sense to them.
Raj: But this can't be turned into a database directly!
Priya: Correct! That's why we need the next level.
Level 2: Logical Data Model
Priya: This is where we add more structure, but still stay database-independent. We define attributes, keys, and relationships more precisely.
LOGICAL MODEL - University System
STUDENT
- StudentID (PK)
- FirstName
- LastName
- DateOfBirth
- EmailAddress
- PhoneNumber
- DepartmentID (FK)
COURSE
- CourseID (PK)
- CourseName
- CourseCode
- Credits
- DepartmentID (FK)
PROFESSOR
- ProfessorID (PK)
- FirstName
- LastName
- EmailAddress
- DepartmentID (FK)
DEPARTMENT
- DepartmentID (PK)
- DepartmentName
- BuildingLocation
- HeadOfDepartment
ENROLLMENT
- EnrollmentID (PK)
- StudentID (FK)
- CourseID (FK)
- EnrollmentDate
- Grade
- Semester
Raj: Now this looks more like a database! But you said database-independent?
Priya: Yes! Notice I haven't specified whether EmailAddress is VARCHAR(100) or TEXT. I haven't said whether this is Oracle, MySQL, or PostgreSQL. I haven't defined indexes or partitions. This is pure logical structure—what data you need and how it relates.
Raj: So when does it become actual database code?
Level 3: Physical Data Model
Priya: This is the final level—the actual implementation in your specific database system.
-- PHYSICAL MODEL - PostgreSQL Implementation
CREATE TABLE department (
department_id SERIAL PRIMARY KEY,
department_name VARCHAR(100) NOT NULL UNIQUE,
building_location VARCHAR(100),
head_of_department INTEGER,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_dept_name ON department(department_name);
CREATE TABLE student (
student_id SERIAL PRIMARY KEY,
first_name VARCHAR(50) NOT NULL,
last_name VARCHAR(50) NOT NULL,
date_of_birth DATE NOT NULL,
email_address VARCHAR(100) UNIQUE NOT NULL,
phone_number VARCHAR(15),
department_id INTEGER REFERENCES department(department_id),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE INDEX idx_student_email ON student(email_address);
CREATE INDEX idx_student_dept ON student(department_id);
-- Partitioning for large enrollment table
CREATE TABLE enrollment (
enrollment_id SERIAL,
student_id INTEGER NOT NULL REFERENCES student(student_id),
course_id INTEGER NOT NULL REFERENCES course(course_id),
enrollment_date DATE NOT NULL,
grade VARCHAR(2),
semester VARCHAR(20) NOT NULL,
PRIMARY KEY (enrollment_id, semester)
) PARTITION BY LIST (semester);
CREATE TABLE enrollment_2024_spring PARTITION OF enrollment
FOR VALUES IN ('2024-Spring');
CREATE TABLE enrollment_2024_fall PARTITION OF enrollment
FOR VALUES IN ('2024-Fall');
Raj: Ah! Now we have data types, indexes, timestamps, even partitioning!
Priya: Exactly! The physical model includes:
- Specific data types (VARCHAR, INTEGER, DATE)
- Indexes for performance
- Partitioning strategy
- Default values and constraints
- Timestamp columns for audit
- Naming conventions (snake_case for PostgreSQL)
Raj: So I need all three levels?
Priya: Depends on the project size. For a small internal tool, you might skip the conceptual and go straight to logical. For enterprise systems, all three are important. The conceptual model is your contract with business, the logical is your design document, and the physical is your implementation.
Entity-Relationship Modeling: The Foundation
Raj: You keep using terms like entity, relationship, cardinality. Can we dig deeper into these?
Priya: Absolutely! Entity-Relationship (ER) modeling is the foundation of relational database design. It was created by Peter Chen in 1976, and it's still relevant today.
Understanding Entities
Priya: An entity is simply a "thing" in your business domain that you want to store information about. It can be:
Tangible entities: Things you can touch
- Student, Professor, Book, Product, Vehicle
Intangible entities: Concepts or events
- Enrollment, Order, Transaction, Appointment
Raj: So basically, anything that becomes a table?
Priya: Mostly, yes! But here's the key—an entity must have:
- Multiple instances (if you only have one, it's just data)
- Unique identification (each instance must be uniquely identifiable)
- Attributes (properties that describe it)
Attributes: The Properties
Priya: Attributes describe your entity. But not all attributes are equal!
STUDENT Entity
Simple Attributes:
- StudentID
- FirstName
- DateOfBirth
Composite Attributes:
- Address
├── Street
├── City
├── State
└── Pincode
Derived Attributes:
- Age (derived from DateOfBirth)
- FullName (derived from FirstName + LastName)
Multi-valued Attributes:
- PhoneNumbers (a student can have multiple)
- EmailAddresses (personal, university, alternate)
Raj: Should I store Age in the database if I can calculate it?
Priya: Generally, no! That's a derived attribute. Store DateOfBirth and calculate Age when needed. Otherwise, you need to update Age every year, which is redundant and error-prone. But there are exceptions—if calculating Age is expensive and you need it frequently, you might store it and recalculate periodically.
Raj: What about multi-valued attributes like multiple phone numbers?
Priya: Great question! You have two choices:
Option 1: Separate table (normalized)
STUDENT
- StudentID
- FirstName
- LastName
STUDENT_PHONE
- PhoneID
- StudentID (FK)
- PhoneNumber
- PhoneType (Mobile/Home/Work)
Option 2: Store as array or JSON (modern databases)
STUDENT
- StudentID
- FirstName
- LastName
- PhoneNumbers (JSONB in PostgreSQL)
Example: [
{"type": "mobile", "number": "9876543210"},
{"type": "home", "number": "02212345678"}
]
Priya: Option 1 is traditional and works everywhere. Option 2 is simpler but less queryable—you can't easily search "find all students with mobile number starting with 987."
Relationships: How Entities Connect
Priya: Now comes the interesting part—relationships. There are three types based on cardinality.
One-to-One (1:1)
Priya: Each instance of Entity A relates to exactly one instance of Entity B, and vice versa.
Example 1: Student and LibraryCard
- Each student has exactly one library card
- Each library card belongs to exactly one student
STUDENT LIBRARY_CARD
- StudentID (PK) - CardID (PK)
- Name - StudentID (FK, UNIQUE)
- Email - IssueDate
- ExpiryDate
Raj: Why not just put CardID, IssueDate, and ExpiryDate in the STUDENT table?
Priya: You absolutely could! One-to-one relationships often mean you can merge the tables. But you might keep them separate if:
- Library card data is managed by a different department
- Not all students have library cards (nullable would make many NULL values)
- Library card has many attributes and you want to separate concerns
- Different security/access requirements
One-to-Many (1:N) - Most Common
Priya: One instance of Entity A relates to many instances of Entity B.
Example: Department and Professors
- One department has many professors
- Each professor belongs to one department
DEPARTMENT PROFESSOR
- DepartmentID (PK) - ProfessorID (PK)
- DepartmentName - Name
- DepartmentID (FK)
Raj: This is straightforward. The foreign key goes on the "many" side.
Priya: Exactly! This is the most common relationship type. Think about:
- Customer → Orders (one customer, many orders)
- Order → OrderItems (one order, many items)
- Author → Books (one author, many books... wait!)
Raj: Wait, one author can write many books, but one book can also have many authors!
Many-to-Many (M:N)
Priya: Correct! That's a many-to-many relationship. You can't implement this directly with foreign keys. You need a junction table (also called bridge table, associative table, or linking table).
Example: Students and Courses
- One student enrolls in many courses
- One course has many students enrolled
STUDENT ENROLLMENT (Junction) COURSE
- StudentID (PK) - EnrollmentID (PK) - CourseID (PK)
- Name - StudentID (FK) - CourseName
- CourseID (FK) - Credits
- EnrollmentDate
- Grade
- Semester
Composite Unique: (StudentID, CourseID, Semester)
Raj: So the junction table breaks the many-to-many into two one-to-many relationships?
Priya: Perfect! And notice that the junction table can have its own attributes—EnrollmentDate, Grade, Semester. These are properties of the relationship itself, not of Student or Course individually.
Participation: Optional vs Mandatory
Priya: Another important concept is whether participation is mandatory or optional.
Example: Professor and Course
Case 1: Mandatory participation on both sides
- Every professor must teach at least one course (mandatory)
- Every course must have at least one professor (mandatory)
Case 2: Optional participation
- A professor might not be teaching any course this semester (optional)
- A course might be created but not yet assigned to a professor (optional)
Raj: How do I implement this?
Priya: Through constraints!
-- Optional: Professor may or may not teach courses
CREATE TABLE course (
course_id SERIAL PRIMARY KEY,
course_name VARCHAR(100),
professor_id INTEGER REFERENCES professor(professor_id) -- Can be NULL
);
-- Mandatory: Every course must have a professor
CREATE TABLE course (
course_id SERIAL PRIMARY KEY,
course_name VARCHAR(100),
professor_id INTEGER NOT NULL REFERENCES professor(professor_id) -- NOT NULL
);
-- Complex mandatory: Professor must teach at least one course
-- Requires CHECK constraint or trigger
CREATE TRIGGER enforce_professor_teaches_course
BEFORE DELETE ON course
FOR EACH ROW
EXECUTE FUNCTION check_professor_has_courses();
Keys: The Anchors of Your Data
Raj: You keep mentioning PK and FK. I know they mean Primary Key and Foreign Key, but what's the deep understanding?
Priya: Keys are fundamental to relational databases. Let me explain all types.
Primary Key (PK)
Priya: A primary key uniquely identifies each row in a table. It must be:
- Unique: No two rows can have the same primary key value
- Not null: Every row must have a primary key value
- Immutable: Should never change (ideally)
-- Simple primary key
CREATE TABLE student (
student_id INTEGER PRIMARY KEY, -- Single column PK
name VARCHAR(100)
);
-- Composite primary key
CREATE TABLE enrollment (
student_id INTEGER,
course_id INTEGER,
semester VARCHAR(20),
PRIMARY KEY (student_id, course_id, semester) -- Multiple columns
);
Foreign Key (FK)
Priya: A foreign key creates a link between two tables. It references a primary key in another table.
CREATE TABLE course (
course_id INTEGER PRIMARY KEY,
course_name VARCHAR(100),
department_id INTEGER,
FOREIGN KEY (department_id) REFERENCES department(department_id)
ON DELETE CASCADE -- If department deleted, delete courses
ON UPDATE CASCADE -- If department_id updated, update here too
);
Raj: What are these ON DELETE and ON UPDATE options?
Priya: Great question! You have several options:
CASCADE: If parent deleted/updated, delete/update children
SET NULL: If parent deleted/updated, set child FK to NULL
SET DEFAULT: Set to a default value
RESTRICT: Don't allow deletion/update if children exist
NO ACTION: Similar to RESTRICT but checked at end of transaction
Example scenarios:
-- Scenario 1: Orders and OrderItems
-- If order deleted, delete all its items (CASCADE)
CREATE TABLE order_item (
item_id INTEGER PRIMARY KEY,
order_id INTEGER,
product_name VARCHAR(100),
FOREIGN KEY (order_id) REFERENCES orders(order_id)
ON DELETE CASCADE
);
-- Scenario 2: Employee and Department
-- Don't allow deleting a department if it has employees (RESTRICT)
CREATE TABLE employee (
employee_id INTEGER PRIMARY KEY,
name VARCHAR(100),
department_id INTEGER,
FOREIGN KEY (department_id) REFERENCES department(department_id)
ON DELETE RESTRICT
);
-- Scenario 3: Product and Category
-- If category deleted, set products to "Uncategorized" (SET DEFAULT)
CREATE TABLE product (
product_id INTEGER PRIMARY KEY,
product_name VARCHAR(100),
category_id INTEGER DEFAULT 1, -- 1 is "Uncategorized"
FOREIGN KEY (category_id) REFERENCES category(category_id)
ON DELETE SET DEFAULT
);
Candidate Keys
Priya: A candidate key is any column (or combination) that could serve as a primary key—it's unique and not null.
STUDENT table:
- StudentID → Unique, not null ✓ (chosen as PK)
- EmailAddress → Unique, not null ✓ (candidate key)
- RollNumber → Unique, not null ✓ (candidate key)
You pick one as primary key, others become alternate keys.
Surrogate vs Natural Keys: The Great Debate
Raj: Should I use StudentID (which we generate) or RollNumber (which the university provides)?
Priya: This is the surrogate vs natural key debate!
Natural Key: Real-world identifier that has business meaning
Examples:
- PAN Card Number
- Aadhaar Number
- Email Address
- ISBN for books
- Phone Number
Surrogate Key: Artificial identifier with no business meaning
Examples:
- Auto-increment ID (1, 2, 3...)
- UUID (550e8400-e29b-41d4-a716-446655440000)
- Sequential BIGINT
The Comparison:
Aspect | Natural Key | Surrogate Key
--------------------|--------------------------------|---------------------------
Meaning | Has business meaning | Just a number
Stability | Can change (email, phone) | Never changes
Size | Often larger (strings) | Small (integer)
Performance | Slower joins (string compare) | Fast joins (integer)
Uniqueness | Might not be truly unique | Guaranteed unique
Dependency | Depends on external system | Internal control
Priya's Recommendation:
-- Use BOTH! Surrogate as PK, Natural as Unique Key
CREATE TABLE customer (
customer_id BIGSERIAL PRIMARY KEY, -- Surrogate (for joins)
email_address VARCHAR(255) UNIQUE NOT NULL, -- Natural (for lookups)
pan_number VARCHAR(10) UNIQUE, -- Natural (for business)
name VARCHAR(100),
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- Fast joins use customer_id
SELECT o.*
FROM orders o
JOIN customer c ON o.customer_id = c.customer_id;
-- Business lookups use email
SELECT * FROM customer WHERE email_address = 'raj@example.com';
Raj: So I get the best of both worlds!
Priya: Exactly! The surrogate key keeps your joins fast and stable, while the natural key provides business meaning and lookup capability.
Composite Keys
Priya: Sometimes you need multiple columns to uniquely identify a row.
-- Attendance system: Student + Date + Period together form unique identity
CREATE TABLE attendance (
student_id INTEGER,
attendance_date DATE,
period_number INTEGER,
status VARCHAR(10),
PRIMARY KEY (student_id, attendance_date, period_number)
);
-- Same student can have multiple attendance records
-- Same date can have multiple records
-- But the combination is unique
Raj: Should I avoid composite keys?
Priya: Not necessarily! They're natural for junction tables and when multiple columns genuinely form the identity. But be aware:
Pros:
- More meaningful
- Enforces business rules
- Less storage (no extra ID column)
Cons:
- Larger index size
- Slower joins
- Harder to reference from other tables
- All columns must be NOT NULL
My approach:
-- For junction tables, I often use both
CREATE TABLE enrollment (
enrollment_id BIGSERIAL PRIMARY KEY, -- For easy reference
student_id INTEGER NOT NULL,
course_id INTEGER NOT NULL,
semester VARCHAR(20) NOT NULL,
enrollment_date DATE,
grade VARCHAR(2),
UNIQUE (student_id, course_id, semester) -- Business uniqueness
);
Normalization: The Science of Organization
Raj: Okay, now let's talk about the famous normalization. I've heard of 1NF, 2NF, 3NF, but never really understood them.
Priya: Normalization is a systematic way of organizing data to reduce redundancy and improve data integrity. Let me explain with a real messy example that I see all the time.
The Unnormalized Mess
Priya: Imagine you're building an online bookstore. A beginner might create this table:
BOOK_ORDERS (Unnormalized - 0NF)
--------------------------------------------------------------------------------
OrderID | CustomerName | CustomerEmail | CustomerPhone | CustomerAddress | BookTitles | BookAuthors | BookPrices | Quantities
--------------------------------------------------------------------------------
1001 | Rahul Sharma | rahul@e.com | 9876543210 | 123 MG Road, Mumbai, MH 400001 | The Alchemist, 1984 | Paulo Coelho, George Orwell | 350, 450 | 1, 2
1002 | Priya Patel | priya@e.com | 9876543211 | 456 SG Highway, Ahmedabad, GJ | The Alchemist | Paulo Coelho | 350 | 1
1003 | Rahul Sharma | rahul@e.com | 9876543210 | 123 MG Road, Mumbai, MH 400001 | Atomic Habits | James Clear | 599 | 1
Raj: Oh god, this is horrible! Multiple books in one cell, separated by commas!
Priya: Exactly! This violates even the most basic principles. Let's fix it step by step.
First Normal Form (1NF): Atomic Values
Rule: Each cell must contain only atomic (indivisible) values. No arrays, no comma-separated values.
BOOK_ORDERS (1NF Applied)
--------------------------------------------------------------------------------
OrderID | CustomerName | CustomerEmail | CustomerPhone | CustomerAddress | BookTitle | BookAuthor | BookPrice | Quantity
--------------------------------------------------------------------------------
1001 | Rahul Sharma | rahul@e.com | 9876543210 | 123 MG Road, Mumbai, MH 400001 | The Alchemist | Paulo Coelho | 350 | 1
1001 | Rahul Sharma | rahul@e.com | 9876543210 | 123 MG Road, Mumbai, MH 400001 | 1984 | George Orwell | 450 | 2
1002 | Priya Patel | priya@e.com | 9876543211 | 456 SG Highway, Ahmedabad, GJ | The Alchemist | Paulo Coelho | 350 | 1
1003 | Rahul Sharma | rahul@e.com | 9876543210 | 123 MG Road, Mumbai, MH 400001 | Atomic Habits | James Clear | 599 | 1
Raj: Better! But Rahul Sharma's information is repeated three times.
Priya: Correct! That's the next problem to solve.
Second Normal Form (2NF): Remove Partial Dependencies
Rule: Must be in 1NF, and all non-key columns must depend on the entire primary key, not just part of it.
Priya: In our current table, if we use (OrderID, BookTitle) as composite primary key, we have a problem:
- CustomerName depends only on OrderID (not on BookTitle)
- BookAuthor depends only on BookTitle (not on OrderID)
Solution: Split into separate tables
-- CUSTOMERS table
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
customer_name VARCHAR(100) NOT NULL,
customer_email VARCHAR(255) UNIQUE NOT NULL,
customer_phone VARCHAR(15),
customer_address TEXT
);
-- BOOKS table
CREATE TABLE books (
book_id SERIAL PRIMARY KEY,
book_title VARCHAR(255) NOT NULL,
book_author VARCHAR(100) NOT NULL,
book_price DECIMAL(10, 2) NOT NULL
);
-- ORDERS table
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INTEGER NOT NULL REFERENCES customers(customer_id),
order_date TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
order_status VARCHAR(20)
);
-- ORDER_ITEMS table (junction table)
CREATE TABLE order_items (
order_item_id SERIAL PRIMARY KEY,
order_id INTEGER NOT NULL REFERENCES orders(order_id),
book_id INTEGER NOT NULL REFERENCES books(book_id),
quantity INTEGER NOT NULL,
unit_price DECIMAL(10, 2) NOT NULL -- Price at time of order
);
Raj: Now each piece of information is stored exactly once! But wait, why is book_price in both BOOKS and ORDER_ITEMS (as unit_price)?
Priya: Excellent observation! That's a business decision. The BOOKS table has the current price, but ORDER_ITEMS stores the historical price at which the customer bought it. Prices change over time, but you want to preserve what the customer actually paid.
Third Normal Form (3NF): Remove Transitive Dependencies
Rule: Must be in 2NF, and no non-key column should depend on another non-key column.
Priya: Look at our CUSTOMERS table. Is there a problem?
CUSTOMERS
- customer_id (PK)
- customer_name
- customer_email
- customer_phone
- customer_address (full address as text)
Raj: The address column seems messy. It contains street, city, state, pincode all together.
Priya: Right! But more importantly, if we split it:
CUSTOMERS (splitting address)
- customer_id (PK)
- customer_name
- customer_email
- customer_phone
- street
- city
- state
- pincode
Priya: Now there's a problem: state depends on city (transitive dependency). The state is determined by the city, not by the customer directly. This is a 3NF violation.
Solution:
-- CUSTOMERS table (3NF)
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
customer_name VARCHAR(100) NOT NULL,
customer_email VARCHAR(255) UNIQUE NOT NULL,
customer_phone VARCHAR(15),
street_address VARCHAR(255),
city_id INTEGER REFERENCES cities(city_id)
);
-- CITIES table
CREATE TABLE cities (
city_id SERIAL PRIMARY KEY,
city_name VARCHAR(100) NOT NULL,
state_id INTEGER REFERENCES states(state_id)
);
-- STATES table
CREATE TABLE states (
state_id SERIAL PRIMARY KEY,
state_name VARCHAR(100) NOT NULL,
state_code VARCHAR(2) NOT NULL UNIQUE
);
Raj: But yaar, this seems like overkill for just storing an address!
Priya: And you're absolutely right! This is where judgment comes in. In practice, for customer addresses, I usually do this:
-- Practical approach
CREATE TABLE customers (
customer_id SERIAL PRIMARY KEY,
customer_name VARCHAR(100) NOT NULL,
customer_email VARCHAR(255) UNIQUE NOT NULL,
customer_phone VARCHAR(15),
street_address VARCHAR(255),
city VARCHAR(100),
state VARCHAR(100),
pincode VARCHAR(10)
);
-- Separate reference tables for dropdowns/validation
CREATE TABLE ref_cities (
city_id SERIAL PRIMARY KEY,
city_name VARCHAR(100),
state_name VARCHAR(100),
is_active BOOLEAN DEFAULT TRUE
);
Priya: The customer table isn't strictly 3NF, but it's practical. The reference table is for application dropdowns and validation, not for foreign key constraints.
Boyce-Codd Normal Form (BCNF): The Stricter 3NF
Priya: BCNF is a slightly stricter version of 3NF. The rule is: For every functional dependency X → Y, X must be a candidate key.
Example where 3NF fails but BCNF catches it:
COURSE_INSTRUCTOR
- StudentID
- Course
- Instructor
Business rules:
- One student can take multiple courses
- One course can have multiple instructors
- Each instructor teaches only one course (specialization)
- A student can have the same instructor in different semesters
Priya: If we use (StudentID, Course) as primary key:
StudentID | Course | Instructor
----------|------------------|-------------
S001 | Database | Prof. Sharma
S001 | Data Science | Prof. Patel
S002 | Database | Prof. Sharma
S002 | Machine Learning | Prof. Patel
Issue: Instructor → Course (Instructor determines Course), but Instructor is not a candidate key. This violates BCNF.
Solution:
CREATE TABLE instructors (
instructor_id SERIAL PRIMARY KEY,
instructor_name VARCHAR(100),
specialization_course VARCHAR(100) UNIQUE -- Each instructor, one course
);
CREATE TABLE student_enrollment (
enrollment_id SERIAL PRIMARY KEY,
student_id INTEGER REFERENCES students(student_id),
instructor_id INTEGER REFERENCES instructors(instructor_id),
enrollment_date DATE,
UNIQUE(student_id, instructor_id)
);
Raj: Honestly, BCNF seems very theoretical. Do people actually use it?
Priya: Rarely in practice! Most developers stop at 3NF. BCNF is important for database theory but can make your schema overly complex. Focus on 3NF, and you'll be fine for 95% of cases.
When to Stop Normalizing
Raj: So should I always normalize to 3NF?
Priya: Great question! Normalization is a tool, not a religion. Here's when to stop:
Normalize to 3NF when:
- Building OLTP systems (transaction processing)
- Data integrity is critical
- You have frequent updates
- Storage space is a concern (though less relevant today)
Don't over-normalize when:
- You need reporting performance
- Queries become too complex
- You're doing analytics/data warehousing
- You're dealing with immutable data (events, logs)
Example of over-normalization:
-- Over-normalized (bad for queries)
CREATE TABLE orders (order_id, customer_id, status_id, payment_method_id, ...);
CREATE TABLE order_status (status_id, status_name); -- Only 5 values!
CREATE TABLE payment_methods (payment_method_id, method_name); -- Only 8 values!
-- Better approach
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(customer_id),
status VARCHAR(20) CHECK (status IN ('Pending','Confirmed','Shipped','Delivered','Cancelled')),
payment_method VARCHAR(20) CHECK (payment_method IN ('Card','UPI','NetBanking','COD','Wallet')),
...
);
Priya: Don't normalize small, stable lookup values. It just adds unnecessary joins.
Denormalization: Breaking the Rules Intelligently
Raj: So when exactly should I denormalize?
Priya: Denormalization is the intentional introduction of redundancy to improve read performance. But it comes with trade-offs.
Why Denormalize?
Scenario: E-commerce order summary page
Normalized approach (many joins):
SELECT
o.order_id,
c.customer_name,
c.customer_email,
o.order_date,
COUNT(oi.order_item_id) as total_items,
SUM(oi.quantity * oi.unit_price) as total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
JOIN order_items oi ON o.order_id = oi.order_id
GROUP BY o.order_id, c.customer_name, c.customer_email, o.order_date;
Denormalized approach:
CREATE TABLE order_summary (
order_id INTEGER PRIMARY KEY,
customer_name VARCHAR(100), -- Denormalized from customers
customer_email VARCHAR(255), -- Denormalized from customers
order_date DATE,
total_items INTEGER, -- Pre-calculated
total_amount DECIMAL(10,2) -- Pre-calculated
);
-- Simple query, no joins!
SELECT * FROM order_summary WHERE order_id = 1001;
Denormalization Strategies
Strategy 1: Store Calculated Values
-- Instead of calculating on every query
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(customer_id),
order_date DATE,
-- Denormalized calculated fields
total_items INTEGER,
subtotal DECIMAL(10,2),
tax_amount DECIMAL(10,2),
discount_amount DECIMAL(10,2),
total_amount DECIMAL(10,2)
);
-- Update via trigger or application code
CREATE OR REPLACE FUNCTION update_order_totals()
RETURNS TRIGGER AS $$
BEGIN
UPDATE orders SET
total_items = (SELECT COUNT(*) FROM order_items WHERE order_id = NEW.order_id),
subtotal = (SELECT SUM(quantity * unit_price) FROM order_items WHERE order_id = NEW.order_id)
WHERE order_id = NEW.order_id;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER order_items_changed
AFTER INSERT OR UPDATE OR DELETE ON order_items
FOR EACH ROW EXECUTE FUNCTION update_order_totals();
Strategy 2: Store Related Entity Names
-- Instead of always joining to get names
CREATE TABLE orders (
order_id SERIAL PRIMARY KEY,
customer_id INTEGER REFERENCES customers(customer_id),
customer_name VARCHAR(100), -- Denormalized
customer_email VARCHAR(255), -- Denormalized
shipping_address TEXT, -- Denormalized
order_date DATE
);
Raj: But what if customer changes their name?
Priya: That's the trade-off! You have three options:
Keep as is (temporal snapshot): The order shows the customer's name at the time of order. This is actually preferred for invoices and legal documents.
Update cascading: When customer name changes, update all orders. But this can be slow and defeats the purpose of denormalization.
Hybrid approach: Store both customer_id (for current data) and customer_name (for historical snapshot).
Strategy 3: Materialized Views
-- Create a denormalized view that's physically stored
CREATE MATERIALIZED VIEW customer_order_summary AS
SELECT
c.customer_id,
c.customer_name,
c.customer_email,
COUNT(DISTINCT o.order_id) as total_orders,
SUM(o.total_amount) as lifetime_value,
MAX(o.order_date) as last_order_date,
AVG(o.total_amount) as average_order_value
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.customer_email;
-- Refresh periodically (daily, hourly, or on-demand)
REFRESH MATERIALIZED VIEW customer_order_summary;
-- Query is lightning fast
SELECT * FROM customer_order_summary WHERE customer_id = 1001;
Priya: Materialized views are perfect for dashboards and reporting. They give you denormalization benefits without the maintenance headache.
Dimensional Modeling: The Analytics Powerhouse
Raj: You mentioned that normalization is great for transaction systems. What about analytics and reporting?
Priya: That's where dimensional modeling comes in! This is Ralph Kimball's approach, and it's specifically designed for analytics, not transactions.
The Core Concept
Priya: In dimensional modeling, we organize data into two types of tables:
- Fact Tables: Store measurements, metrics, and facts
- Dimension Tables: Store the context for those facts
Think of it like a story: Dimensions answer "who, what, when, where, why" and facts answer "how much, how many."
Building a Fact Table
Priya: Let's build a sales fact table for a retail chain:
CREATE TABLE fact_sales (
-- Surrogate key
sales_fact_id BIGSERIAL PRIMARY KEY,
-- Foreign keys to dimensions (the context)
date_key INTEGER NOT NULL, -- When did the sale happen?
product_key INTEGER NOT NULL, -- What was sold?
store_key INTEGER NOT NULL, -- Where was it sold?
customer_key INTEGER NOT NULL, -- Who bought it?
employee_key INTEGER NOT NULL, -- Who made the sale?
promotion_key INTEGER, -- Any promotion applied?
-- Degenerate dimensions (IDs that don't need their own dimension)
transaction_number VARCHAR(50),
receipt_number VARCHAR(50),
-- Facts/Measures (the numbers we want to analyze)
quantity_sold INTEGER NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
unit_cost DECIMAL(10,2) NOT NULL,
discount_amount DECIMAL(10,2) DEFAULT 0,
tax_amount DECIMAL(10,2) DEFAULT 0,
total_amount DECIMAL(10,2) NOT NULL,
-- Metadata
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Raj: Why have both sales_fact_id and transaction_number?
Priya: Good question! The sales_fact_id is a technical surrogate key for database purposes. The transaction_number is a degenerate dimension—it's an operational identifier that business users understand but doesn't need its own dimension table.
The Date Dimension: Your Best Friend
Priya: The date dimension is crucial. Never just store a date—create a full dimension!
CREATE TABLE dim_date (
date_key INTEGER PRIMARY KEY, -- Format: YYYYMMDD (e.g., 20240115)
-- Date attributes
full_date DATE NOT NULL UNIQUE,
day_of_week VARCHAR(10), -- Monday, Tuesday, ...
day_of_week_num INTEGER, -- 1-7
day_of_month INTEGER, -- 1-31
day_of_year INTEGER, -- 1-366
-- Week attributes
week_of_year INTEGER,
week_of_month INTEGER,
-- Month attributes
month_number INTEGER, -- 1-12
month_name VARCHAR(10), -- January, February, ...
month_year VARCHAR(8), -- Jan-2024
-- Quarter attributes
quarter_number INTEGER, -- 1-4
quarter_name VARCHAR(6), -- Q1, Q2, ...
quarter_year VARCHAR(8), -- Q1-2024
-- Year attributes
year INTEGER,
fiscal_year INTEGER, -- If different from calendar
fiscal_quarter INTEGER,
-- Special flags
is_weekend BOOLEAN,
is_holiday BOOLEAN,
holiday_name VARCHAR(100),
is_working_day BOOLEAN,
is_last_day_of_month BOOLEAN,
is_last_day_of_quarter BOOLEAN,
is_last_day_of_year BOOLEAN,
-- Relative dates (updated periodically)
is_current_day BOOLEAN,
is_current_month BOOLEAN,
is_current_quarter BOOLEAN,
is_current_year BOOLEAN
);
Raj: Wow, so much information for just a date!
Priya: Yes! But see how powerful queries become:
-- Sales on weekends
SELECT SUM(total_amount)
FROM fact_sales fs
JOIN dim_date dd ON fs.date_key = dd.date_key
WHERE dd.is_weekend = TRUE;
-- Compare this quarter vs last quarter
SELECT
dd.quarter_year,
SUM(fs.total_amount) as revenue
FROM fact_sales fs
JOIN dim_date dd ON fs.date_key = dd.date_key
WHERE dd.year = 2024 AND dd.quarter_number IN (1, 2)
GROUP BY dd.quarter_year;
-- Sales excluding holidays
SELECT SUM(total_amount)
FROM fact_sales fs
JOIN dim_date dd ON fs.date_key = dd.date_key
WHERE dd.is_holiday = FALSE;
Raj: This is brilliant! No complex date calculations in queries.
Product Dimension
CREATE TABLE dim_product (
product_key INTEGER PRIMARY KEY, -- Surrogate key
-- Natural key
product_id VARCHAR(50) NOT NULL, -- SKU from source system
-- Product attributes (denormalized!)
product_name VARCHAR(255) NOT NULL,
product_description TEXT,
-- Category hierarchy (denormalized)
category_level_1 VARCHAR(100), -- Electronics
category_level_2 VARCHAR(100), -- Mobile Phones
category_level_3 VARCHAR(100), -- Smartphones
-- Brand attributes
brand_name VARCHAR(100),
manufacturer VARCHAR(100),
-- Product characteristics
product_size VARCHAR(50),
product_color VARCHAR(50),
product_weight DECIMAL(8,2),
-- Pricing
current_price DECIMAL(10,2),
cost_price DECIMAL(10,2),
-- SCD Type 2 fields for tracking changes
effective_date DATE NOT NULL,
expiry_date DATE,
is_current BOOLEAN DEFAULT TRUE,
-- Metadata
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Raj: You're storing category_level_1, category_level_2, category_level_3 all in one table? Isn't that denormalized?
Priya: Absolutely! That's the whole point of dimensional modeling. We intentionally denormalize to make queries simple and fast.
Normalized approach (slow):
SELECT p.product_name, cat3.name, cat2.name, cat1.name
FROM products p
JOIN categories cat3 ON p.category_id = cat3.id
JOIN categories cat2 ON cat3.parent_id = cat2.id
JOIN categories cat1 ON cat2.parent_id = cat1.id;
Denormalized approach (fast):
SELECT product_name, category_level_3, category_level_2, category_level_1
FROM dim_product;
Customer Dimension
CREATE TABLE dim_customer (
customer_key INTEGER PRIMARY KEY,
-- Natural key
customer_id VARCHAR(50) NOT NULL,
-- Personal information
customer_name VARCHAR(200),
email_address VARCHAR(255),
phone_number VARCHAR(15),
date_of_birth DATE,
age_group VARCHAR(20), -- 18-25, 26-35, etc.
gender VARCHAR(10),
-- Location (denormalized)
address_line1 VARCHAR(255),
address_line2 VARCHAR(255),
city VARCHAR(100),
state VARCHAR(100),
pincode VARCHAR(10),
country VARCHAR(100),
region VARCHAR(50), -- North, South, East, West
-- Customer segmentation
customer_segment VARCHAR(50), -- Premium, Regular, Budget
loyalty_tier VARCHAR(50), -- Gold, Silver, Bronze
is_vip BOOLEAN DEFAULT FALSE,
-- Customer lifecycle
first_purchase_date DATE,
last_purchase_date DATE,
total_lifetime_value DECIMAL(12,2),
total_orders INTEGER,
-- Demographics
occupation VARCHAR(100),
income_bracket VARCHAR(50),
-- SCD Type 2 fields
effective_date DATE NOT NULL,
expiry_date DATE,
is_current BOOLEAN DEFAULT TRUE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Slowly Changing Dimensions (SCD)
Raj: You keep mentioning SCD Type 2. What is that?
Priya: Great timing! Slowly Changing Dimensions handle how we track changes in dimension data over time. There are several types.
Type 0: Retain Original (No Changes)
Some attributes should never change once set:
-- Birth date should never change
date_of_birth DATE NOT NULL
Type 1: Overwrite (No History)
-- Customer changes phone number - just update it
UPDATE dim_customer
SET phone_number = '9876543999'
WHERE customer_key = 1001;
-- Old number is lost, no history maintained
Use when: History doesn't matter (phone numbers, email addresses for current contact)
Type 2: Add New Row (Full History)
-- Customer moves from Mumbai to Pune
-- Step 1: Expire the old record
UPDATE dim_customer
SET expiry_date = '2024-01-15',
is_current = FALSE
WHERE customer_key = 1001 AND is_current = TRUE;
-- Step 2: Insert new record
INSERT INTO dim_customer (
customer_id, customer_name, email_address, city, state,
effective_date, expiry_date, is_current
) VALUES (
'C001', 'Rahul Sharma', 'rahul@example.com', 'Pune', 'Maharashtra',
'2024-01-16', NULL, TRUE
);
-- Result: Two rows for same customer
customer_key | customer_id | customer_name | city | effective_date | expiry_date | is_current
-------------|-------------|----------------|--------|----------------|-------------|------------
1001 | C001 | Rahul Sharma | Mumbai | 2023-01-01 | 2024-01-15 | FALSE
1045 | C001 | Rahul Sharma | Pune | 2024-01-16 | NULL | TRUE
Use when: You need to track historical accuracy (customer addresses for analyzing regional trends)
Type 3: Add New Column (Limited History)
CREATE TABLE dim_product (
product_key INTEGER PRIMARY KEY,
product_name VARCHAR(255),
current_price DECIMAL(10,2),
previous_price DECIMAL(10,2),
price_change_date DATE
);
-- When price changes, shift current to previous
UPDATE dim_product
SET previous_price = current_price,
current_price = 699,
price_change_date = CURRENT_DATE
WHERE product_key = 5001;
Use when: You only need to track one previous value
Type 4: Add Historical Table
-- Current dimension
CREATE TABLE dim_product_current (
product_key INTEGER PRIMARY KEY,
product_id VARCHAR(50),
product_name VARCHAR(255),
current_price DECIMAL(10,2)
);
-- History table
CREATE TABLE dim_product_history (
history_id SERIAL PRIMARY KEY,
product_key INTEGER,
product_id VARCHAR(50),
product_name VARCHAR(255),
price DECIMAL(10,2),
effective_date DATE,
expiry_date DATE
);
Type 6: Hybrid (1+2+3)
CREATE TABLE dim_customer (
customer_key INTEGER PRIMARY KEY,
customer_id VARCHAR(50),
customer_name VARCHAR(255),
-- Type 1: Always current
email_address VARCHAR(255),
phone_number VARCHAR(15),
-- Type 2: Historical rows
city VARCHAR(100),
state VARCHAR(100),
effective_date DATE,
expiry_date DATE,
is_current BOOLEAN,
-- Type 3: Limited history
current_segment VARCHAR(50),
previous_segment VARCHAR(50)
);
Conformed Dimensions
Raj: What if multiple fact tables need the same dimension?
Priya: That's a conformed dimension—a dimension shared across multiple fact tables. This is crucial for cross-business-process analysis!
-- Same dim_date used by multiple facts
CREATE TABLE fact_sales (
date_key INTEGER REFERENCES dim_date(date_key),
...
);
CREATE TABLE fact_inventory (
date_key INTEGER REFERENCES dim_date(date_key),
...
);
CREATE TABLE fact_website_traffic (
date_key INTEGER REFERENCES dim_date(date_key),
...
);
-- Now you can correlate across business processes
SELECT
dd.full_date,
SUM(fs.total_amount) as sales,
AVG(fi.stock_level) as avg_inventory,
SUM(fwt.page_views) as website_traffic
FROM dim_date dd
LEFT JOIN fact_sales fs ON dd.date_key = fs.date_key
LEFT JOIN fact_inventory fi ON dd.date_key = fi.date_key
LEFT JOIN fact_website_traffic fwt ON dd.date_key = fwt.date_key
WHERE dd.year = 2024 AND dd.quarter_number = 1
GROUP BY dd.full_date;
Priya: Conformed dimensions ensure consistency. Everyone uses the same customer_key for the same customer across all facts.
Junk Dimensions
Raj: What if I have lots of low-cardinality flags and indicators?
Priya: Create a junk dimension! Instead of having many TRUE/FALSE columns in your fact table, consolidate them.
Bad approach: Too many flag columns in fact
CREATE TABLE fact_sales (
sales_id BIGINT,
date_key INTEGER,
product_key INTEGER,
is_discount_applied BOOLEAN,
is_return_eligible BOOLEAN,
is_gift_wrapped BOOLEAN,
is_express_delivery BOOLEAN,
is_weekend_sale BOOLEAN,
... -- Too many flags!
);
Good approach: Junk dimension
CREATE TABLE dim_sales_flags (
sales_flag_key INTEGER PRIMARY KEY,
is_discount_applied BOOLEAN,
is_return_eligible BOOLEAN,
is_gift_wrapped BOOLEAN,
is_express_delivery BOOLEAN,
is_weekend_sale BOOLEAN
);
-- Pre-populate all combinations (2^5 = 32 rows)
INSERT INTO dim_sales_flags VALUES
(1, FALSE, FALSE, FALSE, FALSE, FALSE),
(2, TRUE, FALSE, FALSE, FALSE, FALSE),
(3, FALSE, TRUE, FALSE, FALSE, FALSE),
...
CREATE TABLE fact_sales (
sales_id BIGINT,
date_key INTEGER,
product_key INTEGER,
sales_flag_key INTEGER REFERENCES dim_sales_flags(sales_flag_key),
...
);
Factless Fact Tables
Priya: Sometimes you need to record events that don't have measures!
Example 1: Student Attendance
CREATE TABLE fact_attendance (
attendance_key BIGSERIAL PRIMARY KEY,
date_key INTEGER REFERENCES dim_date(date_key),
student_key INTEGER REFERENCES dim_student(student_key),
course_key INTEGER REFERENCES dim_course(course_key),
class_session_key INTEGER REFERENCES dim_class_session(class_session_key)
-- No measures! The presence of a row means student attended
);
-- Query: How many students attended course CS101 last month?
SELECT COUNT(*) as attendance_count
FROM fact_attendance fa
JOIN dim_date dd ON fa.date_key = dd.date_key
JOIN dim_course dc ON fa.course_key = dc.course_key
WHERE dc.course_code = 'CS101'
AND dd.month_name = 'January'
AND dd.year = 2024;
Example 2: Promotional Coverage
CREATE TABLE fact_promotion_coverage (
coverage_key BIGSERIAL PRIMARY KEY,
date_key INTEGER,
product_key INTEGER,
store_key INTEGER,
promotion_key INTEGER
-- Records which products were on promotion where and when
);
-- Find products on promotion but didn't sell (left join with fact_sales)
SELECT dp.product_name, COUNT(fpc.coverage_key) as promotion_days
FROM fact_promotion_coverage fpc
JOIN dim_product dp ON fpc.product_key = dp.product_key
LEFT JOIN fact_sales fs ON fpc.date_key = fs.date_key
AND fpc.product_key = fs.product_key
AND fpc.store_key = fs.store_key
WHERE fs.sales_fact_id IS NULL
GROUP BY dp.product_name;
Schema Patterns: Star, Snowflake, and Galaxy
Raj: You mentioned star schema. What exactly is that?
Priya: The schema pattern defines how your fact and dimension tables are arranged. Let me show you the three main patterns.
Star Schema
Priya: In a star schema, the fact table is at the center, and dimension tables radiate out like points of a star.
Visual Representation:
dim_date dim_customer
| |
| |
| |
dim_product ---- FACT_SALES ---- dim_store
| |
| |
| |
dim_employee dim_promotion
Characteristics:
- Dimensions are fully denormalized (one table per dimension)
- Direct joins from fact to each dimension
- Simple queries
- Fast performance
- More storage due to denormalization
Example Implementation:
-- Fully denormalized dimension (star schema)
CREATE TABLE dim_product (
product_key INTEGER PRIMARY KEY,
product_name VARCHAR(255),
brand_name VARCHAR(100),
brand_description TEXT,
category_level_1 VARCHAR(100), -- All in one table!
category_level_2 VARCHAR(100),
category_level_3 VARCHAR(100),
supplier_name VARCHAR(200),
supplier_country VARCHAR(100)
);
-- Simple query - just two tables
SELECT
dp.category_level_1,
SUM(fs.total_amount) as revenue
FROM fact_sales fs
JOIN dim_product dp ON fs.product_key = dp.product_key
GROUP BY dp.category_level_1;
Snowflake Schema
Priya: In a snowflake schema, dimensions are normalized into multiple related tables.
Visual Representation:
dim_date
|
dim_brand dim_customer --- dim_city --- dim_state
| |
dim_product ---- FACT_SALES ---- dim_store --- dim_region
| |
dim_category dim_employee --- dim_department
|
dim_subcategory
Characteristics:
- Dimensions are normalized (star is "melted" into snowflake)
- Multiple levels of tables
- Complex queries (more joins)
- Slower performance
- Less storage (normalized)
Example Implementation:
-- Normalized dimensions (snowflake schema)
CREATE TABLE dim_product (
product_key INTEGER PRIMARY KEY,
product_name VARCHAR(255),
brand_key INTEGER REFERENCES dim_brand(brand_key),
category_key INTEGER REFERENCES dim_category(category_key),
supplier_key INTEGER REFERENCES dim_supplier(supplier_key)
);
CREATE TABLE dim_brand (
brand_key INTEGER PRIMARY KEY,
brand_name VARCHAR(100),
brand_description TEXT
);
CREATE TABLE dim_category (
category_key INTEGER PRIMARY KEY,
category_name VARCHAR(100),
parent_category_key INTEGER REFERENCES dim_category(category_key)
);
CREATE TABLE dim_supplier (
supplier_key INTEGER PRIMARY KEY,
supplier_name VARCHAR(200),
country_key INTEGER REFERENCES dim_country(country_key)
);
-- Complex query - many joins
SELECT
dc.category_name,
SUM(fs.total_amount) as revenue
FROM fact_sales fs
JOIN dim_product dp ON fs.product_key = dp.product_key
JOIN dim_category dc ON dp.category_key = dc.category_key
GROUP BY dc.category_name;
Raj: When would I use snowflake over star?
Priya: Honestly? Almost never in modern systems! Here's the comparison:
| Aspect | Star Schema | Snowflake Schema |
|---|---|---|
| Query Performance | ⚡ Fast (fewer joins) | 🐌 Slower (many joins) |
| Query Complexity | ✅ Simple | ❌ Complex |
| Storage Space | More (denormalized) | Less (normalized) |
| ETL Complexity | Simple | Complex |
| Maintenance | Easier | Harder |
| Business User Friendly | ✅ Yes | ❌ Confusing |
| When to Use | 95% of cases | Very large dimensions, legacy systems |
Priya: Kimball himself preferred star schemas. Use snowflake only if:
- You have millions of rows in dimensions and normalizing saves significant space
- You're dealing with legacy systems that are already snowflaked
- You have complex hierarchies that change frequently
Galaxy Schema (Fact Constellation)
Priya: When you have multiple fact tables sharing dimensions, that's a galaxy schema.
Visual Representation:
dim_date
/ \
/ \
FACT_SALES FACT_INVENTORY
/ \ / \
/ \ / \
dim_product dim_store dim_warehouse
\ /
\ /
FACT_ORDERS
Multiple fact tables, shared dimensions
Example Implementation:
-- Shared dimension
CREATE TABLE dim_product (
product_key INTEGER PRIMARY KEY,
product_name VARCHAR(255),
...
);
-- Multiple fact tables using the same dimensions
CREATE TABLE fact_sales (
sales_key BIGSERIAL PRIMARY KEY,
date_key INTEGER REFERENCES dim_date(date_key),
product_key INTEGER REFERENCES dim_product(product_key),
store_key INTEGER REFERENCES dim_store(store_key),
quantity_sold INTEGER,
total_amount DECIMAL(12,2)
);
CREATE TABLE fact_inventory (
inventory_key BIGSERIAL PRIMARY KEY,
date_key INTEGER REFERENCES dim_date(date_key),
product_key INTEGER REFERENCES dim_product(product_key),
warehouse_key INTEGER REFERENCES dim_warehouse(warehouse_key),
quantity_on_hand INTEGER,
reorder_level INTEGER
);
CREATE TABLE fact_purchases (
purchase_key BIGSERIAL PRIMARY KEY,
date_key INTEGER REFERENCES dim_date(date_key),
product_key INTEGER REFERENCES dim_product(product_key),
supplier_key INTEGER REFERENCES dim_supplier(supplier_key),
quantity_purchased INTEGER,
purchase_cost DECIMAL(12,2)
);
-- Cross-business-process analysis
SELECT
dp.product_name,
SUM(fs.quantity_sold) as units_sold,
AVG(fi.quantity_on_hand) as avg_inventory,
SUM(fp.quantity_purchased) as units_purchased
FROM dim_product dp
LEFT JOIN fact_sales fs ON dp.product_key = fs.product_key
LEFT JOIN fact_inventory fi ON dp.product_key = fi.product_key
LEFT JOIN fact_purchases fp ON dp.product_key = fp.product_key
WHERE fs.date_key = 20240115
GROUP BY dp.product_name;
Raj: So galaxy is just multiple stars sharing dimensions?
Priya: Exactly! It's the natural evolution when you have multiple business processes. The shared dimensions (like dim_date, dim_product) are called conformed dimensions—they ensure consistency across all fact tables.
Data Vault: The Agile Approach
Raj: I keep hearing about Data Vault in enterprise contexts. How is it different from Kimball?
Priya: Data Vault is a different modeling methodology created by Dan Linstedt. While Kimball focuses on the presentation layer (star schemas for reporting), Data Vault focuses on the raw data warehouse layer—storing source data in an auditable, scalable way.
The Three Core Components
Priya: Data Vault has three main table types:
- Hubs: Store unique business keys
- Links: Store relationships between hubs
- Satellites: Store descriptive attributes and temporal data
Let me show you with a real example.
Hubs: The Business Entities
-- Hub stores ONLY the business key and metadata
CREATE TABLE hub_customer (
customer_hash_key CHAR(32) PRIMARY KEY, -- MD5 or SHA hash of business key
customer_id VARCHAR(50) NOT NULL UNIQUE, -- Business key from source
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(50) NOT NULL
);
CREATE TABLE hub_product (
product_hash_key CHAR(32) PRIMARY KEY,
product_id VARCHAR(50) NOT NULL UNIQUE,
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(50) NOT NULL
);
CREATE TABLE hub_order (
order_hash_key CHAR(32) PRIMARY KEY,
order_id VARCHAR(50) NOT NULL UNIQUE,
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(50) NOT NULL
);
Raj: Why hash the business key?
Priya: Several reasons:
- Performance: Fixed-length hash keys are faster to join than variable-length strings
- Composite keys: If business key is (customer_id + system_id), hash them together
- Consistency: Same hash algorithm across all hubs
Satellites: The Descriptive Data
-- Satellite stores attributes and tracks changes (SCD Type 2)
CREATE TABLE sat_customer_details (
customer_hash_key CHAR(32) NOT NULL REFERENCES hub_customer(customer_hash_key),
load_date TIMESTAMP NOT NULL,
load_end_date TIMESTAMP, -- NULL for current record
-- Descriptive attributes
customer_name VARCHAR(200),
email_address VARCHAR(255),
phone_number VARCHAR(15),
-- Metadata
record_source VARCHAR(50) NOT NULL,
hash_diff CHAR(32) NOT NULL, -- Hash of all attributes for change detection
PRIMARY KEY (customer_hash_key, load_date)
);
-- You can have multiple satellites for the same hub
CREATE TABLE sat_customer_address (
customer_hash_key CHAR(32) NOT NULL REFERENCES hub_customer(customer_hash_key),
load_date TIMESTAMP NOT NULL,
load_end_date TIMESTAMP,
-- Address attributes
street_address VARCHAR(255),
city VARCHAR(100),
state VARCHAR(100),
pincode VARCHAR(10),
country VARCHAR(100),
record_source VARCHAR(50) NOT NULL,
hash_diff CHAR(32) NOT NULL,
PRIMARY KEY (customer_hash_key, load_date)
);
Raj: Why split into two satellites?
Priya: Separation of concerns! Customer details (name, email) might change independently from address. With separate satellites:
- Load them independently and in parallel
- Different refresh schedules (details daily, address monthly)
- If address loading fails, details still update
- Easier to manage and understand
Links: The Relationships
-- Link stores relationships between hubs
CREATE TABLE link_customer_order (
customer_order_hash_key CHAR(32) PRIMARY KEY, -- Hash of (customer_hash + order_hash)
customer_hash_key CHAR(32) NOT NULL REFERENCES hub_customer(customer_hash_key),
order_hash_key CHAR(32) NOT NULL REFERENCES hub_order(order_hash_key),
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(50) NOT NULL,
UNIQUE (customer_hash_key, order_hash_key)
);
-- Link for many-to-many relationship
CREATE TABLE link_order_product (
order_product_hash_key CHAR(32) PRIMARY KEY,
order_hash_key CHAR(32) NOT NULL REFERENCES hub_order(order_hash_key),
product_hash_key CHAR(32) NOT NULL REFERENCES hub_product(product_hash_key),
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(50) NOT NULL,
UNIQUE (order_hash_key, product_hash_key)
);
-- Satellite on link (for relationship attributes)
CREATE TABLE sat_order_product_details (
order_product_hash_key CHAR(32) NOT NULL REFERENCES link_order_product(order_product_hash_key),
load_date TIMESTAMP NOT NULL,
load_end_date TIMESTAMP,
-- Attributes of the relationship
quantity INTEGER,
unit_price DECIMAL(10,2),
discount_amount DECIMAL(10,2),
record_source VARCHAR(50) NOT NULL,
hash_diff CHAR(32) NOT NULL,
PRIMARY KEY (order_product_hash_key, load_date)
);
Example: Loading Data into Data Vault
-- Source data arrives
-- order_id: ORD001, customer_id: CUST123, product_id: PROD456, quantity: 2
-- Step 1: Load into hub_customer (if not exists)
INSERT INTO hub_customer (customer_hash_key, customer_id, load_date, record_source)
SELECT
MD5('CUST123'),
'CUST123',
CURRENT_TIMESTAMP,
'ERP_SYSTEM'
WHERE NOT EXISTS (
SELECT 1 FROM hub_customer WHERE customer_id = 'CUST123'
);
-- Step 2: Load into hub_order (if not exists)
INSERT INTO hub_order (order_hash_key, order_id, load_date, record_source)
SELECT MD5('ORD001'), 'ORD001', CURRENT_TIMESTAMP, 'ERP_SYSTEM'
WHERE NOT EXISTS (SELECT 1 FROM hub_order WHERE order_id = 'ORD001');
-- Step 3: Load into hub_product (if not exists)
INSERT INTO hub_product (product_hash_key, product_id, load_date, record_source)
SELECT MD5('PROD456'), 'PROD456', CURRENT_TIMESTAMP, 'ERP_SYSTEM'
WHERE NOT EXISTS (SELECT 1 FROM hub_product WHERE product_id = 'PROD456');
-- Step 4: Load link between customer and order
INSERT INTO link_customer_order (
customer_order_hash_key, customer_hash_key, order_hash_key,
load_date, record_source
)
SELECT
MD5(CONCAT(MD5('CUST123'), MD5('ORD001'))),
MD5('CUST123'),
MD5('ORD001'),
CURRENT_TIMESTAMP,
'ERP_SYSTEM'
WHERE NOT EXISTS (
SELECT 1 FROM link_customer_order
WHERE customer_hash_key = MD5('CUST123')
AND order_hash_key = MD5('ORD001')
);
-- Step 5: Load satellite data (with change detection)
-- Close previous record if data changed
UPDATE sat_order_product_details
SET load_end_date = CURRENT_TIMESTAMP
WHERE order_product_hash_key = MD5(CONCAT(MD5('ORD001'), MD5('PROD456')))
AND load_end_date IS NULL
AND hash_diff != MD5(CONCAT('2', '99.99', '0')); -- If data changed
-- Insert new record
INSERT INTO sat_order_product_details (
order_product_hash_key, load_date, load_end_date,
quantity, unit_price, discount_amount,
record_source, hash_diff
)
VALUES (
MD5(CONCAT(MD5('ORD001'), MD5('PROD456'))),
CURRENT_TIMESTAMP,
NULL,
2, 99.99, 0,
'ERP_SYSTEM',
MD5(CONCAT('2', '99.99', '0'))
);
Data Vault vs Kimball
Raj: This seems way more complex than star schema!
Priya: It is! But they serve different purposes. Here's the comparison:
| Aspect | Data Vault | Kimball (Star Schema) |
|---|---|---|
| Purpose | Raw data warehouse layer | Presentation/reporting layer |
| Target Users | ETL developers, data engineers | Business users, analysts |
| Query Performance | Slower (many joins) | Fast (denormalized) |
| Historical Tracking | Complete audit trail | Moderate (SCD Type 2) |
| Scalability | Excellent (parallel loading) | Good |
| Flexibility | Very flexible (schema changes easy) | Less flexible |
| Learning Curve | Steep | Moderate |
| Source System Changes | Handles gracefully | Requires restructuring |
| Complexity | High | Moderate |
Modern Architecture: Best of Both Worlds
Priya: In modern enterprise data warehouses, we often use both:
Architecture Flow:
Source Systems (ERP, CRM, Website, Mobile App)
↓
Data Lake / Staging Area (raw files)
↓
Data Vault Layer (Raw Data Warehouse)
- Hubs, Links, Satellites
- Complete history, audit trail
- Handles multiple sources
↓
Business Vault Layer (Business Rules)
- Calculated fields
- Business logic
- Data quality rules
↓
Star Schema Layer (Presentation/Marts)
- Fact and dimension tables
- Optimized for reporting
- Business-user friendly
↓
BI Tools (Tableau, Power BI, Looker)
Raj: So Data Vault for the heavy lifting, Star Schema for the reporting?
Priya: Exactly! Data Vault gives you:
- Auditability (who changed what, when, from which source)
- Flexibility (add new sources easily)
- Scalability (parallel loading)
Star Schema gives you:
- Performance (fast queries)
- Simplicity (business users understand it)
- BI tool compatibility
Advanced Concepts
Raj: I feel like I'm getting a good grasp now. What are some advanced topics I should know?
Priya: Let me cover some patterns and scenarios you'll encounter in real projects.
Bridge Tables and Many-to-Many Relationships
Priya: Many-to-many relationships in dimensional modeling require special handling.
Scenario: Products can belong to multiple categories (cross-selling)
-- Product dimension (standard)
CREATE TABLE dim_product (
product_key INTEGER PRIMARY KEY,
product_id VARCHAR(50),
product_name VARCHAR(255),
primary_category VARCHAR(100) -- Main category
);
-- Category dimension
CREATE TABLE dim_category (
category_key INTEGER PRIMARY KEY,
category_name VARCHAR(100),
category_level VARCHAR(20)
);
-- Bridge table (handles M:N relationship)
CREATE TABLE bridge_product_category (
product_key INTEGER REFERENCES dim_product(product_key),
category_key INTEGER REFERENCES dim_category(category_key),
weight_factor DECIMAL(5,4) DEFAULT 1.0, -- For allocation
PRIMARY KEY (product_key, category_key)
);
-- Example data:
-- iPhone belongs to: Electronics (primary), Smartphones, Apple Products
INSERT INTO bridge_product_category VALUES
(1001, 101, 1.0), -- Electronics (full weight)
(1001, 102, 0.5), -- Smartphones (half weight)
(1001, 103, 0.5); -- Apple Products (half weight)
Raj: What's this weight_factor?
Priya: It solves the double-counting problem! When you aggregate sales by category, without weights you'd count the same sale multiple times.
-- Without weight factor (wrong - counts sale 3 times)
SELECT c.category_name, SUM(fs.total_amount)
FROM fact_sales fs
JOIN bridge_product_category bpc ON fs.product_key = bpc.product_key
JOIN dim_category c ON bpc.category_key = c.category_key
GROUP BY c.category_name;
-- With weight factor (correct - allocates sale across categories)
SELECT c.category_name, SUM(fs.total_amount * bpc.weight_factor)
FROM fact_sales fs
JOIN bridge_product_category bpc ON fs.product_key = bpc.product_key
JOIN dim_category c ON bpc.category_key = c.category_key
GROUP BY c.category_name;
Hierarchies and Recursive Relationships
Priya: Handling organizational hierarchies or category trees.
Scenario: Employee reporting structure
Option 1: Flatten the Hierarchy (Kimball approach)
CREATE TABLE dim_employee (
employee_key INTEGER PRIMARY KEY,
employee_id VARCHAR(50),
employee_name VARCHAR(200),
-- Flatten the hierarchy
manager_name VARCHAR(200),
manager_level_1 VARCHAR(200), -- Department head
manager_level_2 VARCHAR(200), -- VP
manager_level_3 VARCHAR(200), -- CXO
department VARCHAR(100),
level_in_hierarchy INTEGER,
is_manager BOOLEAN
);
Option 2: Parent-Child (Normalized approach)
CREATE TABLE dim_employee (
employee_key INTEGER PRIMARY KEY,
employee_id VARCHAR(50),
employee_name VARCHAR(200),
manager_key INTEGER REFERENCES dim_employee(employee_key), -- Self-reference
department VARCHAR(100),
level_in_hierarchy INTEGER
);
-- Query hierarchy with recursive CTE
WITH RECURSIVE employee_hierarchy AS (
-- Base case: Start with CXO (no manager)
SELECT
employee_key, employee_name, manager_key,
1 as level,
employee_name as hierarchy_path
FROM dim_employee
WHERE manager_key IS NULL
UNION ALL
-- Recursive case: Add employees reporting to previous level
SELECT
e.employee_key, e.employee_name, e.manager_key,
eh.level + 1,
eh.hierarchy_path || ' > ' || e.employee_name
FROM dim_employee e
JOIN employee_hierarchy eh ON e.manager_key = eh.employee_key
)
SELECT * FROM employee_hierarchy ORDER BY level, employee_name;
Option 3: Bridge Table for Hierarchies
CREATE TABLE dim_employee (
employee_key INTEGER PRIMARY KEY,
employee_id VARCHAR(50),
employee_name VARCHAR(200)
);
-- Bridge table stores all ancestor-descendant relationships
CREATE TABLE bridge_employee_hierarchy (
employee_key INTEGER,
ancestor_employee_key INTEGER,
distance INTEGER, -- How many levels away
PRIMARY KEY (employee_key, ancestor_employee_key)
);
-- Example: CEO → VP → Manager → Employee
-- For "Employee", bridge table has:
(emp_key, emp_key, 0) -- Self
(emp_key, manager_key, 1) -- Direct manager
(emp_key, vp_key, 2) -- VP
(emp_key, ceo_key, 3) -- CEO
-- Query: Total sales by each manager (including their team)
SELECT
e.employee_name,
SUM(fs.total_amount) as team_sales
FROM fact_sales fs
JOIN bridge_employee_hierarchy beh ON fs.employee_key = beh.employee_key
JOIN dim_employee e ON beh.ancestor_employee_key = e.employee_key
WHERE beh.ancestor_employee_key = 1001 -- Specific manager
GROUP BY e.employee_name;
Temporal Modeling: Point-in-Time Tables
Priya: What if you need to know "What did the data look like on 2023-06-15?"
Problem: With SCD Type 2, you get current and historical rows, but finding "as of" a specific date requires complex queries.
Solution: Point-in-Time (PIT) tables
-- Regular dimension with SCD Type 2
CREATE TABLE dim_customer (
customer_key INTEGER,
customer_id VARCHAR(50),
customer_name VARCHAR(200),
customer_segment VARCHAR(50),
effective_date DATE,
expiry_date DATE,
is_current BOOLEAN
);
-- Point-in-Time snapshot table
CREATE TABLE pit_customer_daily (
snapshot_date DATE,
customer_id VARCHAR(50),
customer_key INTEGER, -- Which version was active on this date
customer_segment VARCHAR(50), -- Pre-joined for performance
PRIMARY KEY (snapshot_date, customer_id)
);
-- Populated daily
INSERT INTO pit_customer_daily
SELECT
CURRENT_DATE as snapshot_date,
customer_id,
customer_key,
customer_segment
FROM dim_customer
WHERE is_current = TRUE;
-- Query: What was customer segmentation on 2023-06-15?
SELECT customer_id, customer_segment
FROM pit_customer_daily
WHERE snapshot_date = '2023-06-15';
-- Compare customer segments over time
SELECT
c.customer_id,
pit_old.customer_segment as segment_2023_01,
pit_new.customer_segment as segment_2024_01,
CASE
WHEN pit_old.customer_segment != pit_new.customer_segment
THEN 'Changed' ELSE 'Same'
END as status
FROM dim_customer c
JOIN pit_customer_daily pit_old ON c.customer_id = pit_old.customer_id
AND pit_old.snapshot_date = '2023-01-01'
JOIN pit_customer_daily pit_new ON c.customer_id = pit_new.customer_id
AND pit_new.snapshot_date = '2024-01-01'
WHERE c.is_current = TRUE;
Multi-Valued Dimensions
Priya: What if a transaction has multiple values for a dimension?
Scenario: An order has multiple payment methods (50% card, 50% wallet)
Option 1: Multi-Row (most common)
CREATE TABLE fact_payment (
payment_key BIGSERIAL PRIMARY KEY,
order_key INTEGER,
payment_method_key INTEGER,
amount_paid DECIMAL(10,2)
);
-- Same order, multiple rows
INSERT INTO fact_payment VALUES
(1, 1001, 1, 500), -- Card
(2, 1001, 3, 500); -- Wallet
Option 2: Multi-Value Dimension (array/JSON)
CREATE TABLE fact_order (
order_key INTEGER PRIMARY KEY,
customer_key INTEGER,
payment_methods JSONB, -- ["Card", "Wallet"]
total_amount DECIMAL(10,2)
);
-- Query using JSON functions
SELECT
COUNT(*) as orders_with_card
FROM fact_order
WHERE payment_methods @> '["Card"]';
Option 3: Weighting Bridge Table
CREATE TABLE bridge_order_payment_method (
order_key INTEGER,
payment_method_key INTEGER,
allocation_percentage DECIMAL(5,2),
PRIMARY KEY (order_key, payment_method_key)
);
-- Order 1001: 50% card, 50% wallet
INSERT INTO bridge_order_payment_method VALUES
(1001, 1, 50.00),
(1001, 3, 50.00);
-- Allocate sales by payment method
SELECT
pm.payment_method_name,
SUM(fo.total_amount * bopm.allocation_percentage / 100) as allocated_revenue
FROM fact_order fo
JOIN bridge_order_payment_method bopm ON fo.order_key = bopm.order_key
JOIN dim_payment_method pm ON bopm.payment_method_key = pm.payment_method_key
GROUP BY pm.payment_method_name;
Data Modeling for Modern Systems
Raj: Everything we've discussed so far is for relational databases. What about NoSQL, data lakes, real-time systems?
Priya: Great question! Modern data ecosystems are more diverse. Let me cover the key patterns.
NoSQL Data Modeling
Priya: NoSQL databases (MongoDB, Cassandra, DynamoDB) require a completely different mindset.
Key Principle: Model for your queries, not for normalization!
Example: E-commerce Order in MongoDB
// Denormalized document (optimized for reading)
{
"_id": "ORD1001",
"order_date": ISODate("2024-01-15"),
"order_status": "Delivered",
// Embedded customer info (denormalized)
"customer": {
"customer_id": "CUST123",
"name": "Rahul Sharma",
"email": "rahul@example.com",
"phone": "9876543210"
},
// Embedded address (denormalized)
"shipping_address": {
"street": "123 MG Road",
"city": "Mumbai",
"state": "Maharashtra",
"pincode": "400001"
},
// Array of items (embedded)
"items": [
{
"product_id": "PROD456",
"product_name": "iPhone 14",
"quantity": 1,
"unit_price": 79900,
"discount": 5000,
"total": 74900
},
{
"product_id": "PROD789",
"product_name": "AirPods Pro",
"quantity": 1,
"unit_price": 24900,
"discount": 0,
"total": 24900
}
],
// Embedded payment info
"payment": {
"method": "Card",
"transaction_id": "TXN99999",
"amount_paid": 99800
},
// Calculated totals
"subtotal": 104800,
"total_discount": 5000,
"tax_amount": 14982,
"grand_total": 99800
}
Raj: Everything is in one document! What if I need to update customer email?
Priya: That's the trade-off! In NoSQL:
Embed when:
- Data is read together frequently
- Data doesn't change often
- Data is specific to this context
Reference when:
- Data is accessed independently
- Data changes frequently
- Data is shared across documents
Hybrid approach:
// Order document
{
"_id": "ORD1001",
"customer_ref": "CUST123", // Reference for updates
"customer_snapshot": { // Embedded for reading
"name": "Rahul Sharma",
"email": "rahul@example.com"
},
"items": [...],
"total_amount": 99800
}
// Separate customer document
{
"_id": "CUST123",
"name": "Rahul Sharma",
"email": "rahul@example.com", // Update here
"phone": "9876543210",
"addresses": [...]
}
Data Lake Modeling
Priya: Data lakes store raw data in its native format. The key is organizing it for discovery and processing.
Common Structure:
/data-lake/
/raw/ # Landing zone (as-is from source)
/erp/
/orders/
/date=2024-01-15/
orders_001.parquet
orders_002.parquet
/crm/
/customers/
/date=2024-01-15/
customers.json
/web-logs/
/year=2024/month=01/day=15/
access-logs.gz
/processed/ # Cleaned, validated data
/orders/
/year=2024/month=01/
orders.parquet
/customers/
/active=true/
customers.parquet
/active=false/
customers.parquet
/curated/ # Business-ready datasets
/sales-analytics/
fact_sales.parquet
dim_product.parquet
dim_customer.parquet
/customer-360/
customer_unified.parquet
Partitioning Strategy (crucial for performance):
-- Bad: No partitioning (scans entire dataset)
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- Good: Partitioned by date
/orders/year=2024/month=01/day=15/
SELECT * FROM orders WHERE year=2024 AND month=01 AND day=15;
-- Common partition patterns:
-- Time-based: year/month/day or year/week
-- Region-based: country/state/city
-- Category-based: department/category
-- Hybrid: year/month/day/region
Real-Time Analytics Modeling
Priya: For real-time systems (Kafka, Flink, Kinesis), we use event-based modeling.
Event Schema:
{
"event_id": "evt_12345",
"event_type": "ORDER_PLACED",
"event_timestamp": "2024-01-15T10:30:45Z",
"user_id": "CUST123",
"session_id": "sess_xyz",
"payload": {
"order_id": "ORD1001",
"items": [...],
"total_amount": 99800
},
"metadata": {
"source": "mobile-app",
"version": "2.5.0",
"device_type": "iOS"
}
}
Wide Table for Analytics (denormalized):
CREATE TABLE events_wide (
event_id VARCHAR(50) PRIMARY KEY,
event_type VARCHAR(50),
event_timestamp TIMESTAMP,
-- User context (denormalized)
user_id VARCHAR(50),
user_name VARCHAR(200),
user_segment VARCHAR(50),
-- Session context
session_id VARCHAR(50),
device_type VARCHAR(20),
app_version VARCHAR(20),
-- Event-specific fields (sparse, most will be NULL)
order_id VARCHAR(50),
product_id VARCHAR(50),
page_name VARCHAR(100),
search_term VARCHAR(255),
amount DECIMAL(10,2),
-- Pre-aggregated fields
items_in_cart INTEGER,
session_duration INTEGER,
-- Ingestion metadata
ingestion_timestamp TIMESTAMP,
source_system VARCHAR(50)
);
Lambda Architecture Pattern:
Real-Time Layer (Speed)
Source → Kafka → Flink/Spark Streaming → Real-time aggregates
↓
Data Lake
↓
Batch Processing (Daily) → Historical aggregates
↓
Serving Layer (Merge real-time + batch)
Best Practices and Common Pitfalls
Raj: This has been incredibly comprehensive! Any final advice on best practices?
Priya: Absolutely! Let me share the lessons I've learned over the years.
Best Practices
1. Start Simple, Evolve
Don't do this:
Day 1: Build complex Data Vault with 100 tables
Do this:
Day 1: Build simple normalized schema
Week 2: Add dimensional model for reporting
Month 3: Add Data Vault if needed
2. Name Consistently
-- Bad naming
CREATE TABLE tbl1 (id, nm, dt);
CREATE TABLE customers_new_final_v2 (...);
-- Good naming
CREATE TABLE dim_customer (
customer_key INTEGER, -- dimension_prefix + _key for DW
customer_id VARCHAR(50), -- natural key
customer_name VARCHAR(200), -- descriptive
created_date DATE -- include date/timestamp suffix
);
CREATE TABLE fact_sales (
sales_fact_id BIGINT, -- fact_prefix + _fact_id
date_key INTEGER, -- dimension + _key
quantity_sold INTEGER, -- measure with clear name
...
);
3. Document Everything
-- Add comments to tables and columns
COMMENT ON TABLE dim_customer IS 'Customer dimension with SCD Type 2 for address changes';
COMMENT ON COLUMN dim_customer.effective_date IS 'Date this version became active';
COMMENT ON COLUMN fact_sales.total_amount IS 'Final amount after discounts and tax (in INR)';
4. Use Surrogate Keys in Dimensional Models
-- Always use surrogate keys for dimensions
CREATE TABLE dim_product (
product_key INTEGER PRIMARY KEY, -- Surrogate (stable, fast)
product_id VARCHAR(50) NOT NULL UNIQUE, -- Natural (business meaning)
...
);
5. Separate Concerns
-- Don't mix transactional and analytical in one database
OLTP Database (PostgreSQL) → ETL → OLAP Database (Snowflake/BigQuery)
-- Don't mix raw and curated data
/raw/ (landing zone) → /processed/ (cleaned) → /curated/ (business-ready)
Common Pitfalls
Pitfall 1: Over-Normalization in OLTP
-- Too normalized (bad)
CREATE TABLE orders (..., payment_status_id INTEGER);
CREATE TABLE payment_status (status_id INTEGER, status_name VARCHAR(20));
-- Only 5 values: Pending, Processing, Complete, Failed, Refunded
-- Better
CREATE TABLE orders (
...,
payment_status VARCHAR(20) CHECK (payment_status IN
('Pending','Processing','Complete','Failed','Refunded'))
);
Pitfall 2: Not Planning for History
-- Bad: Can't track history
CREATE TABLE products (
product_id INTEGER PRIMARY KEY,
price DECIMAL(10,2) -- What if price changes?
);
-- Good: Plan for history
CREATE TABLE products (
product_id INTEGER,
price DECIMAL(10,2),
effective_date DATE,
expiry_date DATE,
is_current BOOLEAN,
PRIMARY KEY (product_id, effective_date)
);
Pitfall 3: Ignoring NULL Semantics
-- Problem: NULL in aggregations
SELECT AVG(rating) FROM products;
-- NULLs are excluded, might not be what you want
-- Better: Be explicit
SELECT
AVG(COALESCE(rating, 0)) as avg_rating,
AVG(rating) as avg_non_null_rating,
COUNT(*) as total_products,
COUNT(rating) as products_with_rating
FROM products;
Pitfall 4: Poor Indexing
-- Missing indexes on foreign keys
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER -- No index! Joins will be slow
);
-- Better
CREATE TABLE orders (
order_id INTEGER PRIMARY KEY,
customer_id INTEGER
);
CREATE INDEX idx_orders_customer ON orders(customer_id);
-- Also index filter columns
CREATE INDEX idx_orders_date ON orders(order_date);
CREATE INDEX idx_orders_status ON orders(order_status);
Pitfall 5: Not Considering Query Patterns
-- If you frequently query: "Orders in last 30 days by status"
-- Composite index is better than separate indexes
CREATE INDEX idx_orders_date_status ON orders(order_date, order_status);
-- If you partition by date, don't include date in covering index
CREATE TABLE orders (...) PARTITION BY RANGE (order_date);
-- Index on (date, status) is redundant; partition already handles date
Pitfall 6: Storing Derived Data Without Update Strategy
-- Bad: Stored calculated field with no trigger/update mechanism
CREATE TABLE customers (
customer_id INTEGER,
total_orders INTEGER, -- How is this updated?
total_spent DECIMAL(12,2) -- What if order is refunded?
);
-- Better: Use view or materialized view
CREATE MATERIALIZED VIEW customer_summary AS
SELECT
c.customer_id,
c.customer_name,
COUNT(o.order_id) as total_orders,
SUM(o.total_amount) as total_spent
FROM customers c
LEFT JOIN orders o ON c.customer_id = o.customer_id
WHERE o.order_status != 'Cancelled'
GROUP BY c.customer_id, c.customer_name;
REFRESH MATERIALIZED VIEW customer_summary; -- Refresh on schedule
Conclusion: Choosing Your Path
Raj: Priya, my brain is full! But I feel like I finally understand data modeling. How do I decide which approach to use for my next project?
Priya: Great question! Let me give you a decision framework:
Decision Tree
START: What type of system are you building?
├─ Small Application (< 100k rows)
│ └─ Use: Normalized (3NF) database
│ └─ Tools: PostgreSQL, MySQL
│ └─ Keep it simple!
│
├─ Transactional System (OLTP)
│ ├─ Frequent updates/inserts
│ ├─ Data integrity critical
│ └─ Use: Normalized (3NF) database
│ └─ Tools: PostgreSQL, Oracle, SQL Server
│
├─ Reporting & Analytics (OLAP)
│ ├─ Read-heavy workload
│ ├─ Complex queries
│ ├─ Business users need self-service
│ └─ Use: Dimensional Model (Star Schema)
│ └─ Tools: Snowflake, BigQuery, Redshift
│
├─ Enterprise Data Warehouse
│ ├─ Multiple source systems
│ ├─ Audit requirements
│ ├─ Frequent schema changes
│ └─ Use: Data Vault (raw) + Star Schema (presentation)
│ └─ Tools: Snowflake, Databricks, Azure Synapse
│
├─ Real-Time Analytics
│ ├─ Streaming data
│ ├─ Low latency requirements
│ └─ Use: Event-based wide tables
│ └─ Tools: Apache Kafka, Flink, ClickHouse
│
└─ Document/Unstructured Data
├─ Flexible schema
├─ Nested/hierarchical data
└─ Use: Document model
└─ Tools: MongoDB, Cosmos DB
Final Checklist
Before You Design:
- [ ] Understand the business requirements
- [ ] Identify key entities and relationships
- [ ] Determine OLTP vs OLAP workload
- [ ] Assess data volume and growth
- [ ] Consider query patterns
- [ ] Identify audit/compliance needs
- [ ] Evaluate team skills
Design Principles:
- [ ] Start with conceptual model (business view)
- [ ] Move to logical model (database-independent)
- [ ] Finalize physical model (specific database)
- [ ] Use consistent naming conventions
- [ ] Document grain of fact tables
- [ ] Plan for slowly changing dimensions
- [ ] Add appropriate indexes
- [ ] Consider partitioning for large tables
Before You Deploy:
- [ ] Create sample data and test queries
- [ ] Validate with business users
- [ ] Load test with realistic volumes
- [ ] Document the model
- [ ] Plan for evolution/changes
Raj: This is perfect! One last question—what if I make mistakes in my model?
Priya: You will make mistakes—everyone does! The key is:
- Start with a clear understanding: Spend time upfront understanding the business
- Build iteratively: Don't try to design everything perfectly on day one
- Get feedback early: Show your model to business users and fellow developers
- Monitor and refine: Track query performance and adjust
- Don't be afraid to refactor: Sometimes you need to redesign. That's okay!
Raj: Thank you so much, Priya! I feel equipped to handle data modeling now.
Priya: You're welcome, Raj! Remember, data modeling is both science and art. The science is the techniques we discussed—normalization, dimensional modeling, keys, relationships. The art is knowing when to apply what, based on your specific context. Keep practicing, keep learning, and don't hesitate to reach out!
Raj: Definitely! Next chai is on me! 😊
Quick Reference Guide
Normalization Forms
| Form | Rule | Example Violation |
|---|---|---|
| 1NF | Atomic values only | "red,blue,green" in one cell |
| 2NF | No partial dependencies | Non-key depends on part of composite key |
| 3NF | No transitive dependencies | Non-key depends on another non-key |
| BCNF | Every determinant is a candidate key | Instructor → Course but Instructor not a key |
Key Types
| Key Type | Purpose | Example |
|---|---|---|
| Primary Key | Unique identifier | customer_id |
| Foreign Key | Link to another table | orders.customer_id → customers.customer_id |
| Surrogate Key | Artificial identifier | customer_key (auto-increment) |
| Natural Key | Business identifier | email, PAN number |
| Composite Key | Multiple columns as key | (student_id, course_id, semester) |
SCD Types Summary
| Type | History | Storage | Use Case |
|---|---|---|---|
| Type 0 | None | Original value never changes | Birth date, SSN |
| Type 1 | None | Overwrite | Current phone, email |
| Type 2 | Full | New row for each change | Address, segment |
| Type 3 | Limited | Previous value column | Price (current + previous) |
| Type 6 | Hybrid | Combination of 1+2+3 | Complex requirements |
When to Use What
| Scenario | Approach |
|---|---|
| Small app, frequent updates | 3NF (Normalized) |
| Reporting, complex queries | Star Schema |
| Multiple sources, audit needs | Data Vault |
| Real-time analytics | Wide tables, Event-based |
| Document/hierarchical data | NoSQL (MongoDB) |
| Time-series data | Columnar (ClickHouse) |
Remember: There's no single "right" answer in data modeling. The best model is the one that meets your requirements, performs well, and your team can maintain!
Happy Modeling! 🚀📊
Top comments (0)