DEV Community

Cover image for Part 1: Database Concepts & Architecture
Data Tech Bridge
Data Tech Bridge

Posted on

Part 1: Database Concepts & Architecture

Data Modeling (Advanced)

1. Explain the difference between Star Schema and Snowflake Schema. When would you choose one over the other?

Answer:

Star Schema:

  • Central fact table connected directly to denormalized dimension tables
  • Dimension tables are not normalized (contain redundant data)
  • Simpler queries with fewer joins
  • Better query performance
  • Larger storage footprint
  • Easier for end users to understand

Snowflake Schema:

  • Dimension tables are normalized into multiple related tables
  • Reduces data redundancy
  • More complex queries with more joins
  • Slower query performance
  • Smaller storage footprint
  • More complex for end users

When to Choose:

Star Schema:

  • Query performance is critical
  • Storage is not a constraint
  • Simpler BI tool integration needed
  • End users need direct access
  • Example: Sales reporting dashboard

Snowflake Schema:

  • Storage optimization is important
  • Data integrity and normalization are priorities
  • Complex hierarchical dimensions exist
  • ETL processes can handle complexity
  • Example: Large enterprise data warehouse with complex product hierarchies

Real-world Example:

-- Star Schema
FactSales (SalesID, DateKey, ProductKey, CustomerKey, Amount)
DimProduct (ProductKey, ProductName, Category, SubCategory, Brand, Supplier)
DimCustomer (CustomerKey, Name, City, State, Country, Region)

-- Snowflake Schema
FactSales (SalesID, DateKey, ProductKey, CustomerKey, Amount)
DimProduct (ProductKey, ProductName, CategoryKey, BrandKey)
DimCategory (CategoryKey, Category, SubCategory)
DimBrand (BrandKey, Brand, SupplierKey)
DimSupplier (SupplierKey, SupplierName, Country)
Enter fullscreen mode Exit fullscreen mode

2. How would you design a data model for a multi-tenant SaaS application? What are the key considerations?

Answer:

Three Main Approaches:

A. Shared Schema (Single Database, Shared Tables)

CREATE TABLE Customers (
    CustomerID INT PRIMARY KEY,
    TenantID INT NOT NULL,  -- Discriminator column
    CustomerName VARCHAR(100),
    Email VARCHAR(100),
    CONSTRAINT FK_Tenant FOREIGN KEY (TenantID) REFERENCES Tenants(TenantID)
);

CREATE INDEX IX_Customers_TenantID ON Customers(TenantID);

-- Row Level Security (SQL Server 2016+)
CREATE SECURITY POLICY TenantFilter
ADD FILTER PREDICATE dbo.fn_TenantAccessPredicate(TenantID)
ON dbo.Customers
WITH (STATE = ON);
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Cost-effective (single database)
  • Easy maintenance and upgrades
  • Efficient resource utilization

Cons:

  • Security concerns (data leakage risk)
  • Difficult to customize per tenant
  • Noisy neighbor problem
  • Scaling challenges

B. Separate Schema (Single Database, Multiple Schemas)

-- Tenant 1
CREATE SCHEMA Tenant_1;
CREATE TABLE Tenant_1.Customers (
    CustomerID INT PRIMARY KEY,
    CustomerName VARCHAR(100),
    Email VARCHAR(100)
);

-- Tenant 2
CREATE SCHEMA Tenant_2;
CREATE TABLE Tenant_2.Customers (
    CustomerID INT PRIMARY KEY,
    CustomerName VARCHAR(100),
    Email VARCHAR(100)
);

-- Application logic
SET @schema = 'Tenant_' + @TenantID;
EXEC('SELECT * FROM ' + @schema + '.Customers');
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Better isolation than shared schema
  • Easier per-tenant customization
  • Simpler queries (no TenantID filter)

Cons:

  • Schema management complexity
  • Limited scalability
  • Backup/restore complexity

C. Separate Database (Database per Tenant)

-- Tenant1DB
CREATE DATABASE Tenant_1_DB;
USE Tenant_1_DB;
CREATE TABLE Customers (...);

-- Tenant2DB
CREATE DATABASE Tenant_2_DB;
USE Tenant_2_DB;
CREATE TABLE Customers (...);

-- Connection string routing
string connectionString = GetConnectionString(tenantId);
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Maximum isolation and security
  • Easy per-tenant customization
  • Independent scaling
  • Simplified compliance (data residency)

Cons:

  • Highest cost
  • Complex maintenance
  • Resource intensive

Key Considerations:

  1. Security & Isolation:
-- Implement Row-Level Security
CREATE FUNCTION dbo.fn_TenantAccessPredicate(@TenantID INT)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN SELECT 1 AS result
WHERE @TenantID = CAST(SESSION_CONTEXT(N'TenantID') AS INT);
Enter fullscreen mode Exit fullscreen mode
  1. Performance:
-- Partition by TenantID for large tables
CREATE PARTITION FUNCTION PF_TenantID (INT)
AS RANGE RIGHT FOR VALUES (1, 2, 3, 4, 5);

CREATE PARTITION SCHEME PS_TenantID
AS PARTITION PF_TenantID ALL TO ([PRIMARY]);

CREATE TABLE Orders (
    OrderID INT,
    TenantID INT,
    OrderDate DATE,
    Amount DECIMAL(10,2)
) ON PS_TenantID(TenantID);
Enter fullscreen mode Exit fullscreen mode
  1. Scalability:
  2. Implement sharding strategy
  3. Use read replicas for reporting
  4. Consider hybrid approach (small tenants shared, large tenants isolated)

  5. Compliance:

  6. Data residency requirements

  7. GDPR, HIPAA considerations

  8. Audit logging per tenant

Recommended Hybrid Approach:

-- Metadata database (shared)
CREATE TABLE Tenants (
    TenantID INT PRIMARY KEY,
    TenantName VARCHAR(100),
    TierLevel VARCHAR(20), -- 'Free', 'Premium', 'Enterprise'
    DatabaseName VARCHAR(100), -- NULL for shared, specific for isolated
    IsActive BIT
);

-- Small tenants: Shared schema with TenantID
-- Large tenants: Separate database
Enter fullscreen mode Exit fullscreen mode

3. Explain Slowly Changing Dimensions (SCD). Describe Type 1, 2, and 3 with practical examples.

Answer:

Type 1 SCD - Overwrite (No History)

Simply update the existing record, losing historical data.

-- Initial state
CustomerID | CustomerName | City      | State
1          | John Doe     | Boston    | MA

-- Customer moves to New York
UPDATE DimCustomer
SET City = 'New York',
    State = 'NY',
    LastModifiedDate = GETDATE()
WHERE CustomerID = 1;

-- Result (history lost)
CustomerID | CustomerName | City      | State
1          | John Doe     | New York  | NY
Enter fullscreen mode Exit fullscreen mode

Use Cases:

  • Correcting data errors
  • Attributes that don't require history (e.g., email format changes)
  • Non-critical dimensional attributes

Implementation:

CREATE PROCEDURE usp_UpdateCustomerType1
    @CustomerID INT,
    @NewCity VARCHAR(50),
    @NewState VARCHAR(2)
AS
BEGIN
    UPDATE DimCustomer
    SET City = @NewCity,
        State = @NewState,
        LastModifiedDate = GETDATE(),
        ModifiedBy = SYSTEM_USER
    WHERE CustomerID = @CustomerID;
END;
Enter fullscreen mode Exit fullscreen mode

Type 2 SCD - Add New Row (Full History)

Create a new record for each change, maintaining complete history.

-- Table structure
CREATE TABLE DimCustomer (
    CustomerKey INT IDENTITY(1,1) PRIMARY KEY,  -- Surrogate key
    CustomerID INT NOT NULL,                     -- Natural key
    CustomerName VARCHAR(100),
    City VARCHAR(50),
    State VARCHAR(2),
    EffectiveDate DATE NOT NULL,
    EndDate DATE NULL,
    IsCurrent BIT NOT NULL DEFAULT 1,
    RowVersion INT NOT NULL DEFAULT 1
);

-- Initial record
CustomerKey | CustomerID | CustomerName | City    | State | EffectiveDate | EndDate    | IsCurrent
1           | 1          | John Doe     | Boston  | MA    | 2020-01-01    | NULL       | 1

-- Customer moves (insert new record, update old)
BEGIN TRANSACTION;

-- Close old record
UPDATE DimCustomer
SET EndDate = '2023-06-30',
    IsCurrent = 0
WHERE CustomerID = 1 AND IsCurrent = 1;

-- Insert new record
INSERT INTO DimCustomer (CustomerID, CustomerName, City, State, EffectiveDate, IsCurrent, RowVersion)
VALUES (1, 'John Doe', 'New York', 'NY', '2023-07-01', 1, 2);

COMMIT TRANSACTION;

-- Result
CustomerKey | CustomerID | CustomerName | City     | State | EffectiveDate | EndDate    | IsCurrent | RowVersion
1           | 1          | John Doe     | Boston   | MA    | 2020-01-01    | 2023-06-30 | 0         | 1
2           | 1          | John Doe     | New York | NY    | 2023-07-01    | NULL       | 1         | 2
Enter fullscreen mode Exit fullscreen mode

Querying Type 2:

-- Get current record
SELECT * FROM DimCustomer
WHERE CustomerID = 1 AND IsCurrent = 1;

-- Get historical record at specific date
SELECT * FROM DimCustomer
WHERE CustomerID = 1
  AND '2023-03-15' BETWEEN EffectiveDate AND COALESCE(EndDate, '9999-12-31');

-- Join fact table with dimension (point-in-time)
SELECT 
    f.SalesDate,
    f.Amount,
    d.City,
    d.State
FROM FactSales f
INNER JOIN DimCustomer d 
    ON f.CustomerKey = d.CustomerKey  -- Use surrogate key
WHERE f.SalesDate BETWEEN d.EffectiveDate AND COALESCE(d.EndDate, '9999-12-31');
Enter fullscreen mode Exit fullscreen mode

Use Cases:

  • Customer address changes
  • Product price changes
  • Employee department transfers
  • Any attribute requiring full audit trail

Type 3 SCD - Add New Column (Limited History)

Add columns to track previous value(s).

CREATE TABLE DimCustomer (
    CustomerID INT PRIMARY KEY,
    CustomerName VARCHAR(100),
    CurrentCity VARCHAR(50),
    CurrentState VARCHAR(2),
    PreviousCity VARCHAR(50),
    PreviousState VARCHAR(2),
    EffectiveDate DATE,
    PreviousEffectiveDate DATE
);

-- Initial state
CustomerID | CustomerName | CurrentCity | CurrentState | PreviousCity | PreviousState
1          | John Doe     | Boston      | MA           | NULL         | NULL

-- Customer moves
UPDATE DimCustomer
SET PreviousCity = CurrentCity,
    PreviousState = CurrentState,
    PreviousEffectiveDate = EffectiveDate,
    CurrentCity = 'New York',
    CurrentState = 'NY',
    EffectiveDate = '2023-07-01'
WHERE CustomerID = 1;

-- Result
CustomerID | CustomerName | CurrentCity | CurrentState | PreviousCity | PreviousState
1          | John Doe     | New York    | NY           | Boston       | MA
Enter fullscreen mode Exit fullscreen mode

Use Cases:

  • Tracking one or two previous values
  • Comparing current vs. previous (e.g., sales territory changes)
  • Limited history requirements

Comparison Query:

-- Analyze customers who changed cities
SELECT 
    CustomerID,
    CustomerName,
    PreviousCity,
    CurrentCity,
    DATEDIFF(DAY, PreviousEffectiveDate, EffectiveDate) AS DaysSinceChange
FROM DimCustomer
WHERE PreviousCity IS NOT NULL
  AND PreviousCity <> CurrentCity;
Enter fullscreen mode Exit fullscreen mode

Hybrid Approach (Type 2 + Type 1):

CREATE TABLE DimProduct (
    ProductKey INT IDENTITY(1,1) PRIMARY KEY,
    ProductID INT NOT NULL,
    ProductName VARCHAR(100),        -- Type 1 (overwrite)
    ProductDescription TEXT,         -- Type 1 (overwrite)
    Price DECIMAL(10,2),            -- Type 2 (track history)
    Category VARCHAR(50),           -- Type 2 (track history)
    EffectiveDate DATE NOT NULL,
    EndDate DATE NULL,
    IsCurrent BIT NOT NULL
);

-- Update procedure
CREATE PROCEDURE usp_UpdateProduct
    @ProductID INT,
    @NewName VARCHAR(100),
    @NewDescription TEXT,
    @NewPrice DECIMAL(10,2),
    @NewCategory VARCHAR(50)
AS
BEGIN
    DECLARE @CurrentPrice DECIMAL(10,2);
    DECLARE @CurrentCategory VARCHAR(50);

    -- Get current values
    SELECT @CurrentPrice = Price, @CurrentCategory = Category
    FROM DimProduct
    WHERE ProductID = @ProductID AND IsCurrent = 1;

    -- Check if Type 2 attributes changed
    IF @NewPrice <> @CurrentPrice OR @NewCategory <> @CurrentCategory
    BEGIN
        -- Type 2: Create new record
        BEGIN TRANSACTION;

        UPDATE DimProduct
        SET EndDate = GETDATE(), IsCurrent = 0
        WHERE ProductID = @ProductID AND IsCurrent = 1;

        INSERT INTO DimProduct (ProductID, ProductName, ProductDescription, Price, Category, EffectiveDate, IsCurrent)
        VALUES (@ProductID, @NewName, @NewDescription, @NewPrice, @NewCategory, GETDATE(), 1);

        COMMIT TRANSACTION;
    END
    ELSE
    BEGIN
        -- Type 1: Update existing record
        UPDATE DimProduct
        SET ProductName = @NewName,
            ProductDescription = @NewDescription
        WHERE ProductID = @ProductID AND IsCurrent = 1;
    END
END;
Enter fullscreen mode Exit fullscreen mode

Performance Considerations:

-- Indexes for Type 2 SCD
CREATE INDEX IX_DimCustomer_CustomerID_IsCurrent 
ON DimCustomer(CustomerID, IsCurrent) 
INCLUDE (CustomerName, City, State);

CREATE INDEX IX_DimCustomer_EffectiveDate_EndDate 
ON DimCustomer(EffectiveDate, EndDate) 
INCLUDE (CustomerID);

-- Partitioning for large Type 2 tables
CREATE PARTITION FUNCTION PF_SCD_Date (DATE)
AS RANGE RIGHT FOR VALUES 
('2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01');

CREATE PARTITION SCHEME PS_SCD_Date
AS PARTITION PF_SCD_Date ALL TO ([PRIMARY]);

CREATE TABLE DimCustomer_Partitioned (
    CustomerKey INT IDENTITY(1,1),
    CustomerID INT NOT NULL,
    CustomerName VARCHAR(100),
    City VARCHAR(50),
    State VARCHAR(2),
    EffectiveDate DATE NOT NULL,
    EndDate DATE NULL,
    IsCurrent BIT NOT NULL
) ON PS_SCD_Date(EffectiveDate);
Enter fullscreen mode Exit fullscreen mode

4. What is Data Vault modeling? How does it differ from traditional dimensional modeling?

Answer:

Data Vault 2.0 Components:

1. Hubs - Core Business Entities

-- Hub: Stores unique business keys
CREATE TABLE Hub_Customer (
    CustomerHashKey BINARY(16) PRIMARY KEY,  -- MD5/SHA hash of business key
    CustomerID VARCHAR(50) NOT NULL,         -- Natural business key
    LoadDate DATETIME NOT NULL,
    RecordSource VARCHAR(50) NOT NULL,
    CONSTRAINT UQ_Hub_Customer_ID UNIQUE (CustomerID)
);

-- Example data
CustomerHashKey                  | CustomerID | LoadDate            | RecordSource
0x1A2B3C4D5E6F7G8H9I0J1K2L3M4N | CUST001    | 2023-01-15 10:30:00 | CRM_System
Enter fullscreen mode Exit fullscreen mode

2. Links - Relationships Between Hubs

-- Link: Represents many-to-many relationships
CREATE TABLE Link_CustomerOrder (
    CustomerOrderHashKey BINARY(16) PRIMARY KEY,
    CustomerHashKey BINARY(16) NOT NULL,
    OrderHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    RecordSource VARCHAR(50) NOT NULL,
    CONSTRAINT FK_Link_Customer FOREIGN KEY (CustomerHashKey) 
        REFERENCES Hub_Customer(CustomerHashKey),
    CONSTRAINT FK_Link_Order FOREIGN KEY (OrderHashKey) 
        REFERENCES Hub_Order(OrderHashKey)
);
Enter fullscreen mode Exit fullscreen mode

3. Satellites - Descriptive Attributes

-- Satellite: Stores descriptive attributes and tracks history
CREATE TABLE Sat_Customer (
    CustomerHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    LoadEndDate DATETIME NULL,
    CustomerName VARCHAR(100),
    Email VARCHAR(100),
    Phone VARCHAR(20),
    Address VARCHAR(200),
    City VARCHAR(50),
    State VARCHAR(2),
    HashDiff BINARY(16) NOT NULL,  -- Hash of all attributes
    RecordSource VARCHAR(50) NOT NULL,
    PRIMARY KEY (CustomerHashKey, LoadDate),
    CONSTRAINT FK_Sat_Customer FOREIGN KEY (CustomerHashKey) 
        REFERENCES Hub_Customer(CustomerHashKey)
);

-- Multiple satellites for different change rates
CREATE TABLE Sat_Customer_ContactInfo (
    CustomerHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    Email VARCHAR(100),
    Phone VARCHAR(20),
    HashDiff BINARY(16) NOT NULL,
    PRIMARY KEY (CustomerHashKey, LoadDate)
);

CREATE TABLE Sat_Customer_Demographics (
    CustomerHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    DateOfBirth DATE,
    Gender VARCHAR(10),
    MaritalStatus VARCHAR(20),
    HashDiff BINARY(16) NOT NULL,
    PRIMARY KEY (CustomerHashKey, LoadDate)
);
Enter fullscreen mode Exit fullscreen mode

Complete Example:

-- Hub: Customer
CREATE TABLE Hub_Customer (
    CustomerHashKey BINARY(16) PRIMARY KEY,
    CustomerID VARCHAR(50) NOT NULL UNIQUE,
    LoadDate DATETIME NOT NULL,
    RecordSource VARCHAR(50) NOT NULL
);

-- Hub: Product
CREATE TABLE Hub_Product (
    ProductHashKey BINARY(16) PRIMARY KEY,
    ProductID VARCHAR(50) NOT NULL UNIQUE,
    LoadDate DATETIME NOT NULL,
    RecordSource VARCHAR(50) NOT NULL
);

-- Link: Customer-Product (Purchase)
CREATE TABLE Link_Purchase (
    PurchaseHashKey BINARY(16) PRIMARY KEY,
    CustomerHashKey BINARY(16) NOT NULL,
    ProductHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    RecordSource VARCHAR(50) NOT NULL
);

-- Satellite: Purchase Details
CREATE TABLE Sat_Purchase (
    PurchaseHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    LoadEndDate DATETIME NULL,
    PurchaseDate DATE,
    Quantity INT,
    UnitPrice DECIMAL(10,2),
    TotalAmount DECIMAL(10,2),
    HashDiff BINARY(16) NOT NULL,
    RecordSource VARCHAR(50) NOT NULL,
    PRIMARY KEY (PurchaseHashKey, LoadDate)
);

-- Loading data with hash keys
INSERT INTO Hub_Customer (CustomerHashKey, CustomerID, LoadDate, RecordSource)
SELECT 
    HASHBYTES('MD5', CustomerID) AS CustomerHashKey,
    CustomerID,
    GETDATE(),
    'CRM_System'
FROM SourceCustomers
WHERE NOT EXISTS (
    SELECT 1 FROM Hub_Customer h 
    WHERE h.CustomerID = SourceCustomers.CustomerID
);
Enter fullscreen mode Exit fullscreen mode

Data Vault vs. Dimensional Modeling:

Aspect Data Vault Dimensional (Kimball)
Purpose Enterprise data warehouse, raw data vault Business intelligence, analytics
Structure Hubs, Links, Satellites Facts, Dimensions
Normalization Highly normalized Denormalized (Star) or Normalized (Snowflake)
History Full history by design Type 2 SCD for history
Flexibility Highly flexible, easy to add sources Less flexible, schema changes complex
Query Performance Slower (many joins) Faster (fewer joins)
Load Performance Very fast (parallel loads) Slower (lookups, updates)
Auditability Excellent (full lineage) Limited
Business Keys Preserved Surrogate keys
Use Case Data integration layer Presentation layer

Practical Implementation:

-- ETL Process for Data Vault

-- Step 1: Load Hub (if not exists)
MERGE INTO Hub_Customer AS target
USING (
    SELECT DISTINCT
        HASHBYTES('MD5', CustomerID) AS CustomerHashKey,
        CustomerID,
        GETDATE() AS LoadDate,
        'CRM_System' AS RecordSource
    FROM Staging_Customers
) AS source
ON target.CustomerHashKey = source.CustomerHashKey
WHEN NOT MATCHED THEN
    INSERT (CustomerHashKey, CustomerID, LoadDate, RecordSource)
    VALUES (source.CustomerHashKey, source.CustomerID, source.LoadDate, source.RecordSource);

-- Step 2: Load Satellite (track changes)
INSERT INTO Sat_Customer (CustomerHashKey, LoadDate, CustomerName, Email, Phone, HashDiff, RecordSource)
SELECT 
    h.CustomerHashKey,
    GETDATE() AS LoadDate,
    s.CustomerName,
    s.Email,
    s.Phone,
    HASHBYTES('MD5', CONCAT(s.CustomerName, s.Email, s.Phone)) AS HashDiff,
    'CRM_System' AS RecordSource
FROM Staging_Customers s
INNER JOIN Hub_Customer h ON s.CustomerID = h.CustomerID
WHERE NOT EXISTS (
    SELECT 1 
    FROM Sat_Customer sat
    WHERE sat.CustomerHashKey = h.CustomerHashKey
      AND sat.HashDiff = HASHBYTES('MD5', CONCAT(s.CustomerName, s.Email, s.Phone))
      AND sat.LoadEndDate IS NULL
);

-- Step 3: Close previous satellite records
UPDATE sat
SET LoadEndDate = GETDATE()
FROM Sat_Customer sat
INNER JOIN (
    SELECT CustomerHashKey, MAX(LoadDate) AS MaxLoadDate
    FROM Sat_Customer
    WHERE LoadEndDate IS NULL
    GROUP BY CustomerHashKey
) latest ON sat.CustomerHashKey = latest.CustomerHashKey 
         AND sat.LoadDate < latest.MaxLoadDate
WHERE sat.LoadEndDate IS NULL;
Enter fullscreen mode Exit fullscreen mode

Querying Data Vault (Point-in-Time):

-- Create view for business users (Information Mart)
CREATE VIEW vw_Customer_Current AS
SELECT 
    h.CustomerID,
    s.CustomerName,
    s.Email,
    s.Phone,
    s.LoadDate AS EffectiveDate
FROM Hub_Customer h
INNER JOIN Sat_Customer s ON h.CustomerHashKey = s.CustomerHashKey
WHERE s.LoadEndDate IS NULL;  -- Current records only

-- Historical query
CREATE FUNCTION fn_GetCustomerAtDate(@AsOfDate DATETIME)
RETURNS TABLE
AS
RETURN
(
    SELECT 
        h.CustomerID,
        s.CustomerName,
        s.Email,
        s.Phone
    FROM Hub_Customer h
    INNER JOIN Sat_Customer s ON h.CustomerHashKey = s.CustomerHashKey
    WHERE @AsOfDate >= s.LoadDate 
      AND (@AsOfDate < s.LoadEndDate OR s.LoadEndDate IS NULL)
);

-- Usage
SELECT * FROM fn_GetCustomerAtDate('2023-06-15');
Enter fullscreen mode Exit fullscreen mode

Advantages of Data Vault:

  1. Auditability: Complete lineage and history
  2. Flexibility: Easy to add new data sources
  3. Parallel Loading: Hubs, Links, Satellites load independently
  4. No Data Loss: All source data preserved
  5. Agile: Incremental development

Disadvantages:

  1. Complexity: Steep learning curve
  2. Query Performance: Many joins required
  3. Storage: More storage than dimensional models
  4. Tooling: Limited BI tool support

Best Practice - Hybrid Architecture:

Source Systems → Data Vault (Raw Vault) → Information Marts (Dimensional) → BI Tools
Enter fullscreen mode Exit fullscreen mode

5. How would you handle hierarchical data in a relational database? Compare different approaches.

Answer:

Approach 1: Adjacency List (Parent-Child)

Most common and simplest approach.

CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    EmployeeName VARCHAR(100),
    ManagerID INT NULL,
    Title VARCHAR(50),
    Salary DECIMAL(10,2),
    CONSTRAINT FK_Manager FOREIGN KEY (ManagerID) 
        REFERENCES Employees(EmployeeID)
);

-- Sample data
INSERT INTO Employees VALUES
(1, 'John CEO', NULL, 'CEO', 200000),
(2, 'Jane VP', 1, 'VP Sales', 150000),
(3, 'Bob VP', 1, 'VP Engineering', 150000),
(4, 'Alice Manager', 2, 'Sales Manager', 100000),
(5, 'Charlie Manager', 3, 'Engineering Manager', 100000),
(6, 'David Rep', 4, 'Sales Rep', 60000),
(7, 'Eve Engineer', 5, 'Senior Engineer', 90000);

-- Query: Get direct reports
SELECT 
    e.EmployeeID,
    e.EmployeeName,
    e.Title,
    m.EmployeeName AS ManagerName
FROM Employees e
LEFT JOIN Employees m ON e.ManagerID = m.EmployeeID;

-- Query: Get all subordinates (Recursive CTE)
WITH EmployeeHierarchy AS (
    -- Anchor: Start with a specific manager
    SELECT 
        EmployeeID,
        EmployeeName,
        ManagerID,
        Title,
        Salary,
        0 AS Level,
        CAST(EmployeeID AS VARCHAR(MAX)) AS HierarchyPath
    FROM Employees
    WHERE EmployeeID = 1  -- CEO

    UNION ALL

    -- Recursive: Get subordinates
    SELECT 
        e.EmployeeID,
        e.EmployeeName,
        e.ManagerID,
        e.Title,
        e.Salary,
        eh.Level + 1,
        CAST(eh.HierarchyPath + '/' + CAST(e.EmployeeID AS VARCHAR(10)) AS VARCHAR(MAX))
    FROM Employees e
    INNER JOIN EmployeeHierarchy eh ON e.ManagerID = eh.EmployeeID
)
SELECT 
    REPLICATE('  ', Level) + EmployeeName AS OrgChart,
    Level,
    Title,
    Salary,
    HierarchyPath
FROM EmployeeHierarchy
ORDER BY HierarchyPath;

-- Result:
-- John CEO (Level 0)
--   Jane VP (Level 1)
--     Alice Manager (Level 2)
--       David Rep (Level 3)
--   Bob VP (Level 1)
--     Charlie Manager (Level 2)
--       Eve Engineer (Level 3)
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Simple to understand and implement
  • Easy to insert/update/delete
  • Referential integrity enforced

Cons:

  • Recursive queries needed for hierarchy traversal
  • Performance issues with deep hierarchies
  • No easy way to get all ancestors/descendants

Approach 2: Nested Sets (Modified Preorder Tree Traversal)

Stores left and right boundaries for each node.

CREATE TABLE Categories (
    CategoryID INT PRIMARY KEY,
    CategoryName VARCHAR(100),
    LeftValue INT NOT NULL,
    RightValue INT NOT NULL,
    CONSTRAINT UQ_Left UNIQUE (LeftValue),
    CONSTRAINT UQ_Right UNIQUE (RightValue)
);

-- Sample data (Electronics hierarchy)
INSERT INTO Categories VALUES
(1, 'Electronics', 1, 20),
(2, 'Computers', 2, 9),
(3, 'Laptops', 3, 4),
(4, 'Desktops', 5, 6),
(5, 'Tablets', 7, 8),
(6, 'TVs', 10, 15),
(7, 'LED', 11, 12),
(8, 'OLED', 13, 14),
(9, 'Audio', 16, 19),
(10, 'Headphones', 17, 18);

-- Query: Get all descendants of a node
SELECT 
    child.CategoryName,
    (COUNT(parent.CategoryID) - 1) AS Depth
FROM Categories AS child
CROSS JOIN Categories AS parent
WHERE child.LeftValue BETWEEN parent.LeftValue AND parent.RightValue
  AND parent.CategoryName = 'Electronics'
GROUP BY child.CategoryName, child.LeftValue
ORDER BY child.LeftValue;

-- Query: Get path to root
SELECT 
    parent.CategoryName
FROM Categories AS child
CROSS JOIN Categories AS parent
WHERE child.LeftValue BETWEEN parent.LeftValue AND parent.RightValue
  AND child.CategoryName = 'Laptops'
ORDER BY parent.LeftValue;

-- Query: Get immediate children
SELECT 
    child.CategoryName
FROM Categories AS parent
INNER JOIN Categories AS child 
    ON child.LeftValue > parent.LeftValue 
    AND child.RightValue < parent.RightValue
WHERE parent.CategoryName = 'Computers'
  AND NOT EXISTS (
      SELECT 1 
      FROM Categories AS intermediate
      WHERE intermediate.LeftValue > parent.LeftValue
        AND intermediate.RightValue < parent.RightValue
        AND child.LeftValue > intermediate.LeftValue
        AND child.RightValue < intermediate.RightValue
  );

-- Insert new node (complex)
-- To insert "Gaming Laptops" under "Laptops" (CategoryID 3)
DECLARE @ParentLeft INT = 3;
DECLARE @ParentRight INT = 4;

-- Make space
UPDATE Categories 
SET RightValue = RightValue + 2 
WHERE RightValue >= @ParentRight;

UPDATE Categories 
SET LeftValue = LeftValue + 2 
WHERE LeftValue > @ParentRight;

-- Insert new node
INSERT INTO Categories VALUES
(11, 'Gaming Laptops', @ParentRight, @ParentRight + 1);
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Very fast reads (no recursion)
  • Easy to get all descendants/ancestors
  • Efficient for read-heavy workloads

Cons:

  • Complex inserts/updates/deletes
  • Difficult to maintain
  • Hard to understand
  • Concurrent modifications problematic

Approach 3: Closure Table (Transitive Closure)

Stores all ancestor-descendant relationships.

CREATE TABLE Categories (
    CategoryID INT PRIMARY KEY,
    CategoryName VARCHAR(100)
);

CREATE TABLE CategoryClosure (
    AncestorID INT NOT NULL,
    DescendantID INT NOT NULL,
    Depth INT NOT NULL,
    PRIMARY KEY (AncestorID, DescendantID),
    CONSTRAINT FK_Ancestor FOREIGN KEY (AncestorID) 
        REFERENCES Categories(CategoryID),
    CONSTRAINT FK_Descendant FOREIGN KEY (DescendantID) 
        REFERENCES Categories(CategoryID)
);

-- Sample data
INSERT INTO Categories VALUES
(1, 'Electronics'),
(2, 'Computers'),
(3, 'Laptops'),
(4, 'Desktops'),
(5, 'TVs');

-- Insert closure records (including self-references)
INSERT INTO CategoryClosure VALUES
-- Self-references
(1, 1, 0), (2, 2, 0), (3, 3, 0), (4, 4, 0), (5, 5, 0),
-- Electronics -> Computers
(1, 2, 1),
-- Electronics -> Laptops (through Computers)
(1, 3, 2), (2, 3, 1),
-- Electronics -> Desktops (through Computers)
(1, 4, 2), (2, 4, 1),
-- Electronics -> TVs
(1, 5, 1);

-- Query: Get all descendants
SELECT 
    c.CategoryName,
    cc.Depth
FROM CategoryClosure cc
INNER JOIN Categories c ON cc.DescendantID = c.CategoryID
WHERE cc.AncestorID = 1  -- Electronics
  AND cc.Depth > 0
ORDER BY cc.Depth, c.CategoryName;

-- Query: Get all ancestors
SELECT 
    c.CategoryName,
    cc.Depth
FROM CategoryClosure cc
INNER JOIN Categories c ON cc.AncestorID = c.CategoryID
WHERE cc.DescendantID = 3  -- Laptops
  AND cc.Depth > 0
ORDER BY cc.Depth DESC;

-- Query: Get immediate children
SELECT 
    c.CategoryName
FROM CategoryClosure cc
INNER JOIN Categories c ON cc.DescendantID = c.CategoryID
WHERE cc.AncestorID = 2  -- Computers
  AND cc.Depth = 1;

-- Insert new node (easier than Nested Sets)
-- Add "Gaming Laptops" under "Laptops" (CategoryID 3)
DECLARE @NewCategoryID INT = 6;
DECLARE @ParentID INT = 3;

-- Insert into Categories
INSERT INTO Categories VALUES (@NewCategoryID, 'Gaming Laptops');

-- Insert closure records
-- 1. Self-reference
INSERT INTO CategoryClosure VALUES (@NewCategoryID, @NewCategoryID, 0);

-- 2. Copy all ancestors of parent
INSERT INTO CategoryClosure (AncestorID, DescendantID, Depth)
SELECT 
    AncestorID,
    @NewCategoryID,
    Depth + 1
FROM CategoryClosure
WHERE DescendantID = @ParentID;

-- Delete node and all descendants
DELETE FROM CategoryClosure
WHERE DescendantID IN (
    SELECT DescendantID
    FROM CategoryClosure
    WHERE AncestorID = @CategoryIDToDelete
);

DELETE FROM Categories
WHERE CategoryID IN (
    SELECT DescendantID
    FROM CategoryClosure
    WHERE AncestorID = @CategoryIDToDelete
);
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Fast reads (no recursion)
  • Easier inserts/deletes than Nested Sets
  • Can query any relationship easily
  • Good for complex hierarchies

Cons:

  • More storage (O(n²) in worst case)
  • More complex than Adjacency List
  • Redundant data

Approach 4: Materialized Path (Path Enumeration)

Store the full path from root to node.

CREATE TABLE Categories (
    CategoryID INT PRIMARY KEY,
    CategoryName VARCHAR(100),
    PathString VARCHAR(500) NOT NULL,  -- e.g., '/1/2/3/'
    Level INT NOT NULL,
    CONSTRAINT UQ_Path UNIQUE (PathString)
);

CREATE INDEX IX_Categories_Path ON Categories(PathString);

-- Sample data
INSERT INTO Categories VALUES
(1, 'Electronics', '/1/', 0),
(2, 'Computers', '/1/2/', 1),
(3, 'Laptops', '/1/2/3/', 2),
(4, 'Desktops', '/1/2/4/', 2),
(5, 'TVs', '/1/5/', 1),
(6, 'LED TVs', '/1/5/6/', 2);

-- Query: Get all descendants
SELECT 
    CategoryID,
    CategoryName,
    Level,
    PathString
FROM Categories
WHERE PathString LIKE '/1/%'  -- All under Electronics
  AND CategoryID <> 1
ORDER BY PathString;

-- Query: Get all ancestors
SELECT 
    CategoryID,
    CategoryName,
    Level
FROM Categories
WHERE '/1/2/3/' LIKE PathString + '%'  -- Ancestors of Laptops
  AND CategoryID <> 3
ORDER BY Level;

-- Query: Get immediate children
SELECT 
    CategoryID,
    CategoryName
FROM Categories
WHERE PathString LIKE '/1/2/%'  -- Children of Computers
  AND Level = 2;

-- Query: Get siblings
SELECT 
    CategoryID,
    CategoryName
FROM Categories c1
WHERE EXISTS (
    SELECT 1 
    FROM Categories c2
    WHERE c2.CategoryID = 3  -- Laptops
      AND c1.PathString LIKE SUBSTRING(c2.PathString, 1, 
          LEN(c2.PathString) - CHARINDEX('/', REVERSE(c2.PathString), 2)) + '%'
      AND c1.Level = c2.Level
      AND c1.CategoryID <> c2.CategoryID
);

-- Insert new node
DECLARE @ParentPath VARCHAR(500) = '/1/2/';
DECLARE @NewID INT = 7;

INSERT INTO Categories VALUES
(@NewID, 'Gaming Laptops', @ParentPath + CAST(@NewID AS VARCHAR(10)) + '/', 2);

-- Move subtree (change parent)
DECLARE @OldPath VARCHAR(500) = '/1/2/3/';
DECLARE @NewPath VARCHAR(500) = '/1/5/3/';

UPDATE Categories
SET PathString = REPLACE(PathString, @OldPath, @NewPath),
    Level = Level + (LEN(@NewPath) - LEN(@OldPath)) / 2
WHERE PathString LIKE @OldPath + '%';
Enter fullscreen mode Exit fullscreen mode

Pros:

  • Simple to understand
  • Fast reads with LIKE operator
  • Easy to get ancestors/descendants
  • Can use indexes effectively

Cons:

  • Path length limitation
  • Updates can be expensive (moving subtrees)
  • Redundant data storage

Comparison Table:

Feature Adjacency List Nested Sets Closure Table Materialized Path
Simplicity ⭐⭐⭐⭐⭐ ⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐
Read Performance ⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐
Write Performance ⭐⭐⭐⭐⭐ ⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐
Storage ⭐⭐⭐⭐⭐ ⭐⭐⭐⭐ ⭐⭐ ⭐⭐⭐
Referential Integrity ⭐⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐⭐⭐⭐ ⭐⭐⭐
Move Subtree ⭐⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐

Recommendation:

  • Adjacency List: Default choice for most applications
  • Closure Table: When you need fast reads and can afford storage
  • Materialized Path: Good balance, especially with limited depth
  • Nested Sets: Only for read-heavy, rarely-updated hierarchies

Hybrid Approach:

-- Combine Adjacency List with Materialized Path
CREATE TABLE Categories (
    CategoryID INT PRIMARY KEY,
    CategoryName VARCHAR(100),
    ParentID INT NULL,  -- Adjacency List
    PathString VARCHAR(500),  -- Materialized Path
    Level INT,
    LeftValue INT,  -- Optional: Add Nested Sets for specific queries
    RightValue INT,
    CONSTRAINT FK_Parent FOREIGN KEY (ParentID) 
        REFERENCES Categories(CategoryID)
);
Enter fullscreen mode Exit fullscreen mode

This gives you flexibility to use the best approach for each query type.

Top comments (0)