Data Tech Bridge

Posted on Dec 18, 2025

Part 1: Database Concepts & Architecture

#architecture #database #dataengineering

Data Modeling (Advanced)

1. Explain the difference between Star Schema and Snowflake Schema. When would you choose one over the other?

Answer:

Star Schema:

Central fact table connected directly to denormalized dimension tables
Dimension tables are not normalized (contain redundant data)
Simpler queries with fewer joins
Better query performance
Larger storage footprint
Easier for end users to understand

Snowflake Schema:

Dimension tables are normalized into multiple related tables
Reduces data redundancy
More complex queries with more joins
Slower query performance
Smaller storage footprint
More complex for end users

When to Choose:

Star Schema:

Query performance is critical
Storage is not a constraint
Simpler BI tool integration needed
End users need direct access
Example: Sales reporting dashboard

Snowflake Schema:

Storage optimization is important
Data integrity and normalization are priorities
Complex hierarchical dimensions exist
ETL processes can handle complexity
Example: Large enterprise data warehouse with complex product hierarchies

Real-world Example:

-- Star Schema
FactSales (SalesID, DateKey, ProductKey, CustomerKey, Amount)
DimProduct (ProductKey, ProductName, Category, SubCategory, Brand, Supplier)
DimCustomer (CustomerKey, Name, City, State, Country, Region)

-- Snowflake Schema
FactSales (SalesID, DateKey, ProductKey, CustomerKey, Amount)
DimProduct (ProductKey, ProductName, CategoryKey, BrandKey)
DimCategory (CategoryKey, Category, SubCategory)
DimBrand (BrandKey, Brand, SupplierKey)
DimSupplier (SupplierKey, SupplierName, Country)

2. How would you design a data model for a multi-tenant SaaS application? What are the key considerations?

Answer:

Three Main Approaches:

A. Shared Schema (Single Database, Shared Tables)

CREATE TABLE Customers (
    CustomerID INT PRIMARY KEY,
    TenantID INT NOT NULL,  -- Discriminator column
    CustomerName VARCHAR(100),
    Email VARCHAR(100),
    CONSTRAINT FK_Tenant FOREIGN KEY (TenantID) REFERENCES Tenants(TenantID)
);

CREATE INDEX IX_Customers_TenantID ON Customers(TenantID);

-- Row Level Security (SQL Server 2016+)
CREATE SECURITY POLICY TenantFilter
ADD FILTER PREDICATE dbo.fn_TenantAccessPredicate(TenantID)
ON dbo.Customers
WITH (STATE = ON);

Pros:

Cost-effective (single database)
Easy maintenance and upgrades
Efficient resource utilization

Cons:

Security concerns (data leakage risk)
Difficult to customize per tenant
Noisy neighbor problem
Scaling challenges

B. Separate Schema (Single Database, Multiple Schemas)

-- Tenant 1
CREATE SCHEMA Tenant_1;
CREATE TABLE Tenant_1.Customers (
    CustomerID INT PRIMARY KEY,
    CustomerName VARCHAR(100),
    Email VARCHAR(100)
);

-- Tenant 2
CREATE SCHEMA Tenant_2;
CREATE TABLE Tenant_2.Customers (
    CustomerID INT PRIMARY KEY,
    CustomerName VARCHAR(100),
    Email VARCHAR(100)
);

-- Application logic
SET @schema = 'Tenant_' + @TenantID;
EXEC('SELECT * FROM ' + @schema + '.Customers');

Pros:

Better isolation than shared schema
Easier per-tenant customization
Simpler queries (no TenantID filter)

Cons:

Schema management complexity
Limited scalability
Backup/restore complexity

C. Separate Database (Database per Tenant)

-- Tenant1DB
CREATE DATABASE Tenant_1_DB;
USE Tenant_1_DB;
CREATE TABLE Customers (...);

-- Tenant2DB
CREATE DATABASE Tenant_2_DB;
USE Tenant_2_DB;
CREATE TABLE Customers (...);

-- Connection string routing
string connectionString = GetConnectionString(tenantId);

Pros:

Maximum isolation and security
Easy per-tenant customization
Independent scaling
Simplified compliance (data residency)

Cons:

Highest cost
Complex maintenance
Resource intensive

Key Considerations:

Security & Isolation:

-- Implement Row-Level Security
CREATE FUNCTION dbo.fn_TenantAccessPredicate(@TenantID INT)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN SELECT 1 AS result
WHERE @TenantID = CAST(SESSION_CONTEXT(N'TenantID') AS INT);

Performance:

-- Partition by TenantID for large tables
CREATE PARTITION FUNCTION PF_TenantID (INT)
AS RANGE RIGHT FOR VALUES (1, 2, 3, 4, 5);

CREATE PARTITION SCHEME PS_TenantID
AS PARTITION PF_TenantID ALL TO ([PRIMARY]);

CREATE TABLE Orders (
    OrderID INT,
    TenantID INT,
    OrderDate DATE,
    Amount DECIMAL(10,2)
) ON PS_TenantID(TenantID);

Scalability:
Implement sharding strategy
Use read replicas for reporting
Consider hybrid approach (small tenants shared, large tenants isolated)
Compliance:
Data residency requirements
GDPR, HIPAA considerations
Audit logging per tenant

Recommended Hybrid Approach:

-- Metadata database (shared)
CREATE TABLE Tenants (
    TenantID INT PRIMARY KEY,
    TenantName VARCHAR(100),
    TierLevel VARCHAR(20), -- 'Free', 'Premium', 'Enterprise'
    DatabaseName VARCHAR(100), -- NULL for shared, specific for isolated
    IsActive BIT
);

-- Small tenants: Shared schema with TenantID
-- Large tenants: Separate database

3. Explain Slowly Changing Dimensions (SCD). Describe Type 1, 2, and 3 with practical examples.

Answer:

Type 1 SCD - Overwrite (No History)

Simply update the existing record, losing historical data.

-- Initial state
CustomerID | CustomerName | City      | State
1          | John Doe     | Boston    | MA

-- Customer moves to New York
UPDATE DimCustomer
SET City = 'New York',
    State = 'NY',
    LastModifiedDate = GETDATE()
WHERE CustomerID = 1;

-- Result (history lost)
CustomerID | CustomerName | City      | State
1          | John Doe     | New York  | NY

Use Cases:

Correcting data errors
Attributes that don't require history (e.g., email format changes)
Non-critical dimensional attributes

Implementation:

CREATE PROCEDURE usp_UpdateCustomerType1
    @CustomerID INT,
    @NewCity VARCHAR(50),
    @NewState VARCHAR(2)
AS
BEGIN
    UPDATE DimCustomer
    SET City = @NewCity,
        State = @NewState,
        LastModifiedDate = GETDATE(),
        ModifiedBy = SYSTEM_USER
    WHERE CustomerID = @CustomerID;
END;

Type 2 SCD - Add New Row (Full History)

Create a new record for each change, maintaining complete history.

-- Table structure
CREATE TABLE DimCustomer (
    CustomerKey INT IDENTITY(1,1) PRIMARY KEY,  -- Surrogate key
    CustomerID INT NOT NULL,                     -- Natural key
    CustomerName VARCHAR(100),
    City VARCHAR(50),
    State VARCHAR(2),
    EffectiveDate DATE NOT NULL,
    EndDate DATE NULL,
    IsCurrent BIT NOT NULL DEFAULT 1,
    RowVersion INT NOT NULL DEFAULT 1
);

-- Initial record
CustomerKey | CustomerID | CustomerName | City    | State | EffectiveDate | EndDate    | IsCurrent
1           | 1          | John Doe     | Boston  | MA    | 2020-01-01    | NULL       | 1

-- Customer moves (insert new record, update old)
BEGIN TRANSACTION;

-- Close old record
UPDATE DimCustomer
SET EndDate = '2023-06-30',
    IsCurrent = 0
WHERE CustomerID = 1 AND IsCurrent = 1;

-- Insert new record
INSERT INTO DimCustomer (CustomerID, CustomerName, City, State, EffectiveDate, IsCurrent, RowVersion)
VALUES (1, 'John Doe', 'New York', 'NY', '2023-07-01', 1, 2);

COMMIT TRANSACTION;

-- Result
CustomerKey | CustomerID | CustomerName | City     | State | EffectiveDate | EndDate    | IsCurrent | RowVersion
1           | 1          | John Doe     | Boston   | MA    | 2020-01-01    | 2023-06-30 | 0         | 1
2           | 1          | John Doe     | New York | NY    | 2023-07-01    | NULL       | 1         | 2

Querying Type 2:

-- Get current record
SELECT * FROM DimCustomer
WHERE CustomerID = 1 AND IsCurrent = 1;

-- Get historical record at specific date
SELECT * FROM DimCustomer
WHERE CustomerID = 1
  AND '2023-03-15' BETWEEN EffectiveDate AND COALESCE(EndDate, '9999-12-31');

-- Join fact table with dimension (point-in-time)
SELECT 
    f.SalesDate,
    f.Amount,
    d.City,
    d.State
FROM FactSales f
INNER JOIN DimCustomer d 
    ON f.CustomerKey = d.CustomerKey  -- Use surrogate key
WHERE f.SalesDate BETWEEN d.EffectiveDate AND COALESCE(d.EndDate, '9999-12-31');

Use Cases:

Customer address changes
Product price changes
Employee department transfers
Any attribute requiring full audit trail

Type 3 SCD - Add New Column (Limited History)

Add columns to track previous value(s).

CREATE TABLE DimCustomer (
    CustomerID INT PRIMARY KEY,
    CustomerName VARCHAR(100),
    CurrentCity VARCHAR(50),
    CurrentState VARCHAR(2),
    PreviousCity VARCHAR(50),
    PreviousState VARCHAR(2),
    EffectiveDate DATE,
    PreviousEffectiveDate DATE
);

-- Initial state
CustomerID | CustomerName | CurrentCity | CurrentState | PreviousCity | PreviousState
1          | John Doe     | Boston      | MA           | NULL         | NULL

-- Customer moves
UPDATE DimCustomer
SET PreviousCity = CurrentCity,
    PreviousState = CurrentState,
    PreviousEffectiveDate = EffectiveDate,
    CurrentCity = 'New York',
    CurrentState = 'NY',
    EffectiveDate = '2023-07-01'
WHERE CustomerID = 1;

-- Result
CustomerID | CustomerName | CurrentCity | CurrentState | PreviousCity | PreviousState
1          | John Doe     | New York    | NY           | Boston       | MA

Use Cases:

Tracking one or two previous values
Comparing current vs. previous (e.g., sales territory changes)
Limited history requirements

Comparison Query:

-- Analyze customers who changed cities
SELECT 
    CustomerID,
    CustomerName,
    PreviousCity,
    CurrentCity,
    DATEDIFF(DAY, PreviousEffectiveDate, EffectiveDate) AS DaysSinceChange
FROM DimCustomer
WHERE PreviousCity IS NOT NULL
  AND PreviousCity <> CurrentCity;

Hybrid Approach (Type 2 + Type 1):

CREATE TABLE DimProduct (
    ProductKey INT IDENTITY(1,1) PRIMARY KEY,
    ProductID INT NOT NULL,
    ProductName VARCHAR(100),        -- Type 1 (overwrite)
    ProductDescription TEXT,         -- Type 1 (overwrite)
    Price DECIMAL(10,2),            -- Type 2 (track history)
    Category VARCHAR(50),           -- Type 2 (track history)
    EffectiveDate DATE NOT NULL,
    EndDate DATE NULL,
    IsCurrent BIT NOT NULL
);

-- Update procedure
CREATE PROCEDURE usp_UpdateProduct
    @ProductID INT,
    @NewName VARCHAR(100),
    @NewDescription TEXT,
    @NewPrice DECIMAL(10,2),
    @NewCategory VARCHAR(50)
AS
BEGIN
    DECLARE @CurrentPrice DECIMAL(10,2);
    DECLARE @CurrentCategory VARCHAR(50);

    -- Get current values
    SELECT @CurrentPrice = Price, @CurrentCategory = Category
    FROM DimProduct
    WHERE ProductID = @ProductID AND IsCurrent = 1;

    -- Check if Type 2 attributes changed
    IF @NewPrice <> @CurrentPrice OR @NewCategory <> @CurrentCategory
    BEGIN
        -- Type 2: Create new record
        BEGIN TRANSACTION;

        UPDATE DimProduct
        SET EndDate = GETDATE(), IsCurrent = 0
        WHERE ProductID = @ProductID AND IsCurrent = 1;

        INSERT INTO DimProduct (ProductID, ProductName, ProductDescription, Price, Category, EffectiveDate, IsCurrent)
        VALUES (@ProductID, @NewName, @NewDescription, @NewPrice, @NewCategory, GETDATE(), 1);

        COMMIT TRANSACTION;
    END
    ELSE
    BEGIN
        -- Type 1: Update existing record
        UPDATE DimProduct
        SET ProductName = @NewName,
            ProductDescription = @NewDescription
        WHERE ProductID = @ProductID AND IsCurrent = 1;
    END
END;

Performance Considerations:

-- Indexes for Type 2 SCD
CREATE INDEX IX_DimCustomer_CustomerID_IsCurrent 
ON DimCustomer(CustomerID, IsCurrent) 
INCLUDE (CustomerName, City, State);

CREATE INDEX IX_DimCustomer_EffectiveDate_EndDate 
ON DimCustomer(EffectiveDate, EndDate) 
INCLUDE (CustomerID);

-- Partitioning for large Type 2 tables
CREATE PARTITION FUNCTION PF_SCD_Date (DATE)
AS RANGE RIGHT FOR VALUES 
('2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01');

CREATE PARTITION SCHEME PS_SCD_Date
AS PARTITION PF_SCD_Date ALL TO ([PRIMARY]);

CREATE TABLE DimCustomer_Partitioned (
    CustomerKey INT IDENTITY(1,1),
    CustomerID INT NOT NULL,
    CustomerName VARCHAR(100),
    City VARCHAR(50),
    State VARCHAR(2),
    EffectiveDate DATE NOT NULL,
    EndDate DATE NULL,
    IsCurrent BIT NOT NULL
) ON PS_SCD_Date(EffectiveDate);

4. What is Data Vault modeling? How does it differ from traditional dimensional modeling?

Answer:

Data Vault 2.0 Components:

1. Hubs - Core Business Entities

-- Hub: Stores unique business keys
CREATE TABLE Hub_Customer (
    CustomerHashKey BINARY(16) PRIMARY KEY,  -- MD5/SHA hash of business key
    CustomerID VARCHAR(50) NOT NULL,         -- Natural business key
    LoadDate DATETIME NOT NULL,
    RecordSource VARCHAR(50) NOT NULL,
    CONSTRAINT UQ_Hub_Customer_ID UNIQUE (CustomerID)
);

-- Example data
CustomerHashKey                  | CustomerID | LoadDate            | RecordSource
0x1A2B3C4D5E6F7G8H9I0J1K2L3M4N | CUST001    | 2023-01-15 10:30:00 | CRM_System

2. Links - Relationships Between Hubs

-- Link: Represents many-to-many relationships
CREATE TABLE Link_CustomerOrder (
    CustomerOrderHashKey BINARY(16) PRIMARY KEY,
    CustomerHashKey BINARY(16) NOT NULL,
    OrderHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    RecordSource VARCHAR(50) NOT NULL,
    CONSTRAINT FK_Link_Customer FOREIGN KEY (CustomerHashKey) 
        REFERENCES Hub_Customer(CustomerHashKey),
    CONSTRAINT FK_Link_Order FOREIGN KEY (OrderHashKey) 
        REFERENCES Hub_Order(OrderHashKey)
);

3. Satellites - Descriptive Attributes

-- Satellite: Stores descriptive attributes and tracks history
CREATE TABLE Sat_Customer (
    CustomerHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    LoadEndDate DATETIME NULL,
    CustomerName VARCHAR(100),
    Email VARCHAR(100),
    Phone VARCHAR(20),
    Address VARCHAR(200),
    City VARCHAR(50),
    State VARCHAR(2),
    HashDiff BINARY(16) NOT NULL,  -- Hash of all attributes
    RecordSource VARCHAR(50) NOT NULL,
    PRIMARY KEY (CustomerHashKey, LoadDate),
    CONSTRAINT FK_Sat_Customer FOREIGN KEY (CustomerHashKey) 
        REFERENCES Hub_Customer(CustomerHashKey)
);

-- Multiple satellites for different change rates
CREATE TABLE Sat_Customer_ContactInfo (
    CustomerHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    Email VARCHAR(100),
    Phone VARCHAR(20),
    HashDiff BINARY(16) NOT NULL,
    PRIMARY KEY (CustomerHashKey, LoadDate)
);

CREATE TABLE Sat_Customer_Demographics (
    CustomerHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    DateOfBirth DATE,
    Gender VARCHAR(10),
    MaritalStatus VARCHAR(20),
    HashDiff BINARY(16) NOT NULL,
    PRIMARY KEY (CustomerHashKey, LoadDate)
);

Complete Example:

-- Hub: Customer
CREATE TABLE Hub_Customer (
    CustomerHashKey BINARY(16) PRIMARY KEY,
    CustomerID VARCHAR(50) NOT NULL UNIQUE,
    LoadDate DATETIME NOT NULL,
    RecordSource VARCHAR(50) NOT NULL
);

-- Hub: Product
CREATE TABLE Hub_Product (
    ProductHashKey BINARY(16) PRIMARY KEY,
    ProductID VARCHAR(50) NOT NULL UNIQUE,
    LoadDate DATETIME NOT NULL,
    RecordSource VARCHAR(50) NOT NULL
);

-- Link: Customer-Product (Purchase)
CREATE TABLE Link_Purchase (
    PurchaseHashKey BINARY(16) PRIMARY KEY,
    CustomerHashKey BINARY(16) NOT NULL,
    ProductHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    RecordSource VARCHAR(50) NOT NULL
);

-- Satellite: Purchase Details
CREATE TABLE Sat_Purchase (
    PurchaseHashKey BINARY(16) NOT NULL,
    LoadDate DATETIME NOT NULL,
    LoadEndDate DATETIME NULL,
    PurchaseDate DATE,
    Quantity INT,
    UnitPrice DECIMAL(10,2),
    TotalAmount DECIMAL(10,2),
    HashDiff BINARY(16) NOT NULL,
    RecordSource VARCHAR(50) NOT NULL,
    PRIMARY KEY (PurchaseHashKey, LoadDate)
);

-- Loading data with hash keys
INSERT INTO Hub_Customer (CustomerHashKey, CustomerID, LoadDate, RecordSource)
SELECT 
    HASHBYTES('MD5', CustomerID) AS CustomerHashKey,
    CustomerID,
    GETDATE(),
    'CRM_System'
FROM SourceCustomers
WHERE NOT EXISTS (
    SELECT 1 FROM Hub_Customer h 
    WHERE h.CustomerID = SourceCustomers.CustomerID
);

Data Vault vs. Dimensional Modeling:

Aspect	Data Vault	Dimensional (Kimball)
Purpose	Enterprise data warehouse, raw data vault	Business intelligence, analytics
Structure	Hubs, Links, Satellites	Facts, Dimensions
Normalization	Highly normalized	Denormalized (Star) or Normalized (Snowflake)
History	Full history by design	Type 2 SCD for history
Flexibility	Highly flexible, easy to add sources	Less flexible, schema changes complex
Query Performance	Slower (many joins)	Faster (fewer joins)
Load Performance	Very fast (parallel loads)	Slower (lookups, updates)
Auditability	Excellent (full lineage)	Limited
Business Keys	Preserved	Surrogate keys
Use Case	Data integration layer	Presentation layer

Practical Implementation:

-- ETL Process for Data Vault

-- Step 1: Load Hub (if not exists)
MERGE INTO Hub_Customer AS target
USING (
    SELECT DISTINCT
        HASHBYTES('MD5', CustomerID) AS CustomerHashKey,
        CustomerID,
        GETDATE() AS LoadDate,
        'CRM_System' AS RecordSource
    FROM Staging_Customers
) AS source
ON target.CustomerHashKey = source.CustomerHashKey
WHEN NOT MATCHED THEN
    INSERT (CustomerHashKey, CustomerID, LoadDate, RecordSource)
    VALUES (source.CustomerHashKey, source.CustomerID, source.LoadDate, source.RecordSource);

-- Step 2: Load Satellite (track changes)
INSERT INTO Sat_Customer (CustomerHashKey, LoadDate, CustomerName, Email, Phone, HashDiff, RecordSource)
SELECT 
    h.CustomerHashKey,
    GETDATE() AS LoadDate,
    s.CustomerName,
    s.Email,
    s.Phone,
    HASHBYTES('MD5', CONCAT(s.CustomerName, s.Email, s.Phone)) AS HashDiff,
    'CRM_System' AS RecordSource
FROM Staging_Customers s
INNER JOIN Hub_Customer h ON s.CustomerID = h.CustomerID
WHERE NOT EXISTS (
    SELECT 1 
    FROM Sat_Customer sat
    WHERE sat.CustomerHashKey = h.CustomerHashKey
      AND sat.HashDiff = HASHBYTES('MD5', CONCAT(s.CustomerName, s.Email, s.Phone))
      AND sat.LoadEndDate IS NULL
);

-- Step 3: Close previous satellite records
UPDATE sat
SET LoadEndDate = GETDATE()
FROM Sat_Customer sat
INNER JOIN (
    SELECT CustomerHashKey, MAX(LoadDate) AS MaxLoadDate
    FROM Sat_Customer
    WHERE LoadEndDate IS NULL
    GROUP BY CustomerHashKey
) latest ON sat.CustomerHashKey = latest.CustomerHashKey 
         AND sat.LoadDate < latest.MaxLoadDate
WHERE sat.LoadEndDate IS NULL;

Querying Data Vault (Point-in-Time):

-- Create view for business users (Information Mart)
CREATE VIEW vw_Customer_Current AS
SELECT 
    h.CustomerID,
    s.CustomerName,
    s.Email,
    s.Phone,
    s.LoadDate AS EffectiveDate
FROM Hub_Customer h
INNER JOIN Sat_Customer s ON h.CustomerHashKey = s.CustomerHashKey
WHERE s.LoadEndDate IS NULL;  -- Current records only

-- Historical query
CREATE FUNCTION fn_GetCustomerAtDate(@AsOfDate DATETIME)
RETURNS TABLE
AS
RETURN
(
    SELECT 
        h.CustomerID,
        s.CustomerName,
        s.Email,
        s.Phone
    FROM Hub_Customer h
    INNER JOIN Sat_Customer s ON h.CustomerHashKey = s.CustomerHashKey
    WHERE @AsOfDate >= s.LoadDate 
      AND (@AsOfDate < s.LoadEndDate OR s.LoadEndDate IS NULL)
);

-- Usage
SELECT * FROM fn_GetCustomerAtDate('2023-06-15');

Advantages of Data Vault:

Auditability: Complete lineage and history
Flexibility: Easy to add new data sources
Parallel Loading: Hubs, Links, Satellites load independently
No Data Loss: All source data preserved
Agile: Incremental development

Disadvantages:

Complexity: Steep learning curve
Query Performance: Many joins required
Storage: More storage than dimensional models
Tooling: Limited BI tool support

Best Practice - Hybrid Architecture:

Source Systems → Data Vault (Raw Vault) → Information Marts (Dimensional) → BI Tools

5. How would you handle hierarchical data in a relational database? Compare different approaches.

Answer:

Approach 1: Adjacency List (Parent-Child)

Most common and simplest approach.

CREATE TABLE Employees (
    EmployeeID INT PRIMARY KEY,
    EmployeeName VARCHAR(100),
    ManagerID INT NULL,
    Title VARCHAR(50),
    Salary DECIMAL(10,2),
    CONSTRAINT FK_Manager FOREIGN KEY (ManagerID) 
        REFERENCES Employees(EmployeeID)
);

-- Sample data
INSERT INTO Employees VALUES
(1, 'John CEO', NULL, 'CEO', 200000),
(2, 'Jane VP', 1, 'VP Sales', 150000),
(3, 'Bob VP', 1, 'VP Engineering', 150000),
(4, 'Alice Manager', 2, 'Sales Manager', 100000),
(5, 'Charlie Manager', 3, 'Engineering Manager', 100000),
(6, 'David Rep', 4, 'Sales Rep', 60000),
(7, 'Eve Engineer', 5, 'Senior Engineer', 90000);

-- Query: Get direct reports
SELECT 
    e.EmployeeID,
    e.EmployeeName,
    e.Title,
    m.EmployeeName AS ManagerName
FROM Employees e
LEFT JOIN Employees m ON e.ManagerID = m.EmployeeID;

-- Query: Get all subordinates (Recursive CTE)
WITH EmployeeHierarchy AS (
    -- Anchor: Start with a specific manager
    SELECT 
        EmployeeID,
        EmployeeName,
        ManagerID,
        Title,
        Salary,
        0 AS Level,
        CAST(EmployeeID AS VARCHAR(MAX)) AS HierarchyPath
    FROM Employees
    WHERE EmployeeID = 1  -- CEO

    UNION ALL

    -- Recursive: Get subordinates
    SELECT 
        e.EmployeeID,
        e.EmployeeName,
        e.ManagerID,
        e.Title,
        e.Salary,
        eh.Level + 1,
        CAST(eh.HierarchyPath + '/' + CAST(e.EmployeeID AS VARCHAR(10)) AS VARCHAR(MAX))
    FROM Employees e
    INNER JOIN EmployeeHierarchy eh ON e.ManagerID = eh.EmployeeID
)
SELECT 
    REPLICATE('  ', Level) + EmployeeName AS OrgChart,
    Level,
    Title,
    Salary,
    HierarchyPath
FROM EmployeeHierarchy
ORDER BY HierarchyPath;

-- Result:
-- John CEO (Level 0)
--   Jane VP (Level 1)
--     Alice Manager (Level 2)
--       David Rep (Level 3)
--   Bob VP (Level 1)
--     Charlie Manager (Level 2)
--       Eve Engineer (Level 3)

Pros:

Simple to understand and implement
Easy to insert/update/delete
Referential integrity enforced

Cons:

Recursive queries needed for hierarchy traversal
Performance issues with deep hierarchies
No easy way to get all ancestors/descendants

Approach 2: Nested Sets (Modified Preorder Tree Traversal)

Stores left and right boundaries for each node.

CREATE TABLE Categories (
    CategoryID INT PRIMARY KEY,
    CategoryName VARCHAR(100),
    LeftValue INT NOT NULL,
    RightValue INT NOT NULL,
    CONSTRAINT UQ_Left UNIQUE (LeftValue),
    CONSTRAINT UQ_Right UNIQUE (RightValue)
);

-- Sample data (Electronics hierarchy)
INSERT INTO Categories VALUES
(1, 'Electronics', 1, 20),
(2, 'Computers', 2, 9),
(3, 'Laptops', 3, 4),
(4, 'Desktops', 5, 6),
(5, 'Tablets', 7, 8),
(6, 'TVs', 10, 15),
(7, 'LED', 11, 12),
(8, 'OLED', 13, 14),
(9, 'Audio', 16, 19),
(10, 'Headphones', 17, 18);

-- Query: Get all descendants of a node
SELECT 
    child.CategoryName,
    (COUNT(parent.CategoryID) - 1) AS Depth
FROM Categories AS child
CROSS JOIN Categories AS parent
WHERE child.LeftValue BETWEEN parent.LeftValue AND parent.RightValue
  AND parent.CategoryName = 'Electronics'
GROUP BY child.CategoryName, child.LeftValue
ORDER BY child.LeftValue;

-- Query: Get path to root
SELECT 
    parent.CategoryName
FROM Categories AS child
CROSS JOIN Categories AS parent
WHERE child.LeftValue BETWEEN parent.LeftValue AND parent.RightValue
  AND child.CategoryName = 'Laptops'
ORDER BY parent.LeftValue;

-- Query: Get immediate children
SELECT 
    child.CategoryName
FROM Categories AS parent
INNER JOIN Categories AS child 
    ON child.LeftValue > parent.LeftValue 
    AND child.RightValue < parent.RightValue
WHERE parent.CategoryName = 'Computers'
  AND NOT EXISTS (
      SELECT 1 
      FROM Categories AS intermediate
      WHERE intermediate.LeftValue > parent.LeftValue
        AND intermediate.RightValue < parent.RightValue
        AND child.LeftValue > intermediate.LeftValue
        AND child.RightValue < intermediate.RightValue
  );

-- Insert new node (complex)
-- To insert "Gaming Laptops" under "Laptops" (CategoryID 3)
DECLARE @ParentLeft INT = 3;
DECLARE @ParentRight INT = 4;

-- Make space
UPDATE Categories 
SET RightValue = RightValue + 2 
WHERE RightValue >= @ParentRight;

UPDATE Categories 
SET LeftValue = LeftValue + 2 
WHERE LeftValue > @ParentRight;

-- Insert new node
INSERT INTO Categories VALUES
(11, 'Gaming Laptops', @ParentRight, @ParentRight + 1);

Pros:

Very fast reads (no recursion)
Easy to get all descendants/ancestors
Efficient for read-heavy workloads

Cons:

Complex inserts/updates/deletes
Difficult to maintain
Hard to understand
Concurrent modifications problematic

Approach 3: Closure Table (Transitive Closure)

Stores all ancestor-descendant relationships.

CREATE TABLE Categories (
    CategoryID INT PRIMARY KEY,
    CategoryName VARCHAR(100)
);

CREATE TABLE CategoryClosure (
    AncestorID INT NOT NULL,
    DescendantID INT NOT NULL,
    Depth INT NOT NULL,
    PRIMARY KEY (AncestorID, DescendantID),
    CONSTRAINT FK_Ancestor FOREIGN KEY (AncestorID) 
        REFERENCES Categories(CategoryID),
    CONSTRAINT FK_Descendant FOREIGN KEY (DescendantID) 
        REFERENCES Categories(CategoryID)
);

-- Sample data
INSERT INTO Categories VALUES
(1, 'Electronics'),
(2, 'Computers'),
(3, 'Laptops'),
(4, 'Desktops'),
(5, 'TVs');

-- Insert closure records (including self-references)
INSERT INTO CategoryClosure VALUES
-- Self-references
(1, 1, 0), (2, 2, 0), (3, 3, 0), (4, 4, 0), (5, 5, 0),
-- Electronics -> Computers
(1, 2, 1),
-- Electronics -> Laptops (through Computers)
(1, 3, 2), (2, 3, 1),
-- Electronics -> Desktops (through Computers)
(1, 4, 2), (2, 4, 1),
-- Electronics -> TVs
(1, 5, 1);

-- Query: Get all descendants
SELECT 
    c.CategoryName,
    cc.Depth
FROM CategoryClosure cc
INNER JOIN Categories c ON cc.DescendantID = c.CategoryID
WHERE cc.AncestorID = 1  -- Electronics
  AND cc.Depth > 0
ORDER BY cc.Depth, c.CategoryName;

-- Query: Get all ancestors
SELECT 
    c.CategoryName,
    cc.Depth
FROM CategoryClosure cc
INNER JOIN Categories c ON cc.AncestorID = c.CategoryID
WHERE cc.DescendantID = 3  -- Laptops
  AND cc.Depth > 0
ORDER BY cc.Depth DESC;

-- Query: Get immediate children
SELECT 
    c.CategoryName
FROM CategoryClosure cc
INNER JOIN Categories c ON cc.DescendantID = c.CategoryID
WHERE cc.AncestorID = 2  -- Computers
  AND cc.Depth = 1;

-- Insert new node (easier than Nested Sets)
-- Add "Gaming Laptops" under "Laptops" (CategoryID 3)
DECLARE @NewCategoryID INT = 6;
DECLARE @ParentID INT = 3;

-- Insert into Categories
INSERT INTO Categories VALUES (@NewCategoryID, 'Gaming Laptops');

-- Insert closure records
-- 1. Self-reference
INSERT INTO CategoryClosure VALUES (@NewCategoryID, @NewCategoryID, 0);

-- 2. Copy all ancestors of parent
INSERT INTO CategoryClosure (AncestorID, DescendantID, Depth)
SELECT 
    AncestorID,
    @NewCategoryID,
    Depth + 1
FROM CategoryClosure
WHERE DescendantID = @ParentID;

-- Delete node and all descendants
DELETE FROM CategoryClosure
WHERE DescendantID IN (
    SELECT DescendantID
    FROM CategoryClosure
    WHERE AncestorID = @CategoryIDToDelete
);

DELETE FROM Categories
WHERE CategoryID IN (
    SELECT DescendantID
    FROM CategoryClosure
    WHERE AncestorID = @CategoryIDToDelete
);

Pros:

Fast reads (no recursion)
Easier inserts/deletes than Nested Sets
Can query any relationship easily
Good for complex hierarchies

Cons:

More storage (O(n²) in worst case)
More complex than Adjacency List
Redundant data

Approach 4: Materialized Path (Path Enumeration)

Store the full path from root to node.

CREATE TABLE Categories (
    CategoryID INT PRIMARY KEY,
    CategoryName VARCHAR(100),
    PathString VARCHAR(500) NOT NULL,  -- e.g., '/1/2/3/'
    Level INT NOT NULL,
    CONSTRAINT UQ_Path UNIQUE (PathString)
);

CREATE INDEX IX_Categories_Path ON Categories(PathString);

-- Sample data
INSERT INTO Categories VALUES
(1, 'Electronics', '/1/', 0),
(2, 'Computers', '/1/2/', 1),
(3, 'Laptops', '/1/2/3/', 2),
(4, 'Desktops', '/1/2/4/', 2),
(5, 'TVs', '/1/5/', 1),
(6, 'LED TVs', '/1/5/6/', 2);

-- Query: Get all descendants
SELECT 
    CategoryID,
    CategoryName,
    Level,
    PathString
FROM Categories
WHERE PathString LIKE '/1/%'  -- All under Electronics
  AND CategoryID <> 1
ORDER BY PathString;

-- Query: Get all ancestors
SELECT 
    CategoryID,
    CategoryName,
    Level
FROM Categories
WHERE '/1/2/3/' LIKE PathString + '%'  -- Ancestors of Laptops
  AND CategoryID <> 3
ORDER BY Level;

-- Query: Get immediate children
SELECT 
    CategoryID,
    CategoryName
FROM Categories
WHERE PathString LIKE '/1/2/%'  -- Children of Computers
  AND Level = 2;

-- Query: Get siblings
SELECT 
    CategoryID,
    CategoryName
FROM Categories c1
WHERE EXISTS (
    SELECT 1 
    FROM Categories c2
    WHERE c2.CategoryID = 3  -- Laptops
      AND c1.PathString LIKE SUBSTRING(c2.PathString, 1, 
          LEN(c2.PathString) - CHARINDEX('/', REVERSE(c2.PathString), 2)) + '%'
      AND c1.Level = c2.Level
      AND c1.CategoryID <> c2.CategoryID
);

-- Insert new node
DECLARE @ParentPath VARCHAR(500) = '/1/2/';
DECLARE @NewID INT = 7;

INSERT INTO Categories VALUES
(@NewID, 'Gaming Laptops', @ParentPath + CAST(@NewID AS VARCHAR(10)) + '/', 2);

-- Move subtree (change parent)
DECLARE @OldPath VARCHAR(500) = '/1/2/3/';
DECLARE @NewPath VARCHAR(500) = '/1/5/3/';

UPDATE Categories
SET PathString = REPLACE(PathString, @OldPath, @NewPath),
    Level = Level + (LEN(@NewPath) - LEN(@OldPath)) / 2
WHERE PathString LIKE @OldPath + '%';

Pros:

Simple to understand
Fast reads with LIKE operator
Easy to get ancestors/descendants
Can use indexes effectively

Cons:

Path length limitation
Updates can be expensive (moving subtrees)
Redundant data storage

Comparison Table:

Feature	Adjacency List	Nested Sets	Closure Table	Materialized Path
Simplicity	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Read Performance	⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐⭐
Write Performance	⭐⭐⭐⭐⭐	⭐⭐	⭐⭐⭐	⭐⭐⭐⭐
Storage	⭐⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐	⭐⭐⭐
Referential Integrity	⭐⭐⭐⭐⭐	⭐⭐⭐	⭐⭐⭐⭐⭐	⭐⭐⭐
Move Subtree	⭐⭐⭐⭐⭐	⭐	⭐⭐⭐	⭐⭐

Recommendation:

Adjacency List: Default choice for most applications
Closure Table: When you need fast reads and can afford storage
Materialized Path: Good balance, especially with limited depth
Nested Sets: Only for read-heavy, rarely-updated hierarchies

Hybrid Approach:

-- Combine Adjacency List with Materialized Path
CREATE TABLE Categories (
    CategoryID INT PRIMARY KEY,
    CategoryName VARCHAR(100),
    ParentID INT NULL,  -- Adjacency List
    PathString VARCHAR(500),  -- Materialized Path
    Level INT,
    LeftValue INT,  -- Optional: Add Nested Sets for specific queries
    RightValue INT,
    CONSTRAINT FK_Parent FOREIGN KEY (ParentID) 
        REFERENCES Categories(CategoryID)
);

This gives you flexibility to use the best approach for each query type.

DEV Community

Part 1: Database Concepts & Architecture

Data Modeling (Advanced)

Top comments (0)