Data Modeling (Advanced)
1. Explain the difference between Star Schema and Snowflake Schema. When would you choose one over the other?
Answer:
Star Schema:
- Central fact table connected directly to denormalized dimension tables
- Dimension tables are not normalized (contain redundant data)
- Simpler queries with fewer joins
- Better query performance
- Larger storage footprint
- Easier for end users to understand
Snowflake Schema:
- Dimension tables are normalized into multiple related tables
- Reduces data redundancy
- More complex queries with more joins
- Slower query performance
- Smaller storage footprint
- More complex for end users
When to Choose:
Star Schema:
- Query performance is critical
- Storage is not a constraint
- Simpler BI tool integration needed
- End users need direct access
- Example: Sales reporting dashboard
Snowflake Schema:
- Storage optimization is important
- Data integrity and normalization are priorities
- Complex hierarchical dimensions exist
- ETL processes can handle complexity
- Example: Large enterprise data warehouse with complex product hierarchies
Real-world Example:
-- Star Schema
FactSales (SalesID, DateKey, ProductKey, CustomerKey, Amount)
DimProduct (ProductKey, ProductName, Category, SubCategory, Brand, Supplier)
DimCustomer (CustomerKey, Name, City, State, Country, Region)
-- Snowflake Schema
FactSales (SalesID, DateKey, ProductKey, CustomerKey, Amount)
DimProduct (ProductKey, ProductName, CategoryKey, BrandKey)
DimCategory (CategoryKey, Category, SubCategory)
DimBrand (BrandKey, Brand, SupplierKey)
DimSupplier (SupplierKey, SupplierName, Country)
2. How would you design a data model for a multi-tenant SaaS application? What are the key considerations?
Answer:
Three Main Approaches:
A. Shared Schema (Single Database, Shared Tables)
CREATE TABLE Customers (
CustomerID INT PRIMARY KEY,
TenantID INT NOT NULL, -- Discriminator column
CustomerName VARCHAR(100),
Email VARCHAR(100),
CONSTRAINT FK_Tenant FOREIGN KEY (TenantID) REFERENCES Tenants(TenantID)
);
CREATE INDEX IX_Customers_TenantID ON Customers(TenantID);
-- Row Level Security (SQL Server 2016+)
CREATE SECURITY POLICY TenantFilter
ADD FILTER PREDICATE dbo.fn_TenantAccessPredicate(TenantID)
ON dbo.Customers
WITH (STATE = ON);
Pros:
- Cost-effective (single database)
- Easy maintenance and upgrades
- Efficient resource utilization
Cons:
- Security concerns (data leakage risk)
- Difficult to customize per tenant
- Noisy neighbor problem
- Scaling challenges
B. Separate Schema (Single Database, Multiple Schemas)
-- Tenant 1
CREATE SCHEMA Tenant_1;
CREATE TABLE Tenant_1.Customers (
CustomerID INT PRIMARY KEY,
CustomerName VARCHAR(100),
Email VARCHAR(100)
);
-- Tenant 2
CREATE SCHEMA Tenant_2;
CREATE TABLE Tenant_2.Customers (
CustomerID INT PRIMARY KEY,
CustomerName VARCHAR(100),
Email VARCHAR(100)
);
-- Application logic
SET @schema = 'Tenant_' + @TenantID;
EXEC('SELECT * FROM ' + @schema + '.Customers');
Pros:
- Better isolation than shared schema
- Easier per-tenant customization
- Simpler queries (no TenantID filter)
Cons:
- Schema management complexity
- Limited scalability
- Backup/restore complexity
C. Separate Database (Database per Tenant)
-- Tenant1DB
CREATE DATABASE Tenant_1_DB;
USE Tenant_1_DB;
CREATE TABLE Customers (...);
-- Tenant2DB
CREATE DATABASE Tenant_2_DB;
USE Tenant_2_DB;
CREATE TABLE Customers (...);
-- Connection string routing
string connectionString = GetConnectionString(tenantId);
Pros:
- Maximum isolation and security
- Easy per-tenant customization
- Independent scaling
- Simplified compliance (data residency)
Cons:
- Highest cost
- Complex maintenance
- Resource intensive
Key Considerations:
- Security & Isolation:
-- Implement Row-Level Security
CREATE FUNCTION dbo.fn_TenantAccessPredicate(@TenantID INT)
RETURNS TABLE
WITH SCHEMABINDING
AS
RETURN SELECT 1 AS result
WHERE @TenantID = CAST(SESSION_CONTEXT(N'TenantID') AS INT);
- Performance:
-- Partition by TenantID for large tables
CREATE PARTITION FUNCTION PF_TenantID (INT)
AS RANGE RIGHT FOR VALUES (1, 2, 3, 4, 5);
CREATE PARTITION SCHEME PS_TenantID
AS PARTITION PF_TenantID ALL TO ([PRIMARY]);
CREATE TABLE Orders (
OrderID INT,
TenantID INT,
OrderDate DATE,
Amount DECIMAL(10,2)
) ON PS_TenantID(TenantID);
- Scalability:
- Implement sharding strategy
- Use read replicas for reporting
Consider hybrid approach (small tenants shared, large tenants isolated)
Compliance:
Data residency requirements
GDPR, HIPAA considerations
Audit logging per tenant
Recommended Hybrid Approach:
-- Metadata database (shared)
CREATE TABLE Tenants (
TenantID INT PRIMARY KEY,
TenantName VARCHAR(100),
TierLevel VARCHAR(20), -- 'Free', 'Premium', 'Enterprise'
DatabaseName VARCHAR(100), -- NULL for shared, specific for isolated
IsActive BIT
);
-- Small tenants: Shared schema with TenantID
-- Large tenants: Separate database
3. Explain Slowly Changing Dimensions (SCD). Describe Type 1, 2, and 3 with practical examples.
Answer:
Type 1 SCD - Overwrite (No History)
Simply update the existing record, losing historical data.
-- Initial state
CustomerID | CustomerName | City | State
1 | John Doe | Boston | MA
-- Customer moves to New York
UPDATE DimCustomer
SET City = 'New York',
State = 'NY',
LastModifiedDate = GETDATE()
WHERE CustomerID = 1;
-- Result (history lost)
CustomerID | CustomerName | City | State
1 | John Doe | New York | NY
Use Cases:
- Correcting data errors
- Attributes that don't require history (e.g., email format changes)
- Non-critical dimensional attributes
Implementation:
CREATE PROCEDURE usp_UpdateCustomerType1
@CustomerID INT,
@NewCity VARCHAR(50),
@NewState VARCHAR(2)
AS
BEGIN
UPDATE DimCustomer
SET City = @NewCity,
State = @NewState,
LastModifiedDate = GETDATE(),
ModifiedBy = SYSTEM_USER
WHERE CustomerID = @CustomerID;
END;
Type 2 SCD - Add New Row (Full History)
Create a new record for each change, maintaining complete history.
-- Table structure
CREATE TABLE DimCustomer (
CustomerKey INT IDENTITY(1,1) PRIMARY KEY, -- Surrogate key
CustomerID INT NOT NULL, -- Natural key
CustomerName VARCHAR(100),
City VARCHAR(50),
State VARCHAR(2),
EffectiveDate DATE NOT NULL,
EndDate DATE NULL,
IsCurrent BIT NOT NULL DEFAULT 1,
RowVersion INT NOT NULL DEFAULT 1
);
-- Initial record
CustomerKey | CustomerID | CustomerName | City | State | EffectiveDate | EndDate | IsCurrent
1 | 1 | John Doe | Boston | MA | 2020-01-01 | NULL | 1
-- Customer moves (insert new record, update old)
BEGIN TRANSACTION;
-- Close old record
UPDATE DimCustomer
SET EndDate = '2023-06-30',
IsCurrent = 0
WHERE CustomerID = 1 AND IsCurrent = 1;
-- Insert new record
INSERT INTO DimCustomer (CustomerID, CustomerName, City, State, EffectiveDate, IsCurrent, RowVersion)
VALUES (1, 'John Doe', 'New York', 'NY', '2023-07-01', 1, 2);
COMMIT TRANSACTION;
-- Result
CustomerKey | CustomerID | CustomerName | City | State | EffectiveDate | EndDate | IsCurrent | RowVersion
1 | 1 | John Doe | Boston | MA | 2020-01-01 | 2023-06-30 | 0 | 1
2 | 1 | John Doe | New York | NY | 2023-07-01 | NULL | 1 | 2
Querying Type 2:
-- Get current record
SELECT * FROM DimCustomer
WHERE CustomerID = 1 AND IsCurrent = 1;
-- Get historical record at specific date
SELECT * FROM DimCustomer
WHERE CustomerID = 1
AND '2023-03-15' BETWEEN EffectiveDate AND COALESCE(EndDate, '9999-12-31');
-- Join fact table with dimension (point-in-time)
SELECT
f.SalesDate,
f.Amount,
d.City,
d.State
FROM FactSales f
INNER JOIN DimCustomer d
ON f.CustomerKey = d.CustomerKey -- Use surrogate key
WHERE f.SalesDate BETWEEN d.EffectiveDate AND COALESCE(d.EndDate, '9999-12-31');
Use Cases:
- Customer address changes
- Product price changes
- Employee department transfers
- Any attribute requiring full audit trail
Type 3 SCD - Add New Column (Limited History)
Add columns to track previous value(s).
CREATE TABLE DimCustomer (
CustomerID INT PRIMARY KEY,
CustomerName VARCHAR(100),
CurrentCity VARCHAR(50),
CurrentState VARCHAR(2),
PreviousCity VARCHAR(50),
PreviousState VARCHAR(2),
EffectiveDate DATE,
PreviousEffectiveDate DATE
);
-- Initial state
CustomerID | CustomerName | CurrentCity | CurrentState | PreviousCity | PreviousState
1 | John Doe | Boston | MA | NULL | NULL
-- Customer moves
UPDATE DimCustomer
SET PreviousCity = CurrentCity,
PreviousState = CurrentState,
PreviousEffectiveDate = EffectiveDate,
CurrentCity = 'New York',
CurrentState = 'NY',
EffectiveDate = '2023-07-01'
WHERE CustomerID = 1;
-- Result
CustomerID | CustomerName | CurrentCity | CurrentState | PreviousCity | PreviousState
1 | John Doe | New York | NY | Boston | MA
Use Cases:
- Tracking one or two previous values
- Comparing current vs. previous (e.g., sales territory changes)
- Limited history requirements
Comparison Query:
-- Analyze customers who changed cities
SELECT
CustomerID,
CustomerName,
PreviousCity,
CurrentCity,
DATEDIFF(DAY, PreviousEffectiveDate, EffectiveDate) AS DaysSinceChange
FROM DimCustomer
WHERE PreviousCity IS NOT NULL
AND PreviousCity <> CurrentCity;
Hybrid Approach (Type 2 + Type 1):
CREATE TABLE DimProduct (
ProductKey INT IDENTITY(1,1) PRIMARY KEY,
ProductID INT NOT NULL,
ProductName VARCHAR(100), -- Type 1 (overwrite)
ProductDescription TEXT, -- Type 1 (overwrite)
Price DECIMAL(10,2), -- Type 2 (track history)
Category VARCHAR(50), -- Type 2 (track history)
EffectiveDate DATE NOT NULL,
EndDate DATE NULL,
IsCurrent BIT NOT NULL
);
-- Update procedure
CREATE PROCEDURE usp_UpdateProduct
@ProductID INT,
@NewName VARCHAR(100),
@NewDescription TEXT,
@NewPrice DECIMAL(10,2),
@NewCategory VARCHAR(50)
AS
BEGIN
DECLARE @CurrentPrice DECIMAL(10,2);
DECLARE @CurrentCategory VARCHAR(50);
-- Get current values
SELECT @CurrentPrice = Price, @CurrentCategory = Category
FROM DimProduct
WHERE ProductID = @ProductID AND IsCurrent = 1;
-- Check if Type 2 attributes changed
IF @NewPrice <> @CurrentPrice OR @NewCategory <> @CurrentCategory
BEGIN
-- Type 2: Create new record
BEGIN TRANSACTION;
UPDATE DimProduct
SET EndDate = GETDATE(), IsCurrent = 0
WHERE ProductID = @ProductID AND IsCurrent = 1;
INSERT INTO DimProduct (ProductID, ProductName, ProductDescription, Price, Category, EffectiveDate, IsCurrent)
VALUES (@ProductID, @NewName, @NewDescription, @NewPrice, @NewCategory, GETDATE(), 1);
COMMIT TRANSACTION;
END
ELSE
BEGIN
-- Type 1: Update existing record
UPDATE DimProduct
SET ProductName = @NewName,
ProductDescription = @NewDescription
WHERE ProductID = @ProductID AND IsCurrent = 1;
END
END;
Performance Considerations:
-- Indexes for Type 2 SCD
CREATE INDEX IX_DimCustomer_CustomerID_IsCurrent
ON DimCustomer(CustomerID, IsCurrent)
INCLUDE (CustomerName, City, State);
CREATE INDEX IX_DimCustomer_EffectiveDate_EndDate
ON DimCustomer(EffectiveDate, EndDate)
INCLUDE (CustomerID);
-- Partitioning for large Type 2 tables
CREATE PARTITION FUNCTION PF_SCD_Date (DATE)
AS RANGE RIGHT FOR VALUES
('2020-01-01', '2021-01-01', '2022-01-01', '2023-01-01');
CREATE PARTITION SCHEME PS_SCD_Date
AS PARTITION PF_SCD_Date ALL TO ([PRIMARY]);
CREATE TABLE DimCustomer_Partitioned (
CustomerKey INT IDENTITY(1,1),
CustomerID INT NOT NULL,
CustomerName VARCHAR(100),
City VARCHAR(50),
State VARCHAR(2),
EffectiveDate DATE NOT NULL,
EndDate DATE NULL,
IsCurrent BIT NOT NULL
) ON PS_SCD_Date(EffectiveDate);
4. What is Data Vault modeling? How does it differ from traditional dimensional modeling?
Answer:
Data Vault 2.0 Components:
1. Hubs - Core Business Entities
-- Hub: Stores unique business keys
CREATE TABLE Hub_Customer (
CustomerHashKey BINARY(16) PRIMARY KEY, -- MD5/SHA hash of business key
CustomerID VARCHAR(50) NOT NULL, -- Natural business key
LoadDate DATETIME NOT NULL,
RecordSource VARCHAR(50) NOT NULL,
CONSTRAINT UQ_Hub_Customer_ID UNIQUE (CustomerID)
);
-- Example data
CustomerHashKey | CustomerID | LoadDate | RecordSource
0x1A2B3C4D5E6F7G8H9I0J1K2L3M4N | CUST001 | 2023-01-15 10:30:00 | CRM_System
2. Links - Relationships Between Hubs
-- Link: Represents many-to-many relationships
CREATE TABLE Link_CustomerOrder (
CustomerOrderHashKey BINARY(16) PRIMARY KEY,
CustomerHashKey BINARY(16) NOT NULL,
OrderHashKey BINARY(16) NOT NULL,
LoadDate DATETIME NOT NULL,
RecordSource VARCHAR(50) NOT NULL,
CONSTRAINT FK_Link_Customer FOREIGN KEY (CustomerHashKey)
REFERENCES Hub_Customer(CustomerHashKey),
CONSTRAINT FK_Link_Order FOREIGN KEY (OrderHashKey)
REFERENCES Hub_Order(OrderHashKey)
);
3. Satellites - Descriptive Attributes
-- Satellite: Stores descriptive attributes and tracks history
CREATE TABLE Sat_Customer (
CustomerHashKey BINARY(16) NOT NULL,
LoadDate DATETIME NOT NULL,
LoadEndDate DATETIME NULL,
CustomerName VARCHAR(100),
Email VARCHAR(100),
Phone VARCHAR(20),
Address VARCHAR(200),
City VARCHAR(50),
State VARCHAR(2),
HashDiff BINARY(16) NOT NULL, -- Hash of all attributes
RecordSource VARCHAR(50) NOT NULL,
PRIMARY KEY (CustomerHashKey, LoadDate),
CONSTRAINT FK_Sat_Customer FOREIGN KEY (CustomerHashKey)
REFERENCES Hub_Customer(CustomerHashKey)
);
-- Multiple satellites for different change rates
CREATE TABLE Sat_Customer_ContactInfo (
CustomerHashKey BINARY(16) NOT NULL,
LoadDate DATETIME NOT NULL,
Email VARCHAR(100),
Phone VARCHAR(20),
HashDiff BINARY(16) NOT NULL,
PRIMARY KEY (CustomerHashKey, LoadDate)
);
CREATE TABLE Sat_Customer_Demographics (
CustomerHashKey BINARY(16) NOT NULL,
LoadDate DATETIME NOT NULL,
DateOfBirth DATE,
Gender VARCHAR(10),
MaritalStatus VARCHAR(20),
HashDiff BINARY(16) NOT NULL,
PRIMARY KEY (CustomerHashKey, LoadDate)
);
Complete Example:
-- Hub: Customer
CREATE TABLE Hub_Customer (
CustomerHashKey BINARY(16) PRIMARY KEY,
CustomerID VARCHAR(50) NOT NULL UNIQUE,
LoadDate DATETIME NOT NULL,
RecordSource VARCHAR(50) NOT NULL
);
-- Hub: Product
CREATE TABLE Hub_Product (
ProductHashKey BINARY(16) PRIMARY KEY,
ProductID VARCHAR(50) NOT NULL UNIQUE,
LoadDate DATETIME NOT NULL,
RecordSource VARCHAR(50) NOT NULL
);
-- Link: Customer-Product (Purchase)
CREATE TABLE Link_Purchase (
PurchaseHashKey BINARY(16) PRIMARY KEY,
CustomerHashKey BINARY(16) NOT NULL,
ProductHashKey BINARY(16) NOT NULL,
LoadDate DATETIME NOT NULL,
RecordSource VARCHAR(50) NOT NULL
);
-- Satellite: Purchase Details
CREATE TABLE Sat_Purchase (
PurchaseHashKey BINARY(16) NOT NULL,
LoadDate DATETIME NOT NULL,
LoadEndDate DATETIME NULL,
PurchaseDate DATE,
Quantity INT,
UnitPrice DECIMAL(10,2),
TotalAmount DECIMAL(10,2),
HashDiff BINARY(16) NOT NULL,
RecordSource VARCHAR(50) NOT NULL,
PRIMARY KEY (PurchaseHashKey, LoadDate)
);
-- Loading data with hash keys
INSERT INTO Hub_Customer (CustomerHashKey, CustomerID, LoadDate, RecordSource)
SELECT
HASHBYTES('MD5', CustomerID) AS CustomerHashKey,
CustomerID,
GETDATE(),
'CRM_System'
FROM SourceCustomers
WHERE NOT EXISTS (
SELECT 1 FROM Hub_Customer h
WHERE h.CustomerID = SourceCustomers.CustomerID
);
Data Vault vs. Dimensional Modeling:
| Aspect | Data Vault | Dimensional (Kimball) |
|---|---|---|
| Purpose | Enterprise data warehouse, raw data vault | Business intelligence, analytics |
| Structure | Hubs, Links, Satellites | Facts, Dimensions |
| Normalization | Highly normalized | Denormalized (Star) or Normalized (Snowflake) |
| History | Full history by design | Type 2 SCD for history |
| Flexibility | Highly flexible, easy to add sources | Less flexible, schema changes complex |
| Query Performance | Slower (many joins) | Faster (fewer joins) |
| Load Performance | Very fast (parallel loads) | Slower (lookups, updates) |
| Auditability | Excellent (full lineage) | Limited |
| Business Keys | Preserved | Surrogate keys |
| Use Case | Data integration layer | Presentation layer |
Practical Implementation:
-- ETL Process for Data Vault
-- Step 1: Load Hub (if not exists)
MERGE INTO Hub_Customer AS target
USING (
SELECT DISTINCT
HASHBYTES('MD5', CustomerID) AS CustomerHashKey,
CustomerID,
GETDATE() AS LoadDate,
'CRM_System' AS RecordSource
FROM Staging_Customers
) AS source
ON target.CustomerHashKey = source.CustomerHashKey
WHEN NOT MATCHED THEN
INSERT (CustomerHashKey, CustomerID, LoadDate, RecordSource)
VALUES (source.CustomerHashKey, source.CustomerID, source.LoadDate, source.RecordSource);
-- Step 2: Load Satellite (track changes)
INSERT INTO Sat_Customer (CustomerHashKey, LoadDate, CustomerName, Email, Phone, HashDiff, RecordSource)
SELECT
h.CustomerHashKey,
GETDATE() AS LoadDate,
s.CustomerName,
s.Email,
s.Phone,
HASHBYTES('MD5', CONCAT(s.CustomerName, s.Email, s.Phone)) AS HashDiff,
'CRM_System' AS RecordSource
FROM Staging_Customers s
INNER JOIN Hub_Customer h ON s.CustomerID = h.CustomerID
WHERE NOT EXISTS (
SELECT 1
FROM Sat_Customer sat
WHERE sat.CustomerHashKey = h.CustomerHashKey
AND sat.HashDiff = HASHBYTES('MD5', CONCAT(s.CustomerName, s.Email, s.Phone))
AND sat.LoadEndDate IS NULL
);
-- Step 3: Close previous satellite records
UPDATE sat
SET LoadEndDate = GETDATE()
FROM Sat_Customer sat
INNER JOIN (
SELECT CustomerHashKey, MAX(LoadDate) AS MaxLoadDate
FROM Sat_Customer
WHERE LoadEndDate IS NULL
GROUP BY CustomerHashKey
) latest ON sat.CustomerHashKey = latest.CustomerHashKey
AND sat.LoadDate < latest.MaxLoadDate
WHERE sat.LoadEndDate IS NULL;
Querying Data Vault (Point-in-Time):
-- Create view for business users (Information Mart)
CREATE VIEW vw_Customer_Current AS
SELECT
h.CustomerID,
s.CustomerName,
s.Email,
s.Phone,
s.LoadDate AS EffectiveDate
FROM Hub_Customer h
INNER JOIN Sat_Customer s ON h.CustomerHashKey = s.CustomerHashKey
WHERE s.LoadEndDate IS NULL; -- Current records only
-- Historical query
CREATE FUNCTION fn_GetCustomerAtDate(@AsOfDate DATETIME)
RETURNS TABLE
AS
RETURN
(
SELECT
h.CustomerID,
s.CustomerName,
s.Email,
s.Phone
FROM Hub_Customer h
INNER JOIN Sat_Customer s ON h.CustomerHashKey = s.CustomerHashKey
WHERE @AsOfDate >= s.LoadDate
AND (@AsOfDate < s.LoadEndDate OR s.LoadEndDate IS NULL)
);
-- Usage
SELECT * FROM fn_GetCustomerAtDate('2023-06-15');
Advantages of Data Vault:
- Auditability: Complete lineage and history
- Flexibility: Easy to add new data sources
- Parallel Loading: Hubs, Links, Satellites load independently
- No Data Loss: All source data preserved
- Agile: Incremental development
Disadvantages:
- Complexity: Steep learning curve
- Query Performance: Many joins required
- Storage: More storage than dimensional models
- Tooling: Limited BI tool support
Best Practice - Hybrid Architecture:
Source Systems → Data Vault (Raw Vault) → Information Marts (Dimensional) → BI Tools
5. How would you handle hierarchical data in a relational database? Compare different approaches.
Answer:
Approach 1: Adjacency List (Parent-Child)
Most common and simplest approach.
CREATE TABLE Employees (
EmployeeID INT PRIMARY KEY,
EmployeeName VARCHAR(100),
ManagerID INT NULL,
Title VARCHAR(50),
Salary DECIMAL(10,2),
CONSTRAINT FK_Manager FOREIGN KEY (ManagerID)
REFERENCES Employees(EmployeeID)
);
-- Sample data
INSERT INTO Employees VALUES
(1, 'John CEO', NULL, 'CEO', 200000),
(2, 'Jane VP', 1, 'VP Sales', 150000),
(3, 'Bob VP', 1, 'VP Engineering', 150000),
(4, 'Alice Manager', 2, 'Sales Manager', 100000),
(5, 'Charlie Manager', 3, 'Engineering Manager', 100000),
(6, 'David Rep', 4, 'Sales Rep', 60000),
(7, 'Eve Engineer', 5, 'Senior Engineer', 90000);
-- Query: Get direct reports
SELECT
e.EmployeeID,
e.EmployeeName,
e.Title,
m.EmployeeName AS ManagerName
FROM Employees e
LEFT JOIN Employees m ON e.ManagerID = m.EmployeeID;
-- Query: Get all subordinates (Recursive CTE)
WITH EmployeeHierarchy AS (
-- Anchor: Start with a specific manager
SELECT
EmployeeID,
EmployeeName,
ManagerID,
Title,
Salary,
0 AS Level,
CAST(EmployeeID AS VARCHAR(MAX)) AS HierarchyPath
FROM Employees
WHERE EmployeeID = 1 -- CEO
UNION ALL
-- Recursive: Get subordinates
SELECT
e.EmployeeID,
e.EmployeeName,
e.ManagerID,
e.Title,
e.Salary,
eh.Level + 1,
CAST(eh.HierarchyPath + '/' + CAST(e.EmployeeID AS VARCHAR(10)) AS VARCHAR(MAX))
FROM Employees e
INNER JOIN EmployeeHierarchy eh ON e.ManagerID = eh.EmployeeID
)
SELECT
REPLICATE(' ', Level) + EmployeeName AS OrgChart,
Level,
Title,
Salary,
HierarchyPath
FROM EmployeeHierarchy
ORDER BY HierarchyPath;
-- Result:
-- John CEO (Level 0)
-- Jane VP (Level 1)
-- Alice Manager (Level 2)
-- David Rep (Level 3)
-- Bob VP (Level 1)
-- Charlie Manager (Level 2)
-- Eve Engineer (Level 3)
Pros:
- Simple to understand and implement
- Easy to insert/update/delete
- Referential integrity enforced
Cons:
- Recursive queries needed for hierarchy traversal
- Performance issues with deep hierarchies
- No easy way to get all ancestors/descendants
Approach 2: Nested Sets (Modified Preorder Tree Traversal)
Stores left and right boundaries for each node.
CREATE TABLE Categories (
CategoryID INT PRIMARY KEY,
CategoryName VARCHAR(100),
LeftValue INT NOT NULL,
RightValue INT NOT NULL,
CONSTRAINT UQ_Left UNIQUE (LeftValue),
CONSTRAINT UQ_Right UNIQUE (RightValue)
);
-- Sample data (Electronics hierarchy)
INSERT INTO Categories VALUES
(1, 'Electronics', 1, 20),
(2, 'Computers', 2, 9),
(3, 'Laptops', 3, 4),
(4, 'Desktops', 5, 6),
(5, 'Tablets', 7, 8),
(6, 'TVs', 10, 15),
(7, 'LED', 11, 12),
(8, 'OLED', 13, 14),
(9, 'Audio', 16, 19),
(10, 'Headphones', 17, 18);
-- Query: Get all descendants of a node
SELECT
child.CategoryName,
(COUNT(parent.CategoryID) - 1) AS Depth
FROM Categories AS child
CROSS JOIN Categories AS parent
WHERE child.LeftValue BETWEEN parent.LeftValue AND parent.RightValue
AND parent.CategoryName = 'Electronics'
GROUP BY child.CategoryName, child.LeftValue
ORDER BY child.LeftValue;
-- Query: Get path to root
SELECT
parent.CategoryName
FROM Categories AS child
CROSS JOIN Categories AS parent
WHERE child.LeftValue BETWEEN parent.LeftValue AND parent.RightValue
AND child.CategoryName = 'Laptops'
ORDER BY parent.LeftValue;
-- Query: Get immediate children
SELECT
child.CategoryName
FROM Categories AS parent
INNER JOIN Categories AS child
ON child.LeftValue > parent.LeftValue
AND child.RightValue < parent.RightValue
WHERE parent.CategoryName = 'Computers'
AND NOT EXISTS (
SELECT 1
FROM Categories AS intermediate
WHERE intermediate.LeftValue > parent.LeftValue
AND intermediate.RightValue < parent.RightValue
AND child.LeftValue > intermediate.LeftValue
AND child.RightValue < intermediate.RightValue
);
-- Insert new node (complex)
-- To insert "Gaming Laptops" under "Laptops" (CategoryID 3)
DECLARE @ParentLeft INT = 3;
DECLARE @ParentRight INT = 4;
-- Make space
UPDATE Categories
SET RightValue = RightValue + 2
WHERE RightValue >= @ParentRight;
UPDATE Categories
SET LeftValue = LeftValue + 2
WHERE LeftValue > @ParentRight;
-- Insert new node
INSERT INTO Categories VALUES
(11, 'Gaming Laptops', @ParentRight, @ParentRight + 1);
Pros:
- Very fast reads (no recursion)
- Easy to get all descendants/ancestors
- Efficient for read-heavy workloads
Cons:
- Complex inserts/updates/deletes
- Difficult to maintain
- Hard to understand
- Concurrent modifications problematic
Approach 3: Closure Table (Transitive Closure)
Stores all ancestor-descendant relationships.
CREATE TABLE Categories (
CategoryID INT PRIMARY KEY,
CategoryName VARCHAR(100)
);
CREATE TABLE CategoryClosure (
AncestorID INT NOT NULL,
DescendantID INT NOT NULL,
Depth INT NOT NULL,
PRIMARY KEY (AncestorID, DescendantID),
CONSTRAINT FK_Ancestor FOREIGN KEY (AncestorID)
REFERENCES Categories(CategoryID),
CONSTRAINT FK_Descendant FOREIGN KEY (DescendantID)
REFERENCES Categories(CategoryID)
);
-- Sample data
INSERT INTO Categories VALUES
(1, 'Electronics'),
(2, 'Computers'),
(3, 'Laptops'),
(4, 'Desktops'),
(5, 'TVs');
-- Insert closure records (including self-references)
INSERT INTO CategoryClosure VALUES
-- Self-references
(1, 1, 0), (2, 2, 0), (3, 3, 0), (4, 4, 0), (5, 5, 0),
-- Electronics -> Computers
(1, 2, 1),
-- Electronics -> Laptops (through Computers)
(1, 3, 2), (2, 3, 1),
-- Electronics -> Desktops (through Computers)
(1, 4, 2), (2, 4, 1),
-- Electronics -> TVs
(1, 5, 1);
-- Query: Get all descendants
SELECT
c.CategoryName,
cc.Depth
FROM CategoryClosure cc
INNER JOIN Categories c ON cc.DescendantID = c.CategoryID
WHERE cc.AncestorID = 1 -- Electronics
AND cc.Depth > 0
ORDER BY cc.Depth, c.CategoryName;
-- Query: Get all ancestors
SELECT
c.CategoryName,
cc.Depth
FROM CategoryClosure cc
INNER JOIN Categories c ON cc.AncestorID = c.CategoryID
WHERE cc.DescendantID = 3 -- Laptops
AND cc.Depth > 0
ORDER BY cc.Depth DESC;
-- Query: Get immediate children
SELECT
c.CategoryName
FROM CategoryClosure cc
INNER JOIN Categories c ON cc.DescendantID = c.CategoryID
WHERE cc.AncestorID = 2 -- Computers
AND cc.Depth = 1;
-- Insert new node (easier than Nested Sets)
-- Add "Gaming Laptops" under "Laptops" (CategoryID 3)
DECLARE @NewCategoryID INT = 6;
DECLARE @ParentID INT = 3;
-- Insert into Categories
INSERT INTO Categories VALUES (@NewCategoryID, 'Gaming Laptops');
-- Insert closure records
-- 1. Self-reference
INSERT INTO CategoryClosure VALUES (@NewCategoryID, @NewCategoryID, 0);
-- 2. Copy all ancestors of parent
INSERT INTO CategoryClosure (AncestorID, DescendantID, Depth)
SELECT
AncestorID,
@NewCategoryID,
Depth + 1
FROM CategoryClosure
WHERE DescendantID = @ParentID;
-- Delete node and all descendants
DELETE FROM CategoryClosure
WHERE DescendantID IN (
SELECT DescendantID
FROM CategoryClosure
WHERE AncestorID = @CategoryIDToDelete
);
DELETE FROM Categories
WHERE CategoryID IN (
SELECT DescendantID
FROM CategoryClosure
WHERE AncestorID = @CategoryIDToDelete
);
Pros:
- Fast reads (no recursion)
- Easier inserts/deletes than Nested Sets
- Can query any relationship easily
- Good for complex hierarchies
Cons:
- More storage (O(n²) in worst case)
- More complex than Adjacency List
- Redundant data
Approach 4: Materialized Path (Path Enumeration)
Store the full path from root to node.
CREATE TABLE Categories (
CategoryID INT PRIMARY KEY,
CategoryName VARCHAR(100),
PathString VARCHAR(500) NOT NULL, -- e.g., '/1/2/3/'
Level INT NOT NULL,
CONSTRAINT UQ_Path UNIQUE (PathString)
);
CREATE INDEX IX_Categories_Path ON Categories(PathString);
-- Sample data
INSERT INTO Categories VALUES
(1, 'Electronics', '/1/', 0),
(2, 'Computers', '/1/2/', 1),
(3, 'Laptops', '/1/2/3/', 2),
(4, 'Desktops', '/1/2/4/', 2),
(5, 'TVs', '/1/5/', 1),
(6, 'LED TVs', '/1/5/6/', 2);
-- Query: Get all descendants
SELECT
CategoryID,
CategoryName,
Level,
PathString
FROM Categories
WHERE PathString LIKE '/1/%' -- All under Electronics
AND CategoryID <> 1
ORDER BY PathString;
-- Query: Get all ancestors
SELECT
CategoryID,
CategoryName,
Level
FROM Categories
WHERE '/1/2/3/' LIKE PathString + '%' -- Ancestors of Laptops
AND CategoryID <> 3
ORDER BY Level;
-- Query: Get immediate children
SELECT
CategoryID,
CategoryName
FROM Categories
WHERE PathString LIKE '/1/2/%' -- Children of Computers
AND Level = 2;
-- Query: Get siblings
SELECT
CategoryID,
CategoryName
FROM Categories c1
WHERE EXISTS (
SELECT 1
FROM Categories c2
WHERE c2.CategoryID = 3 -- Laptops
AND c1.PathString LIKE SUBSTRING(c2.PathString, 1,
LEN(c2.PathString) - CHARINDEX('/', REVERSE(c2.PathString), 2)) + '%'
AND c1.Level = c2.Level
AND c1.CategoryID <> c2.CategoryID
);
-- Insert new node
DECLARE @ParentPath VARCHAR(500) = '/1/2/';
DECLARE @NewID INT = 7;
INSERT INTO Categories VALUES
(@NewID, 'Gaming Laptops', @ParentPath + CAST(@NewID AS VARCHAR(10)) + '/', 2);
-- Move subtree (change parent)
DECLARE @OldPath VARCHAR(500) = '/1/2/3/';
DECLARE @NewPath VARCHAR(500) = '/1/5/3/';
UPDATE Categories
SET PathString = REPLACE(PathString, @OldPath, @NewPath),
Level = Level + (LEN(@NewPath) - LEN(@OldPath)) / 2
WHERE PathString LIKE @OldPath + '%';
Pros:
- Simple to understand
- Fast reads with LIKE operator
- Easy to get ancestors/descendants
- Can use indexes effectively
Cons:
- Path length limitation
- Updates can be expensive (moving subtrees)
- Redundant data storage
Comparison Table:
| Feature | Adjacency List | Nested Sets | Closure Table | Materialized Path |
|---|---|---|---|---|
| Simplicity | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Read Performance | ⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ |
| Write Performance | ⭐⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Storage | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐ |
| Referential Integrity | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| Move Subtree | ⭐⭐⭐⭐⭐ | ⭐ | ⭐⭐⭐ | ⭐⭐ |
Recommendation:
- Adjacency List: Default choice for most applications
- Closure Table: When you need fast reads and can afford storage
- Materialized Path: Good balance, especially with limited depth
- Nested Sets: Only for read-heavy, rarely-updated hierarchies
Hybrid Approach:
-- Combine Adjacency List with Materialized Path
CREATE TABLE Categories (
CategoryID INT PRIMARY KEY,
CategoryName VARCHAR(100),
ParentID INT NULL, -- Adjacency List
PathString VARCHAR(500), -- Materialized Path
Level INT,
LeftValue INT, -- Optional: Add Nested Sets for specific queries
RightValue INT,
CONSTRAINT FK_Parent FOREIGN KEY (ParentID)
REFERENCES Categories(CategoryID)
);
This gives you flexibility to use the best approach for each query type.
Top comments (0)