Anthony Gicheru

Posted on May 17

Slowly Changing Dimensions Explained: How Data Warehouses Keep History Accurate

#sql #datawarehouse #dataengineering #scd

1. Why Slowly Changing Dimensions Matter

In data engineering, not all data changes the same way.

Some data changes constantly, like transactions, clicks, payments, and sensor readings. These are usually facts: events that happen at a specific point in time.

But other data changes slowly.

A customer changes their address.
A product changes category.
An employee moves to a new department.
A supplier changes region.
A user upgrades from a free plan to a premium plan.

These changes do not happen every second, but when they happen, they matter a lot.

Imagine you are building a sales report. A customer originally lived in Nairobi, then moved to Mombasa. If you simply update the customer record, all their historical sales may suddenly appear as if they happened in Mombasa.

That is a problem.

The business may ask:

“How much revenue did we make from Nairobi customers last year?”

But if you overwrote the customer’s location, your report may give the wrong answer.

This is the exact problem Slowly Changing Dimensions solve.

Slowly Changing Dimensions help data teams manage changes in descriptive data over time while keeping analytics accurate.

2. What Is a Slowly Changing Dimension?

A Slowly Changing Dimension, often shortened to SCD, is a technique used in data warehousing to manage changes in dimension tables over time.

A dimension table stores descriptive information.

For example:

customer_id	customer_name	city	customer_type
101	Mary Wanjiku	Nairobi	Regular

This is not a transaction. It describes the customer.

Now imagine Mary moves from Nairobi to Kisumu.

The question becomes:

Should we overwrite Nairobi with Kisumu, or should we keep a history of both?

That decision is what SCD is all about.

This is where Slowly Changing Dimensions become useful.

They give data teams a structured way to decide how changes should be stored.

Sometimes we only care about the latest value.

Sometimes we want to preserve the original value.

Sometimes we need the full history of every meaningful change.

And sometimes we only need a simple previous-and-current comparison.

3. How Slowly Changing Dimensions Work

In a data warehouse, data is usually organized into fact tables and dimension tables.

Fact Tables

Fact tables store business events.

Examples:

Sales
Orders
Payments
Website clicks
Deliveries

A sales fact table might look like this:

sale_id	customer_key	product_key	amount	sale_date
5001	1	20	3000	2025-01-10

Dimension Tables

Dimension tables describe the facts.

A customer dimension might look like this:

customer_key	customer_id	name	city	customer_type
1	101	Mary Wanjiku	Nairobi	Regular

The fact table tells us what happened.

The dimension table tells us who, what, where, or how it happened.

The challenge is that dimension data changes.

When Mary moves from Nairobi to Kisumu, we need to decide how to store that change.

There are different SCD types, but the most commonly used are:

SCD Type 0
SCD Type 1
SCD Type 2
SCD Type 3

Let’s go through them practically.

4. SCD Type 0: Keep the Original Value

SCD Type 0 means a value does not change in the data warehouse.

Once the value is loaded, it stays the same, even if the source system changes later.

In simple terms, Type 0 says:

“Keep the original value as it was first recorded.”

In real data warehouse work, Type 0 appears often for fields that should represent the original state of something. But many teams do not always call it “SCD Type 0” explicitly.

They may simply say:

“This field should never be updated.”

Or:

“Preserve the original value.”

So conceptually, Type 0 is common. The name “Type 0” is just less commonly emphasized.

Good examples of Type 0 fields are:

Original signup date
First purchase date
Original registration country
Original acquisition channel
Original product launch date
Original employee hire date

For example, imagine Mary Wanjiku first registered as a customer while living in Kenya through an Instagram campaign.

customer_id	name	original_signup_date	original_country	acquisition_channel
101	Mary Wanjiku	2025-01-01	Kenya	Instagram Ads

Later, Mary may move cities, upgrade her customer type, or start coming through email campaigns.

But the original acquisition channel should still remain Instagram Ads.

Why?

Because it tells the business how Mary was first acquired.

If we overwrite that value, we lose the ability to answer questions like:

“Which marketing channel originally brought us our best customers?”

That is where Type 0 is useful.

It protects values that describe the original state of a record.

5. SCD Type 1: Overwrite the Old Value

SCD Type 1 is the simplest approach.

When a value changes, you overwrite the old value with the new one.

Before:

customer_id	name	city
101	Mary Wanjiku	Nairobi

After Mary moves:

customer_id	name	city
101	Mary Wanjiku	Kisumu

The old city is gone.

When Type 1 Makes Sense

SCD Type 1 is useful when history does not matter.

For example:

Fixing a spelling mistake
Correcting wrong data
Updating an email address
Updating a phone number
Correcting a product name typo

If the original value was wrong, you usually do not want to preserve it.

Example

Imagine a customer’s name was loaded as:

Mary Wanjikuuu

Then later corrected to:

Mary Wanjiku

You do not need historical tracking for the typo. You just update the record.

That is SCD Type 1.

The Risk

The risk with Type 1 is that it destroys history.

If city changes from Nairobi to Kisumu, all past reports will now treat Mary as a Kisumu customer, even if she lived in Nairobi when the sales happened.

So Type 1 is simple, but dangerous when historical accuracy matters.

6. SCD Type 2: Keep Full History

SCD Type 2 is the most important and most commonly used SCD technique in analytics.

Instead of overwriting the old record, you create a new row when important attributes change.

Before Mary moves:

customer_key	customer_id	name	city	start_date	end_date	is_current
1	101	Mary Wanjiku	Nairobi	2024-01-01	NULL	true

After Mary moves to Kisumu:

customer_key	customer_id	name	city	start_date	end_date	is_current
1	101	Mary Wanjiku	Nairobi	2024-01-01	2025-03-15	false
2	101	Mary Wanjiku	Kisumu	2025-03-15	NULL	true

Notice something important.

The customer_id stays the same because it represents the real-world customer.

But the customer_key changes because each historical version gets its own unique warehouse key.

This is usually called a surrogate key.

Why This Matters

Now, if Mary made a purchase while living in Nairobi, the fact table can point to the Nairobi version of her customer record.

If she made another purchase after moving to Kisumu, that sale can point to the Kisumu version.

This allows historical reports to stay accurate.

Example Fact Table

sale_id	customer_key	amount	sale_date
5001	1	3000	2025-02-10
5002	2	4500	2025-04-20

The first sale belongs to Mary when she was in Nairobi.

The second sale belongs to Mary when she was in Kisumu.

That is the power of SCD Type 2.

7. SCD Type 3: Store Limited History in Columns

SCD Type 3 keeps limited history by adding extra columns.

For example:

customer_id	name	current_city	previous_city
101	Mary Wanjiku	Kisumu	Nairobi

This lets you see the current value and one previous value.

When Type 3 Makes Sense

SCD Type 3 is useful when you only care about a small amount of history.

For example:

Previous region and current region
Previous plan and current plan
Previous department and current department

But it does not scale well if changes happen many times.

What happens if Mary moves from Nairobi to Kisumu, then Nakuru, then Eldoret?

You would need more columns:

previous_city_1
previous_city_2
previous_city_3

That becomes messy quickly.

So Type 3 is useful, but only for very specific cases.

8. Practical Example in a Data Warehousing Project

Let’s say you are building a sales analytics warehouse for an e-commerce company.

You have data coming from:

PostgreSQL for application data
Kafka for order events
Airflow for orchestration
dbt for transformations
Snowflake, BigQuery, Redshift, or PostgreSQL as the warehouse

Your source customer table in PostgreSQL looks like this:

customer_id	name	city	customer_type	acquisition_channel	updated_at
101	Mary Wanjiku	Nairobi	Regular	Instagram Ads	2025-01-01

Later, the same customer changes:

customer_id	name	city	customer_type	acquisition_channel	updated_at
101	Mary Wanjiku	Kisumu	Premium	Instagram Ads	2025-03-15

Now the data team must decide:

Do we overwrite the old record?

Or do we preserve the old version?

And what do we do with the original acquisition channel?

In this example:

acquisition_channel can be treated as Type 0 because it represents how Mary was originally acquired.
city and customer_type can be treated as Type 2 because they affect historical reporting.

For analytics, this is why we often combine different SCD behaviors in the same dimension table. Some fields preserve the original value, while others keep full history.

A Type 2 customer dimension may look like this:

customer_key	customer_id	name	city	customer_type	acquisition_channel	valid_from	valid_to	is_current
1	101	Mary Wanjiku	Nairobi	Regular	Instagram Ads	2025-01-01	2025-03-15	false
2	101	Mary Wanjiku	Kisumu	Premium	Instagram Ads	2025-03-15	NULL	true

Now your reports can answer questions like:

How many sales came from Nairobi customers in February?
How much revenue came from Premium customers after March?
What was the customer type at the time of purchase?
How many customers upgraded from Regular to Premium?

Without SCD Type 2, these questions become difficult or inaccurate.

9. A Simple SCD Type 2 Flow

A typical SCD Type 2 pipeline works like this:

Step 1: Load the Latest Source Data

You extract the latest customer data from the source system.

This could come from PostgreSQL, an API, a CSV file, or CDC events from Kafka.

Step 2: Compare Source Data With Current Dimension Records

You compare the incoming record with the current active record in the warehouse.

For example, compare:

source.city
source.customer_type

against:

dim_customer.city
dim_customer.customer_type

Step 3: Detect Changes

If nothing changed, do nothing.

If important attributes changed, expire the old record.

For example:

customer_key	customer_id	city	valid_to	is_current
1	101	Nairobi	2025-03-15	false

Step 4: Insert a New Current Record

Then insert the new version:

customer_key	customer_id	city	valid_from	valid_to	is_current
2	101	Kisumu	2025-03-15	NULL	true

Step 5: Use the Correct Dimension Version in Fact Tables

When loading fact data, join the fact date to the correct dimension record using the validity period.

For example:

SELECT
    f.order_id,
    d.customer_key,
    f.order_date,
    f.amount
FROM staging_orders f
JOIN dim_customer d
    ON f.customer_id = d.customer_id
   AND f.order_date >= d.valid_from
   AND (
        f.order_date < d.valid_to
        OR d.valid_to IS NULL
   );

This ensures the order connects to the correct version of the customer.

10. Common Mistakes Beginners Make

Mistake 1: Using Type 1 When History Matters

This is probably the most common mistake.

Beginners often overwrite dimension records because it feels simple.

But later, when the business asks historical questions, the warehouse cannot answer correctly.

For example:

“What was revenue by customer region last year?”

If you overwrote all customer regions with the latest value, the report will be wrong.

How to Avoid It

Before choosing Type 1, ask:

“Will the business ever need to know what this value was in the past?”

If yes, consider Type 2.

Mistake 2: Tracking Every Column as Type 2

Not every change deserves a new historical version.

For example, do you really need a new customer dimension row when the phone number changes?

Maybe not.

If you track every small change, your dimension table can grow unnecessarily large and become harder to manage.

How to Avoid It

Classify columns carefully.

For example:

Column	SCD Type
customer_name typo fix	Type 1
city	Type 2
customer_type	Type 2
phone_number	Type 1
email	Type 1 or Type 2 depending on business need

The decision should be based on reporting needs, not just technical preference.

Mistake 3: Not Using a Surrogate Key

A big mistake is using the source system ID as the primary key for the dimension table.

For example, using customer_id as the only key.

That becomes a problem in Type 2 because the same customer can have multiple historical versions.

Better Approach

Use:

customer_id as the business key
customer_key as the warehouse surrogate key

Example:

customer_key	customer_id	city	is_current
1	101	Nairobi	false
2	101	Kisumu	true

The surrogate key uniquely identifies each version.

Mistake 4: Forgetting the Current Flag

Without an is_current column, it becomes harder to query the latest version of each record.

You would have to check for:

valid_to IS NULL

That works, but is_current makes queries easier and clearer.

A good SCD Type 2 table usually has:

valid_from
valid_to
is_current

Some teams also add:

created_at
updated_at
record_hash

Mistake 5: Poor Date Handling

SCD Type 2 depends heavily on dates.

If your valid_from and valid_to values are wrong, your historical joins will be wrong.

Common problems include:

Overlapping date ranges
Gaps between versions
Incorrect timezone handling
Using load date instead of actual business effective date
Not handling late-arriving data

How to Avoid It

Be very intentional about what your dates mean.

For example:

valid_from: when this version became valid
valid_to: when this version stopped being valid
loaded_at: when the data entered the warehouse

These are not always the same thing.

11. Best Practices for SCD

Use SCD Type 2 for Business-Critical History

If a change affects reporting, segmentation, revenue analysis, or compliance, preserve history.

Examples:

Customer region
Customer plan
Product category
Sales territory
Employee department
Account status

These are usually worth tracking with Type 2.

Use Hashing to Detect Changes

Instead of comparing many columns one by one, you can create a hash from the important attributes.

Example:

MD5(CONCAT(city, customer_type, region))

Then compare the source hash with the current dimension hash.

If the hash changes, something important changed.

This makes SCD pipelines easier to maintain, especially when there are many columns.

Keep Your SCD Logic Clear

SCD logic can become confusing quickly.

Use clear column names like:

valid_from
valid_to
is_current

Avoid unclear names like:

start
end
flag

Your future self and your teammates will thank you.

Document Which Columns Are Tracked

Do not leave SCD behavior hidden inside SQL code only.

Document which columns are:

Type 0
Type 1
Type 2
Type 3
Ignored
Used for change detection

This is especially important in team environments.

Avoid Duplicates in Current Records

For a Type 2 dimension, each business key should have only one current record.

This should be true:

One customer_id = one current record

You can test this with SQL:

SELECT
    customer_id,
    COUNT(*) AS current_records
FROM dim_customer
WHERE is_current = true
GROUP BY customer_id
HAVING COUNT(*) > 1;

If this query returns rows, your dimension has a problem.

Test for Overlapping Validity Periods

For each business key, the date ranges should not overlap.

Bad example:

customer_id	city	valid_from	valid_to
101	Nairobi	2025-01-01	2025-04-01
101	Kisumu	2025-03-15	NULL

These overlap between March 15 and April 1.

That means a sale on March 20 could match both records.

That is dangerous.

12. When to Use Slowly Changing Dimensions

Use SCD when dimension data changes and those changes affect analysis.

Good use cases include:

Customer address history
Product category history
Subscription plan changes
Employee department changes
Supplier region changes
Account status changes
Sales territory changes

SCD is especially useful in data warehouses and analytics systems where historical accuracy matters.

For example:

“Show revenue by the customer’s region at the time of purchase.”

That is a classic SCD Type 2 problem.

13. When SCD May Not Be the Best Choice

SCD is powerful, but not every situation needs it.

Do Not Use Type 2 for Every Small Change

If a field changes often and does not matter historically, Type 2 may create unnecessary complexity.

For example:

Last login timestamp
Profile picture URL
Phone number
Minor spelling corrections
Temporary status fields

For these, Type 1 may be enough.

Be Careful With Very High-Volume Changes

If a dimension changes too frequently, Type 2 can grow very fast.

At that point, you may need a different modeling approach, such as:

Event sourcing
Audit tables
Snapshot tables
Data lake versioning
Change Data Capture history

SCD is best for slowly changing descriptive attributes, not every event that happens in the system.

14. SCD in Modern Data Tools

SCD is not limited to traditional warehouses.

You can implement SCD patterns in many modern data stacks.

In dbt

dbt supports snapshots, which are commonly used to implement SCD Type 2.

A dbt snapshot can track when records change and automatically create historical versions.

In Airflow

Airflow can orchestrate SCD pipelines by scheduling extraction, staging, comparison, and dimension loading tasks.

In Spark

Spark is useful when you are handling large-scale dimension updates.

You can compare source and target datasets, detect changes, and write updated records to a lakehouse or warehouse.

In Kafka and CDC

Kafka can stream changes from source systems.

For example, using CDC tools, you can capture customer updates from PostgreSQL and send them into Kafka.

From there, you can process those changes and update your dimension tables.

In Warehouses

SCD can be implemented in:

Snowflake
BigQuery
Redshift
PostgreSQL
Databricks
SQL Server

The concept stays the same, even if the syntax differs.

15. Final Summary

Slowly Changing Dimensions help data engineers manage changes in dimension data over time.

They solve an important problem:

How do we keep analytics accurate when descriptive data changes?

The most common SCD types are:

Type	Meaning	Best For
Type 0	Keep the original value unchanged	Original values that should not change
Type 1	Overwrite old values	Corrections and non-historical changes
Type 2	Keep full history using new rows	Historical reporting
Type 3	Keep limited history in columns	Simple previous/current comparisons

In real data engineering projects, SCD Type 2 is especially important because it allows the warehouse to answer historical questions correctly.

Without SCD, reports can quietly become wrong.

A customer’s current city may overwrite their past city.
A product’s new category may rewrite old sales history.
An employee’s new department may change old performance reports.

That is why SCD matters.

The practical takeaway is simple: