1. Why Slowly Changing Dimensions Matter
In data engineering, not all data changes the same way.
Some data changes constantly, like transactions, clicks, payments, and sensor readings. These are usually facts: events that happen at a specific point in time.
But other data changes slowly.
A customer changes their address.
A product changes category.
An employee moves to a new department.
A supplier changes region.
A user upgrades from a free plan to a premium plan.
These changes do not happen every second, but when they happen, they matter a lot.
Imagine you are building a sales report. A customer originally lived in Nairobi, then moved to Mombasa. If you simply update the customer record, all their historical sales may suddenly appear as if they happened in Mombasa.
That is a problem.
The business may ask:
“How much revenue did we make from Nairobi customers last year?”
But if you overwrote the customer’s location, your report may give the wrong answer.
This is the exact problem Slowly Changing Dimensions solve.
Slowly Changing Dimensions help data teams manage changes in descriptive data over time while keeping analytics accurate.
2. What Is a Slowly Changing Dimension?
A Slowly Changing Dimension, often shortened to SCD, is a technique used in data warehousing to manage changes in dimension tables over time.
A dimension table stores descriptive information.
For example:
| customer_id | customer_name | city | customer_type |
|---|---|---|---|
| 101 | Mary Wanjiku | Nairobi | Regular |
This is not a transaction. It describes the customer.
Now imagine Mary moves from Nairobi to Kisumu.
The question becomes:
Should we overwrite Nairobi with Kisumu, or should we keep a history of both?
That decision is what SCD is all about.
This is where Slowly Changing Dimensions become useful.
They give data teams a structured way to decide how changes should be stored.
Sometimes we only care about the latest value.
Sometimes we want to preserve the original value.
Sometimes we need the full history of every meaningful change.
And sometimes we only need a simple previous-and-current comparison.
3. How Slowly Changing Dimensions Work
In a data warehouse, data is usually organized into fact tables and dimension tables.
Fact Tables
Fact tables store business events.
Examples:
- Sales
- Orders
- Payments
- Website clicks
- Deliveries
A sales fact table might look like this:
| sale_id | customer_key | product_key | amount | sale_date |
|---|---|---|---|---|
| 5001 | 1 | 20 | 3000 | 2025-01-10 |
Dimension Tables
Dimension tables describe the facts.
A customer dimension might look like this:
| customer_key | customer_id | name | city | customer_type |
|---|---|---|---|---|
| 1 | 101 | Mary Wanjiku | Nairobi | Regular |
The fact table tells us what happened.
The dimension table tells us who, what, where, or how it happened.
The challenge is that dimension data changes.
When Mary moves from Nairobi to Kisumu, we need to decide how to store that change.
There are different SCD types, but the most commonly used are:
- SCD Type 0
- SCD Type 1
- SCD Type 2
- SCD Type 3
Let’s go through them practically.
4. SCD Type 0: Keep the Original Value
SCD Type 0 means a value does not change in the data warehouse.
Once the value is loaded, it stays the same, even if the source system changes later.
In simple terms, Type 0 says:
“Keep the original value as it was first recorded.”
In real data warehouse work, Type 0 appears often for fields that should represent the original state of something. But many teams do not always call it “SCD Type 0” explicitly.
They may simply say:
“This field should never be updated.”
Or:
“Preserve the original value.”
So conceptually, Type 0 is common. The name “Type 0” is just less commonly emphasized.
Good examples of Type 0 fields are:
- Original signup date
- First purchase date
- Original registration country
- Original acquisition channel
- Original product launch date
- Original employee hire date
For example, imagine Mary Wanjiku first registered as a customer while living in Kenya through an Instagram campaign.
| customer_id | name | original_signup_date | original_country | acquisition_channel |
|---|---|---|---|---|
| 101 | Mary Wanjiku | 2025-01-01 | Kenya | Instagram Ads |
Later, Mary may move cities, upgrade her customer type, or start coming through email campaigns.
But the original acquisition channel should still remain Instagram Ads.
Why?
Because it tells the business how Mary was first acquired.
If we overwrite that value, we lose the ability to answer questions like:
“Which marketing channel originally brought us our best customers?”
That is where Type 0 is useful.
It protects values that describe the original state of a record.
5. SCD Type 1: Overwrite the Old Value
SCD Type 1 is the simplest approach.
When a value changes, you overwrite the old value with the new one.
Before:
| customer_id | name | city |
|---|---|---|
| 101 | Mary Wanjiku | Nairobi |
After Mary moves:
| customer_id | name | city |
|---|---|---|
| 101 | Mary Wanjiku | Kisumu |
The old city is gone.
When Type 1 Makes Sense
SCD Type 1 is useful when history does not matter.
For example:
- Fixing a spelling mistake
- Correcting wrong data
- Updating an email address
- Updating a phone number
- Correcting a product name typo
If the original value was wrong, you usually do not want to preserve it.
Example
Imagine a customer’s name was loaded as:
Mary Wanjikuuu
Then later corrected to:
Mary Wanjiku
You do not need historical tracking for the typo. You just update the record.
That is SCD Type 1.
The Risk
The risk with Type 1 is that it destroys history.
If city changes from Nairobi to Kisumu, all past reports will now treat Mary as a Kisumu customer, even if she lived in Nairobi when the sales happened.
So Type 1 is simple, but dangerous when historical accuracy matters.
6. SCD Type 2: Keep Full History
SCD Type 2 is the most important and most commonly used SCD technique in analytics.
Instead of overwriting the old record, you create a new row when important attributes change.
Before Mary moves:
| customer_key | customer_id | name | city | start_date | end_date | is_current |
|---|---|---|---|---|---|---|
| 1 | 101 | Mary Wanjiku | Nairobi | 2024-01-01 | NULL | true |
After Mary moves to Kisumu:
| customer_key | customer_id | name | city | start_date | end_date | is_current |
|---|---|---|---|---|---|---|
| 1 | 101 | Mary Wanjiku | Nairobi | 2024-01-01 | 2025-03-15 | false |
| 2 | 101 | Mary Wanjiku | Kisumu | 2025-03-15 | NULL | true |
Notice something important.
The customer_id stays the same because it represents the real-world customer.
But the customer_key changes because each historical version gets its own unique warehouse key.
This is usually called a surrogate key.
Why This Matters
Now, if Mary made a purchase while living in Nairobi, the fact table can point to the Nairobi version of her customer record.
If she made another purchase after moving to Kisumu, that sale can point to the Kisumu version.
This allows historical reports to stay accurate.
Example Fact Table
| sale_id | customer_key | amount | sale_date |
|---|---|---|---|
| 5001 | 1 | 3000 | 2025-02-10 |
| 5002 | 2 | 4500 | 2025-04-20 |
The first sale belongs to Mary when she was in Nairobi.
The second sale belongs to Mary when she was in Kisumu.
That is the power of SCD Type 2.
7. SCD Type 3: Store Limited History in Columns
SCD Type 3 keeps limited history by adding extra columns.
For example:
| customer_id | name | current_city | previous_city |
|---|---|---|---|
| 101 | Mary Wanjiku | Kisumu | Nairobi |
This lets you see the current value and one previous value.
When Type 3 Makes Sense
SCD Type 3 is useful when you only care about a small amount of history.
For example:
- Previous region and current region
- Previous plan and current plan
- Previous department and current department
But it does not scale well if changes happen many times.
What happens if Mary moves from Nairobi to Kisumu, then Nakuru, then Eldoret?
You would need more columns:
previous_city_1
previous_city_2
previous_city_3
That becomes messy quickly.
So Type 3 is useful, but only for very specific cases.
8. Practical Example in a Data Warehousing Project
Let’s say you are building a sales analytics warehouse for an e-commerce company.
You have data coming from:
- PostgreSQL for application data
- Kafka for order events
- Airflow for orchestration
- dbt for transformations
- Snowflake, BigQuery, Redshift, or PostgreSQL as the warehouse
Your source customer table in PostgreSQL looks like this:
| customer_id | name | city | customer_type | acquisition_channel | updated_at |
|---|---|---|---|---|---|
| 101 | Mary Wanjiku | Nairobi | Regular | Instagram Ads | 2025-01-01 |
Later, the same customer changes:
| customer_id | name | city | customer_type | acquisition_channel | updated_at |
|---|---|---|---|---|---|
| 101 | Mary Wanjiku | Kisumu | Premium | Instagram Ads | 2025-03-15 |
Now the data team must decide:
Do we overwrite the old record?
Or do we preserve the old version?
And what do we do with the original acquisition channel?
In this example:
-
acquisition_channelcan be treated as Type 0 because it represents how Mary was originally acquired. -
cityandcustomer_typecan be treated as Type 2 because they affect historical reporting.
For analytics, this is why we often combine different SCD behaviors in the same dimension table. Some fields preserve the original value, while others keep full history.
A Type 2 customer dimension may look like this:
| customer_key | customer_id | name | city | customer_type | acquisition_channel | valid_from | valid_to | is_current |
|---|---|---|---|---|---|---|---|---|
| 1 | 101 | Mary Wanjiku | Nairobi | Regular | Instagram Ads | 2025-01-01 | 2025-03-15 | false |
| 2 | 101 | Mary Wanjiku | Kisumu | Premium | Instagram Ads | 2025-03-15 | NULL | true |
Now your reports can answer questions like:
- How many sales came from Nairobi customers in February?
- How much revenue came from Premium customers after March?
- What was the customer type at the time of purchase?
- How many customers upgraded from Regular to Premium?
Without SCD Type 2, these questions become difficult or inaccurate.
9. A Simple SCD Type 2 Flow
A typical SCD Type 2 pipeline works like this:
Step 1: Load the Latest Source Data
You extract the latest customer data from the source system.
This could come from PostgreSQL, an API, a CSV file, or CDC events from Kafka.
Step 2: Compare Source Data With Current Dimension Records
You compare the incoming record with the current active record in the warehouse.
For example, compare:
source.city
source.customer_type
against:
dim_customer.city
dim_customer.customer_type
Step 3: Detect Changes
If nothing changed, do nothing.
If important attributes changed, expire the old record.
For example:
| customer_key | customer_id | city | valid_to | is_current |
|---|---|---|---|---|
| 1 | 101 | Nairobi | 2025-03-15 | false |
Step 4: Insert a New Current Record
Then insert the new version:
| customer_key | customer_id | city | valid_from | valid_to | is_current |
|---|---|---|---|---|---|
| 2 | 101 | Kisumu | 2025-03-15 | NULL | true |
Step 5: Use the Correct Dimension Version in Fact Tables
When loading fact data, join the fact date to the correct dimension record using the validity period.
For example:
SELECT
f.order_id,
d.customer_key,
f.order_date,
f.amount
FROM staging_orders f
JOIN dim_customer d
ON f.customer_id = d.customer_id
AND f.order_date >= d.valid_from
AND (
f.order_date < d.valid_to
OR d.valid_to IS NULL
);
This ensures the order connects to the correct version of the customer.
10. Common Mistakes Beginners Make
Mistake 1: Using Type 1 When History Matters
This is probably the most common mistake.
Beginners often overwrite dimension records because it feels simple.
But later, when the business asks historical questions, the warehouse cannot answer correctly.
For example:
“What was revenue by customer region last year?”
If you overwrote all customer regions with the latest value, the report will be wrong.
How to Avoid It
Before choosing Type 1, ask:
“Will the business ever need to know what this value was in the past?”
If yes, consider Type 2.
Mistake 2: Tracking Every Column as Type 2
Not every change deserves a new historical version.
For example, do you really need a new customer dimension row when the phone number changes?
Maybe not.
If you track every small change, your dimension table can grow unnecessarily large and become harder to manage.
How to Avoid It
Classify columns carefully.
For example:
| Column | SCD Type |
|---|---|
| customer_name typo fix | Type 1 |
| city | Type 2 |
| customer_type | Type 2 |
| phone_number | Type 1 |
| Type 1 or Type 2 depending on business need |
The decision should be based on reporting needs, not just technical preference.
Mistake 3: Not Using a Surrogate Key
A big mistake is using the source system ID as the primary key for the dimension table.
For example, using customer_id as the only key.
That becomes a problem in Type 2 because the same customer can have multiple historical versions.
Better Approach
Use:
-
customer_idas the business key -
customer_keyas the warehouse surrogate key
Example:
| customer_key | customer_id | city | is_current |
|---|---|---|---|
| 1 | 101 | Nairobi | false |
| 2 | 101 | Kisumu | true |
The surrogate key uniquely identifies each version.
Mistake 4: Forgetting the Current Flag
Without an is_current column, it becomes harder to query the latest version of each record.
You would have to check for:
valid_to IS NULL
That works, but is_current makes queries easier and clearer.
A good SCD Type 2 table usually has:
valid_from
valid_to
is_current
Some teams also add:
created_at
updated_at
record_hash
Mistake 5: Poor Date Handling
SCD Type 2 depends heavily on dates.
If your valid_from and valid_to values are wrong, your historical joins will be wrong.
Common problems include:
- Overlapping date ranges
- Gaps between versions
- Incorrect timezone handling
- Using load date instead of actual business effective date
- Not handling late-arriving data
How to Avoid It
Be very intentional about what your dates mean.
For example:
-
valid_from: when this version became valid -
valid_to: when this version stopped being valid -
loaded_at: when the data entered the warehouse
These are not always the same thing.
11. Best Practices for SCD
Use SCD Type 2 for Business-Critical History
If a change affects reporting, segmentation, revenue analysis, or compliance, preserve history.
Examples:
- Customer region
- Customer plan
- Product category
- Sales territory
- Employee department
- Account status
These are usually worth tracking with Type 2.
Use Hashing to Detect Changes
Instead of comparing many columns one by one, you can create a hash from the important attributes.
Example:
MD5(CONCAT(city, customer_type, region))
Then compare the source hash with the current dimension hash.
If the hash changes, something important changed.
This makes SCD pipelines easier to maintain, especially when there are many columns.
Keep Your SCD Logic Clear
SCD logic can become confusing quickly.
Use clear column names like:
valid_from
valid_to
is_current
Avoid unclear names like:
start
end
flag
Your future self and your teammates will thank you.
Document Which Columns Are Tracked
Do not leave SCD behavior hidden inside SQL code only.
Document which columns are:
- Type 0
- Type 1
- Type 2
- Type 3
- Ignored
- Used for change detection
This is especially important in team environments.
Avoid Duplicates in Current Records
For a Type 2 dimension, each business key should have only one current record.
This should be true:
One customer_id = one current record
You can test this with SQL:
SELECT
customer_id,
COUNT(*) AS current_records
FROM dim_customer
WHERE is_current = true
GROUP BY customer_id
HAVING COUNT(*) > 1;
If this query returns rows, your dimension has a problem.
Test for Overlapping Validity Periods
For each business key, the date ranges should not overlap.
Bad example:
| customer_id | city | valid_from | valid_to |
|---|---|---|---|
| 101 | Nairobi | 2025-01-01 | 2025-04-01 |
| 101 | Kisumu | 2025-03-15 | NULL |
These overlap between March 15 and April 1.
That means a sale on March 20 could match both records.
That is dangerous.
12. When to Use Slowly Changing Dimensions
Use SCD when dimension data changes and those changes affect analysis.
Good use cases include:
- Customer address history
- Product category history
- Subscription plan changes
- Employee department changes
- Supplier region changes
- Account status changes
- Sales territory changes
SCD is especially useful in data warehouses and analytics systems where historical accuracy matters.
For example:
“Show revenue by the customer’s region at the time of purchase.”
That is a classic SCD Type 2 problem.
13. When SCD May Not Be the Best Choice
SCD is powerful, but not every situation needs it.
Do Not Use Type 2 for Every Small Change
If a field changes often and does not matter historically, Type 2 may create unnecessary complexity.
For example:
- Last login timestamp
- Profile picture URL
- Phone number
- Minor spelling corrections
- Temporary status fields
For these, Type 1 may be enough.
Be Careful With Very High-Volume Changes
If a dimension changes too frequently, Type 2 can grow very fast.
At that point, you may need a different modeling approach, such as:
- Event sourcing
- Audit tables
- Snapshot tables
- Data lake versioning
- Change Data Capture history
SCD is best for slowly changing descriptive attributes, not every event that happens in the system.
14. SCD in Modern Data Tools
SCD is not limited to traditional warehouses.
You can implement SCD patterns in many modern data stacks.
In dbt
dbt supports snapshots, which are commonly used to implement SCD Type 2.
A dbt snapshot can track when records change and automatically create historical versions.
In Airflow
Airflow can orchestrate SCD pipelines by scheduling extraction, staging, comparison, and dimension loading tasks.
In Spark
Spark is useful when you are handling large-scale dimension updates.
You can compare source and target datasets, detect changes, and write updated records to a lakehouse or warehouse.
In Kafka and CDC
Kafka can stream changes from source systems.
For example, using CDC tools, you can capture customer updates from PostgreSQL and send them into Kafka.
From there, you can process those changes and update your dimension tables.
In Warehouses
SCD can be implemented in:
- Snowflake
- BigQuery
- Redshift
- PostgreSQL
- Databricks
- SQL Server
The concept stays the same, even if the syntax differs.
15. Final Summary
Slowly Changing Dimensions help data engineers manage changes in dimension data over time.
They solve an important problem:
How do we keep analytics accurate when descriptive data changes?
The most common SCD types are:
| Type | Meaning | Best For |
|---|---|---|
| Type 0 | Keep the original value unchanged | Original values that should not change |
| Type 1 | Overwrite old values | Corrections and non-historical changes |
| Type 2 | Keep full history using new rows | Historical reporting |
| Type 3 | Keep limited history in columns | Simple previous/current comparisons |
In real data engineering projects, SCD Type 2 is especially important because it allows the warehouse to answer historical questions correctly.
Without SCD, reports can quietly become wrong.
A customer’s current city may overwrite their past city.
A product’s new category may rewrite old sales history.
An employee’s new department may change old performance reports.
That is why SCD matters.
The practical takeaway is simple:
Whenever a dimension value changes, ask whether the business needs to remember the old value.
If the value should never change, Type 0 may be the right choice.
If the old value does not matter, Type 1 may be enough.
If the business needs full history, Type 2 is usually the best choice.
If the business only needs a simple previous-and-current comparison, Type 3 may work.
Top comments (0)