DEV Community

Anthony Gicheru
Anthony Gicheru

Posted on

Slowly Changing Dimensions Explained: How Data Warehouses Keep History Accurate

1. Why Slowly Changing Dimensions Matter

In data engineering, not all data changes the same way.

Some data changes constantly, like transactions, clicks, payments, and sensor readings. These are usually facts: events that happen at a specific point in time.

But other data changes slowly.

A customer changes their address.
A product changes category.
An employee moves to a new department.
A supplier changes region.
A user upgrades from a free plan to a premium plan.

These changes do not happen every second, but when they happen, they matter a lot.

Imagine you are building a sales report. A customer originally lived in Nairobi, then moved to Mombasa. If you simply update the customer record, all their historical sales may suddenly appear as if they happened in Mombasa.

That is a problem.

The business may ask:

“How much revenue did we make from Nairobi customers last year?”

But if you overwrote the customer’s location, your report may give the wrong answer.

This is the exact problem Slowly Changing Dimensions solve.

Slowly Changing Dimensions help data teams manage changes in descriptive data over time while keeping analytics accurate.


2. What Is a Slowly Changing Dimension?

A Slowly Changing Dimension, often shortened to SCD, is a technique used in data warehousing to manage changes in dimension tables over time.

A dimension table stores descriptive information.

For example:

customer_id customer_name city customer_type
101 Mary Wanjiku Nairobi Regular

This is not a transaction. It describes the customer.

Now imagine Mary moves from Nairobi to Kisumu.

The question becomes:

Should we overwrite Nairobi with Kisumu, or should we keep a history of both?

That decision is what SCD is all about.

This is where Slowly Changing Dimensions become useful.

They give data teams a structured way to decide how changes should be stored.

Sometimes we only care about the latest value.

Sometimes we want to preserve the original value.

Sometimes we need the full history of every meaningful change.

And sometimes we only need a simple previous-and-current comparison.


3. How Slowly Changing Dimensions Work

In a data warehouse, data is usually organized into fact tables and dimension tables.

Fact Tables

Fact tables store business events.

Examples:

  • Sales
  • Orders
  • Payments
  • Website clicks
  • Deliveries

A sales fact table might look like this:

sale_id customer_key product_key amount sale_date
5001 1 20 3000 2025-01-10

Dimension Tables

Dimension tables describe the facts.

A customer dimension might look like this:

customer_key customer_id name city customer_type
1 101 Mary Wanjiku Nairobi Regular

The fact table tells us what happened.

The dimension table tells us who, what, where, or how it happened.

The challenge is that dimension data changes.

When Mary moves from Nairobi to Kisumu, we need to decide how to store that change.

There are different SCD types, but the most commonly used are:

  • SCD Type 0
  • SCD Type 1
  • SCD Type 2
  • SCD Type 3

Let’s go through them practically.


4. SCD Type 0: Keep the Original Value

SCD Type 0 means a value does not change in the data warehouse.

Once the value is loaded, it stays the same, even if the source system changes later.

In simple terms, Type 0 says:

“Keep the original value as it was first recorded.”

In real data warehouse work, Type 0 appears often for fields that should represent the original state of something. But many teams do not always call it “SCD Type 0” explicitly.

They may simply say:

“This field should never be updated.”

Or:

“Preserve the original value.”

So conceptually, Type 0 is common. The name “Type 0” is just less commonly emphasized.

Good examples of Type 0 fields are:

  • Original signup date
  • First purchase date
  • Original registration country
  • Original acquisition channel
  • Original product launch date
  • Original employee hire date

For example, imagine Mary Wanjiku first registered as a customer while living in Kenya through an Instagram campaign.

customer_id name original_signup_date original_country acquisition_channel
101 Mary Wanjiku 2025-01-01 Kenya Instagram Ads

Later, Mary may move cities, upgrade her customer type, or start coming through email campaigns.

But the original acquisition channel should still remain Instagram Ads.

Why?

Because it tells the business how Mary was first acquired.

If we overwrite that value, we lose the ability to answer questions like:

“Which marketing channel originally brought us our best customers?”

That is where Type 0 is useful.

It protects values that describe the original state of a record.


5. SCD Type 1: Overwrite the Old Value

SCD Type 1 is the simplest approach.

When a value changes, you overwrite the old value with the new one.

Before:

customer_id name city
101 Mary Wanjiku Nairobi

After Mary moves:

customer_id name city
101 Mary Wanjiku Kisumu

The old city is gone.

When Type 1 Makes Sense

SCD Type 1 is useful when history does not matter.

For example:

  • Fixing a spelling mistake
  • Correcting wrong data
  • Updating an email address
  • Updating a phone number
  • Correcting a product name typo

If the original value was wrong, you usually do not want to preserve it.

Example

Imagine a customer’s name was loaded as:

Mary Wanjikuuu
Enter fullscreen mode Exit fullscreen mode

Then later corrected to:

Mary Wanjiku
Enter fullscreen mode Exit fullscreen mode

You do not need historical tracking for the typo. You just update the record.

That is SCD Type 1.

The Risk

The risk with Type 1 is that it destroys history.

If city changes from Nairobi to Kisumu, all past reports will now treat Mary as a Kisumu customer, even if she lived in Nairobi when the sales happened.

So Type 1 is simple, but dangerous when historical accuracy matters.


6. SCD Type 2: Keep Full History

SCD Type 2 is the most important and most commonly used SCD technique in analytics.

Instead of overwriting the old record, you create a new row when important attributes change.

Before Mary moves:

customer_key customer_id name city start_date end_date is_current
1 101 Mary Wanjiku Nairobi 2024-01-01 NULL true

After Mary moves to Kisumu:

customer_key customer_id name city start_date end_date is_current
1 101 Mary Wanjiku Nairobi 2024-01-01 2025-03-15 false
2 101 Mary Wanjiku Kisumu 2025-03-15 NULL true

Notice something important.

The customer_id stays the same because it represents the real-world customer.

But the customer_key changes because each historical version gets its own unique warehouse key.

This is usually called a surrogate key.

Why This Matters

Now, if Mary made a purchase while living in Nairobi, the fact table can point to the Nairobi version of her customer record.

If she made another purchase after moving to Kisumu, that sale can point to the Kisumu version.

This allows historical reports to stay accurate.

Example Fact Table

sale_id customer_key amount sale_date
5001 1 3000 2025-02-10
5002 2 4500 2025-04-20

The first sale belongs to Mary when she was in Nairobi.

The second sale belongs to Mary when she was in Kisumu.

That is the power of SCD Type 2.


7. SCD Type 3: Store Limited History in Columns

SCD Type 3 keeps limited history by adding extra columns.

For example:

customer_id name current_city previous_city
101 Mary Wanjiku Kisumu Nairobi

This lets you see the current value and one previous value.

When Type 3 Makes Sense

SCD Type 3 is useful when you only care about a small amount of history.

For example:

  • Previous region and current region
  • Previous plan and current plan
  • Previous department and current department

But it does not scale well if changes happen many times.

What happens if Mary moves from Nairobi to Kisumu, then Nakuru, then Eldoret?

You would need more columns:

previous_city_1
previous_city_2
previous_city_3
Enter fullscreen mode Exit fullscreen mode

That becomes messy quickly.

So Type 3 is useful, but only for very specific cases.


8. Practical Example in a Data Warehousing Project

Let’s say you are building a sales analytics warehouse for an e-commerce company.

You have data coming from:

  • PostgreSQL for application data
  • Kafka for order events
  • Airflow for orchestration
  • dbt for transformations
  • Snowflake, BigQuery, Redshift, or PostgreSQL as the warehouse

Your source customer table in PostgreSQL looks like this:

customer_id name city customer_type acquisition_channel updated_at
101 Mary Wanjiku Nairobi Regular Instagram Ads 2025-01-01

Later, the same customer changes:

customer_id name city customer_type acquisition_channel updated_at
101 Mary Wanjiku Kisumu Premium Instagram Ads 2025-03-15

Now the data team must decide:

Do we overwrite the old record?

Or do we preserve the old version?

And what do we do with the original acquisition channel?

In this example:

  • acquisition_channel can be treated as Type 0 because it represents how Mary was originally acquired.
  • city and customer_type can be treated as Type 2 because they affect historical reporting.

For analytics, this is why we often combine different SCD behaviors in the same dimension table. Some fields preserve the original value, while others keep full history.

A Type 2 customer dimension may look like this:

customer_key customer_id name city customer_type acquisition_channel valid_from valid_to is_current
1 101 Mary Wanjiku Nairobi Regular Instagram Ads 2025-01-01 2025-03-15 false
2 101 Mary Wanjiku Kisumu Premium Instagram Ads 2025-03-15 NULL true

Now your reports can answer questions like:

  • How many sales came from Nairobi customers in February?
  • How much revenue came from Premium customers after March?
  • What was the customer type at the time of purchase?
  • How many customers upgraded from Regular to Premium?

Without SCD Type 2, these questions become difficult or inaccurate.


9. A Simple SCD Type 2 Flow

A typical SCD Type 2 pipeline works like this:

Step 1: Load the Latest Source Data

You extract the latest customer data from the source system.

This could come from PostgreSQL, an API, a CSV file, or CDC events from Kafka.

Step 2: Compare Source Data With Current Dimension Records

You compare the incoming record with the current active record in the warehouse.

For example, compare:

source.city
source.customer_type
Enter fullscreen mode Exit fullscreen mode

against:

dim_customer.city
dim_customer.customer_type
Enter fullscreen mode Exit fullscreen mode

Step 3: Detect Changes

If nothing changed, do nothing.

If important attributes changed, expire the old record.

For example:

customer_key customer_id city valid_to is_current
1 101 Nairobi 2025-03-15 false

Step 4: Insert a New Current Record

Then insert the new version:

customer_key customer_id city valid_from valid_to is_current
2 101 Kisumu 2025-03-15 NULL true

Step 5: Use the Correct Dimension Version in Fact Tables

When loading fact data, join the fact date to the correct dimension record using the validity period.

For example:

SELECT
    f.order_id,
    d.customer_key,
    f.order_date,
    f.amount
FROM staging_orders f
JOIN dim_customer d
    ON f.customer_id = d.customer_id
   AND f.order_date >= d.valid_from
   AND (
        f.order_date < d.valid_to
        OR d.valid_to IS NULL
   );
Enter fullscreen mode Exit fullscreen mode

This ensures the order connects to the correct version of the customer.


10. Common Mistakes Beginners Make

Mistake 1: Using Type 1 When History Matters

This is probably the most common mistake.

Beginners often overwrite dimension records because it feels simple.

But later, when the business asks historical questions, the warehouse cannot answer correctly.

For example:

“What was revenue by customer region last year?”

If you overwrote all customer regions with the latest value, the report will be wrong.

How to Avoid It

Before choosing Type 1, ask:

“Will the business ever need to know what this value was in the past?”

If yes, consider Type 2.


Mistake 2: Tracking Every Column as Type 2

Not every change deserves a new historical version.

For example, do you really need a new customer dimension row when the phone number changes?

Maybe not.

If you track every small change, your dimension table can grow unnecessarily large and become harder to manage.

How to Avoid It

Classify columns carefully.

For example:

Column SCD Type
customer_name typo fix Type 1
city Type 2
customer_type Type 2
phone_number Type 1
email Type 1 or Type 2 depending on business need

The decision should be based on reporting needs, not just technical preference.


Mistake 3: Not Using a Surrogate Key

A big mistake is using the source system ID as the primary key for the dimension table.

For example, using customer_id as the only key.

That becomes a problem in Type 2 because the same customer can have multiple historical versions.

Better Approach

Use:

  • customer_id as the business key
  • customer_key as the warehouse surrogate key

Example:

customer_key customer_id city is_current
1 101 Nairobi false
2 101 Kisumu true

The surrogate key uniquely identifies each version.


Mistake 4: Forgetting the Current Flag

Without an is_current column, it becomes harder to query the latest version of each record.

You would have to check for:

valid_to IS NULL
Enter fullscreen mode Exit fullscreen mode

That works, but is_current makes queries easier and clearer.

A good SCD Type 2 table usually has:

valid_from
valid_to
is_current
Enter fullscreen mode Exit fullscreen mode

Some teams also add:

created_at
updated_at
record_hash
Enter fullscreen mode Exit fullscreen mode

Mistake 5: Poor Date Handling

SCD Type 2 depends heavily on dates.

If your valid_from and valid_to values are wrong, your historical joins will be wrong.

Common problems include:

  • Overlapping date ranges
  • Gaps between versions
  • Incorrect timezone handling
  • Using load date instead of actual business effective date
  • Not handling late-arriving data

How to Avoid It

Be very intentional about what your dates mean.

For example:

  • valid_from: when this version became valid
  • valid_to: when this version stopped being valid
  • loaded_at: when the data entered the warehouse

These are not always the same thing.


11. Best Practices for SCD

Use SCD Type 2 for Business-Critical History

If a change affects reporting, segmentation, revenue analysis, or compliance, preserve history.

Examples:

  • Customer region
  • Customer plan
  • Product category
  • Sales territory
  • Employee department
  • Account status

These are usually worth tracking with Type 2.


Use Hashing to Detect Changes

Instead of comparing many columns one by one, you can create a hash from the important attributes.

Example:

MD5(CONCAT(city, customer_type, region))
Enter fullscreen mode Exit fullscreen mode

Then compare the source hash with the current dimension hash.

If the hash changes, something important changed.

This makes SCD pipelines easier to maintain, especially when there are many columns.


Keep Your SCD Logic Clear

SCD logic can become confusing quickly.

Use clear column names like:

valid_from
valid_to
is_current
Enter fullscreen mode Exit fullscreen mode

Avoid unclear names like:

start
end
flag
Enter fullscreen mode Exit fullscreen mode

Your future self and your teammates will thank you.


Document Which Columns Are Tracked

Do not leave SCD behavior hidden inside SQL code only.

Document which columns are:

  • Type 0
  • Type 1
  • Type 2
  • Type 3
  • Ignored
  • Used for change detection

This is especially important in team environments.


Avoid Duplicates in Current Records

For a Type 2 dimension, each business key should have only one current record.

This should be true:

One customer_id = one current record
Enter fullscreen mode Exit fullscreen mode

You can test this with SQL:

SELECT
    customer_id,
    COUNT(*) AS current_records
FROM dim_customer
WHERE is_current = true
GROUP BY customer_id
HAVING COUNT(*) > 1;
Enter fullscreen mode Exit fullscreen mode

If this query returns rows, your dimension has a problem.


Test for Overlapping Validity Periods

For each business key, the date ranges should not overlap.

Bad example:

customer_id city valid_from valid_to
101 Nairobi 2025-01-01 2025-04-01
101 Kisumu 2025-03-15 NULL

These overlap between March 15 and April 1.

That means a sale on March 20 could match both records.

That is dangerous.


12. When to Use Slowly Changing Dimensions

Use SCD when dimension data changes and those changes affect analysis.

Good use cases include:

  • Customer address history
  • Product category history
  • Subscription plan changes
  • Employee department changes
  • Supplier region changes
  • Account status changes
  • Sales territory changes

SCD is especially useful in data warehouses and analytics systems where historical accuracy matters.

For example:

“Show revenue by the customer’s region at the time of purchase.”

That is a classic SCD Type 2 problem.


13. When SCD May Not Be the Best Choice

SCD is powerful, but not every situation needs it.

Do Not Use Type 2 for Every Small Change

If a field changes often and does not matter historically, Type 2 may create unnecessary complexity.

For example:

  • Last login timestamp
  • Profile picture URL
  • Phone number
  • Minor spelling corrections
  • Temporary status fields

For these, Type 1 may be enough.

Be Careful With Very High-Volume Changes

If a dimension changes too frequently, Type 2 can grow very fast.

At that point, you may need a different modeling approach, such as:

  • Event sourcing
  • Audit tables
  • Snapshot tables
  • Data lake versioning
  • Change Data Capture history

SCD is best for slowly changing descriptive attributes, not every event that happens in the system.


14. SCD in Modern Data Tools

SCD is not limited to traditional warehouses.

You can implement SCD patterns in many modern data stacks.

In dbt

dbt supports snapshots, which are commonly used to implement SCD Type 2.

A dbt snapshot can track when records change and automatically create historical versions.

In Airflow

Airflow can orchestrate SCD pipelines by scheduling extraction, staging, comparison, and dimension loading tasks.

In Spark

Spark is useful when you are handling large-scale dimension updates.

You can compare source and target datasets, detect changes, and write updated records to a lakehouse or warehouse.

In Kafka and CDC

Kafka can stream changes from source systems.

For example, using CDC tools, you can capture customer updates from PostgreSQL and send them into Kafka.

From there, you can process those changes and update your dimension tables.

In Warehouses

SCD can be implemented in:

  • Snowflake
  • BigQuery
  • Redshift
  • PostgreSQL
  • Databricks
  • SQL Server

The concept stays the same, even if the syntax differs.


15. Final Summary

Slowly Changing Dimensions help data engineers manage changes in dimension data over time.

They solve an important problem:

How do we keep analytics accurate when descriptive data changes?

The most common SCD types are:

Type Meaning Best For
Type 0 Keep the original value unchanged Original values that should not change
Type 1 Overwrite old values Corrections and non-historical changes
Type 2 Keep full history using new rows Historical reporting
Type 3 Keep limited history in columns Simple previous/current comparisons

In real data engineering projects, SCD Type 2 is especially important because it allows the warehouse to answer historical questions correctly.

Without SCD, reports can quietly become wrong.

A customer’s current city may overwrite their past city.
A product’s new category may rewrite old sales history.
An employee’s new department may change old performance reports.

That is why SCD matters.

The practical takeaway is simple:

Whenever a dimension value changes, ask whether the business needs to remember the old value.

If the value should never change, Type 0 may be the right choice.

If the old value does not matter, Type 1 may be enough.

If the business needs full history, Type 2 is usually the best choice.

If the business only needs a simple previous-and-current comparison, Type 3 may work.

Top comments (0)