SCD is a data modeling technique used to track how changes in entity attributes affect behavior over time.
Think:
- A product’s price change vs sales impact
- A button turns red, do clicks spike?
- A post’s tone goes from formal to savage, what happens to engagement?
“What was true when this happened?”
Example
Metclicks, a streaming service startup, saw a views spike in “Spongebob Watches” , right after reclassifying it from "Cartoon"
to "Realistic"
.
Slowly Changing Dimension Common Types
Type 0 - The Fanatic
No Change, Frozen in time
customer_id | name | tier |
---|---|---|
1 | Sandy Cheeks | Gold |
Type 1 - The Chad
New values overwrites old values Doesn't give a 🦆 about old values like it wasn't existed
customer_id | name | tier |
---|---|---|
1 | Sandy Cheeks | Gold |
Type 2 - The OCD
Keeps everything, marks old values with
is_current
flag, in addition to validity period for each change
customer_sk | customer_id | name | tier | valid_from | valid_to | is_current |
---|---|---|---|---|---|---|
abc123 | 1 | Sandy Cheeks | Bronze | 2024-01-01 | 2025-06-01 | false |
def456 | 1 | Sandy Cheeks | Gold | 2025-06-02 | 2999-12-31 | true |
Type 3 - The Short memory
It only remembers the previous values. barely!
customer_id | name | tier | prev_tier |
---|---|---|---|
1 | Sandy Cheeks | Gold | Bronze |
👉 Before starting, designing SCD 2 should be taken with a grain of salt, because:
- Some newly-arrived attributes might have spelling issue or fidelity issues.
- Know how would you handle
NULLs
, doesNULL
has any business meaning? should it be taken in consideration while designing the dimension or just ignore it?
Writer opinions:
- The
MDL
should be extendable, meaning that tracking new attributes in the future should not be painful and doesn't require any huge refactoring and testing. - A lot of time you might go with a hybrid approach between SCD 2 and SCD 1, where applying strict tracking on some attributes and loosely track the others, for example: strictly track product's
product_category
but loosely trackproduct_name
(depending on business logic).
Getting Our Hands Dirty
let's have a look at this fake python-generated
dataset, which loaded into duckdb
into customers
table.
create or replace table customers as
with cte_data as (
select *, now() as delivered_at from read_csv('./customers_example.csv')
)
select * from cte_data;
Source's Schema
column_name | column_type | null | key | default | extra |
---|---|---|---|---|---|
customer_id | BIGINT | YES | NULL | NULL | NULL |
name | VARCHAR | YES | NULL | NULL | NULL |
VARCHAR | YES | NULL | NULL | NULL | |
signup_date | DATE | YES | NULL | NULL | NULL |
country | VARCHAR | YES | NULL | NULL | NULL |
date | DATE | YES | NULL | NULL | NULL |
tier | VARCHAR | YES | NULL | NULL | NULL |
interaction_type | VARCHAR | YES | NULL | NULL | NULL |
interaction_value | DOUBLE | YES | NULL | NULL | NULL |
delivered_at | TIMESTAMP WITH TIME ZONE | YES | NULL | NULL | NULL |
The goal is to track the following attributes:
Tracking Type | SCD Type | Attribute | Reason |
---|---|---|---|
Restrict | 2 | country |
Impacts customer segmentation and behavior analysis. |
Restrict | 2 | tier |
Indicates customer status; affects engagement, offers, and analytics. |
Restrict | 2 | email |
Primary communication channel; needs to be historically tracked for auditing and outreach. |
Loosy | 1 | name |
Might change (e.g. after marriage), but we treat it as static. Changes reflect corrections, not business-impacting updates. |
Loosy | 1 | signup_date |
Treated as a factual point in time. Fixes may occur (e.g. migration issues), but don't require historical tracking. |
SQL Implementation
Compress Tracked Attributes
By giving each version of attributes combination a hash, concatenate, or any type of idempotent
process, it would be easier to identify when something changes in the next couple of steps.
Make sure to include your
business key
in the hash, just to embed to whom this piece of information belong.
WITH cte_scd_columns AS (
SELECT
MD5(
COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
) AS scd_id,
*
FROM events
)
SELECT
*
FROM cte_scd_columns
LIMIT 5
scd_id | column0 | customer_id | name | signup_date | country | timestamp | tier | interaction_type | interaction_value | delivered_at | |
---|---|---|---|---|---|---|---|---|---|---|---|
7ebaf657dc78359b4030489069de9f7d | 0 | 1 | Allison Hill | donaldgarcia@example.net | 2025-04-30 00:00:00 | Benin | 2025-05-22 09:44:21.840779 | Gold | login | nan | 2025-08-02 15:06:47.289000+03:00 |
7ebaf657dc78359b4030489069de9f7d | 1 | 1 | Allison Hill | donaldgarcia@example.net | 2025-04-30 00:00:00 | Benin | 2025-05-22 05:40:45.840779 | Gold | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 |
c67d1706d04fce3f93ec891827804882 | 2 | 1 | Allison Hill | donaldgarcia@example.net | 2025-04-30 00:00:00 | Benin | 2025-05-23 15:20:28.840779 | Silver | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 |
c67d1706d04fce3f93ec891827804882 | 3 | 1 | Allison Hill | donaldgarcia@example.net | 2025-04-30 00:00:00 | Benin | 2025-05-23 10:04:58.840779 | Silver | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 |
c67d1706d04fce3f93ec891827804882 | 4 | 1 | Allison Hill | donaldgarcia@example.net | 2025-04-30 00:00:00 | Benin | 2025-05-23 17:27:35.840779 | Silver | newsletter_click | nan | 2025-08-02 15:06:47.289000+03:00 |
Track When Exactly The Marker Changed
With window functions, it's possible to get the previous combination of attributes
per business key
in order to check whether something changed or not.
WITH cte_scd_columns AS (
SELECT
MD5(
COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
) AS scd_id,
*
FROM events
), cte_change_indicator AS (
SELECT
scd_id,
COALESCE(scd_id <> LAG(scd_id) OVER (PARTITION BY customer_id ORDER BY timestamp), TRUE) AS change_indicator, /* coalesce to handle the first appearance of the entity */ /* as null <> [value] is NULL */
*
FROM cte_scd_columns
)
SELECT
*
FROM cte_change_indicator
LIMIT 5
scd_id | change_indicator | scd_id_1 | column0 | customer_id | name | signup_date | country | timestamp | tier | interaction_type | interaction_value | delivered_at | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ced54b448991710b857be121958a99a2 | True | ced54b448991710b857be121958a99a2 | 3177 | 53 | Sylvia Martinez | lauriecontreras@example.net | 2024-09-25 00:00:00 | Zambia | 2025-05-21 13:01:56.840779 | Platinum | purchase | 363.9 | 2025-08-02 15:06:47.289000+03:00 |
ced54b448991710b857be121958a99a2 | False | ced54b448991710b857be121958a99a2 | 3178 | 53 | Sylvia Martinez | lauriecontreras@example.net | 2024-09-25 00:00:00 | Zambia | 2025-05-25 06:26:37.840779 | Platinum | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 |
8088f8ba9d0c567eb22e8e090a78789e | True | 8088f8ba9d0c567eb22e8e090a78789e | 3179 | 53 | Sylvia Martinez | lauriecontreras@example.net | 2024-09-25 00:00:00 | Zambia | 2025-05-25 20:17:17.840779 | Gold | login | nan | 2025-08-02 15:06:47.289000+03:00 |
8088f8ba9d0c567eb22e8e090a78789e | False | 8088f8ba9d0c567eb22e8e090a78789e | 3180 | 53 | Sylvia Martinez | lauriecontreras@example.net | 2024-09-25 00:00:00 | Zambia | 2025-05-28 10:25:32.840779 | Gold | purchase | 362.82 | 2025-08-02 15:06:47.289000+03:00 |
8088f8ba9d0c567eb22e8e090a78789e | False | 8088f8ba9d0c567eb22e8e090a78789e | 3182 | 53 | Sylvia Martinez | lauriecontreras@example.net | 2024-09-25 00:00:00 | Zambia | 2025-05-29 02:13:36.840779 | Gold | newsletter_click | nan | 2025-08-02 15:06:47.289000+03:00 |
Check out the prev_scd_id
column in the output. When it's NULL
, that means it's the very first entry for that customer_id
. When prev_scd_id
is different from scd_id
, that's our signal that one of the strictly tracked attributes has changed.
At this stage, we'll tweak this CTE a bit to add a boolean column to identify changes. This change_indicator
will be TRUE
when a tracked attribute combination is new or has changed, and FALSE
otherwise. The COALESCE
is a trick to make sure the very first entry for a customer_id
is always marked as a change, since LAG()
would return NULL
for it.
Add Valid From Valid To
-
valid_from
: This is thedate
when a new combination of attributes became active. It's pulled directly from thedate
column for rows wherechange_indicator
isTRUE
. -
valid_to
: This is thedate
when that specific combination of attributes stopped being valid. We useLEAD()
here to peek at thedate
of the next change for the samecustomer_id
, then subtract one day to mark the end of the current record's validity. If there's no next change (meaning it's the most current record),LEAD()
returnsNULL
, which we'll handle in the final step. We only care about the rows wherechange_indicator
isTRUE
because those are the actual points in time when a relevant attribute shifted.
WITH cte_scd_columns AS (
SELECT
MD5(
COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
) AS scd_id,
*
FROM events
), cte_change_indicator AS (
SELECT
scd_id,
COALESCE(scd_id <> LAG(scd_id) OVER (PARTITION BY customer_id ORDER BY timestamp), TRUE) AS change_indicator, /* coalesce to handle the first appearance of the entity */ /* as null <> [value] is NULL */
*
FROM cte_scd_columns
), cte_valid_from_valid_to AS (
SELECT
*,
timestamp AS valid_from, /* when a change occurred */
LEAD(timestamp) OVER (PARTITION BY customer_id ORDER BY timestamp) /* when the next change occurred */ - CAST('1 microsecond' AS INTERVAL) AS valid_to
FROM (
SELECT
*
FROM cte_change_indicator
WHERE
change_indicator IS TRUE
)
)
SELECT
*
FROM cte_valid_from_valid_to
LIMIT 5
scd_id | change_indicator | scd_id_1 | column0 | customer_id | name | signup_date | country | timestamp | tier | interaction_type | interaction_value | delivered_at | valid_from | valid_to | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ff39bc29510492f665cae5e23bdaaa5a | True | ff39bc29510492f665cae5e23bdaaa5a | 1272 | 21 | Laura Sanchez | maguirre@example.org | 2023-09-27 00:00:00 | Jersey | 2025-05-20 23:26:46.840779 | Gold | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 | 2025-05-20 23:26:46.840779 | 2025-05-27 06:23:35.840778 |
ad3df688bca726e92e47974ba0e384e8 | True | ad3df688bca726e92e47974ba0e384e8 | 1277 | 21 | Laura Sanchez | maguirre@example.org | 2023-09-27 00:00:00 | Jersey | 2025-05-27 06:23:35.840779 | Bronze | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 | 2025-05-27 06:23:35.840779 | 2025-06-03 01:00:23.840778 |
ff39bc29510492f665cae5e23bdaaa5a | True | ff39bc29510492f665cae5e23bdaaa5a | 1287 | 21 | Laura Sanchez | maguirre@example.org | 2023-09-27 00:00:00 | Jersey | 2025-06-03 01:00:23.840779 | Gold | login | nan | 2025-08-02 15:06:47.289000+03:00 | 2025-06-03 01:00:23.840779 | 2025-06-12 21:08:30.840778 |
dc5be9887d65c60946b791b6e3bbda49 | True | dc5be9887d65c60946b791b6e3bbda49 | 1292 | 21 | Laura Sanchez | maguirre@example.org | 2023-09-27 00:00:00 | Jersey | 2025-06-12 21:08:30.840779 | Platinum | purchase | 51.82 | 2025-08-02 15:06:47.289000+03:00 | 2025-06-12 21:08:30.840779 | 2025-06-28 21:01:02.840778 |
ad3df688bca726e92e47974ba0e384e8 | True | ad3df688bca726e92e47974ba0e384e8 | 1302 | 21 | Laura Sanchez | maguirre@example.org | 2023-09-27 00:00:00 | Jersey | 2025-06-28 21:01:02.840779 | Bronze | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 | 2025-06-28 21:01:02.840779 | 2025-06-30 21:34:29.840778 |
Final Touch
Observe the valid_from
and valid_to
columns. Each row now represents a specific period during which the country
, email
, and tier
attributes were stable for that customer_id
. Notice some valid_to
values are NULL
for now; that's because they represent the current active record.
WITH cte_scd_columns AS (
SELECT
MD5(
COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
) AS scd_id,
*
FROM events
), cte_change_indicator AS (
SELECT
scd_id,
COALESCE(scd_id <> LAG(scd_id) OVER (PARTITION BY customer_id ORDER BY timestamp), TRUE) AS change_indicator, /* coalesce to handle the first appearance of the entity */ /* as null <> [value] is NULL */
*
FROM cte_scd_columns
), cte_valid_from_valid_to AS (
SELECT
*,
timestamp AS valid_from, /* when a change occurred */
LEAD(timestamp) OVER (PARTITION BY customer_id ORDER BY timestamp) /* when the next change occurred */ - CAST('1 microsecond' AS INTERVAL) AS valid_to
FROM (
SELECT
*
FROM cte_change_indicator
WHERE
change_indicator IS TRUE
)
), cte_scd2_scd1 AS (
SELECT
MD5(scd_id || CAST(valid_from AS TEXT)) AS customer_sk,
customer_id,
name, /* scd1 attributes */
signup_date,
country, /* scd2 attributes */
email,
tier,
valid_from,
CAST(COALESCE(valid_to, '9999-12-31T23:59:59') AS TIMESTAMP) AS valid_to,
valid_to IS NULL AS is_effective_row
FROM cte_valid_from_valid_to
)
SELECT
*
FROM cte_scd2_scd1
LIMIT 5
customer_sk | customer_id | name | signup_date | country | tier | valid_from | valid_to | is_effective_row | |
---|---|---|---|---|---|---|---|---|---|
4aa26cc24e8bfeb92462e7b77ddd6e5e | 12 | Kyle Mcdonald | 2025-04-24 00:00:00 | Sao Tome and Principe | tasha01@example.net | Platinum | 2025-05-21 06:41:00.840779 | 2025-05-25 20:46:32.840778 | False |
509016b089696afc5f7177a4a45d38c3 | 12 | Kyle Mcdonald | 2025-04-24 00:00:00 | Sao Tome and Principe | tasha01@example.net | Bronze | 2025-05-25 20:46:32.840779 | 2025-06-14 07:11:09.840778 | False |
e929d1ae5e0d2deabb9a5e7ae42d1219 | 53 | Sylvia Martinez | 2024-09-25 00:00:00 | Zambia | lauriecontreras@example.net | Platinum | 2025-05-21 13:01:56.840779 | 2025-05-25 20:17:17.840778 | False |
fbf1b35f2d65ef63d1f24541a6d18088 | 53 | Sylvia Martinez | 2024-09-25 00:00:00 | Zambia | lauriecontreras@example.net | Gold | 2025-05-25 20:17:17.840779 | 2025-06-24 17:19:57.840778 | False |
41ce1a3fe926554585bb70b5302e48f4 | 53 | Sylvia Martinez | 2024-09-25 00:00:00 | Zambia | lauriecontreras@example.net | Bronze | 2025-06-24 17:19:57.840779 | 9999-12-31 23:59:59 | True |
After creating the table, it's essential to merge it and use it in another fact table
.
That process is beyond the scope of this article, but you can check it here
Top comments (0)