SCD is a data modeling technique used to track how changes in entity attributes affect behavior over time.
Think:
- A product’s price change vs sales impact
- A button turns red, do clicks spike?
- A post’s tone goes from formal to savage, what happens to engagement?
“What was true when this happened?”
Example
Metclicks, a streaming service startup, saw a views spike in “Spongebob Watches” , right after reclassifying it from "Cartoon" to "Realistic".
Slowly Changing Dimension Common Types
Type 0 - The Fanatic
No Change, Frozen in time
| customer_id | name | tier |
|---|---|---|
| 1 | Sandy Cheeks | Gold |
Type 1 - The Chad
New values overwrites old values Doesn't give a 🦆 about old values like it wasn't existed
| customer_id | name | tier |
|---|---|---|
| 1 | Sandy Cheeks | Gold |
Type 2 - The OCD
Keeps everything, marks old values with
is_currentflag, in addition to validity period for each change
| customer_sk | customer_id | name | tier | valid_from | valid_to | is_current |
|---|---|---|---|---|---|---|
| abc123 | 1 | Sandy Cheeks | Bronze | 2024-01-01 | 2025-06-01 | false |
| def456 | 1 | Sandy Cheeks | Gold | 2025-06-02 | 2999-12-31 | true |
Type 3 - The Short memory
It only remembers the previous values. barely!
| customer_id | name | tier | prev_tier |
|---|---|---|---|
| 1 | Sandy Cheeks | Gold | Bronze |
👉 Before starting, designing SCD 2 should be taken with a grain of salt, because:
- Some newly-arrived attributes might have spelling issue or fidelity issues.
- Know how would you handle
NULLs, doesNULLhas any business meaning? should it be taken in consideration while designing the dimension or just ignore it?
Writer opinions:
- The
MDLshould be extendable, meaning that tracking new attributes in the future should not be painful and doesn't require any huge refactoring and testing. - A lot of time you might go with a hybrid approach between SCD 2 and SCD 1, where applying strict tracking on some attributes and loosely track the others, for example: strictly track product's
product_categorybut loosely trackproduct_name(depending on business logic).
Getting Our Hands Dirty
let's have a look at this fake python-generated dataset, which loaded into duckdb into customers table.
create or replace table customers as
with cte_data as (
select *, now() as delivered_at from read_csv('./customers_example.csv')
)
select * from cte_data;
Source's Schema
| column_name | column_type | null | key | default | extra |
|---|---|---|---|---|---|
| customer_id | BIGINT | YES | NULL | NULL | NULL |
| name | VARCHAR | YES | NULL | NULL | NULL |
| VARCHAR | YES | NULL | NULL | NULL | |
| signup_date | DATE | YES | NULL | NULL | NULL |
| country | VARCHAR | YES | NULL | NULL | NULL |
| date | DATE | YES | NULL | NULL | NULL |
| tier | VARCHAR | YES | NULL | NULL | NULL |
| interaction_type | VARCHAR | YES | NULL | NULL | NULL |
| interaction_value | DOUBLE | YES | NULL | NULL | NULL |
| delivered_at | TIMESTAMP WITH TIME ZONE | YES | NULL | NULL | NULL |
The goal is to track the following attributes:
| Tracking Type | SCD Type | Attribute | Reason |
|---|---|---|---|
| Restrict | 2 | country |
Impacts customer segmentation and behavior analysis. |
| Restrict | 2 | tier |
Indicates customer status; affects engagement, offers, and analytics. |
| Restrict | 2 | email |
Primary communication channel; needs to be historically tracked for auditing and outreach. |
| Loosy | 1 | name |
Might change (e.g. after marriage), but we treat it as static. Changes reflect corrections, not business-impacting updates. |
| Loosy | 1 | signup_date |
Treated as a factual point in time. Fixes may occur (e.g. migration issues), but don't require historical tracking. |
SQL Implementation
Compress Tracked Attributes
By giving each version of attributes combination a hash, concatenate, or any type of idempotent process, it would be easier to identify when something changes in the next couple of steps.
Make sure to include your
business keyin the hash, just to embed to whom this piece of information belong.
WITH cte_scd_columns AS (
SELECT
MD5(
COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
) AS scd_id,
*
FROM events
)
SELECT
*
FROM cte_scd_columns
LIMIT 5
| scd_id | column0 | customer_id | name | signup_date | country | timestamp | tier | interaction_type | interaction_value | delivered_at | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 7ebaf657dc78359b4030489069de9f7d | 0 | 1 | Allison Hill | donaldgarcia@example.net | 2025-04-30 00:00:00 | Benin | 2025-05-22 09:44:21.840779 | Gold | login | nan | 2025-08-02 15:06:47.289000+03:00 |
| 7ebaf657dc78359b4030489069de9f7d | 1 | 1 | Allison Hill | donaldgarcia@example.net | 2025-04-30 00:00:00 | Benin | 2025-05-22 05:40:45.840779 | Gold | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 |
| c67d1706d04fce3f93ec891827804882 | 2 | 1 | Allison Hill | donaldgarcia@example.net | 2025-04-30 00:00:00 | Benin | 2025-05-23 15:20:28.840779 | Silver | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 |
| c67d1706d04fce3f93ec891827804882 | 3 | 1 | Allison Hill | donaldgarcia@example.net | 2025-04-30 00:00:00 | Benin | 2025-05-23 10:04:58.840779 | Silver | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 |
| c67d1706d04fce3f93ec891827804882 | 4 | 1 | Allison Hill | donaldgarcia@example.net | 2025-04-30 00:00:00 | Benin | 2025-05-23 17:27:35.840779 | Silver | newsletter_click | nan | 2025-08-02 15:06:47.289000+03:00 |
Track When Exactly The Marker Changed
With window functions, it's possible to get the previous combination of attributes per business key in order to check whether something changed or not.
WITH cte_scd_columns AS (
SELECT
MD5(
COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
) AS scd_id,
*
FROM events
), cte_change_indicator AS (
SELECT
scd_id,
COALESCE(scd_id <> LAG(scd_id) OVER (PARTITION BY customer_id ORDER BY timestamp), TRUE) AS change_indicator, /* coalesce to handle the first appearance of the entity */ /* as null <> [value] is NULL */
*
FROM cte_scd_columns
)
SELECT
*
FROM cte_change_indicator
LIMIT 5
| scd_id | change_indicator | scd_id_1 | column0 | customer_id | name | signup_date | country | timestamp | tier | interaction_type | interaction_value | delivered_at | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ced54b448991710b857be121958a99a2 | True | ced54b448991710b857be121958a99a2 | 3177 | 53 | Sylvia Martinez | lauriecontreras@example.net | 2024-09-25 00:00:00 | Zambia | 2025-05-21 13:01:56.840779 | Platinum | purchase | 363.9 | 2025-08-02 15:06:47.289000+03:00 |
| ced54b448991710b857be121958a99a2 | False | ced54b448991710b857be121958a99a2 | 3178 | 53 | Sylvia Martinez | lauriecontreras@example.net | 2024-09-25 00:00:00 | Zambia | 2025-05-25 06:26:37.840779 | Platinum | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 |
| 8088f8ba9d0c567eb22e8e090a78789e | True | 8088f8ba9d0c567eb22e8e090a78789e | 3179 | 53 | Sylvia Martinez | lauriecontreras@example.net | 2024-09-25 00:00:00 | Zambia | 2025-05-25 20:17:17.840779 | Gold | login | nan | 2025-08-02 15:06:47.289000+03:00 |
| 8088f8ba9d0c567eb22e8e090a78789e | False | 8088f8ba9d0c567eb22e8e090a78789e | 3180 | 53 | Sylvia Martinez | lauriecontreras@example.net | 2024-09-25 00:00:00 | Zambia | 2025-05-28 10:25:32.840779 | Gold | purchase | 362.82 | 2025-08-02 15:06:47.289000+03:00 |
| 8088f8ba9d0c567eb22e8e090a78789e | False | 8088f8ba9d0c567eb22e8e090a78789e | 3182 | 53 | Sylvia Martinez | lauriecontreras@example.net | 2024-09-25 00:00:00 | Zambia | 2025-05-29 02:13:36.840779 | Gold | newsletter_click | nan | 2025-08-02 15:06:47.289000+03:00 |
Check out the prev_scd_id column in the output. When it's NULL, that means it's the very first entry for that customer_id. When prev_scd_id is different from scd_id, that's our signal that one of the strictly tracked attributes has changed.
At this stage, we'll tweak this CTE a bit to add a boolean column to identify changes. This change_indicator will be TRUE when a tracked attribute combination is new or has changed, and FALSE otherwise. The COALESCE is a trick to make sure the very first entry for a customer_id is always marked as a change, since LAG() would return NULL for it.
Add Valid From Valid To
-
valid_from: This is thedatewhen a new combination of attributes became active. It's pulled directly from thedatecolumn for rows wherechange_indicatorisTRUE. -
valid_to: This is thedatewhen that specific combination of attributes stopped being valid. We useLEAD()here to peek at thedateof the next change for the samecustomer_id, then subtract one day to mark the end of the current record's validity. If there's no next change (meaning it's the most current record),LEAD()returnsNULL, which we'll handle in the final step. We only care about the rows wherechange_indicatorisTRUEbecause those are the actual points in time when a relevant attribute shifted.
WITH cte_scd_columns AS (
SELECT
MD5(
COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
) AS scd_id,
*
FROM events
), cte_change_indicator AS (
SELECT
scd_id,
COALESCE(scd_id <> LAG(scd_id) OVER (PARTITION BY customer_id ORDER BY timestamp), TRUE) AS change_indicator, /* coalesce to handle the first appearance of the entity */ /* as null <> [value] is NULL */
*
FROM cte_scd_columns
), cte_valid_from_valid_to AS (
SELECT
*,
timestamp AS valid_from, /* when a change occurred */
LEAD(timestamp) OVER (PARTITION BY customer_id ORDER BY timestamp) /* when the next change occurred */ - CAST('1 microsecond' AS INTERVAL) AS valid_to
FROM (
SELECT
*
FROM cte_change_indicator
WHERE
change_indicator IS TRUE
)
)
SELECT
*
FROM cte_valid_from_valid_to
LIMIT 5
| scd_id | change_indicator | scd_id_1 | column0 | customer_id | name | signup_date | country | timestamp | tier | interaction_type | interaction_value | delivered_at | valid_from | valid_to | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ff39bc29510492f665cae5e23bdaaa5a | True | ff39bc29510492f665cae5e23bdaaa5a | 1272 | 21 | Laura Sanchez | maguirre@example.org | 2023-09-27 00:00:00 | Jersey | 2025-05-20 23:26:46.840779 | Gold | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 | 2025-05-20 23:26:46.840779 | 2025-05-27 06:23:35.840778 |
| ad3df688bca726e92e47974ba0e384e8 | True | ad3df688bca726e92e47974ba0e384e8 | 1277 | 21 | Laura Sanchez | maguirre@example.org | 2023-09-27 00:00:00 | Jersey | 2025-05-27 06:23:35.840779 | Bronze | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 | 2025-05-27 06:23:35.840779 | 2025-06-03 01:00:23.840778 |
| ff39bc29510492f665cae5e23bdaaa5a | True | ff39bc29510492f665cae5e23bdaaa5a | 1287 | 21 | Laura Sanchez | maguirre@example.org | 2023-09-27 00:00:00 | Jersey | 2025-06-03 01:00:23.840779 | Gold | login | nan | 2025-08-02 15:06:47.289000+03:00 | 2025-06-03 01:00:23.840779 | 2025-06-12 21:08:30.840778 |
| dc5be9887d65c60946b791b6e3bbda49 | True | dc5be9887d65c60946b791b6e3bbda49 | 1292 | 21 | Laura Sanchez | maguirre@example.org | 2023-09-27 00:00:00 | Jersey | 2025-06-12 21:08:30.840779 | Platinum | purchase | 51.82 | 2025-08-02 15:06:47.289000+03:00 | 2025-06-12 21:08:30.840779 | 2025-06-28 21:01:02.840778 |
| ad3df688bca726e92e47974ba0e384e8 | True | ad3df688bca726e92e47974ba0e384e8 | 1302 | 21 | Laura Sanchez | maguirre@example.org | 2023-09-27 00:00:00 | Jersey | 2025-06-28 21:01:02.840779 | Bronze | support_ticket | nan | 2025-08-02 15:06:47.289000+03:00 | 2025-06-28 21:01:02.840779 | 2025-06-30 21:34:29.840778 |
Final Touch
Observe the valid_from and valid_to columns. Each row now represents a specific period during which the country, email, and tier attributes were stable for that customer_id. Notice some valid_to values are NULL for now; that's because they represent the current active record.
WITH cte_scd_columns AS (
SELECT
MD5(
COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
) AS scd_id,
*
FROM events
), cte_change_indicator AS (
SELECT
scd_id,
COALESCE(scd_id <> LAG(scd_id) OVER (PARTITION BY customer_id ORDER BY timestamp), TRUE) AS change_indicator, /* coalesce to handle the first appearance of the entity */ /* as null <> [value] is NULL */
*
FROM cte_scd_columns
), cte_valid_from_valid_to AS (
SELECT
*,
timestamp AS valid_from, /* when a change occurred */
LEAD(timestamp) OVER (PARTITION BY customer_id ORDER BY timestamp) /* when the next change occurred */ - CAST('1 microsecond' AS INTERVAL) AS valid_to
FROM (
SELECT
*
FROM cte_change_indicator
WHERE
change_indicator IS TRUE
)
), cte_scd2_scd1 AS (
SELECT
MD5(scd_id || CAST(valid_from AS TEXT)) AS customer_sk,
customer_id,
name, /* scd1 attributes */
signup_date,
country, /* scd2 attributes */
email,
tier,
valid_from,
CAST(COALESCE(valid_to, '9999-12-31T23:59:59') AS TIMESTAMP) AS valid_to,
valid_to IS NULL AS is_effective_row
FROM cte_valid_from_valid_to
)
SELECT
*
FROM cte_scd2_scd1
LIMIT 5
| customer_sk | customer_id | name | signup_date | country | tier | valid_from | valid_to | is_effective_row | |
|---|---|---|---|---|---|---|---|---|---|
| 4aa26cc24e8bfeb92462e7b77ddd6e5e | 12 | Kyle Mcdonald | 2025-04-24 00:00:00 | Sao Tome and Principe | tasha01@example.net | Platinum | 2025-05-21 06:41:00.840779 | 2025-05-25 20:46:32.840778 | False |
| 509016b089696afc5f7177a4a45d38c3 | 12 | Kyle Mcdonald | 2025-04-24 00:00:00 | Sao Tome and Principe | tasha01@example.net | Bronze | 2025-05-25 20:46:32.840779 | 2025-06-14 07:11:09.840778 | False |
| e929d1ae5e0d2deabb9a5e7ae42d1219 | 53 | Sylvia Martinez | 2024-09-25 00:00:00 | Zambia | lauriecontreras@example.net | Platinum | 2025-05-21 13:01:56.840779 | 2025-05-25 20:17:17.840778 | False |
| fbf1b35f2d65ef63d1f24541a6d18088 | 53 | Sylvia Martinez | 2024-09-25 00:00:00 | Zambia | lauriecontreras@example.net | Gold | 2025-05-25 20:17:17.840779 | 2025-06-24 17:19:57.840778 | False |
| 41ce1a3fe926554585bb70b5302e48f4 | 53 | Sylvia Martinez | 2024-09-25 00:00:00 | Zambia | lauriecontreras@example.net | Bronze | 2025-06-24 17:19:57.840779 | 9999-12-31 23:59:59 | True |
After creating the table, it's essential to merge it and use it in another fact table.
That process is beyond the scope of this article, but you can check it here

Top comments (0)