DEV Community

Cover image for Capture Slowly Changing Attributes in SQL - SCD Type 2
Ahmad Muhammad
Ahmad Muhammad

Posted on

Capture Slowly Changing Attributes in SQL - SCD Type 2

SCD is a data modeling technique used to track how changes in entity attributes affect behavior over time.

Think:

  • A product’s price change vs sales impact
  • A button turns red, do clicks spike?
  • A post’s tone goes from formal to savage, what happens to engagement?

“What was true when this happened?”

Example

metclicks

Metclicks, a streaming service startup, saw a views spike in “Spongebob Watches” , right after reclassifying it from "Cartoon" to "Realistic".


Slowly Changing Dimension Common Types

Type 0 - The Fanatic

No Change, Frozen in time

customer_id name tier
1 Sandy Cheeks Gold

Type 1 - The Chad

New values overwrites old values Doesn't give a 🦆 about old values like it wasn't existed

customer_id name tier
1 Sandy Cheeks Gold

Type 2 - The OCD

Keeps everything, marks old values with is_current flag, in addition to validity period for each change

customer_sk customer_id name tier valid_from valid_to is_current
abc123 1 Sandy Cheeks Bronze 2024-01-01 2025-06-01 false
def456 1 Sandy Cheeks Gold 2025-06-02 2999-12-31 true

Type 3 - The Short memory

It only remembers the previous values. barely!

customer_id name tier prev_tier
1 Sandy Cheeks Gold Bronze

👉 Before starting, designing SCD 2 should be taken with a grain of salt, because:

  • Some newly-arrived attributes might have spelling issue or fidelity issues.
  • Know how would you handle NULLs, does NULL has any business meaning? should it be taken in consideration while designing the dimension or just ignore it?

Writer opinions:

  • The MDL should be extendable, meaning that tracking new attributes in the future should not be painful and doesn't require any huge refactoring and testing.
  • A lot of time you might go with a hybrid approach between SCD 2 and SCD 1, where applying strict tracking on some attributes and loosely track the others, for example: strictly track product's product_category but loosely track product_name (depending on business logic).

Getting Our Hands Dirty

let's have a look at this fake python-generated dataset, which loaded into duckdb into customers table.

Get data

create or replace table customers as
with cte_data as (
select *, now() as delivered_at from read_csv('./customers_example.csv')
)
select * from cte_data;
Enter fullscreen mode Exit fullscreen mode

Source's Schema

column_name column_type null key default extra
customer_id BIGINT YES NULL NULL NULL
name VARCHAR YES NULL NULL NULL
email VARCHAR YES NULL NULL NULL
signup_date DATE YES NULL NULL NULL
country VARCHAR YES NULL NULL NULL
date DATE YES NULL NULL NULL
tier VARCHAR YES NULL NULL NULL
interaction_type VARCHAR YES NULL NULL NULL
interaction_value DOUBLE YES NULL NULL NULL
delivered_at TIMESTAMP WITH TIME ZONE YES NULL NULL NULL

The goal is to track the following attributes:

Tracking Type SCD Type Attribute Reason
Restrict 2 country Impacts customer segmentation and behavior analysis.
Restrict 2 tier Indicates customer status; affects engagement, offers, and analytics.
Restrict 2 email Primary communication channel; needs to be historically tracked for auditing and outreach.
Loosy 1 name Might change (e.g. after marriage), but we treat it as static. Changes reflect corrections, not business-impacting updates.
Loosy 1 signup_date Treated as a factual point in time. Fixes may occur (e.g. migration issues), but don't require historical tracking.

SQL Implementation

Compress Tracked Attributes

By giving each version of attributes combination a hash, concatenate, or any type of idempotent process, it would be easier to identify when something changes in the next couple of steps.

Make sure to include your business key in the hash, just to embed to whom this piece of information belong.

WITH cte_scd_columns AS (
  SELECT
    MD5(
      COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
    ) AS scd_id,
    *
  FROM events
)
SELECT
  *
FROM cte_scd_columns
LIMIT 5
Enter fullscreen mode Exit fullscreen mode
scd_id column0 customer_id name email signup_date country timestamp tier interaction_type interaction_value delivered_at
7ebaf657dc78359b4030489069de9f7d 0 1 Allison Hill donaldgarcia@example.net 2025-04-30 00:00:00 Benin 2025-05-22 09:44:21.840779 Gold login nan 2025-08-02 15:06:47.289000+03:00
7ebaf657dc78359b4030489069de9f7d 1 1 Allison Hill donaldgarcia@example.net 2025-04-30 00:00:00 Benin 2025-05-22 05:40:45.840779 Gold support_ticket nan 2025-08-02 15:06:47.289000+03:00
c67d1706d04fce3f93ec891827804882 2 1 Allison Hill donaldgarcia@example.net 2025-04-30 00:00:00 Benin 2025-05-23 15:20:28.840779 Silver support_ticket nan 2025-08-02 15:06:47.289000+03:00
c67d1706d04fce3f93ec891827804882 3 1 Allison Hill donaldgarcia@example.net 2025-04-30 00:00:00 Benin 2025-05-23 10:04:58.840779 Silver support_ticket nan 2025-08-02 15:06:47.289000+03:00
c67d1706d04fce3f93ec891827804882 4 1 Allison Hill donaldgarcia@example.net 2025-04-30 00:00:00 Benin 2025-05-23 17:27:35.840779 Silver newsletter_click nan 2025-08-02 15:06:47.289000+03:00

Track When Exactly The Marker Changed

With window functions, it's possible to get the previous combination of attributes per business key in order to check whether something changed or not.

WITH cte_scd_columns AS (
  SELECT
    MD5(
      COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
    ) AS scd_id,
    *
  FROM events
), cte_change_indicator AS (
  SELECT
    scd_id,
    COALESCE(scd_id <> LAG(scd_id) OVER (PARTITION BY customer_id ORDER BY timestamp), TRUE) AS change_indicator, /* coalesce to handle the first appearance of the entity */ /* as null <> [value] is NULL */
    *
  FROM cte_scd_columns
)
SELECT
  *
FROM cte_change_indicator
LIMIT 5
Enter fullscreen mode Exit fullscreen mode
scd_id change_indicator scd_id_1 column0 customer_id name email signup_date country timestamp tier interaction_type interaction_value delivered_at
ced54b448991710b857be121958a99a2 True ced54b448991710b857be121958a99a2 3177 53 Sylvia Martinez lauriecontreras@example.net 2024-09-25 00:00:00 Zambia 2025-05-21 13:01:56.840779 Platinum purchase 363.9 2025-08-02 15:06:47.289000+03:00
ced54b448991710b857be121958a99a2 False ced54b448991710b857be121958a99a2 3178 53 Sylvia Martinez lauriecontreras@example.net 2024-09-25 00:00:00 Zambia 2025-05-25 06:26:37.840779 Platinum support_ticket nan 2025-08-02 15:06:47.289000+03:00
8088f8ba9d0c567eb22e8e090a78789e True 8088f8ba9d0c567eb22e8e090a78789e 3179 53 Sylvia Martinez lauriecontreras@example.net 2024-09-25 00:00:00 Zambia 2025-05-25 20:17:17.840779 Gold login nan 2025-08-02 15:06:47.289000+03:00
8088f8ba9d0c567eb22e8e090a78789e False 8088f8ba9d0c567eb22e8e090a78789e 3180 53 Sylvia Martinez lauriecontreras@example.net 2024-09-25 00:00:00 Zambia 2025-05-28 10:25:32.840779 Gold purchase 362.82 2025-08-02 15:06:47.289000+03:00
8088f8ba9d0c567eb22e8e090a78789e False 8088f8ba9d0c567eb22e8e090a78789e 3182 53 Sylvia Martinez lauriecontreras@example.net 2024-09-25 00:00:00 Zambia 2025-05-29 02:13:36.840779 Gold newsletter_click nan 2025-08-02 15:06:47.289000+03:00

Check out the prev_scd_id column in the output. When it's NULL, that means it's the very first entry for that customer_id. When prev_scd_id is different from scd_id, that's our signal that one of the strictly tracked attributes has changed.

At this stage, we'll tweak this CTE a bit to add a boolean column to identify changes. This change_indicator will be TRUE when a tracked attribute combination is new or has changed, and FALSE otherwise. The COALESCE is a trick to make sure the very first entry for a customer_id is always marked as a change, since LAG() would return NULL for it.

Add Valid From Valid To

  • valid_from: This is the date when a new combination of attributes became active. It's pulled directly from the date column for rows where change_indicator is TRUE.
  • valid_to: This is the date when that specific combination of attributes stopped being valid. We use LEAD() here to peek at the date of the next change for the same customer_id, then subtract one day to mark the end of the current record's validity. If there's no next change (meaning it's the most current record), LEAD() returns NULL, which we'll handle in the final step. We only care about the rows where change_indicator is TRUE because those are the actual points in time when a relevant attribute shifted.
WITH cte_scd_columns AS (
  SELECT
    MD5(
      COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
    ) AS scd_id,
    *
  FROM events
), cte_change_indicator AS (
  SELECT
    scd_id,
    COALESCE(scd_id <> LAG(scd_id) OVER (PARTITION BY customer_id ORDER BY timestamp), TRUE) AS change_indicator, /* coalesce to handle the first appearance of the entity */ /* as null <> [value] is NULL */
    *
  FROM cte_scd_columns
), cte_valid_from_valid_to AS (
  SELECT
    *,
    timestamp AS valid_from, /* when a change occurred */
    LEAD(timestamp) OVER (PARTITION BY customer_id ORDER BY timestamp) /* when the next change occurred */ - CAST('1 microsecond' AS INTERVAL) AS valid_to
  FROM (
    SELECT
      *
    FROM cte_change_indicator
    WHERE
      change_indicator IS TRUE
  )
)
SELECT
  *
FROM cte_valid_from_valid_to
LIMIT 5
Enter fullscreen mode Exit fullscreen mode
scd_id change_indicator scd_id_1 column0 customer_id name email signup_date country timestamp tier interaction_type interaction_value delivered_at valid_from valid_to
ff39bc29510492f665cae5e23bdaaa5a True ff39bc29510492f665cae5e23bdaaa5a 1272 21 Laura Sanchez maguirre@example.org 2023-09-27 00:00:00 Jersey 2025-05-20 23:26:46.840779 Gold support_ticket nan 2025-08-02 15:06:47.289000+03:00 2025-05-20 23:26:46.840779 2025-05-27 06:23:35.840778
ad3df688bca726e92e47974ba0e384e8 True ad3df688bca726e92e47974ba0e384e8 1277 21 Laura Sanchez maguirre@example.org 2023-09-27 00:00:00 Jersey 2025-05-27 06:23:35.840779 Bronze support_ticket nan 2025-08-02 15:06:47.289000+03:00 2025-05-27 06:23:35.840779 2025-06-03 01:00:23.840778
ff39bc29510492f665cae5e23bdaaa5a True ff39bc29510492f665cae5e23bdaaa5a 1287 21 Laura Sanchez maguirre@example.org 2023-09-27 00:00:00 Jersey 2025-06-03 01:00:23.840779 Gold login nan 2025-08-02 15:06:47.289000+03:00 2025-06-03 01:00:23.840779 2025-06-12 21:08:30.840778
dc5be9887d65c60946b791b6e3bbda49 True dc5be9887d65c60946b791b6e3bbda49 1292 21 Laura Sanchez maguirre@example.org 2023-09-27 00:00:00 Jersey 2025-06-12 21:08:30.840779 Platinum purchase 51.82 2025-08-02 15:06:47.289000+03:00 2025-06-12 21:08:30.840779 2025-06-28 21:01:02.840778
ad3df688bca726e92e47974ba0e384e8 True ad3df688bca726e92e47974ba0e384e8 1302 21 Laura Sanchez maguirre@example.org 2023-09-27 00:00:00 Jersey 2025-06-28 21:01:02.840779 Bronze support_ticket nan 2025-08-02 15:06:47.289000+03:00 2025-06-28 21:01:02.840779 2025-06-30 21:34:29.840778

Final Touch

Observe the valid_from and valid_to columns. Each row now represents a specific period during which the country, email, and tier attributes were stable for that customer_id. Notice some valid_to values are NULL for now; that's because they represent the current active record.

WITH cte_scd_columns AS (
  SELECT
    MD5(
      COALESCE(CAST(customer_id AS TEXT), '__null__') || COALESCE(CAST(country AS TEXT), '__null__') || COALESCE(CAST(email AS TEXT), '__null__') || COALESCE(CAST(tier AS TEXT), '__null__')
    ) AS scd_id,
    *
  FROM events
), cte_change_indicator AS (
  SELECT
    scd_id,
    COALESCE(scd_id <> LAG(scd_id) OVER (PARTITION BY customer_id ORDER BY timestamp), TRUE) AS change_indicator, /* coalesce to handle the first appearance of the entity */ /* as null <> [value] is NULL */
    *
  FROM cte_scd_columns
), cte_valid_from_valid_to AS (
  SELECT
    *,
    timestamp AS valid_from, /* when a change occurred */
    LEAD(timestamp) OVER (PARTITION BY customer_id ORDER BY timestamp) /* when the next change occurred */ - CAST('1 microsecond' AS INTERVAL) AS valid_to
  FROM (
    SELECT
      *
    FROM cte_change_indicator
    WHERE
      change_indicator IS TRUE
  )
), cte_scd2_scd1 AS (
  SELECT
    MD5(scd_id || CAST(valid_from AS TEXT)) AS customer_sk,
    customer_id,
    name, /* scd1 attributes */
    signup_date,
    country, /* scd2 attributes */
    email,
    tier,
    valid_from,
    CAST(COALESCE(valid_to, '9999-12-31T23:59:59') AS TIMESTAMP) AS valid_to,
    valid_to IS NULL AS is_effective_row
  FROM cte_valid_from_valid_to
)
SELECT
  *
FROM cte_scd2_scd1
LIMIT 5
Enter fullscreen mode Exit fullscreen mode
customer_sk customer_id name signup_date country email tier valid_from valid_to is_effective_row
4aa26cc24e8bfeb92462e7b77ddd6e5e 12 Kyle Mcdonald 2025-04-24 00:00:00 Sao Tome and Principe tasha01@example.net Platinum 2025-05-21 06:41:00.840779 2025-05-25 20:46:32.840778 False
509016b089696afc5f7177a4a45d38c3 12 Kyle Mcdonald 2025-04-24 00:00:00 Sao Tome and Principe tasha01@example.net Bronze 2025-05-25 20:46:32.840779 2025-06-14 07:11:09.840778 False
e929d1ae5e0d2deabb9a5e7ae42d1219 53 Sylvia Martinez 2024-09-25 00:00:00 Zambia lauriecontreras@example.net Platinum 2025-05-21 13:01:56.840779 2025-05-25 20:17:17.840778 False
fbf1b35f2d65ef63d1f24541a6d18088 53 Sylvia Martinez 2024-09-25 00:00:00 Zambia lauriecontreras@example.net Gold 2025-05-25 20:17:17.840779 2025-06-24 17:19:57.840778 False
41ce1a3fe926554585bb70b5302e48f4 53 Sylvia Martinez 2024-09-25 00:00:00 Zambia lauriecontreras@example.net Bronze 2025-06-24 17:19:57.840779 9999-12-31 23:59:59 True

After creating the table, it's essential to merge it and use it in another fact table.
That process is beyond the scope of this article, but you can check it here

Top comments (0)