Baldwin Apps

Posted on Jun 13

SQL Pattern Series #5: The Deduplication Pattern

#sql #database #tutorial #beginners

Keeping the row you want and removing the rest

SQL Pattern Series #5 of 21

A collection of practical SQL patterns that help developers recognize common solutions to recurring database problems.

What You'll Learn

In this article you'll learn:

How to identify duplicate records
Why duplicates are often unavoidable in real systems
How ROW_NUMBER() helps isolate the record you want to keep
A common pattern for deduplicating data

Most developers eventually encounter a table that contains duplicate records.

Sometimes the duplicates are accidental.

Sometimes they are the result of imports, integrations, retries, or application bugs.

The challenge is usually not finding the duplicates.

The challenge is deciding:

Which row should survive?

The Problem

Imagine a Users table:

UserID	Email	CreatedDate
1	alice@example.com	2024-01-01
2	alice@example.com	2024-03-15
3	bob@example.com	2024-02-01

The email address appears more than once.

If only one row should remain, which one do we keep?

The oldest?

The newest?

The one with the highest ID?

The answer depends on the business rule.

Once that rule is defined, the Deduplication Pattern becomes useful.

The Deduplication Pattern

The most common approach uses:

ROW_NUMBER()

combined with:

PARTITION BY

to assign a sequence number to each duplicate group.

For example:

ROW_NUMBER() OVER (
    PARTITION BY Email
    ORDER BY CreatedDate DESC
)

This creates groups based on Email.

Within each group, rows are numbered.

The newest record receives:

rn = 1

The next receives:

rn = 2

and so on.

Example

SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY Email
               ORDER BY CreatedDate DESC
           ) AS rn
    FROM Users
) t
WHERE rn = 1;

This returns:

The most recent record for each email address.

All other duplicates are excluded.

Why This Pattern Matters

Deduplication is one of the most common data-cleaning tasks.

You'll encounter it when:

importing data
merging systems
cleaning customer records
preparing analytics datasets
fixing application issues

The pattern allows you to express exactly which record should survive.

That makes the logic clear and repeatable.

Window Functions Make This Easy

Before window functions became widely available, deduplication often required:

self joins
nested subqueries
temporary tables

Window functions simplified the process dramatically.

By combining:

PARTITION BY

with:

ROW_NUMBER()

you can isolate duplicates in a single query.

Choosing the Winning Row

The most important part of the pattern is often:

ORDER BY

For example:

ORDER BY CreatedDate DESC

keeps the newest row.

But you could also choose:

ORDER BY CreatedDate ASC

to keep the oldest.

Or:

ORDER BY UserID DESC

to keep the highest ID.

The ordering defines the business rule.

A Note on Ties

One subtle detail:

ORDER BY CreatedDate DESC

works well when every CreatedDate value is unique.

But what happens if two rows share the same timestamp?

Email                CreatedDate
-------------------  -------------------
alice@example.com    2024-03-15 10:00:00
alice@example.com    2024-03-15 10:00:00

In that case, ROW_NUMBER() will still assign:

rn = 1
rn = 2

but the database may choose either row as the winner.

If the result must be predictable, add a tiebreaker:

ROW_NUMBER() OVER (
    PARTITION BY Email
    ORDER BY CreatedDate DESC,
             UserID DESC
)

Now the ordering is deterministic.

When multiple rows share the same timestamp, the row with the highest UserID wins consistently.

When I Reach for This Pattern

I typically use the Deduplication Pattern when:

duplicate records exist
I need one row per business entity
data quality issues must be cleaned
multiple systems have been merged

Examples include:

duplicate customer records
duplicate email addresses
repeated imports
overlapping transaction data

Key Takeaway

Finding duplicates is only half the problem.

The real question is:

Which row should survive?

The Deduplication Pattern provides a repeatable way to answer that question.

By combining:

ROW_NUMBER()

with:

PARTITION BY

you can identify the winning row and remove the rest.