DEV Community

Cover image for SQL Pattern Series #5: The Deduplication Pattern
Baldwin Apps
Baldwin Apps

Posted on

SQL Pattern Series #5: The Deduplication Pattern

Keeping the row you want and removing the rest

SQL Pattern Series #5 of 21

A collection of practical SQL patterns that help developers recognize common solutions to recurring database problems.

What You'll Learn

In this article you'll learn:

  • How to identify duplicate records
  • Why duplicates are often unavoidable in real systems
  • How ROW_NUMBER() helps isolate the record you want to keep
  • A common pattern for deduplicating data

Most developers eventually encounter a table that contains duplicate records.

Sometimes the duplicates are accidental.

Sometimes they are the result of imports, integrations, retries, or application bugs.

The challenge is usually not finding the duplicates.

The challenge is deciding:

Which row should survive?


The Problem

Imagine a Users table:

UserID Email CreatedDate
1 alice@example.com 2024-01-01
2 alice@example.com 2024-03-15
3 bob@example.com 2024-02-01

The email address appears more than once.

If only one row should remain, which one do we keep?

The oldest?

The newest?

The one with the highest ID?

The answer depends on the business rule.

Once that rule is defined, the Deduplication Pattern becomes useful.



The Deduplication Pattern

The most common approach uses:

ROW_NUMBER()
Enter fullscreen mode Exit fullscreen mode

combined with:

PARTITION BY
Enter fullscreen mode Exit fullscreen mode

to assign a sequence number to each duplicate group.

For example:

ROW_NUMBER() OVER (
    PARTITION BY Email
    ORDER BY CreatedDate DESC
)
Enter fullscreen mode Exit fullscreen mode

This creates groups based on Email.

Within each group, rows are numbered.

The newest record receives:

rn = 1
Enter fullscreen mode Exit fullscreen mode

The next receives:

rn = 2
Enter fullscreen mode Exit fullscreen mode

and so on.


Example

SELECT *
FROM (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY Email
               ORDER BY CreatedDate DESC
           ) AS rn
    FROM Users
) t
WHERE rn = 1;
Enter fullscreen mode Exit fullscreen mode

This returns:

The most recent record for each email address.

All other duplicates are excluded.


Why This Pattern Matters

Deduplication is one of the most common data-cleaning tasks.

You'll encounter it when:

  • importing data
  • merging systems
  • cleaning customer records
  • preparing analytics datasets
  • fixing application issues

The pattern allows you to express exactly which record should survive.

That makes the logic clear and repeatable.


Window Functions Make This Easy

Before window functions became widely available, deduplication often required:

  • self joins
  • nested subqueries
  • temporary tables

Window functions simplified the process dramatically.

By combining:

PARTITION BY
Enter fullscreen mode Exit fullscreen mode

with:

ROW_NUMBER()
Enter fullscreen mode Exit fullscreen mode

you can isolate duplicates in a single query.


Choosing the Winning Row

The most important part of the pattern is often:

ORDER BY
Enter fullscreen mode Exit fullscreen mode

For example:

ORDER BY CreatedDate DESC
Enter fullscreen mode Exit fullscreen mode

keeps the newest row.

But you could also choose:

ORDER BY CreatedDate ASC
Enter fullscreen mode Exit fullscreen mode

to keep the oldest.

Or:

ORDER BY UserID DESC
Enter fullscreen mode Exit fullscreen mode

to keep the highest ID.

The ordering defines the business rule.


A Note on Ties

One subtle detail:

ORDER BY CreatedDate DESC
Enter fullscreen mode Exit fullscreen mode

works well when every CreatedDate value is unique.

But what happens if two rows share the same timestamp?

Email                CreatedDate
-------------------  -------------------
alice@example.com    2024-03-15 10:00:00
alice@example.com    2024-03-15 10:00:00
Enter fullscreen mode Exit fullscreen mode

In that case, ROW_NUMBER() will still assign:

rn = 1
rn = 2
Enter fullscreen mode Exit fullscreen mode

but the database may choose either row as the winner.

If the result must be predictable, add a tiebreaker:

ROW_NUMBER() OVER (
    PARTITION BY Email
    ORDER BY CreatedDate DESC,
             UserID DESC
)
Enter fullscreen mode Exit fullscreen mode

Now the ordering is deterministic.

When multiple rows share the same timestamp, the row with the highest UserID wins consistently.


When I Reach for This Pattern

I typically use the Deduplication Pattern when:

  • duplicate records exist
  • I need one row per business entity
  • data quality issues must be cleaned
  • multiple systems have been merged

Examples include:

  • duplicate customer records
  • duplicate email addresses
  • repeated imports
  • overlapping transaction data

Key Takeaway

Finding duplicates is only half the problem.

The real question is:

Which row should survive?

The Deduplication Pattern provides a repeatable way to answer that question.

By combining:

ROW_NUMBER()
Enter fullscreen mode Exit fullscreen mode

with:

PARTITION BY
Enter fullscreen mode Exit fullscreen mode

you can identify the winning row and remove the rest.


SQL Pattern Series

This article is part of the SQL Pattern Series, a collection of practical SQL patterns that help developers recognize common problem-solving approaches found in reporting, analytics, and application development.


SQL Bubble Pop

If you are learning SQL or helping others learn SQL, I created SQL Bubble Pop, a mobile game that teaches SQL concepts through quick, interactive challenges and pattern recognition exercises.

The goal is simple:

Learn SQL by recognizing patterns instead of memorizing syntax.

Top comments (0)