Keeping the row you want and removing the rest
SQL Pattern Series #5 of 21
A collection of practical SQL patterns that help developers recognize common solutions to recurring database problems.
What You'll Learn
In this article you'll learn:
- How to identify duplicate records
- Why duplicates are often unavoidable in real systems
- How
ROW_NUMBER()helps isolate the record you want to keep - A common pattern for deduplicating data
Most developers eventually encounter a table that contains duplicate records.
Sometimes the duplicates are accidental.
Sometimes they are the result of imports, integrations, retries, or application bugs.
The challenge is usually not finding the duplicates.
The challenge is deciding:
Which row should survive?
The Problem
Imagine a Users table:
| UserID | CreatedDate | |
|---|---|---|
| 1 | alice@example.com | 2024-01-01 |
| 2 | alice@example.com | 2024-03-15 |
| 3 | bob@example.com | 2024-02-01 |
The email address appears more than once.
If only one row should remain, which one do we keep?
The oldest?
The newest?
The one with the highest ID?
The answer depends on the business rule.
Once that rule is defined, the Deduplication Pattern becomes useful.
The Deduplication Pattern
The most common approach uses:
ROW_NUMBER()
combined with:
PARTITION BY
to assign a sequence number to each duplicate group.
For example:
ROW_NUMBER() OVER (
PARTITION BY Email
ORDER BY CreatedDate DESC
)
This creates groups based on Email.
Within each group, rows are numbered.
The newest record receives:
rn = 1
The next receives:
rn = 2
and so on.
Example
SELECT *
FROM (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY Email
ORDER BY CreatedDate DESC
) AS rn
FROM Users
) t
WHERE rn = 1;
This returns:
The most recent record for each email address.
All other duplicates are excluded.
Why This Pattern Matters
Deduplication is one of the most common data-cleaning tasks.
You'll encounter it when:
- importing data
- merging systems
- cleaning customer records
- preparing analytics datasets
- fixing application issues
The pattern allows you to express exactly which record should survive.
That makes the logic clear and repeatable.
Window Functions Make This Easy
Before window functions became widely available, deduplication often required:
- self joins
- nested subqueries
- temporary tables
Window functions simplified the process dramatically.
By combining:
PARTITION BY
with:
ROW_NUMBER()
you can isolate duplicates in a single query.
Choosing the Winning Row
The most important part of the pattern is often:
ORDER BY
For example:
ORDER BY CreatedDate DESC
keeps the newest row.
But you could also choose:
ORDER BY CreatedDate ASC
to keep the oldest.
Or:
ORDER BY UserID DESC
to keep the highest ID.
The ordering defines the business rule.
A Note on Ties
One subtle detail:
ORDER BY CreatedDate DESC
works well when every CreatedDate value is unique.
But what happens if two rows share the same timestamp?
Email CreatedDate
------------------- -------------------
alice@example.com 2024-03-15 10:00:00
alice@example.com 2024-03-15 10:00:00
In that case, ROW_NUMBER() will still assign:
rn = 1
rn = 2
but the database may choose either row as the winner.
If the result must be predictable, add a tiebreaker:
ROW_NUMBER() OVER (
PARTITION BY Email
ORDER BY CreatedDate DESC,
UserID DESC
)
Now the ordering is deterministic.
When multiple rows share the same timestamp, the row with the highest UserID wins consistently.
When I Reach for This Pattern
I typically use the Deduplication Pattern when:
- duplicate records exist
- I need one row per business entity
- data quality issues must be cleaned
- multiple systems have been merged
Examples include:
- duplicate customer records
- duplicate email addresses
- repeated imports
- overlapping transaction data
Key Takeaway
Finding duplicates is only half the problem.
The real question is:
Which row should survive?
The Deduplication Pattern provides a repeatable way to answer that question.
By combining:
ROW_NUMBER()
with:
PARTITION BY
you can identify the winning row and remove the rest.
SQL Pattern Series
This article is part of the SQL Pattern Series, a collection of practical SQL patterns that help developers recognize common problem-solving approaches found in reporting, analytics, and application development.
SQL Bubble Pop
If you are learning SQL or helping others learn SQL, I created SQL Bubble Pop, a mobile game that teaches SQL concepts through quick, interactive challenges and pattern recognition exercises.
The goal is simple:
Learn SQL by recognizing patterns instead of memorizing syntax.

Top comments (0)