DEV Community

loading...

Remove Duplicate Records via SQL Server's Rank Function

Paul Delcogliano
Iā€™m an old dog developer learning new tricks. I come from the Microsoft world with over 25 years' experience.
惻3 min read

Removing duplicates from a table can be a frustrating process. Various methods for identifying, and removing records exist. Creating a query to remove certain records and keep others requires lots of careful analysis. I recently implemented a time saving, simple solution using SQL Server's RANK function to identify and mark records for deletion within a duplicate set.

The SQL Server RANK function returns the rank, or row number, of each row within the partitioned grouping of a result set. I'm not going to dive into the RANK function in this article. You can find more information about it here.

Using a query with RANK, I create a query that creates sets of duplicate records and assigns a number to each row in duplicate set. The row number will reset for each duplicate set in the table. I then use the number to identify rows to be deleted.

Let's set up a quick sample. Run the queries below to create a table which contains duplicate records. The test data sets up a scenario where duplicate orders were imported on a specific date, 4/20/2020.

create table duplicates (
    Id int identity(1,1)
    ,CustomerId int
    ,OrderAmount numeric(7,2)
    ,OrderDate datetime
)

insert into duplicates (CustomerId, OrderAmount, OrderDate)
values (1, 100.45, '4/20/2020'), (1, 100.45, '4/20/2020'), (1, 33.12, '4/14/2020'), (1, 75.45, '4/2/2020'), 
       (2, 10.99, '4/20/2020'), (2, 10.99, '4/20/2020'), (2, 100.45, '4/14/2020'), (2, 12.75, '4/2/2020'), 
       (3, 55.66, '4/20/2020'), (3, 55.66, '4/20/2020'), (3, 31.44, '4/14/2020'), (3, 85.22, '4/2/2020'),
       (4, 46.89, '4/20/2020'), (5, 88.33, '4/20/2020'), (6, 90.13, '4/14/2020'), (7, 3.77, '4/2/2020')

select * from duplicates
Enter fullscreen mode Exit fullscreen mode

As you can see from the query results, there are duplicate orders on 4/20/2020 for CustomerId's 1, 2, and 3.

Id CustomerId OrderAmount OrderDate
1 1 100.45 Apr 20 2020
2 1 100.45 Apr 20 2020
3 1 33.12 Apr 14 2020
4 1 75.45 Apr 2 2020
5 2 10.99 Apr 20 2020
6 2 10.99 Apr 20 2020
7 2 100.45 Apr 14 2020
8 2 12.75 Apr 2 2020
9 3 55.66 Apr 20 2020
10 3 55.66 Apr 20 2020
11 3 31.44 Apr 14 2020
12 3 85.22 Apr 2 2020
13 4 46.89 Apr 20 2020
14 5 88.33 Apr 20 2020
15 6 90.13 Apr 14 2020
16 7 3.77 Apr 2 2020

To identify the duplicate records, I create a query which uses the RANK function to assign a number to each record in each duplicate set. The PARTITION BY clause specifies groupings of data which the function is applied. The query creates groupings of CustomerId and OrderAmount. For each grouping, a new number is assigned. In the query below, the new number is aliased as RowNumberWithinDuplicateSet.

SELECT Id, CustomerId, OrderAmount, OrderDate,
       RANK() OVER (PARTITION BY CustomerId, OrderAmount ORDER BY Id DESC) AS RowNumberWithinDuplicateSet
FROM duplicates
WHERE OrderDate = '4/20/2020'
Enter fullscreen mode Exit fullscreen mode

The query returns the following results:

Id CustomerId OrderAmount OrderDate RowNumberWithinDuplicateSet
2 1 100.45 Apr 20 2020 1
1 1 100.45 Apr 20 2020 2
6 2 10.99 Apr 20 2020 1
5 2 10.99 Apr 20 2020 2
10 3 55.66 Apr 20 2020 1
9 3 55.66 Apr 20 2020 2
13 4 46.89 Apr 20 2020 1
14 5 88.33 Apr 20 2020 1

Once RowNumberWithinDuplicateSet is assigned, I use it to remove one of the duplicate records in each set. Using a query like the one below, I use the modulus of RowNumberWithinDuplicateSet to find one row in each set to delete.

delete 
    d
from 
    duplicates d
    inner join (
    select Id,
           RANK() OVER (PARTITION BY CustomerId, OrderAmount ORDER BY Id DESC) AS RowNumberWithinDuplicateSet
    from duplicates
    where 
            OrderDate = '4/20/2020'
    ) x on d.id = x.id
where 
    x.RowNumberWithinDuplicateSet % 2 = 0
Enter fullscreen mode Exit fullscreen mode

After this delete is executed, rows with Ids 1, 5, and 9 are removed, eliminating the duplicate records. This technique is really effective when the PARTITION BY columns are key fields and the duplicates can be filtered from the main table with a WHERE clause.

SQL Server's RANK function makes the complex process of identifying and removing duplicate records a cinch. The next time you need to write a query to remove duplicate rows from a table, think of the RANK function. Using the RANK function, you can create simple queries to assist you in removing duplicate records.

Discussion (2)

Collapse
allanford17 profile image
allanford17
    OUTER APPLY (SELECT TOP 1
Enter fullscreen mode Exit fullscreen mode

also useful for this ..

e.g.

    OUTER APPLY (SELECT TOP 1
        ClaimNo, ClaimLineNo, ProviderID, AgreementID, CampusID, Version, BenefitID, SvcClass, SetNo, SeqNo
        FROM   #Claimlines CL
        WHERE
            E.Episodeid = CL.Episodeid
        AND CL.Admitdate IS NOT NULL
        AND CL.Dischargedate IS NOT NULL
        AND CL.Facilitylevel IS NOT NULL
        AND CL.Dischargedate <=
                    (
                      SELECT MIN(Dischargedate)
                      FROM #Claimlines
                      WHERE Episodeid = CL.Episodeid
                        AND Admitdate IS NOT NULL
                        AND Dischargedate IS NOT NULL
                        AND Facilitylevel IS NOT NULL
                     )
        ORDER BY
            CL.SetNo,
            CL.SeqNo,
            CL.ClaimNo,
            CL.ClaimLineNo
        )  CL
Enter fullscreen mode Exit fullscreen mode
Collapse
amangarg47 profile image
amangarg47

We can do it by group by easily :). Btw nice approach