Handling Duplicate Records

#sql #database #devjournal

Let's talk about a scenario that comes up occasionally:
How can I deal with duplicate records in a database?

This happens for various reasons outside the scope of this article. The question is how to deal with them.

You have a few options.

Ideally, you would have some indexes on your table that enforces unique constraints. That way, duplicate data simply gets rejected by the database instead of sneaking in. In my case, the table in question is a generic log that includes lots of optional fields. The complexity prevented me from making any kind of unique constraints that made any sense.

Another option is (if you have control over the software that inserts the records), you can update that software to first look for existing records instead of blindly doing inserts. If you can't apply unique constraints to the table at the database level, this is your next best option, and is what I did. However, I still had all the duplicates sitting there in the database from before I added the code to my API to protect against them.

If you have just a few dupes, you could manually locate them, and manually execute delete statements using whatever primary keys or other specific data that identifies just the duplicate records. The drawback here is that you could end up accidentally deleting non-dupes, other data, or worse, the whole table. Terrifying, isn't it! (This is what keeps DBA's up at night! "Don't forget the WHERE clause!!" you'll hear them scream as they wake up from their nightmare, sweating and shaking uncontrollably.)

There are probably other options. Here's a solution that worked for me.

I'm using a Microsoft SQL Server database hosted on an Azure server. So, I can take advantage of some slightly non-ISO standard SQL syntax. (you could easily modify this slightly to work with whatever DB platform you’re running)

My solution uses a cursor.

Now, I know. I could hear you gasp just then and click your tongue at me. But rest assured, this is probably one of only, . . . well honestly, I can’t remember any other time I've ever been forced to use a cursor when I could do anything otherwise. (Full disclosure, I have actually approved the code review of someone else who was forced to use a cursor, and I've also been forced the indignity of maintaining someone else's code base that included some cursors for whatever reason. Don't judge me!)

The basic overview is that we need to declare some variables to hold some values we can use for finding and matching the duplicates. Then we set up the cursor, preferably with FORWARD_ONLY specified so we keep things lite on the server. Next, we begin a loop that first finds one "good" record per dupe-set, and then deletes all the other matching records except that good one. Finally, don't forget to close and deallocate the cursor so you can pretend you never used one and feign ignorance.

Now for the good stuff.

(Variable, table, and column names obfuscated below to protect the innocent)

-- Set up the variables
DECLARE 
  @countOfDupes int, 
  @transactionDate datetime2(7), 
  @customerId int, 
  @invoiceId nvarchar(13), 
  @transactionType nvarchar(25)

-- Prepare the cursor we'll use
DECLARE logDupes CURSOR FORWARD_ONLY 
FOR
SELECT
  CountOfDupes = count(Id),
  TransactionDate,
  CustomerId,
  InvoiceId,
  TransactionType
FROM TransactionLogs
GROUP BY TransactionDate, CustomerId, InvoiceId, TransactionType
HAVING count(Id) > 1 -- only the ones that have duplicates

OPEN logDupes

FETCH NEXT FROM logDupes   
INTO @countOfDupes, @transactionDate, @customerId, @invoiceId, @transactionType

-- Now loop through each set that has duplicates
WHILE @@FETCH_STATUS = 0
BEGIN
    PRINT concat('Handling ', @countOfDupes, ' duplicates for customer: ', @customerId, ' on ''', @transactionDate, '''')

    -- First, get the @firstMatchingId for just the top 1 matching result
    DECLARE @firstMatchingId int

    SELECT 
      TOP 1 
      @firstMatchingId = Id
    FROM TransactionLogs
    WHERE CustomerId = @customerId
    AND TransactionDate = @transactionDate
    AND InvoiceId = @invoiceId
    AND TransactionType = @transactionType

    -- Now use that @firstMatchingId to delete all the matching results EXCEPT FOR the top 1
    DELETE FROM
    TransactionLogs
    WHERE Id NOT IN (@firstMatchingId)
    AND CustomerId = @customerId
    AND TransactionDate = @transactionDate
    AND InvoiceId = @invoiceId
    AND TransactionType = @transactionType

    -- Continue the loop
    FETCH NEXT FROM logDupes   
    INTO @countOfDupes, @transactionDate, @customerId, @invoiceId, @transactionType
END

-- Dont forget to close and deallocate the cursor!
CLOSE logDupes  
DEALLOCATE logDupes

See Microsoft's documentation for specific guidance on using cursors in SQL Server.

If you've found a better way to delete a bunch of duplicate records from your db, I'd love to hear your story!

Hope you find this amusing, or possibly helpful.

DEV Community

Handling Duplicate Records

Top comments (0)

Read next

ChatWithSQL — Secure, Schema-Validated Text-to-SQL Python Library, Eliminating Arbitrary Query Risks from LLMs

Linking Data in PostgreSQL Without Explicit Keys: A Practical Guide

SQL: Perform statistics at different granularities based on the time span of the data #85

Understanding Sequelize's 'escape' function and why ideally we should avoid using it manually