DEV Community

Cover image for SQL Duplicates
Andy Allison
Andy Allison

Posted on

3 2

SQL Duplicates

Disclaimer

This is a purely self serving article. I've had to do this a number of times and end up having to re-learn it all every time I've done it. SOOOOO this time I will drop some basics here as a reference for my failing memory but if it helps anybody else even better. It's not indented to be instructional or in anyway best practice so if you have issues or know of more appropriate ways please feel fee to drop things in the comments etc.

The Problem

The key problem being addressed is a number of duplicate records that have found their way into our datasets. These duplicates are matched on the master record id and the description used in the child example

id parentId description
1 P1 this is ok
2 P1 I'm here twice
3 P1 I'm here twice

Breaking it down

So the first thing I did was a query that will actualy help me find the rows I was looking for and the proof there are actual duplicates.

select "parentId",
       id,
       description,
       row_number()
       over (partition by "parentId", description order by "parentId", overview) as RN
from "Detail"
Enter fullscreen mode Exit fullscreen mode

This produces results like these

id parentId description RN
1 P1 this is ok 1
2 P1 I'm here twice 1
3 P1 I'm here twice 2

So we now have a row id/number based on the parentId and the description matching.

We then use this in the delete query to kill al the row numbers over 1

delete
from "Detail"
where id in (
    select id
    from (
             select "parentId",
                    id,
                    description,
                    row_number()
                    over (partition by "parentId", description order by "parentId", overview) as RN
             from "Detail"
         ) as F
    where RN > 1
);
Enter fullscreen mode Exit fullscreen mode

This is probably a convoluted way to achieve this and I'm sure there will be more efficient way but it has worked for me in this occasion so I hope it helps others.

Running the previous query then results in

id parentId description RN
1 P1 this is ok 1
2 P1 I'm here twice 1

Image of Timescale

Timescale – the developer's data platform for modern apps, built on PostgreSQL

Timescale Cloud is PostgreSQL optimized for speed, scale, and performance. Over 3 million IoT, AI, crypto, and dev tool apps are powered by Timescale. Try it free today! No credit card required.

Try free

Top comments (0)

AWS Security LIVE!

Tune in for AWS Security LIVE!

Join AWS Security LIVE! for expert insights and actionable tips to protect your organization and keep security teams prepared.

Learn More

👋 Kindness is contagious

Explore a sea of insights with this enlightening post, highly esteemed within the nurturing DEV Community. Coders of all stripes are invited to participate and contribute to our shared knowledge.

Expressing gratitude with a simple "thank you" can make a big impact. Leave your thanks in the comments!

On DEV, exchanging ideas smooths our way and strengthens our community bonds. Found this useful? A quick note of thanks to the author can mean a lot.

Okay