Dealing with duplicates

#postgres #linux #softwareengineering #sql

I've been working on a inventory management app and recently ran into an oversight in the initial rollout of the database.

This one oversight landed me with a total duplicate count of 186 records stemming from what should have been 82 unique records. So trying to do the math, it's not clean case of two duplicates per item. Which meant that trying to consolidate based on latest modified or vice-versa would not work as some of the latest duplicates of a given part contained missing or bad data and others didn't, in other cases up to 4 duplicates would be found - making it complex to determine which one to pick programmatically.

After taking some time to think through ways to fix this I landed on creating a few reports in a way that grouped all the duplicates and showed what was changed between them, and with a recovery strategy in hand, drop all the records with duplicates, add the uniqueness constraint and re-insert the records with the reports in hand and ability to check the inventory physically.

Even though it was a costly mistake on my end, it was the best way to guarantee the data would be accurate.

First step: Backups

I'm using a managed Postgres instance on Heroku for the prod database and luckily it runs backups periodically. But it's not enough. So I decided to create an cron job to run a daily manual backup (the database is quite small at this point in time).

#! /bin/bash

echo "Running database backup"

backup_dir="/Users/felipe/backups/<client>-db-backups"
latest_backup=$(ls -t "$backup_dir"/*.dump | head -n1)

if [ -n "$latest_backup" ]; then
    echo "Previous backup found"
    last_modified=$(date -r "$latest_backup" "+%m-%d-%Y %H:%M:%S")
    echo "Created at: $last_modified"
else
    echo "No existing backup was found."
fi

echo "Running backup..."

backup_filename="$backup_dir/<client>_$(date +%m-%d-%y).dump"
heroku pg:backups:capture -a server
heroku pg:backups:download -a server -o "$backup_filename"

So breaking it down:

Gab the latest backup, if any, from the backup directory
If it exists, proceed to print a message declaring a backup exists and when it was created to stdout
If it doesn't exist, it will print a message declaring there are no backups found
Capture and download a backup from the prod instance

Finally, add the job to the cron table

$ crontab -e 

0 3 * * * /bin/bash /path/to/the/script.sh

Second step: Write the update script

So this part is fairly simple but a lot can go wrong.

First I ran a few queries to ensure what I was about to do was targeting the right data:

-- 1: find the total count to get an idea of how much data we're talking about
SELECT COUNT(*)
FROM (
    SELECT
        pn,
        COUNT(pn)
    FROM
        sometable
    GROUP BY
        pn
    HAVING
        COUNT(pn) > 1
);

-- 2: unpack the previous query and get an idea of much duplicates per item
SELECT pn, count(pn)
FROM sometable
GROUP BY pn
HAVING COUNT(pn) > 1;

-- 3: inspect the duplicate data
SELECT *
FROM sometable
WHERE pn IN (
    SELECT pn
    FROM sometable
    GROUP BY pn
    HAVING COUNT(pn) > 1
)
ORDER BY pn;

Cool, after looking over the result sets I felt fairly confident about what I was about to delete as well as took the opportunity to export these reports as they would come in handy in the final step.

DELETE FROM sometable
WHERE id IN ( 
    SELECT *
    FROM sometable
    WHERE pn IN (
        SELECT pn
        FROM sometable
        GROUP BY pn
        HAVING COUNT(pn) > 1
    ) 
    ORDER BY pn
  );

-- Let's try to prevent this from happening again
ALTER TABLE sometable
ADD CONSTRAINT unique_col UNIQUE (pn);

In part 2 I will work on making changes to both the client and API to improve the UX and API around handling a possible duplicate insertion gracefully. Also, if you'd like to share any similar stories please do.

DEV Community

Dealing with duplicates

First step: Backups

Second step: Write the update script

Top comments (0)