DEV Community

Nikhil Soman Sahu
Nikhil Soman Sahu

Posted on

Deleting Duplicate Data from MySQL

Hey there, 👋 Welcome to another blog post. Today, we’ll tackle something every database engineer faces at some point—deleting duplicate data in MySQL. This is a crucial skill, especially when managing large datasets where data cleanliness is vital.

Why Should We Remove Duplicates?

Imagine you’re managing a user database for a website. If users sign up multiple times due to some glitch or error, you’ll end up with duplicates. This can mess up your reports, slow down queries, and increase storage costs. So, let’s get our hands dirty and see how to remove these duplicates!


Setting Up the Scenario

Let’s start with a simple users table:

CREATE TABLE users (
    id INT AUTO_INCREMENT PRIMARY KEY,
    email VARCHAR(255) NOT NULL,
    name VARCHAR(100),
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
Enter fullscreen mode Exit fullscreen mode

And let’s insert some duplicate data to simulate the real-world problem:

INSERT INTO users (email, name) VALUES
('john.doe@example.com', 'John Doe'),
('jane.smith@example.com', 'Jane Smith'),
('john.doe@example.com', 'John Doe'),
('mary.jane@example.com', 'Mary Jane'),
('jane.smith@example.com', 'Jane Smith');
Enter fullscreen mode Exit fullscreen mode

Now, if we run a simple query to check the data:

SELECT * FROM users;
Enter fullscreen mode Exit fullscreen mode

You’ll notice there are duplicate emails. Let’s work on cleaning that up!


Step 1: Identifying Duplicates

Before we delete anything, we need to identify which rows are duplicates. One way to find duplicate entries is by using the GROUP BY clause in MySQL:

SELECT email, COUNT(*)
FROM users
GROUP BY email
HAVING COUNT(*) > 1;
Enter fullscreen mode Exit fullscreen mode

This query shows you any email addresses that appear more than once.

What other ways could duplicates sneak into a table? It’s not always about emails—what else should we look out for? Comment below with your thoughts!


Step 2: Deleting Duplicates, the Safe Way

Deleting rows from a database always requires caution. In MySQL, one common trick to delete duplicate rows while keeping one copy is using a subquery combined with ROW_NUMBER(). But since MySQL (before version 8.0) doesn’t support window functions, we’ll use a more classic approach with a temporary table.

Here’s how to safely delete duplicates:

  1. Create a Temporary Table that holds the unique rows.
CREATE TEMPORARY TABLE temp_users AS
SELECT MIN(id) AS id, email, name
FROM users
GROUP BY email, name;
Enter fullscreen mode Exit fullscreen mode
  1. Delete the Duplicates from the original table.
DELETE FROM users
WHERE id NOT IN (SELECT id FROM temp_users);
Enter fullscreen mode Exit fullscreen mode
  1. Verify the Result. Let’s make sure only unique entries remain:
SELECT * FROM users;
Enter fullscreen mode Exit fullscreen mode

Boom! 🎉 You’ve removed the duplicates, leaving only one copy of each email.


Step 3: Another Method for MySQL 8.0+

If you’re using MySQL 8.0 or later, you have access to the ROW_NUMBER() function, which makes this process even easier. Let’s check it out:

WITH RankedUsers AS (
    SELECT *, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
    FROM users
)
DELETE FROM users WHERE id IN (
    SELECT id FROM RankedUsers WHERE rn > 1
);
Enter fullscreen mode Exit fullscreen mode

Here, we assign a row number (rn) to each duplicate based on their email. Rows with rn > 1 are the duplicates and get deleted.


Common Pitfalls to Avoid

  1. Accidental Deletion: Always run a SELECT query to identify duplicates before jumping to DELETE. I recommend testing on a small subset of data before running it on the entire table.

  2. Handling More Complex Duplicates: Sometimes duplicates aren’t exact matches. For instance, john.doe@example.com vs. JOHN.DOE@EXAMPLE.COM. Case sensitivity can trip you up, so always normalize data before deduplication.


Final Thoughts: Keep it Clean!

Data quality is key to any successful project. Removing duplicates keeps your database lean and your queries fast. The methods I shared here are good starting points, but always adapt to your specific use case.


Interactive Question:

What strategies do you use to prevent duplicates from entering your database in the first place? Share your best practices below!

Top comments (0)