DEV Community

Sean Coughlin
Sean Coughlin

Posted on • Originally published at blog.seancoughlin.me on

Delete Duplicate Emails

Eliminating Duplicate Emails Efficiently: A Comprehensive Guide to Data Cleanup Using Pandas and SQL

The Problem

With this article, I will be covering the delete duplicate emails Leetcode problem.

Leetcode describes this problem as easy. That's a super reasonable evaluation as the solution requires only basic SQL or Pandas knowledge.

The problem description is as follows:

Write a solution to delete all duplicate emails, keeping only one unique email with the smallest id.

For SQL users, please note that you are supposed to write a DELETE statement and not a SELECT one.

For Pandas users, please note that you are supposed to modify Person in place.

After running your script, the answer shown is the Person table. The driver will first compile and run your piece of code and then show the Person table. The final order of the Person table does not matter.

+-------------+---------+
| Column Name | Type |
+-------------+---------+
| id | int |
| email | varchar |
+-------------+---------+
id is the primary key (column with unique values) for this table.
Each row of this table contains an email. 
The emails will not contain uppercase letters.

The Solution

Pandas

Pandas is a great Python tool for data analysis and manipulation. Built into that library is the drop duplicates function. Please note that the problem statement asks us to do this in place.

Using the Pandas library this can be achieved by first in-place sorting by the id field and then dropping the duplicates from email. We want to keep at least the first duplicated element.

def delete_duplicate_emails(person: pd.DataFrame) -> None:
    person.sort_values(by='id', inplace=True)
    person.drop_duplicates(subset='email', keep='first', inplace=True)

Enter fullscreen mode Exit fullscreen mode

Voila! I love these simple questions.

SQL

In SQL we want to run a delete query. We will grab two copies of the person table and match them based on the email. To keep at least one of the solutions we only delete the entry with the higher id value. This keeps the email associated with the smallest id.

DELETE p1 
FROM person p1, person p2 
WHERE p1.email = p2.email AND p1.id > p2.id;

Enter fullscreen mode Exit fullscreen mode

As with many problems, there are multiple ways to solve them. These Pandas and SQL solutions are but one way of approaching the delete duplicate question.


Originally published at https://blog.seancoughlin.me.

Top comments (0)