How to Remove Duplicates from a List in Python? (with code)

Lists are one of the most fundamental and versatile data structures. They are similar to dynamic arrays, capable of holding an ordered collection of objects, which can be of any type. Python, with its simplicity and power, provides an intuitive way to work with lists. However, like any data structure, lists come with their own challenges. One such challenge is the presence of duplicate objects or elements.

Imagine you’re compiling a list of email subscribers, and you notice that some email addresses appear more than once. Or perhaps you are collecting data from sensors, and due to some glitches, some data points are recorded multiple times. These repetitions, known as duplicates, can cause inaccuracies in data analysis, increased memory usage, and even errors in some algorithms.

But why do duplicates matter, and why should we be concerned about them? There are many reasons. From ensuring data integrity to optimizing memory and ensuring the accuracy of data analysis, handling duplicates is an important aspect of data management in Python.

In this guide, we’ll embark on a journey to understand what duplicates are in a list, why they may appear, and most importantly, different ways to remove them efficiently. Whether you’re just starting out with Python or are an experienced developer looking for a refresher, this guide aims to provide a clear and concise overview of handling duplicates in Python lists.

What are duplicates in a list?
In the context of programming and data structures, a list is a collection of objects that can be of any data type, such as integers, strings, or even other lists. These things are called elements. When two or more elements in a list have the same value, they are considered duplicates.

For example, consider the list: [1, 2, 3, 2, 4, 3].

In this list, numbers 2 and 3 occur more than once, so they are duplicates.

Why might you want to remove duplicates?
There are several reasons why someone might want to remove duplicates from a list:

Data integrity: Duplicates can sometimes be the result of errors in data collection or processing. By removing duplicates, you ensure that each item in the list is unique, thereby maintaining the integrity of your data.
Efficiency: Duplicates can take up unnecessary space in memory. If you’re working with large datasets, removing duplicates can help optimize memory usage.
Accurate analysis: If you’re doing statistical analysis or data visualization, duplicates can skew your results. For example, if you are calculating the average of a list of numbers, duplicates may affect the result.
User experience: In applications where users interact with lists (for example, a list of search results or product listings), showing duplicate items can be unnecessary and confusing.
Database operations: When inserting data into a database, especially in relational databases, duplicates may violate unique constraints or lead to redundant records.
Algorithm Requirements: Some algorithms require input lists of unique elements to function correctly or optimally.

Example of a list with duplicates

In the world of programming, real-world data is often messy and incomplete. When dealing with lists in Python, it is common to encounter duplicates. For example, suppose you are collecting feedback ratings from a website, and due to some technical glitches, some ratings are recorded multiple times. Your list might look something like this:

ratings = [5, 4, 3, 5, 5, 3, 2, 4, 5]
In the above list, the rating 5 appears four times, 4 appears twice, and 3 appears twice. These repetitions are the duplicates we’re referring to.

The challenge of preserving order

Removing duplicates may seem simple at first glance. One can simply think of converting the list into a set, which naturally does not allow duplicates. However, there is a problem: sets do not preserve the order of elements. In many scenarios, the order of elements in a list is important.

Let’s take the example of our ratings. If the ratings were given in chronological order, converting the list to a set and then back to a list would lose this chronological information. The original order in which the ratings were given will be destroyed.

Using set to remove duplicates

unique_ratings = list(set(ratings))
print(unique_ratings) # The order might be different from the original list
In data analysis, orders often contain important information. For example, a time series of stock prices, temperature readings, or even the sequence of DNA bases in bioinformatics. Preserving this order when removing duplicates becomes a challenge that requires a more subtle solution than just a simple set transformation.

Methods to Remove Duplicates from a List

Lists are a fundamental data structure in Python, often used to store collections of objects. However, as data is collected, processed, or manipulated, duplicates can be inadvertently introduced into these lists. Duplicates can lead to inaccuracies in data analysis, increased memory usage, and potential errors in some algorithms.

Therefore, we need to have techniques to efficiently remove these duplicates while considering other factors such as preserving the order of elements.

List of Methods to Remove Duplicates:

Using a Loop: A basic approach where we iterate over the list and construct a new list without duplicates.

Using List Comprehension: A concise method that leverages Python’s list comprehension feature combined with sets to filter out duplicates.

Using the set Data Type: A direct method that uses the properties of sets to eliminate duplicates but might not preserve order.

Using dict.fromkeys(): A method that exploits the uniqueness of dictionary keys to remove duplicates while maintaining the order.

Using Python Libraries: There are built-in Python libraries like itertools and collections that offer tools to handle duplicates.

Custom Functions for Complex Data Types: For lists containing complex data types like objects or dictionaries, custom functions might be needed to define uniqueness and remove duplicates.

Now we will start with an explanation of each method, one by one.

1. Using a Loop
One of the most intuitive ways to remove duplicates from a list in Python is to use a loop. This method involves repeating the original list and creating a new list that only contains unique items. Although this is straightforward and easy to understand, it is important to be aware of its performance characteristics, especially with large lists. Let us know about this method in detail.

Code Example
def remove_duplicates(input_list):
no_duplicates = [] # Initialize an empty list to store unique items
for item in input_list: # Iterate over each item in the input list
if item not in no_duplicates: # Check if the item is already in our unique list
no_duplicates.append(item) # If not, add it to our unique list
return no_duplicates

Input

sample_list = [1, 2, 3, 2, 4, 3]
print(remove_duplicates(sample_list))

Output

[1, 2, 3, 4]

Explanation:
We start by initializing an empty list called no_duplicates. This list will house our unique items as we identify them.

We then iterate over each item in the input_list using a for loop.
For each item, we check if it already exists in our no_duplicates list, using the if condition if the item is not in no_duplicates.
If the item is not already in our no_duplicates list (ie, it is unique), we add it to the no_duplicates list.

Once the loop completes, we have a list (no_duplicates) containing all the unique items from the original list, preserving their order. We return this list.