Posted on Dec 27, 2023 • Originally published at webdesignguy.me on Dec 27, 2023

Efficient File Management: How to Find and Remove Duplicate Files Using Python

#python #programming #productivity #automation

Introduction

Duplicate files can clutter your storage space and make it difficult to manage your data efficiently. Whether you want to free up disk space or simply keep your files organized, finding and removing duplicate files is a useful task. In this blog post, we will explore how to check for duplicate files in a directory using Python and create a simple script for this purpose.

Python and hashlib

Python is a versatile programming language that allows you to automate various tasks, including file management. We will use the hashlib library in Python to calculate hash values for files. Hash values are unique representations of data, making them ideal for comparing files for duplication.

Calculating File Hashes

To compare files, we need to calculate hash values for each file in the directory. Well use the MD5 hash algorithm provided by the hashlib library. Heres a Python function that calculates the MD5 hash of a file:

import hashlib

def get_file_hash(file_path): 
    hash_md5 = hashlib.md5()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""): 
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Finding Duplicate Files

Now that we can calculate hash values for files, well create a function to find duplicate files in a directory. The script will iterate through all files in the specified directory and its subdirectories, comparing their hash values. Heres the function:

import os

def find_duplicate_files(directory): 
    file_hash_dict = {}
    duplicate_files = []

    for root, dirs, files in os.walk(directory):
        for file_name in files:
            file_path = os.path.join(root, file_name)
            file_hash = get_file_hash(file_path)

            if file_hash in file_hash_dict:
                duplicate_files.append((file_path, file_hash_dict[file_hash]))
            else:
                file_hash_dict[file_hash] = file_path

    return duplicate_files

Putting It All Together

Now, lets create the main part of our script. Well prompt the user to input the directory path they want to check for duplicate files, and then well call the functions we defined earlier. Heres the main function:

def main():
    directory = input("Enter the directory path to check for duplicate files: ")

    if not os.path.isdir(directory):
        print("Invalid directory path.")
        return

    duplicates = find_duplicate_files(directory)

    if duplicates:
        print("Duplicate files found:")
        for file1, file2 in duplicates:
            print(f"File 1: {file1}")
            print(f"File 2: {file2}")
            print("-" * 30)
    else:
        print("No duplicate files found.")

if __name__ == "__main__":
    main()

Running the Script

To use this script:

Save it as a .py file (e.g., find_duplicates.py).
Open a terminal or command prompt.
Navigate to the directory where you saved the script.
Run the script by entering python find_duplicates.py
Enter the directory path you want to check for duplicate files when prompted.

The script will then identify and display any duplicate files in the specified directory.

Conclusion

Managing duplicate files is an essential part of keeping your storage organized and efficient. With this Python script, you can quickly find and remove duplicate files in any directory. Feel free to use and modify the script to suit your specific needs. Happy file management!

DEV Community

Efficient File Management: How to Find and Remove Duplicate Files Using Python

Introduction

Python and hashlib

Calculating File Hashes

Finding Duplicate Files

Putting It All Together

Running the Script

Top comments (0)

Read next

The most abused Cypress command ever: cy.wait(TIME)

Mockingbird: New tool for your mock environments

AI enthusiasm #4 - Your stable diffusion chatbot🐠

Why Flutter Teams up with Dart?