DEV Community

Cover image for Efficient File Management: How to Find and Remove Duplicate Files Using Python
MK
MK

Posted on • Originally published at webdesignguy.me on

Efficient File Management: How to Find and Remove Duplicate Files Using Python

Introduction

Duplicate files can clutter your storage space and make it difficult to manage your data efficiently. Whether you want to free up disk space or simply keep your files organized, finding and removing duplicate files is a useful task. In this blog post, we will explore how to check for duplicate files in a directory using Python and create a simple script for this purpose.

Python and hashlib

Python is a versatile programming language that allows you to automate various tasks, including file management. We will use the hashlib library in Python to calculate hash values for files. Hash values are unique representations of data, making them ideal for comparing files for duplication.

Calculating File Hashes

To compare files, we need to calculate hash values for each file in the directory. Well use the MD5 hash algorithm provided by the hashlib library. Heres a Python function that calculates the MD5 hash of a file:

import hashlib

def get_file_hash(file_path): 
    hash_md5 = hashlib.md5()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""): 
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

Enter fullscreen mode Exit fullscreen mode

Finding Duplicate Files

Now that we can calculate hash values for files, well create a function to find duplicate files in a directory. The script will iterate through all files in the specified directory and its subdirectories, comparing their hash values. Heres the function:

import os

def find_duplicate_files(directory): 
    file_hash_dict = {}
    duplicate_files = []

    for root, dirs, files in os.walk(directory):
        for file_name in files:
            file_path = os.path.join(root, file_name)
            file_hash = get_file_hash(file_path)

            if file_hash in file_hash_dict:
                duplicate_files.append((file_path, file_hash_dict[file_hash]))
            else:
                file_hash_dict[file_hash] = file_path

    return duplicate_files


Enter fullscreen mode Exit fullscreen mode

Putting It All Together

Now, lets create the main part of our script. Well prompt the user to input the directory path they want to check for duplicate files, and then well call the functions we defined earlier. Heres the main function:

def main():
    directory = input("Enter the directory path to check for duplicate files: ")

    if not os.path.isdir(directory):
        print("Invalid directory path.")
        return

    duplicates = find_duplicate_files(directory)

    if duplicates:
        print("Duplicate files found:")
        for file1, file2 in duplicates:
            print(f"File 1: {file1}")
            print(f"File 2: {file2}")
            print("-" * 30)
    else:
        print("No duplicate files found.")

if __name__ == "__main__":
    main()

Enter fullscreen mode Exit fullscreen mode

Running the Script

To use this script:

  1. Save it as a .py file (e.g., find_duplicates.py).

  2. Open a terminal or command prompt.

  3. Navigate to the directory where you saved the script.

  4. Run the script by entering python find_duplicates.py

  5. Enter the directory path you want to check for duplicate files when prompted.

The script will then identify and display any duplicate files in the specified directory.

Conclusion

Managing duplicate files is an essential part of keeping your storage organized and efficient. With this Python script, you can quickly find and remove duplicate files in any directory. Feel free to use and modify the script to suit your specific needs. Happy file management!

Top comments (0)