Jordan Kalebu

Posted on Oct 5, 2020 • Edited on May 22, 2022

How to build a Simple file cleaner in Python

#python #codenewbie #computerscience #codepen

Hi guys,

Today you're going to learn how to use Python programming in a way that can ultimately save a lot of space on your drive by removing all the duplicates.

Intro

In many situations you find yourself having duplicates files on your disk and but when it comes to tracking and checking them manually it can tedious.

Here a solution

Instead of tracking throughout your disk to see if there is a duplicate, you can automate the process using coding, by writing a program to recursively track through the disk and remove all the found duplicates and that's what this article is about.

But How do we do it?

If we were to read the whole file and then compare it to the rest of the files recursively through the given directory it will take a very long time then how do we do it?

The answer is hashing, with hashing can generate a given string of letters and numbers which act as the identity of a given file and if we find any other file with the same identity we gonna delete it.

There a variety of hashing algorithms out there such as

md5
sha1
sha224, sha256, sha384 and sha512

Lets do some coding

Hashing in Python is pretty straight forward we are going to use hashlib library which comes by default with Python standard library

Below is an example of how we hash stuff using hashlib, we are going to hash of a string in Python using md5 hashing algorithms

Example of Usage

>>> import hashlib
>>> example_text = "Duplython is amazing".encode('utf-8')
>>> hashlib.md5(example_text).hexdigest()
'73a14f46eadcc04f4e04bec8eb66f2ab'

It’s straight forward, you just need to import hashlib and then use md5 method to create hash and finally use hexdigest to generate string of the hash.

The above example has shown us how to hash a string but as we look in relation to the project we are about to build we are more concerned with files rather than strings, another question arises;

How do we hash files?

Hashing files is similar to hashing string with just minor differences, during the hashing file we first need to open the file in binary and then generate a hash of the file binary value.

Hashing File

Let's say you have simple text document on your project directory with name learn.txt, This is how we will do it.

>>> import hashlib
>>> file = open('learn.txt', 'rb').read()
>>> hashlib.md5(file).hexdigest()
'0534cf6d5816c4f1ace48fff75f616c9'

As you can see above even If you try to generate the hashes for a second time, The value of the generated hash doesn't change as long as its the same file.

The challenge arises when we try to read a very large file, It gonna take a while loading it therefore instead of waiting for the whole file to memory we can keep computing the hashes as we read the file.

Computing hashes while reading the file requires us to read the file in blocks of a given size and keep updating the hashes as we keep reading the file until the complete hashing the whole file.

Doing this way could save us a lot of waiting time that we could use on waiting for the whole file to be ready.

Example of Usage

>>> import hashlib
>>> block_size = 1024
>>> hash = hashlib.md5()
>>> with open('learn.txt', 'rb') as file:
...     block = file.read(block_size)
...     while len(block)>0:
...             hash.update(block)
...             block = file.read(block_size)
...     print(hash)
... 
0534cf6d5816c4f1ace48fff75f616c9

As you can see hash has not changed it still the same, therefore we are ready to go to building our python too to do the job.

But wait for hashing is just one step we need a way to actually removes the duplicates, we gonna use built python module OS in deleting duplicates.

We gonna use Python OS remove( ) method to remove the duplicates on our drive.

Let’s try deleting learn.txt with os module

Example of Usage (os module):

>>> import os
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'learn.txt', 'app.py', 'README.md']
>>> os.remove('learn.txt')
>>> os.listdir()
['Desktop-File-cleaner', '.git', 'app.py', 'README.md']

Well, that’s simple you just call remove ( ) with a parameter of the name of the file you wanna remove done. now let’s go build our application.

Building our cleaning Tool

importing necessary libraries

import time
import os
from hashlib import sha256

I'm a huge fan of Object-oriented programming and on this article, we gonna build our tool as a single class, below is just as exoskeleton class for our code.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')

    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

That’s just initial cover for our Python Program of which when we ran it will just print the welcoming method to the screen

Output :

$ python3 app.py
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************

----------------        WELCOME        ----------------------------

Cleaning .................

We now have to create a simple function to generate hash of a file with a given path using the hashing knowledge we have learned above.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')

   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def main(self)->None:
      self.welcome()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Now Let's implement our program Logic

Now after we made a function to generate hash per a given path of the file, Now let's implement Where by we will be comparing those hashes and removing any found duplicate.

I have made a simple function called clean( ) just to that as shown below.

import time
import os
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')

   def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)

    def main(self)->None:
      self.welcome();self.clean()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Now Our program is nearly complete, we now have to add a simple method to print the summary of the cleaning process.
to the command line and how many bytes of memory has been saved.

I have implemented the method cleaning_summary () just to do that, print out the summary of the cleaning process to the screen which completes our python tool as shown below

import time
import os
import shutil
from hashlib import sha256

class Duplython:
    def __init__(self):
        self.home_dir = os.getcwd(); self.File_hashes = []
        self.Cleaned_dirs = []; self.Total_bytes_saved = 0
        self.block_size = 65536; self.count_cleaned = 0

    def welcome(self)->None:
        print('******************************************************************')
        print('****************        DUPLYTHON      ****************************')
        print('********************************************************************\n\n')
        print('----------------        WELCOME        ----------------------------')
        time.sleep(3)
        print('\nCleaning .................')

    def generate_hash(self, Filename:str)->str:
        Filehash = sha256()
        try:
            with open(Filename, 'rb') as File:
                fileblock = File.read(self.block_size)
                while len(fileblock)>0:
                    Filehash.update(fileblock)
                    fileblock = File.read(self.block_size)
                Filehash = Filehash.hexdigest()
            return Filehash
        except:
            return False

    def clean(self)->None:
        all_dirs = [path[0] for path in os.walk('.')]
        for path in all_dirs:
            os.chdir(path)
            All_Files =[file for file in os.listdir() if os.path.isfile(file)]
            for file in All_Files:
                filehash = self.generate_hash(file)
                if not filehash in self.File_hashes:
                    if filehash:                       
                        self.File_hashes.append(filehash)
                        #print(file)
                else:
                    byte_saved = os.path.getsize(file); self.count_cleaned+=1
                    self.Total_bytes_saved+=byte_saved
                    os.remove(file); filename = file.split('/')[-1]
                    print(filename, '.. cleaned ')
            os.chdir(self.home_dir)

    def cleaning_summary(self)->None:
        mb_saved = self.Total_bytes_saved/1048576
        mb_saved = round(mb_saved, 2)
        print('\n\n--------------FINISHED CLEANING ------------')
        print('File cleaned  : ', self.count_cleaned)
        print('Total Space saved : ', mb_saved, 'MB')
        print('-----------------------------------------------')

    def main(self)->None:
        self.welcome();self.clean();self.cleaning_summary()

if __name__ == '__main__':
    App = Duplython()
    App.main()

Our app is complete, now to run the application is now complete now go run the application on the specific folder you want to clean and it will iterate recursively over a given folder to find all the files and remove the duplicate one .

Example output :

$ python3 app.py 
******************************************************************
****************        DUPLYTHON      ****************************
********************************************************************


----------------        WELCOME        ----------------------------

Cleaning .................
0(copy).jpeg .. cleaned 
0 (1)(copy).jpeg .. cleaned 
0 (2)(copy).jpeg .. cleaned 


--------------FINISHED CLEANING ------------
File cleaned  :  3
Total Space saved :  0.38 MB
-----------------------------------------------

Hope you find this post interesting, Now it's time to share it with your fellow friends on twitter and other dev communities

The Original Article of this post can be found on kalebujordan.dev

Kalebu / Duplython

CLI tool that recursively removes all the duplicates files over a given directory

Duplython

What is Duplython ?

Duplython is a simple cli app that can be used to recursively remove duplicates files over a given directory using Python

How does it work ?

Duplython underhood is able to detect any duplicate file using hashing, each single file whether its image | music | video has a unique hash, so if any two file with duplicate hash, one of them will be removed.

Getting started

To start using this tool you might wanna clone or download this repository

$-> git clone https://github.com/Kalebu/Duplython

Dependencies

You don't need to install anything, all the libraries and module used in this tool are found in the Python standard library.

Move into project directory

Now move into the project repository and you will see a script named app.py , now move it to the top directory of a folder you would to clean up the duplicates…

View on GitHub

Top comments (2)

Bill Miller • Oct 7 '20 • Edited

If I'm reading this correctly, it seems like it will keep the first file it finds and then turf the rest. This may be a "bad thing" if the first file is a temp file and a subsequent file is the "good one". Just an observation, and other than that, it's a great idea!!
To alleviate this problem it may make more sense to have it create a script to delete the duplicate files instead of actually deleting them immediately. That way a user can view the files to be deleted and remove lines that would be a problem. This could be a bad idea too... YMMV :D

Jordan Kalebu • Oct 8 '20

Well thanks, @bill Miller for mentioning it out,

I think it's a good idea, never considered that fact. but also in fixing the temp issue thing we can try removing the temp folder in all_dirs so that the script to only focus on the real-life on the drive.