Duplicate files in this case are files with different names but the same contents.
How do we find them?
TL;DR
- Mostly good to use fdupes
- There's a way to handle the process (it won't be exact).
How to find duplicate files
First, consider a file like the following
$ ls -l
total 20
-rw-r--r-- 1 root root 1 May 6 00:53 a.dat
-rw-r--r-- 1 root root 2 May 6 00:53 b.dat
-rw-r--r-- 1 root root 1024 May 6 01:00 c.dat
-rw-r--r-- 1 root root 1024 May 6 01:00 d.dat
-rw-r--r-- 1 root root 1024 May 6 01:00 e.dat
For a small file like this, you can calculate the hash of all files as follows, but let's try to be creative.
The hash value calculation can be done with md5, sha-1 or whatever, but we use light md5 as well as fdupes.
$ md5sum * | perl -nalE 'push(@{$u{$F[0]}},$F[1])}{for(keys %u){say"@{$u{$_}}"if@{$u{$_}}>1}'
c.dat e.dat
For a.dat and b.dat, the file size is 1 byte and 2 bytes, respectively, so we know they are unique without any calculation.
Next, before calculating the hash values of c.dat, d.dat, and e.dat, let's calculate the first 10 bytes.
$ head -c 10 c.dat | md5sum
f6513d74de14a6c81ef2e7c1c1de4ab1 -
$ head -c 10 d.dat | md5sum
660f54f5a7658cbf1462b2a91fbe7487 -
$ head -c 10 e.dat | md5sum
f6513d74de14a6c81ef2e7c1c1de4ab1 -
This way we know that d.dat is unique!
In this way, it may be faster to calculate and compare the first Xbytes first before calculating the whole file.
For example, if c.dat and d.dat are 10Gbyte files, you need to read 20Gbyte to calculate the hash.
If there is a difference within the first 10 bytes, the data to be read will be only 20 bytes, which is much faster.
fdupes reads the first 4096 bytes
If there is a difference within 4096 bytes, the process will proceed at high speed.
However, if they are the same size and the first 100kbytes or so are the same, there is a high probability that they are the same.
I didn't want it to be that accurate, but I wanted it to display quickly, so I made it myself.
There seems to be nothing in fdupes that terminates processing at the first X bytes.
https://github.com/adrianlopezroche/fdupes/issues/79
jdupes has an option (-TT option) to determine duplicate files by the first 4096 bytes, which fdupes cannot do.
However, in my environment, there were several files that were the same up to 4096 bytes.
Implementation
https://github.com/yaasita/chofuku
Use SQLite's in-memory mode.
First, record the name and file.
I have not considered hard links this time.
If you want to support hard links, you can record the i-node number in the i-node and give the same one without hash calculation.
symbolic links are skipped.
SELECT * FROM files;
name | size | head100k_hash | full_hash |
---|---|---|---|
a.dat | 1 | ||
b.dat | 2 | ||
c.dat | 1024 | ||
d.dat | 1024 | ||
e.dat | 1024 | ||
f.dat | 1150976 | ||
g.dat | 1150976 | ||
h.dat | 1150976 |
Count duplicates with the following SQL (I always use this query to count duplicates)
SELECT size, head100k_hash, full_hash, json_group_array(name) FROM files GROUP BY size, head100k_hash, full_hash HAVING count(*) > 1;
size | head100k_hash | full_hash | json_group_array(name) |
---|---|---|---|
1024 | ["c.dat","d.dat","e.dat"] | ||
1150976 | ["f.dat","g.dat","h.dat"] |
I got a list of files that are the same size.
If the -size-only option is specified, the process will be terminated.
The next step is to calculate the hash for files of the same size.
Files with 0 bytes will not be hashed.
name | size | head100k_hash | full_hash |
---|---|---|---|
a.dat | 1 | ||
b.dat | 2 | ||
c.dat | 1024 | 2f2bf74e24d26a2a159c4f130eec39ac | |
d.dat | 1024 | fc65c6cb47f6eed0aa6a34448a8bfcec | |
e.dat | 1024 | 2f2bf74e24d26a2a159c4f130eec39ac | |
f.dat | 1150976 | 595cd4e40357324cec2737e067d582b1 | |
g.dat | 1150976 | 595cd4e40357324cec2737e067d582b1 | |
h.dat | 1150976 | 595cd4e40357324cec2737e067d582b1 |
Counting duplicates.
SELECT size, head100k_hash, full_hash, json_group_array(name) FROM files GROUP BY size, head100k_hash, full_hash HAVING count(*) > 1;
size | head100k_hash | full_hash | json_group_array(name) |
---|---|---|---|
1024 | 2f2bf74e24d26a2a159c4f130eec39ac | ["c.dat","e.dat"] | |
1150976 | 595cd4e40357324cec2737e067d582b1 | ["f.dat","g.dat","h.dat"] |
If the -100k-only option is specified, the process will be terminated here.
For files that are still duplicates, calculate the hash value of the entire file.
If the file is less than 100Kbytes, the calculation will be skipped.
The hash value calculation part can be parallelized, but you need to make sure that it does not exceed the limit of file opening (the value of ulimit -n).
In this case, it is done sequentially.
name | size | head100k_hash | full_hash |
---|---|---|---|
a.dat | 1 | ||
b.dat | 2 | ||
c.dat | 1024 | 2f2bf74e24d26a2a159c4f130eec39ac | |
d.dat | 1024 | fc65c6cb47f6eed0aa6a34448a8bfcec | |
e.dat | 1024 | 2f2bf74e24d26a2a159c4f130eec39ac | |
f.dat | 1150976 | 595cd4e40357324cec2737e067d582b1 | ca2e51ae14747a1f1f0dcb81e982c287 |
g.dat | 1150976 | 595cd4e40357324cec2737e067d582b1 | 067d1eed705e0f7756ceb37a10462665 |
h.dat | 1150976 | 595cd4e40357324cec2737e067d582b1 | ca2e51ae14747a1f1f0dcb81e982c287 |
Count and output duplicates.
SELECT size, head100k_hash, full_hash, json_group_array(name) FROM files GROUP BY size, head100k_hash, full_hash HAVING count(*) > 1;
size | head100k_hash | full_hash | json_group_array(name) |
---|---|---|---|
1024 | 2f2bf74e24d26a2a159c4f130eec39ac | ["c.dat","e.dat"] | |
1150976 | 595cd4e40357324cec2737e067d582b1 | ca2e51ae14747a1f1f0dcb81e982c287 | ["f.dat","h.dat"] |
If you want to go through the whole file hashing process, it's much faster to use fdupes.
If you don't specify the "-100k-only" or "-size-only" option, then there is no point in using this program.
conclusion
Even though the hash calculation was censored at 100kbytes, the result was the same as that of fdupes.
However, this is only for my environment.
There may be some files in the world where the first 100 kbytes are used as a common header.
I think it's better to change the method to suit the situation.
Most of the time, fdupes are good.
You can also use a file system like the following.
ipfs can look for uniqueness across the network.
Top comments (0)