DEV Community

STrRedWolf
STrRedWolf

Posted on

2

The right hash(s) for the job?

Once again, I'm finding myself deduplicating a repository of files, and trying to do it right. Yeah, I could use a dedicated tool... but I'm not sure they're doing it correctly.

You see, they hash the file contents and try to match on that, assuming that they're going to be unique for each file. But in reality, it doesn't work that way. You're going to collide.

How do I know this? I wrote my own deduplication tool.

#/usr/bin/perl
#use Digest::FarmHash qw(farmhash64);
use File::Find;
use DBI;
use Digest::SHA qw(sha256_hex);
#use bigint;
$|=1;
# Connect...
my $dbh = DBI->connect("DBI:mysql:database=tygris:host=localhost",
"tygris","lunataur") || die "$!";
$dbh->{RaiseError} = 1;
my $prefix=`pwd`;
chomp $prefix;
# Find what's in this directory.
#find(\&hashthis, "1.Import");
find(\&hashthis, "00.Master");
# main routine for each file...
sub hashthis
{
return unless -f;
my $file = $_;
my $path = $File::Find::dir;
my @s = stat;
my $mtime = $s[9];
my $fsize = $s[7];
# Which file are we on...
my $filename=$File::Find::name;
# print $filename, "\n";
# Is it already registered?
my $id;
my @row=$dbh->selectrow_array("SELECT fileid FROM dedupe_files WHERE name=? AND path=?",
undef, $file, $path);
if($row[0]) {
$dbh->do("UPDATE dedupe_files SET mtime=? WHERE fileid=?",undef,$row[0],$mtime);
print "Skip: $filename\n";
return;
}
print $filename,"\n";
# Now hash it. Do an overall hash.
my $bighash = Digest::SHA->new(256);
open(IN,"<",$file) || die "Can't open < $file: $!";
my $buff;
while(read IN,$buff,4096*1024) # 4M chunks
{
$bighash->add($buff);
}
# Now check if we hashed an exact dupe.
# Match it by file size and hash
@row=$dbh->selectrow_array("SELECT MIN(fileid) FROM dedupe_files WHERE size=? AND hash=?",
undef, $fsize, $bighash);
unless($row[0]) { # No collision...
$dbh->do("INSERT INTO dedupe_files (name,path,mtime,size,hash) VALUES (?,?,?,?,?)",
undef, $file, $path, $mtime, $fsize, $bighash);
# print "Reg'd: $filename\n";
return;
}
# at this point, the file size and hash collided, so verify byte-by-byte.
my $srcid=$row[0];
@row=$dbh->selectrow_array("SELECT fileid, name, path FROM dedupe_files WHERE fileid=?",
undef, $srcid);
my $srcfile=$prefix."/".$row[2]."/".$row[1];
print "Likey dupe w/$srcfile\n";
system('cmp','-s',$srcfile,$file);
if($? == -1) {
die "cmp failed: $!";
} elsif($? & 127) {
die sprintf("child died w/signal %d, %s coredump",
$? & 127, ($? & 128) ? 'with' : 'without');
} else {
my $rv=$? >> 8;
if($rv) { # cmp exit(1)'s when different, so skip.
print "### cmp says no match ($rv)\n";
$dbh->do("INSERT INTO dedupe_files (name,path,mtime,size,hash) VALUES (?,?,?,?,?)",
undef, $file, $path, $mtime, $fsize, $bighash);
return;
}
}
# print "Exact match.\n";
# It's an exact match. Ugh. Toss it.
# toss($id,$filename,$path);
print "DUPE: ",$filename,"\n";
}
### Toss files away (into a dedicated trash folder. We'll nuke 'em later)
sub toss {
my $id=shift;
my $filename=shift;
my $path=shift;
my $dest='zz.Trash/'.$path;
system('mkdir','-p',$dest);
system('mv',$filename,$dest);
}
view raw scantest.pl hosted with ❤ by GitHub

Right now it's using SHA-256 to generate the hash digests, going over the entire file. It stores them in a MariaDB database and compares them to all the other files as it steps through each file. If there's a hash collision, it breaks out the venerable cmp tool and verifies that they're byte-to-byte duplicates.

The result?

In a hand-deduped archive of 69248 files, 420 were false positives.

My question now is: Is SHA-256 the right choice? Which would be better, or should I be using multiple hash algorithms to reduce that false positive number? Speed boosting here is gravy at this point.

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

Top comments (1)

Collapse
 
jrw profile image
jrw

It is practically impossible to find even one SHA-256 collision, let alone 420 of them. (Search for SHA256 collision). So you must be doing something wrong. Probably truncating the results when you store them in the database. In any case, checking the hash + file size should be sufficient to eliminate any practical possibility of finding a false positive.

Sentry image

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay