Once again, I'm finding myself deduplicating a repository of files, and trying to do it right. Yeah, I could use a dedicated tool... but I'm not sure they're doing it correctly.
You see, they hash the file contents and try to match on that, assuming that they're going to be unique for each file. But in reality, it doesn't work that way. You're going to collide.
How do I know this? I wrote my own deduplication tool.
Right now it's using SHA-256 to generate the hash digests, going over the entire file. It stores them in a MariaDB database and compares them to all the other files as it steps through each file. If there's a hash collision, it breaks out the venerable cmp tool and verifies that they're byte-to-byte duplicates.
In a hand-deduped archive of 69248 files, 420 were false positives.
My question now is: Is SHA-256 the right choice? Which would be better, or should I be using multiple hash algorithms to reduce that false positive number? Speed boosting here is gravy at this point.