The Fastest Set Is Often Not a Set: 4050 Duplicate-Detection Benchmarks

#cpp #performance #algorithms #datastructures

Duplicate detection looks solved: keep a hash set, skip what you have already seen. A benchmark suite of 4050 measurements across 480 workloads says the fastest strategy can be 94x faster than std::unordered_set, or 90,000x slower, depending on what you are deduplicating and what guarantees you need.

Dense integers are an array problem

When keys are dense, bounded 32-bit integers, a hash set wastes work: it hashes, probes buckets, and chases pointers. A bitset turns membership into one indexed bit. At one million uniform integers:

strategy	ns per insert
growable bitset	5.1
sort then unique	60
roaring bitmap	165
`std::unordered_set`	483
`std::set`	1154

The bitset is 94x faster than the hash set for the same correct answer. If your key is already an array index, do not turn it into a hashing problem.

Text keys change the cost model

For long strings, comparison and hashing dominate. Sorting with fingerprints (with full-key verification when correctness matters) can beat a hash set by 1.8x to 2.7x. For clustered duplicate strings, a hash set is excellent because recent buckets stay hot in cache.

Streaming is a forgetting problem

For unbounded streams, the question is what to remember. An in-memory sliding window costs ~68 ns/event. A PostgreSQL-backed detector with per-event transactions costs ~6.1 ms/event, a 90,000x gap for the same logical check. Batching commits closes most of it.