Duplicate detection looks solved: keep a hash set, skip what you have already seen. A benchmark suite of 4050 measurements across 480 workloads says the fastest strategy can be 94x faster than std::unordered_set, or 90,000x slower, depending on what you are deduplicating and what guarantees you need.
Dense integers are an array problem
When keys are dense, bounded 32-bit integers, a hash set wastes work: it hashes, probes buckets, and chases pointers. A bitset turns membership into one indexed bit. At one million uniform integers:
| strategy | ns per insert |
|---|---|
| growable bitset | 5.1 |
| sort then unique | 60 |
| roaring bitmap | 165 |
std::unordered_set |
483 |
std::set |
1154 |
The bitset is 94x faster than the hash set for the same correct answer. If your key is already an array index, do not turn it into a hashing problem.
Text keys change the cost model
For long strings, comparison and hashing dominate. Sorting with fingerprints (with full-key verification when correctness matters) can beat a hash set by 1.8x to 2.7x. For clustered duplicate strings, a hash set is excellent because recent buckets stay hot in cache.
Streaming is a forgetting problem
For unbounded streams, the question is what to remember. An in-memory sliding window costs ~68 ns/event. A PostgreSQL-backed detector with per-event transactions costs ~6.1 ms/event, a 90,000x gap for the same logical check. Batching commits closes most of it.
A practical decision table
- Dense bounded ints -> pre-sized bitset (30x to 110x faster).
- Sparse 64-bit ints -> Roaring bitmap, or sort + unique on finite batches.
- Long strings -> fingerprinted sorting, verify on match.
- Streaming, bounded memory -> in-memory sliding window.
- Streaming, durable -> RocksDB or Postgres with batched writes.
The fastest set is often not a set. It is the data structure your key space was trying to be.
For all 4050 measurements, the winner heatmaps, and the streaming benchmarks: The Shape of Duplicate Detection.
Top comments (0)