DEV Community

Cover image for The Fastest Set Is Often Not a Set: 4050 Duplicate-Detection Benchmarks
kartikay dubey
kartikay dubey

Posted on • Originally published at dubeykartikay.com

The Fastest Set Is Often Not a Set: 4050 Duplicate-Detection Benchmarks

Duplicate detection looks solved: keep a hash set, skip what you have already seen. A benchmark suite of 4050 measurements across 480 workloads says the fastest strategy can be 94x faster than std::unordered_set, or 90,000x slower, depending on what you are deduplicating and what guarantees you need.

Dense integers are an array problem

When keys are dense, bounded 32-bit integers, a hash set wastes work: it hashes, probes buckets, and chases pointers. A bitset turns membership into one indexed bit. At one million uniform integers:

strategy ns per insert
growable bitset 5.1
sort then unique 60
roaring bitmap 165
std::unordered_set 483
std::set 1154

The bitset is 94x faster than the hash set for the same correct answer. If your key is already an array index, do not turn it into a hashing problem.

Text keys change the cost model

For long strings, comparison and hashing dominate. Sorting with fingerprints (with full-key verification when correctness matters) can beat a hash set by 1.8x to 2.7x. For clustered duplicate strings, a hash set is excellent because recent buckets stay hot in cache.

Streaming is a forgetting problem

For unbounded streams, the question is what to remember. An in-memory sliding window costs ~68 ns/event. A PostgreSQL-backed detector with per-event transactions costs ~6.1 ms/event, a 90,000x gap for the same logical check. Batching commits closes most of it.

A practical decision table

  • Dense bounded ints -> pre-sized bitset (30x to 110x faster).
  • Sparse 64-bit ints -> Roaring bitmap, or sort + unique on finite batches.
  • Long strings -> fingerprinted sorting, verify on match.
  • Streaming, bounded memory -> in-memory sliding window.
  • Streaming, durable -> RocksDB or Postgres with batched writes.

The fastest set is often not a set. It is the data structure your key space was trying to be.

For all 4050 measurements, the winner heatmaps, and the streaming benchmarks: The Shape of Duplicate Detection.

Top comments (0)