Stop Joining Millions of Rows for Every Single Swipe

#computerscience #softwareengineering #datastructures

TL;DR: Dating apps avoid the architectural nightmare of joining millions of left-swipe records by using Bloom filters. By hashing user IDs into a bit array, they get a 100% guarantee that a '0' means a user is new, while accepting a rare false positive as a necessary trade-off for high-concurrency performance.

I’ve seen plenty of teams try to scale discovery feeds by throwing more hardware at SQL joins, and it is a losing game. If I have to check a user’s entire history of left swipes against a pool of millions of profiles every time they refresh their deck, the database isn't just going to be slow—it’s going to stop breathing. We aren't talking about a simple query; we are talking about a cross-reference that grows every time a user interacts with the app.

Scaling a discovery engine requires moving away from absolute row-level checks and toward probabilistic data structures. When I look at how top-tier apps handle the "have they seen this person?" problem, I don’t see massive JOIN statements. I see Bloom filters.

Why can't we just use a standard database join?

I can't use a standard join because the latency of checking billions of swipe records for every profile in a stack is unsustainable at scale. Even with perfect indexing, the I/O overhead of a massive NOT IN or LEFT JOIN on that volume will kill the request-response cycle and frustrate the user.

Think about the math. If I have 50 million users and each user has swiped on a few thousand people, my swipes table is a disaster waiting to happen. If I try to run a query to "show me 10 people this user hasn't swiped on," the engine has to scan or index-hop through a mountain of data. By the time the database returns a result, the user has already closed the app. I need a way to filter those candidates without actually touching the disk for every individual check.

How does a Bloom filter handle membership tests?

I use a Bloom filter to treat membership as a bitmask operation rather than a record lookup. It consists of a fixed-size array of bits, all starting at zero, which I flip to one based on the output of multiple hash functions.

When a user swipes left on someone, I don't just log the event; I run that profile’s ID through a set of hash functions. If I'm using two functions, they might return the integers 1 and 3. I go to my bit array, find those indices, and set them to one. That profile is now "recorded."

Later, when I'm deciding whether to show that same profile again, I run the ID through those same two hashes. If I get back 1 and 3, and both are already set to "1," I assume the user has seen them. But if I check a different profile and the hashes return 3 and 5, and index 5 is still a "0," I know for a fact—with 100% mathematical certainty—that this user has never seen this profile. I show it to them immediately.

The Engineering Trade-offs of Probabilistic Filtering

Factor	Bloom Filter Approach
Storage	Bit-level storage; constant footprint.
Speed	Constant time (O(k)) lookups.
Accuracy	100% Negative accuracy; Probabilistic Positive.
Cost	Negligible CPU/Memory overhead.

What happens when the filter gives a false positive?

I accept the false positive as a necessary trade-off: the app simply skips showing a specific profile because it incorrectly thinks the user has already swiped on it. In a discovery engine with millions of candidates, missing one potential match is an invisible error that saves the entire system from a performance collapse.

If I used a standard Hash Set to keep track of every swipe, the memory usage would balloon until I was spending a fortune on RAM just to keep the feed functional. With a Bloom filter, I keep the memory footprint predictable. If the filter tells me all the bits are "1," but it’s actually a collision and the user hasn't seen that person, I just move to the next candidate. The user doesn't care about the one person they didn't see; they care that the app didn't hang for five seconds while loading the next card.

FAQ

Can I remove a swipe from a Bloom filter?
No. In a standard Bloom filter, I can't flip a bit back to zero because I don't know which other IDs also hashed to that same bit. If I need to support "undoing" a swipe, I’d have to use a more complex structure like a Counting Bloom Filter or just rebuild the filter from the source of truth occasionally.

How do I decide the size of the bit array?
It’s a balance between memory and the error rate. If I make the array too small, it fills up with ones too quickly and starts giving me false positives for everyone. I calculate the size based on how many swipes I expect a user to perform over the lifetime of their account to keep the collision rate under a certain threshold.

Does this replace my primary database?
Absolutely not. I still store the actual swipe data in a persistent database like Postgres or Cassandra for long-term records and analytics. The Bloom filter is a performance layer I use to make real-time decisions in the discovery feed without hitting the heavy data store every single time.