When your system grows to millions of users, even the simplest operations—like checking if a phone number or email already exists—can become costly.
Yes, you can add database indexes, but every lookup still eats up I/O and CPU cycles. Under heavy signup traffic, this can quickly overwhelm your database.
This is where Bloom Filters come to the rescue. 🌸
🌱 What is a Bloom Filter?
A Bloom Filter is a probabilistic data structure used for set membership checks. It allows us to ask:
👉 “Does this value possibly exist?”
It can say:
- ❌ Definitely Not → Safe to skip DB.
- ✅ Might Exist → Confirm with DB.
This tiny compromise (allowing false positives, but never false negatives) gives us O(1) lookups with very little memory usage.
🔬 Anatomy of a Bloom Filter (Abscissa)
At its core, a Bloom filter is just:
-
A Bit Array (size
m
) → starts with all 0’s. -
k Hash Functions → each maps an input to one of the
m
positions.
👉 When we add an element:
- Run it through all
k
hash functions. - Flip those positions in the bit array to
1
.
👉 When we check an element:
- Run it through the same
k
functions. - If all those positions are
1
→ the element might exist. - If any position is
0
→ it definitely does not exist.
📈 Visual Abscissa
Think of it as a number line (abscissa = x-axis):
Bit Array (size m)
0 1 2 3 4 5 6 7 8 9 ... m-1
[0][0][0][0][0][0][0][0][0][0]
- Each hash function picks some positions along this line.
- Adding
"alice@example.com"
might flip positions 3, 7, and 9. - Checking
"bob@example.com"
? If one of its hash positions is still0
, we know Bob isn’t in the set.
⚖️ Balancing Act
- More bits (m) → fewer collisions, lower false positives.
- More hash functions (k) → more accuracy, but also more computation.
- The sweet spot depends on expected number of elements
n
.
Formula for optimal k
:
k=n/m(ln2)
This balance is why Bloom filters are tiny in memory yet mighty in scale.
🏗️ Our Architecture
We built a Bloom Filter Service in Node.js that acts as a fast gatekeeper before the database.
It consists of:
- Routes Layer → API endpoints for clients.
- Handler Layer → Processes requests, interacts with the service.
- Service Layer → Manages Bloom filters, population, refresh, and lookups.
📜 Routes Layer
We expose three endpoints under /bloom_filter
:
import express from 'express';
import { getBloomFilterStatus, refreshBloomFilter, checkPhoneExists } from './handler.js';
const router = express.Router();
router.get('/status', getBloomFilterStatus);
router.post('/refresh', refreshBloomFilter);
router.get('/check', checkPhoneExists);
export default router;
-
GET /status
→ Monitor filters. -
POST /refresh
→ Force a refresh. -
GET /check?phoneNumber=...
→ Check existence.
⚙️ Handler Layer
The handlers sit between API requests and the service. They manage errors and responses:
import userSearchBloomFilter from '../../services/userSearchBloomFilter.js';
import { generateInternalServerErrorRepsonse } from '../../utils/errorHandler.js';
export async function getBloomFilterStatus(req, res) {
try {
const status = userSearchBloomFilter.getStatus();
return res.status(200).json({ success: true, data: status });
} catch (error) {
const errorResponse = await generateInternalServerErrorRepsonse(error, 'getBloomFilterStatus');
return res.status(500).json(errorResponse);
}
}
export async function refreshBloomFilter(req, res) {
try {
await userSearchBloomFilter.refresh();
return res.status(200).json({
success: true,
message: 'Bloom filter refreshed successfully'
});
} catch (error) {
const errorResponse = await generateInternalServerErrorRepsonse(error, 'refreshBloomFilter');
return res.status(500).json(errorResponse);
}
}
export async function checkPhoneExists(req, res) {
try {
const { phoneNumber } = req.query;
if (!phoneNumber) {
return res.status(400).json({
success: false,
error: 'Phone number is required'
});
}
const mightExist = userSearchBloomFilter.mightExist(phoneNumber);
return res.status(200).json({
success: true,
data: {
phoneNumber,
mightExist,
note: mightExist
? 'Might exist - check database'
: 'Definitely does not exist'
}
});
} catch (error) {
const errorResponse = await generateInternalServerErrorRepsonse(error, 'checkPhoneExists');
return res.status(500).json(errorResponse);
}
}
👉 Notice how checkPhoneExists
does not immediately hit the DB. It asks the Bloom filter first.
🧠 Service Layer: Core Bloom Filter Logic
This is where the real magic happens.
We maintain four Bloom filters:
emailFilter
phoneFilter
alternateEmailFilter
alternatePhoneFilter
Each filter is initialized with a target capacity and error rate (0.01
= 1%).
import BloomFilter from '../utils/BloomFilter.js';
import User from '../models/user.js';
import logger from '../setup/logger.js';
class UserSearchBloomFilterService {
constructor() {
this.emailFilter = new BloomFilter(100000, 0.01);
this.phoneFilter = new BloomFilter(100000, 0.01);
this.alternateEmailFilter = new BloomFilter(50000, 0.01);
this.alternatePhoneFilter = new BloomFilter(50000, 0.01);
this.isInitialized = false;
this.lastUpdated = null;
this.updateInterval = 24 * 60 * 60 * 1000; // 24 hours
}
🔄 Populating the Filters
On startup, we fetch users in batches and add their identifiers into the filters:
async populateFilters() {
const batchSize = 1000;
let offset = 0;
let hasMoreUsers = true;
while (hasMoreUsers) {
const users = await User.query(qb => {
qb.select('email', 'phone_number', 'alternate_email', 'alternate_phone');
qb.whereNotNull('email').orWhereNotNull('phone_number');
qb.limit(batchSize);
qb.offset(offset);
}).fetchAll();
const userList = users.toJSON();
if (userList.length === 0) {
hasMoreUsers = false;
break;
}
userList.forEach(user => {
if (user.email) this.emailFilter.add(user.email);
if (user.phone_number) this.phoneFilter.add(user.phone_number);
if (user.alternate_email) this.alternateEmailFilter.add(user.alternate_email);
if (user.alternate_phone) this.alternatePhoneFilter.add(user.alternate_phone);
});
offset += batchSize;
logger.info(`Processed ${offset} users for bloom filter`);
}
logger.info(`Bloom filter population completed. Total users processed: ${offset}`);
}
This ensures all existing users are represented in the filter.
⚡ Lookup Logic
When a phone/email check request arrives:
mightExist(searchKey) {
if (!this.isInitialized) {
return true; // fail-safe until initialized
}
const normalizedKey = searchKey.toLowerCase().trim();
return (
this.emailFilter.mightContain(normalizedKey) ||
this.phoneFilter.mightContain(normalizedKey) ||
this.alternateEmailFilter.mightContain(normalizedKey) ||
this.alternatePhoneFilter.mightContain(normalizedKey)
);
}
👉 If it returns false
, we know for sure the user doesn’t exist.
👉 If it returns true
, we query the DB to confirm.
🕒 Auto-Refreshing
To stay fresh with new data, we schedule a 24-hour refresh:
schedulePeriodicUpdate() {
setInterval(async () => {
try {
logger.info('Starting scheduled bloom filter update');
await this.refresh();
} catch (error) {
logger.error('Scheduled bloom filter update failed:', error);
}
}, this.updateInterval);
}
This clears and repopulates the filters.
📊 Status Reporting
Finally, we can inspect filter health:
getStatus() {
return {
isInitialized: this.isInitialized,
lastUpdated: this.lastUpdated,
emailFilterStats: this.emailFilter.getStats(),
phoneFilterStats: this.phoneFilter.getStats(),
alternateEmailFilterStats: this.alternateEmailFilter.getStats(),
alternatePhoneFilterStats: this.alternatePhoneFilter.getStats()
};
}
🚀 Example Flow
A signup form checks if 1231881971
exists:
- Client calls →
GET /bloom_filter/check?phoneNumber=<phoneNumber>
- Bloom filter says:
- ❌ Not in set → Return immediately (skip DB).
- ✅ Might exist → Query DB to confirm.
This cuts DB load massively.
✅ Benefits
- O(1) Lookups → Super fast.
- Reduced DB Load → Fewer queries.
- Scalable → Handles millions of entries.
- Fault-Tolerant → Always errs on the safe side (false positives only).
⚠️ Limitations
- False Positives → Might say “exists” when it doesn’t.
- No Deletion → Standard Bloom filters don’t support removing entries.
- Cold Start → Until initialized, returns “might exist” to avoid false negatives.
🌟 Final Thoughts
Bloom Filters are not a database replacement—they’re an optimization layer.
Think of them as a bouncer at the club entrance:
- If you’re definitely not on the list, you’re turned away instantly.
- If you might be on the list, they’ll double-check inside.
For high-traffic systems, this simple gatekeeping can make a world of difference.
Top comments (2)
Some rough sketches to ideate the flow
How is this more beneficial then a redis set cache layer....