Abhinav

Posted on Sep 17

⚡ Scaling User Search with Bloom Filters in Node.js

#learning #programming #development #tutorial

When your system grows to millions of users, even the simplest operations—like checking if a phone number or email already exists—can become costly.

Yes, you can add database indexes, but every lookup still eats up I/O and CPU cycles. Under heavy signup traffic, this can quickly overwhelm your database.

This is where Bloom Filters come to the rescue. 🌸

🌱 What is a Bloom Filter?

A Bloom Filter is a probabilistic data structure used for set membership checks. It allows us to ask:

👉 “Does this value possibly exist?”

It can say:

❌ Definitely Not → Safe to skip DB.
✅ Might Exist → Confirm with DB.

This tiny compromise (allowing false positives, but never false negatives) gives us O(1) lookups with very little memory usage.

🔬 Anatomy of a Bloom Filter (Abscissa)

At its core, a Bloom filter is just:

A Bit Array (size m) → starts with all 0’s.
k Hash Functions → each maps an input to one of the m positions.

👉 When we add an element:

Run it through all k hash functions.
Flip those positions in the bit array to 1.

👉 When we check an element:

Run it through the same k functions.
If all those positions are 1 → the element might exist.
If any position is 0 → it definitely does not exist.

📈 Visual Abscissa

Think of it as a number line (abscissa = x-axis):

Bit Array (size m)
0 1 2 3 4 5 6 7 8 9 ... m-1
[0][0][0][0][0][0][0][0][0][0]

Each hash function picks some positions along this line.
Adding "alice@example.com" might flip positions 3, 7, and 9.
Checking "bob@example.com"? If one of its hash positions is still 0, we know Bob isn’t in the set.

⚖️ Balancing Act

More bits (m) → fewer collisions, lower false positives.
More hash functions (k) → more accuracy, but also more computation.
The sweet spot depends on expected number of elements n.

Formula for optimal k:

k=n/m(ln2)

This balance is why Bloom filters are tiny in memory yet mighty in scale.

🏗️ Our Architecture

We built a Bloom Filter Service in Node.js that acts as a fast gatekeeper before the database.

It consists of:

Routes Layer → API endpoints for clients.
Handler Layer → Processes requests, interacts with the service.
Service Layer → Manages Bloom filters, population, refresh, and lookups.

📜 Routes Layer

We expose three endpoints under /bloom_filter:

import express from 'express';
import { getBloomFilterStatus, refreshBloomFilter, checkPhoneExists } from './handler.js';

const router = express.Router();

router.get('/status', getBloomFilterStatus);
router.post('/refresh', refreshBloomFilter);
router.get('/check', checkPhoneExists);

export default router;

GET /status → Monitor filters.
POST /refresh → Force a refresh.
GET /check?phoneNumber=... → Check existence.

⚙️ Handler Layer

The handlers sit between API requests and the service. They manage errors and responses:

import userSearchBloomFilter from '../../services/userSearchBloomFilter.js';
import { generateInternalServerErrorRepsonse } from '../../utils/errorHandler.js';

export async function getBloomFilterStatus(req, res) {
    try {
        const status = userSearchBloomFilter.getStatus();
        return res.status(200).json({ success: true, data: status });
    } catch (error) {
        const errorResponse = await generateInternalServerErrorRepsonse(error, 'getBloomFilterStatus');
        return res.status(500).json(errorResponse);
    }
}

export async function refreshBloomFilter(req, res) {
    try {
        await userSearchBloomFilter.refresh();
        return res.status(200).json({
            success: true,
            message: 'Bloom filter refreshed successfully'
        });
    } catch (error) {
        const errorResponse = await generateInternalServerErrorRepsonse(error, 'refreshBloomFilter');
        return res.status(500).json(errorResponse);
    }
}

export async function checkPhoneExists(req, res) {
    try {
        const { phoneNumber } = req.query;

        if (!phoneNumber) {
            return res.status(400).json({
                success: false,
                error: 'Phone number is required'
            });
        }

        const mightExist = userSearchBloomFilter.mightExist(phoneNumber);

        return res.status(200).json({
            success: true,
            data: {
                phoneNumber,
                mightExist,
                note: mightExist
                    ? 'Might exist - check database'
                    : 'Definitely does not exist'
            }
        });
    } catch (error) {
        const errorResponse = await generateInternalServerErrorRepsonse(error, 'checkPhoneExists');
        return res.status(500).json(errorResponse);
    }
}

👉 Notice how checkPhoneExists does not immediately hit the DB. It asks the Bloom filter first.

🧠 Service Layer: Core Bloom Filter Logic

This is where the real magic happens.

We maintain four Bloom filters:

emailFilter
phoneFilter
alternateEmailFilter
alternatePhoneFilter

Each filter is initialized with a target capacity and error rate (0.01 = 1%).

import BloomFilter from '../utils/BloomFilter.js';
import User from '../models/user.js';
import logger from '../setup/logger.js';

class UserSearchBloomFilterService {
    constructor() {
        this.emailFilter = new BloomFilter(100000, 0.01);
        this.phoneFilter = new BloomFilter(100000, 0.01);
        this.alternateEmailFilter = new BloomFilter(50000, 0.01);
        this.alternatePhoneFilter = new BloomFilter(50000, 0.01);
        this.isInitialized = false;
        this.lastUpdated = null;
        this.updateInterval = 24 * 60 * 60 * 1000; // 24 hours
    }

🔄 Populating the Filters

On startup, we fetch users in batches and add their identifiers into the filters:

async populateFilters() {
    const batchSize = 1000;
    let offset = 0;
    let hasMoreUsers = true;

    while (hasMoreUsers) {
        const users = await User.query(qb => {
            qb.select('email', 'phone_number', 'alternate_email', 'alternate_phone');
            qb.whereNotNull('email').orWhereNotNull('phone_number');
            qb.limit(batchSize);
            qb.offset(offset);
        }).fetchAll();

        const userList = users.toJSON();
        if (userList.length === 0) {
            hasMoreUsers = false;
            break;
        }

        userList.forEach(user => {
            if (user.email) this.emailFilter.add(user.email);
            if (user.phone_number) this.phoneFilter.add(user.phone_number);
            if (user.alternate_email) this.alternateEmailFilter.add(user.alternate_email);
            if (user.alternate_phone) this.alternatePhoneFilter.add(user.alternate_phone);
        });

        offset += batchSize;
        logger.info(`Processed ${offset} users for bloom filter`);
    }

    logger.info(`Bloom filter population completed. Total users processed: ${offset}`);
}

This ensures all existing users are represented in the filter.

⚡ Lookup Logic

When a phone/email check request arrives:

mightExist(searchKey) {
    if (!this.isInitialized) {
        return true; // fail-safe until initialized
    }

    const normalizedKey = searchKey.toLowerCase().trim();

    return (
        this.emailFilter.mightContain(normalizedKey) ||
        this.phoneFilter.mightContain(normalizedKey) ||
        this.alternateEmailFilter.mightContain(normalizedKey) ||
        this.alternatePhoneFilter.mightContain(normalizedKey)
    );
}

👉 If it returns false, we know for sure the user doesn’t exist.
👉 If it returns true, we query the DB to confirm.

🕒 Auto-Refreshing

To stay fresh with new data, we schedule a 24-hour refresh:

schedulePeriodicUpdate() {
    setInterval(async () => {
        try {
            logger.info('Starting scheduled bloom filter update');
            await this.refresh();
        } catch (error) {
            logger.error('Scheduled bloom filter update failed:', error);
        }
    }, this.updateInterval);
}

This clears and repopulates the filters.

📊 Status Reporting

Finally, we can inspect filter health:

getStatus() {
    return {
        isInitialized: this.isInitialized,
        lastUpdated: this.lastUpdated,
        emailFilterStats: this.emailFilter.getStats(),
        phoneFilterStats: this.phoneFilter.getStats(),
        alternateEmailFilterStats: this.alternateEmailFilter.getStats(),
        alternatePhoneFilterStats: this.alternatePhoneFilter.getStats()
    };
}

🚀 Example Flow

A signup form checks if 1231881971 exists:

Client calls → GET /bloom_filter/check?phoneNumber=<phoneNumber>
Bloom filter says:

❌ Not in set → Return immediately (skip DB).
✅ Might exist → Query DB to confirm.

This cuts DB load massively.

✅ Benefits

O(1) Lookups → Super fast.
Reduced DB Load → Fewer queries.
Scalable → Handles millions of entries.
Fault-Tolerant → Always errs on the safe side (false positives only).

⚠️ Limitations

False Positives → Might say “exists” when it doesn’t.
No Deletion → Standard Bloom filters don’t support removing entries.
Cold Start → Until initialized, returns “might exist” to avoid false negatives.