DEV Community

Cover image for How to Check 10 Million Usernames in Under 1 Millisecond
Bhupesh Chikara
Bhupesh Chikara

Posted on • Originally published at github.com

How to Check 10 Million Usernames in Under 1 Millisecond

Every second, platforms like GitHub, Instagram, and Twitter process thousands of username availability checks. A seemingly simple operation that becomes a critical performance bottleneck at scale. I built a production-grade proof-of-concept to measure exactly how different architectures handle this challenge.

GitHub Username Validation Interface - Shows real-time username availability checking during signup

The Challenge

When a user types "john_doe" during signup, the system must instantly verify if that username exists among millions of registered users. At 1000 requests per second, this translates to 86.4 million database queries per day.

Traditional approaches create two fundamental problems:

Performance Degradation: Each username check requires a database roundtrip, introducing 10-50ms latency per request.

Infrastructure Costs: Database CPU and I/O costs scale linearly with request volume, creating exponential cost growth.

Three Architectural Approaches

I tested three production-viable architectures against a dataset of 10 million usernames under sustained load of 1000 requests/second.

Architecture 1: PostgreSQL Direct Query

// Every request hits PostgreSQL
const result = await db.query(
  'SELECT EXISTS(SELECT 1 FROM usernames WHERE username = $1)',
  [username]
);
Enter fullscreen mode Exit fullscreen mode

Characteristics:

  • 100% accuracy guaranteed
  • Network + query latency on every request
  • Database becomes the bottleneck
  • Simple to implement and maintain

Architecture 2: Redis In-Memory Cache

// Query Redis SET
const exists = await redis.sIsMember('usernames', username);
Enter fullscreen mode Exit fullscreen mode

Characteristics:

  • 100% accuracy with proper synchronization
  • Sub-5ms latency
  • Requires ~500MB memory for 10M usernames
  • Zero database queries after initial load

Architecture 3: Bloom Filter + Database Fallback

// Check Bloom filter first
const mightExist = bloomFilter.has(username);

if (!mightExist) {
  // Definitely NOT exists - return immediately
  return { available: true };
}

// Maybe exists - verify with database
const actuallyExists = await db.query('SELECT EXISTS...');
return { available: !actuallyExists };
Enter fullscreen mode Exit fullscreen mode

Characteristics:

  • In-process checks eliminate network overhead
  • 95% of requests return instantly without database query
  • ~10MB memory for 10M usernames
  • 1-5% false positive rate (acceptable for this use case)

Performance Comparison Chart - PostgreSQL vs Redis vs Bloom Filter showing latency metrics (P50, P95, P99) and throughput across three architectures

Experiment Design

Infrastructure Stack

  • Runtime: Node.js 18 with Express
  • Database: PostgreSQL 15 with indexed username column
  • Cache: Redis 7.x with optimized memory settings
  • Bloom Filter: bloom-filters library (1% false positive rate)
  • Load Testing: k6 with realistic traffic patterns
  • Monitoring: Prometheus + Grafana for metrics collection

Test Methodology

Dataset: 10 million pre-populated usernames
Traffic Pattern: 90% new users, 10% existing users (realistic signup distribution)
Load Profile:

Phase 1 (0-30s):   Warmup - 100 req/s
Phase 2 (30-90s):  Baseline - 500 req/s
Phase 3 (90-210s): Peak - 1000 req/s
Phase 4 (210-240s): Ramp down
Enter fullscreen mode Exit fullscreen mode

Each test ran for 4 minutes with continuous monitoring of latency, throughput, and resource utilization.

Performance Results

Metric PostgreSQL Redis Bloom Filter
P50 Latency 23.4ms 2.8ms 0.08ms
P95 Latency 45.2ms 8.1ms 0.31ms
P99 Latency 67.8ms 14.5ms 2.1ms
Throughput 850 req/s 3500 req/s 12,000 req/s
DB Queries (per 100K) 100,000 0 5,127
Memory Usage Minimal 487 MB 9.6 MB
Accuracy 100% 100% 99.87%

Key Findings

1. Latency Improvement: 300x Faster

Bloom filters achieved P50 latency of 0.08ms compared to PostgreSQL's 23.4ms. This 29,150% improvement comes from eliminating network overhead entirely - the check happens in-process.

2. Database Query Reduction: 95%

Out of 100,000 requests, Bloom filters only triggered 5,127 database queries (5.13% false positive rate). The remaining 94.87% returned instantly with zero database load.

3. Memory Efficiency: 50x Less

Redis required 487MB to store 10 million usernames. The Bloom filter used just 9.6MB for the same dataset - a 98% memory reduction.

4. Cost Implications

At $0.000001 per database query:

  • PostgreSQL: $86.40/day for 86.4M queries
  • Bloom Filter: $4.32/day for 4.32M queries (5% fallback)
  • Savings: $2,462/month at 1000 req/s sustained load

Understanding Bloom Filters

A Bloom filter is a probabilistic data structure that tests set membership with two critical properties:

No False Negatives: If the filter says an element doesn't exist, it's 100% correct.

Possible False Positives: If the filter says an element might exist, you must verify with the source of truth.

How It Works

When adding a username to the filter:

"john_doe" → hash1(x), hash2(x), hash3(x) → set bits at positions [142, 891, 1523]
Enter fullscreen mode Exit fullscreen mode

When checking if a username exists:

If ANY bit is 0 → Definitely NOT exists (return immediately)
If ALL bits are 1 → MAYBE exists (check database)
Enter fullscreen mode Exit fullscreen mode

The false positive rate is configurable through bit array size and number of hash functions. Our implementation uses a 1% FPR, meaning 99% accuracy for "might exist" cases.

Bloom Filter Bit Array Visualization - Demonstrates how hash functions map usernames to bit positions and how membership checks work with visual representation of bit array operations

Production Considerations

When to Use Each Approach

Choose PostgreSQL when:

  • Request volume is low (<100 req/s)
  • Strong consistency is critical
  • Team lacks operational expertise for distributed systems
  • Simplicity outweighs performance optimization

Choose Redis when:

  • High traffic requires <5ms response times
  • 100% accuracy is non-negotiable
  • Budget supports 50MB+ memory per million items
  • Need distributed caching across multiple services

Choose Bloom Filters when:

  • Massive scale demands sub-millisecond response
  • Negative lookups dominate (90%+ checking new items)
  • Memory constraints exist
  • 1-5% false positive rate is acceptable
  • Minimizing database load is critical

Handling False Positives

False positives in username availability checks are operationally acceptable because:

  1. User experience remains identical (username shown as taken)
  2. Database verification occurs transparently
  3. No data corruption or incorrect state
  4. Performance benefit outweighs rare false positive
// Bloom filter says "maybe exists"
if (bloomFilter.has(username)) {
  // Always verify with database
  const actuallyExists = await db.query('SELECT EXISTS...');

  if (!actuallyExists) {
    // False positive detected
    // User can still register - no negative impact
    metrics.increment('bloom_filter_false_positive');
  }
}
Enter fullscreen mode Exit fullscreen mode

Data Synchronization

Maintain filter accuracy through:

// On new user registration
async function registerUser(username) {
  // 1. Insert into database
  await db.query('INSERT INTO users (username) VALUES ($1)', [username]);

  // 2. Update Bloom filter immediately
  bloomFilter.add(username);

  // 3. Persist filter periodically (every hour)
  if (shouldPersist()) {
    await saveBloomFilterToDisk(bloomFilter);
  }
}
Enter fullscreen mode Exit fullscreen mode

Real-World Applications

Major technology platforms use Bloom filters for similar use cases:

GitHub: Checks user passwords against 10 billion leaked credentials in <1ms
Medium: Filters already-read articles from recommendation feeds
Google Chrome: Detects malicious URLs locally without server requests
Akamai CDN: Performs cache existence checks at edge nodes
Bitcoin: Validates transaction double-spending in UTXO sets

Implementation Guide

Building the Bloom Filter

const { BloomFilter } = require('bloom-filters');

// Create filter: 10M capacity, 1% false positive rate
const filter = BloomFilter.create(10000000, 0.01);

// Load existing usernames
const users = await db.query('SELECT username FROM users');
users.rows.forEach(user => filter.add(user.username));

// Persist to disk
const filterData = {
  size: filter.size,
  nbHashes: filter.nbHashes,
  bits: Array.from(filter._bits),
  metadata: { totalItems: users.rows.length, fpr: 0.01 }
};
fs.writeFileSync('bloom-filter.json', JSON.stringify(filterData));
Enter fullscreen mode Exit fullscreen mode

Production API Implementation

app.post('/check-username', async (req, res) => {
  const { username } = req.body;
  const start = Date.now();

  // In-process Bloom filter check
  const mightExist = bloomFilter.has(username);

  if (!mightExist) {
    // 95% of requests return here
    return res.json({
      available: true,
      latency: Date.now() - start,
      source: 'bloom_filter'
    });
  }

  // 5% of requests verify with database
  const result = await db.query(
    'SELECT EXISTS(SELECT 1 FROM users WHERE username = $1)',
    [username]
  );

  res.json({
    available: !result.rows[0].exists,
    latency: Date.now() - start,
    source: 'database_fallback',
    falsePositive: !result.rows[0].exists
  });
});
Enter fullscreen mode Exit fullscreen mode

Reproduction Instructions

Full source code and setup available on GitHub:

git clone https://github.com/builtbychikara/WhatIfSeries.git
cd what-if/bloom-filter

# Install dependencies
npm install

# Start infrastructure (PostgreSQL, Redis, Prometheus, Grafana)
npm run docker:up

# Seed 10M usernames (takes ~15 minutes)
npm run seed

# Run services (3 separate terminals)
npm run start:postgres  # Terminal 1
npm run start:redis     # Terminal 2
npm run start:bloom     # Terminal 3

# Execute load tests
npm run experiment

# View results
open http://localhost:3100  # Grafana dashboard
Enter fullscreen mode Exit fullscreen mode

Conclusion

Bloom filters provide a compelling solution for username availability checks at scale:

  • 300x latency improvement over direct database queries
  • 95% reduction in database load
  • 50x memory efficiency compared to Redis caching
  • Production-proven at companies like GitHub, Medium, and Chrome

The 1-5% false positive rate is an acceptable trade-off for the massive performance and cost benefits. For high-traffic applications where negative lookups dominate, Bloom filters are a battle-tested solution.

The complete proof-of-concept demonstrates production-grade implementation with comprehensive monitoring, load testing, and detailed metrics. All code is open source and ready to run.


GitHub Repository: https://github.com/builtbychikara/WhatIfSeries/tree/main/what-if/bloom-filter

Tech Stack: Node.js, PostgreSQL, Redis, Bloom Filters, k6, Prometheus, Grafana, Docker

Part of the What If Series - Production-grade POCs exploring system design decisions through empirical measurement.

Top comments (0)