DEV Community: Bhupesh Chikara

How to Check 10 Million Usernames in Under 1 Millisecond

Bhupesh Chikara — Mon, 02 Mar 2026 03:51:38 +0000

Every second, platforms like GitHub, Instagram, and Twitter process thousands of username availability checks. A seemingly simple operation that becomes a critical performance bottleneck at scale. I built a production-grade proof-of-concept to measure exactly how different architectures handle this challenge.

The Challenge

When a user types "john_doe" during signup, the system must instantly verify if that username exists among millions of registered users. At 1000 requests per second, this translates to 86.4 million database queries per day.

Traditional approaches create two fundamental problems:

Performance Degradation: Each username check requires a database roundtrip, introducing 10-50ms latency per request.

Infrastructure Costs: Database CPU and I/O costs scale linearly with request volume, creating exponential cost growth.

Three Architectural Approaches

I tested three production-viable architectures against a dataset of 10 million usernames under sustained load of 1000 requests/second.

Architecture 1: PostgreSQL Direct Query

// Every request hits PostgreSQL
const result = await db.query(
  'SELECT EXISTS(SELECT 1 FROM usernames WHERE username = $1)',
  [username]
);

Characteristics:

100% accuracy guaranteed
Network + query latency on every request
Database becomes the bottleneck
Simple to implement and maintain

Architecture 2: Redis In-Memory Cache

// Query Redis SET
const exists = await redis.sIsMember('usernames', username);

Characteristics:

100% accuracy with proper synchronization
Sub-5ms latency
Requires ~500MB memory for 10M usernames
Zero database queries after initial load

Architecture 3: Bloom Filter + Database Fallback

// Check Bloom filter first
const mightExist = bloomFilter.has(username);

if (!mightExist) {
  // Definitely NOT exists - return immediately
  return { available: true };
}

// Maybe exists - verify with database
const actuallyExists = await db.query('SELECT EXISTS...');
return { available: !actuallyExists };

Characteristics:

In-process checks eliminate network overhead
95% of requests return instantly without database query
~10MB memory for 10M usernames
1-5% false positive rate (acceptable for this use case)

Experiment Design

Infrastructure Stack

Runtime: Node.js 18 with Express
Database: PostgreSQL 15 with indexed username column
Cache: Redis 7.x with optimized memory settings
Bloom Filter: bloom-filters library (1% false positive rate)
Load Testing: k6 with realistic traffic patterns
Monitoring: Prometheus + Grafana for metrics collection

Test Methodology

Dataset: 10 million pre-populated usernames
Traffic Pattern: 90% new users, 10% existing users (realistic signup distribution)
Load Profile:

Phase 1 (0-30s):   Warmup - 100 req/s
Phase 2 (30-90s):  Baseline - 500 req/s
Phase 3 (90-210s): Peak - 1000 req/s
Phase 4 (210-240s): Ramp down

Each test ran for 4 minutes with continuous monitoring of latency, throughput, and resource utilization.

Performance Results

Metric	PostgreSQL	Redis	Bloom Filter
P50 Latency	23.4ms	2.8ms	0.08ms
P95 Latency	45.2ms	8.1ms	0.31ms
P99 Latency	67.8ms	14.5ms	2.1ms
Throughput	850 req/s	3500 req/s	12,000 req/s
DB Queries (per 100K)	100,000	0	5,127
Memory Usage	Minimal	487 MB	9.6 MB
Accuracy	100%	100%	99.87%

Key Findings

1. Latency Improvement: 300x Faster

Bloom filters achieved P50 latency of 0.08ms compared to PostgreSQL's 23.4ms. This 29,150% improvement comes from eliminating network overhead entirely - the check happens in-process.

2. Database Query Reduction: 95%

Out of 100,000 requests, Bloom filters only triggered 5,127 database queries (5.13% false positive rate). The remaining 94.87% returned instantly with zero database load.

3. Memory Efficiency: 50x Less

Redis required 487MB to store 10 million usernames. The Bloom filter used just 9.6MB for the same dataset - a 98% memory reduction.

4. Cost Implications

At $0.000001 per database query:

PostgreSQL: $86.40/day for 86.4M queries
Bloom Filter: $4.32/day for 4.32M queries (5% fallback)
Savings: $2,462/month at 1000 req/s sustained load

Understanding Bloom Filters

A Bloom filter is a probabilistic data structure that tests set membership with two critical properties:

No False Negatives: If the filter says an element doesn't exist, it's 100% correct.

Possible False Positives: If the filter says an element might exist, you must verify with the source of truth.

How It Works

When adding a username to the filter:

"john_doe" → hash1(x), hash2(x), hash3(x) → set bits at positions [142, 891, 1523]

When checking if a username exists:

If ANY bit is 0 → Definitely NOT exists (return immediately)
If ALL bits are 1 → MAYBE exists (check database)

The false positive rate is configurable through bit array size and number of hash functions. Our implementation uses a 1% FPR, meaning 99% accuracy for "might exist" cases.

Production Considerations

When to Use Each Approach

Choose PostgreSQL when:

Request volume is low (<100 req/s)
Strong consistency is critical
Team lacks operational expertise for distributed systems
Simplicity outweighs performance optimization

Choose Redis when:

High traffic requires <5ms response times
100% accuracy is non-negotiable
Budget supports 50MB+ memory per million items
Need distributed caching across multiple services

Choose Bloom Filters when:

Massive scale demands sub-millisecond response
Negative lookups dominate (90%+ checking new items)
Memory constraints exist
1-5% false positive rate is acceptable
Minimizing database load is critical

Handling False Positives

False positives in username availability checks are operationally acceptable because:

User experience remains identical (username shown as taken)
Database verification occurs transparently
No data corruption or incorrect state
Performance benefit outweighs rare false positive

// Bloom filter says "maybe exists"
if (bloomFilter.has(username)) {
  // Always verify with database
  const actuallyExists = await db.query('SELECT EXISTS...');

  if (!actuallyExists) {
    // False positive detected
    // User can still register - no negative impact
    metrics.increment('bloom_filter_false_positive');
  }
}

Data Synchronization

Maintain filter accuracy through:

// On new user registration
async function registerUser(username) {
  // 1. Insert into database
  await db.query('INSERT INTO users (username) VALUES ($1)', [username]);

  // 2. Update Bloom filter immediately
  bloomFilter.add(username);

  // 3. Persist filter periodically (every hour)
  if (shouldPersist()) {
    await saveBloomFilterToDisk(bloomFilter);
  }
}

Real-World Applications

Major technology platforms use Bloom filters for similar use cases:

GitHub: Checks user passwords against 10 billion leaked credentials in <1ms
Medium: Filters already-read articles from recommendation feeds
Google Chrome: Detects malicious URLs locally without server requests
Akamai CDN: Performs cache existence checks at edge nodes
Bitcoin: Validates transaction double-spending in UTXO sets

Implementation Guide

Building the Bloom Filter

const { BloomFilter } = require('bloom-filters');

// Create filter: 10M capacity, 1% false positive rate
const filter = BloomFilter.create(10000000, 0.01);

// Load existing usernames
const users = await db.query('SELECT username FROM users');
users.rows.forEach(user => filter.add(user.username));

// Persist to disk
const filterData = {
  size: filter.size,
  nbHashes: filter.nbHashes,
  bits: Array.from(filter._bits),
  metadata: { totalItems: users.rows.length, fpr: 0.01 }
};
fs.writeFileSync('bloom-filter.json', JSON.stringify(filterData));

Production API Implementation

app.post('/check-username', async (req, res) => {
  const { username } = req.body;
  const start = Date.now();

  // In-process Bloom filter check
  const mightExist = bloomFilter.has(username);

  if (!mightExist) {
    // 95% of requests return here
    return res.json({
      available: true,
      latency: Date.now() - start,
      source: 'bloom_filter'
    });
  }

  // 5% of requests verify with database
  const result = await db.query(
    'SELECT EXISTS(SELECT 1 FROM users WHERE username = $1)',
    [username]
  );

  res.json({
    available: !result.rows[0].exists,
    latency: Date.now() - start,
    source: 'database_fallback',
    falsePositive: !result.rows[0].exists
  });
});

Reproduction Instructions

Full source code and setup available on GitHub:

git clone https://github.com/builtbychikara/WhatIfSeries.git
cd what-if/bloom-filter

# Install dependencies
npm install

# Start infrastructure (PostgreSQL, Redis, Prometheus, Grafana)
npm run docker:up

# Seed 10M usernames (takes ~15 minutes)
npm run seed

# Run services (3 separate terminals)
npm run start:postgres  # Terminal 1
npm run start:redis     # Terminal 2
npm run start:bloom     # Terminal 3

# Execute load tests
npm run experiment

# View results
open http://localhost:3100  # Grafana dashboard

Conclusion

Bloom filters provide a compelling solution for username availability checks at scale:

300x latency improvement over direct database queries
95% reduction in database load
50x memory efficiency compared to Redis caching
Production-proven at companies like GitHub, Medium, and Chrome

The 1-5% false positive rate is an acceptable trade-off for the massive performance and cost benefits. For high-traffic applications where negative lookups dominate, Bloom filters are a battle-tested solution.

The complete proof-of-concept demonstrates production-grade implementation with comprehensive monitoring, load testing, and detailed metrics. All code is open source and ready to run.

GitHub Repository: https://github.com/builtbychikara/WhatIfSeries/tree/main/what-if/bloom-filter

Tech Stack: Node.js, PostgreSQL, Redis, Bloom Filters, k6, Prometheus, Grafana, Docker

Part of the What If Series - Production-grade POCs exploring system design decisions through empirical measurement.

What If Your Database Goes Down? REST vs Kafka Under Fire

Bhupesh Chikara — Thu, 26 Feb 2026 22:01:13 +0000

What If Your Database Goes Down? REST vs Kafka Under Fire

Companies like Uber, Netflix, and Airbnb have migrated from traditional REST APIs to event-driven architectures. This shift isn't driven by trends, but by fundamental architectural resilience requirements. To quantify these differences, I built a production-grade chaos engineering proof-of-concept, simulating real-world database failures under sustained load.

The Hypothesis

Traditional synchronous REST APIs create tight coupling between API servers and databases. When the database fails, the API fails. Event-driven architectures using Kafka decouple producers from consumers through message buffering, theoretically providing resilience during infrastructure failures.

We designed an experiment to measure this resilience difference empirically.

Experiment Design

Scenario

We simulated Uber's real-time driver location tracking system, where high-frequency updates must be reliably persisted to PostgreSQL.

Infrastructure

Load Pattern: Constant 50 requests/second
Failure Injection: PostgreSQL crash for 120 seconds (t=90s to t=210s)
Duration: 5 minutes total (300 seconds)
Load Testing: k6 with automated orchestration
Monitoring: Prometheus + Grafana for real-time metrics collection

Architectures Tested

Architecture A: Synchronous REST

Direct HTTP POST to REST API
Immediate database INSERT
Response after database confirmation

Architecture B: Asynchronous Kafka

HTTP POST to Producer API
Event published to Kafka topic
Consumer processes events with circuit breaker pattern
Asynchronous database INSERT

Both architectures handled identical traffic patterns and experienced the same database failure window.

Results

REST API Performance

Metric                  Value
Total Requests         30,002
Successful             15,001
Failed                 15,001
Error Rate             50.00%
P50 Latency            3.60ms
P95 Latency            15.55ms
Average Latency        11.24ms

Failure Characteristics:

Immediate error propagation to clients
No request buffering capability
Manual intervention required for recovery
50% of requests returned HTTP 500 errors
Complete service degradation during database outage

Kafka Architecture Performance

Metric                  Value
Total Requests         15,000
Successful             15,000
Failed                 0
Error Rate             0.00%
P50 Latency            3.47ms
P95 Latency            12.10ms
Average Latency        6.77ms
Latency Improvement    39.77%

Failure Characteristics:

Zero client-facing errors
Automatic request buffering in Kafka topics
Circuit breaker pattern enabled graceful degradation
Automatic recovery when database restored
Transparent failure handling from client perspective

Architecture Analysis

REST: Synchronous Coupling

Client → REST API → PostgreSQL → Response
           ↓ (if DB fails)
       HTTP 500 Error

Failure Mode:
When PostgreSQL becomes unavailable, the REST API has no buffering mechanism. Each incoming request attempts a database connection, fails after timeout (2000ms), and returns an error to the client. This creates a cascading failure pattern where API health is directly tied to database availability.

Characteristics:

Strong consistency guarantees
Immediate failure feedback
No request buffering
Tight coupling between components
Simple operational model

Kafka: Asynchronous Decoupling

Client → Producer API → Kafka Topic → Consumer → PostgreSQL
         ↓ (immediate)           ↓ (buffered)
     HTTP 202 Accepted     Circuit Breaker

Failure Mode:
When PostgreSQL fails, the Producer API continues accepting requests and publishing events to Kafka. The Consumer detects database failures through the circuit breaker pattern, transitions to OPEN state, and stops attempting writes. Events accumulate in Kafka's persistent log. When PostgreSQL recovers, the circuit breaker transitions to HALF_OPEN, tests connectivity, and resumes processing the buffered events.

Characteristics:

Eventual consistency model
Request buffering via Kafka topics
Automatic failure detection and recovery
Component independence
Complex operational requirements

Circuit Breaker Implementation

The circuit breaker pattern is critical for preventing cascading failures in distributed systems. Our implementation uses a state machine with three states:

class CircuitBreaker {
  constructor(threshold = 5, timeout = 30000, resetTimeout = 60000) {
    this.state = 'CLOSED';
    this.failureThreshold = threshold;
    this.resetTimeout = resetTimeout;
    this.failureCount = 0;
    this.nextAttempt = Date.now();
  }

  async execute(fn) {
    if (this.state === 'OPEN') {
      if (Date.now() < this.nextAttempt) {
        throw new Error('Circuit breaker OPEN - database unavailable');
      }
      this.state = 'HALF_OPEN';
    }

    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      throw err;
    }
  }

  onSuccess() {
    this.failureCount = 0;
    if (this.state === 'HALF_OPEN') {
      this.successCount++;
      if (this.successCount >= 3) {
        this.state = 'CLOSED';
        this.successCount = 0;
      }
    }
  }

  onFailure() {
    this.failureCount++;
    if (this.failureCount >= this.failureThreshold) {
      this.state = 'OPEN';
      this.nextAttempt = Date.now() + this.resetTimeout;
    }
  }
}

State Transitions:

CLOSED: Normal operation, all requests processed
OPEN: Failure threshold exceeded, stop attempting database writes
HALF_OPEN: Testing recovery, require 3 consecutive successes before full recovery

Business Impact Analysis

Assumptions

Request rate: 1,000 requests/second
Revenue per request: $0.01
Database outage duration: 5 minutes
Error rate impact: 50% (REST) vs 0% (Kafka)

REST Architecture Impact

Failed Requests:    150,000 (50% × 1,000 req/s × 300s)
Revenue Loss:       $1,500
Recovery Time:      Manual intervention required
Customer Impact:    Severe degradation, visible errors

Kafka Architecture Impact

Failed Requests:    0 (buffered in Kafka)
Revenue Loss:       $0
Recovery Time:      Automatic
Customer Impact:    None (transparent to clients)

ROI Calculation:
At scale, assuming one 5-minute database outage per month, Kafka prevents $18,000 in annual revenue loss. This excludes:

Customer support costs from incident tickets
Engineering time for manual recovery
Reputational damage from service degradation
Compliance implications for SLA violations

Technical Stack

Runtime: Node.js 18+ with Express
Message Broker: Apache Kafka 3.x with KafkaJS client
Database: PostgreSQL 15 with connection pooling
Load Testing: k6 with custom JavaScript scenarios
Monitoring: Prometheus 2.x + Grafana 9.x
Orchestration: Docker Compose for reproducible environments
Chaos Engineering: Automated failure injection scripts

Observability Metrics

Prometheus Instrumentation

// REST API Metrics
rest_http_requests_total{status="200|500"}
rest_request_duration_seconds{quantile="0.5|0.95|0.99"}

// Kafka Producer Metrics
kafka_produce_total
kafka_produce_latency_milliseconds
kafka_produce_errors_total

// Consumer Metrics with Circuit Breaker
circuit_breaker_state{state="CLOSED|HALF_OPEN|OPEN"}
kafka_consumer_lag_seconds
db_write_failures_total
db_connection_pool_size

These metrics enabled real-time failure detection and post-mortem analysis of system behavior during the outage window.

Reproduction Guide

Prerequisites

# Install k6 load testing tool
brew install k6  # macOS
sudo apt-get install k6  # Linux

# Clone repository
git clone https://github.com/builtbychikara/WhatIfSeries
cd WhatIfSeries/what-if/kafka-vs-rest-polling

Infrastructure Startup

# Start Kafka, PostgreSQL, Prometheus, and Grafana
docker-compose up -d

# Verify all services are healthy
docker-compose ps

Service Deployment

Open three terminal windows:

# Terminal 1: REST API (Port 3001)
npm run start:rest

# Terminal 2: Kafka Producer API (Port 3002)
npm run start:kafka

# Terminal 3: Kafka Consumer with Circuit Breaker (Port 3003)
npm run start:consumer

Execute Experiment

# Run automated chaos engineering experiment
npm run experiment

Experiment Timeline

Phase         Time        Load        Database State
Warmup        0-30s       25 req/s    Healthy
Normal        30-90s      50 req/s    Healthy
Crash         90-210s     50 req/s    OFFLINE
Recovery      210-300s    50 req/s    Healthy

The experiment script automatically injects the database failure at t=90s and restores it at t=210s, while maintaining constant load throughout.

Key Findings

1. Resilience Through Decoupling

Event-driven architectures achieve resilience not through redundancy, but through temporal decoupling. Kafka's persistent log provides a buffer that absorbs transient failures, converting availability problems into latency problems.

2. Circuit Breaker Necessity

Without the circuit breaker pattern, the Kafka consumer would continuously retry failed database writes, wasting resources and potentially overwhelming the database during recovery. The circuit breaker provides:

Automatic failure detection
Resource conservation during outages
Controlled recovery testing
Metrics for failure state visibility

3. Consistency Trade-offs

REST provides strong consistency - clients know immediately whether their write succeeded. Kafka provides eventual consistency - clients receive acknowledgment that the write is buffered, not that it's persisted. This trade-off must align with business requirements.

4. Operational Complexity

Kafka introduces operational overhead:

Additional infrastructure (brokers, ZooKeeper/KRaft)
Message ordering guarantees to maintain
Consumer lag monitoring requirements
Topic partition management
Schema evolution considerations

This complexity must be justified by resilience requirements.

Decision Framework

Choose REST When:

Traffic Patterns: Request rate under 100 req/s
Consistency Requirements: Strong consistency required
Team Capabilities: Limited operational expertise
System Simplicity: Monolithic architecture preferred
Latency Requirements: Sub-10ms response times required

Choose Kafka When:

Traffic Patterns: Request rate exceeds 1,000 req/s
Resilience Requirements: Infrastructure failures must be transparent
Event Consumers: Multiple downstream systems need the same events
Consistency Trade-offs: Eventual consistency acceptable
Operational Maturity: Team can manage distributed systems

Future Work

This experiment is part of the What If Series, exploring system design decisions through empirical measurement rather than theoretical analysis.

Upcoming Experiments:

High-throughput scaling: Measuring Kafka performance at 1M+ req/s
Cache failure analysis: Redis outage impact on application tier
Real-time processing: Stream processing 1TB of logs with sub-second latency
Network partition testing: Consistency guarantees under split-brain scenarios

Code Repository

Full implementation including:

Complete source code for both architectures
Docker Compose infrastructure definitions
k6 load testing scenarios
Prometheus metric exporters
Automated chaos injection scripts
Grafana dashboard configurations

GitHub: https://github.com/builtbychikara/WhatIfSeries/tree/main/what-if/kafka-vs-rest-polling

Key Files:

kafka-pubsub-service/consumer.js - Circuit breaker implementation
k6/runner.js - Chaos engineering experiment orchestration
docker-compose.yml - Complete infrastructure stack

Conclusion

Event-driven architectures using Kafka demonstrate measurably superior resilience during database failures, with zero client-facing errors compared to 50% error rates in synchronous REST implementations. However, this resilience comes at the cost of operational complexity and consistency guarantees.

The choice between REST and Kafka should be driven by quantified requirements for resilience, throughput, and acceptable consistency models, not by architectural preferences or industry trends.

Acknowledgments: Built in collaboration with Aakanksh Singh to demonstrate production-grade system design patterns through empirical measurement.

Tech Stack: Node.js, Apache Kafka, PostgreSQL, k6, Prometheus, Grafana, Docker

Building CodeNova: System Design Deep Dive into an AI-Enhanced Coding Platform

Bhupesh Chikara — Sun, 30 Nov 2025 01:14:02 +0000

TL;DR

I designed and built CodeNova, a scalable coding interview platform handling 10K+ concurrent users with three AI-powered features: video avatar tutor, algorithm visualizer, and collaborative whiteboard. This is a deep dive into the system architecture and design decisions.

🎯 What is CodeNova?

CodeNova is an AI-enhanced coding interview platform designed for scalability and learning. Core features include:

155+ problems across multiple difficulty levels
10+ programming languages with sandboxed execution
AI video tutor with realistic avatar and natural voice
Automatic algorithm visualization for any code
Real-time collaborative whiteboard for mock interviews
Contest leaderboards with analytics

Scale: Built to handle 10,000 concurrent users, 1,000 submissions/minute, with 99.9% uptime.

🏗️ High-Level Architecture

System Overview

The architecture follows a microservices-ready design with clear separation of concerns across 6 layers:

Layer 1: Client (Browser)
    ↓
Layer 2: CDN & Load Balancing (CloudFlare + Nginx)
    ↓
Layer 3: Application Tier (Next.js + Express + Socket.io)
    ↓
Layer 4: Data Tier (MongoDB + Redis + PostgreSQL)
    ↓
Layer 5: Queue Layer (BullMQ)
    ↓
Layer 6: Workers & External Services (Judge0, Gemini AI, ElevenLabs, ANAM)

🌟 Three Unique Features - Architecture Breakdown

1. AI Video Avatar Tutor

The Challenge:
How do you provide personalized video explanations to thousands of users without hiring human tutors?

The Solution: Three-Stage Pipeline

User Question → Gemini AI → ElevenLabs → ANAM AI → Cached Video
              (Text Gen)   (TTS)        (Avatar)

Architecture Decisions:

Decision 1: Why Three Separate Services?

Gemini AI - Best at generating educational content
ElevenLabs - Most natural-sounding TTS (better than AWS Polly)
ANAM AI - Realistic lip-sync (alternatives: D-ID, Synthesia)

Trade-off: Higher complexity but better quality. Users prefer natural voice over robotic TTS.

Decision 2: Caching Strategy

Problem: Generating avatar videos takes 30 seconds per request
Solution: Redis cache with 24-hour TTL for common questions
Result: 70% cache hit rate significantly reduces generation load

Decision 3: Async Processing

Why: 30-second generation time blocks API
How: BullMQ job queue
Benefit: User sees loading screen, gets notification when ready

2. AI-Powered Algorithm Visualizer

The Challenge:
Traditional visualizers need manual step creation for each algorithm. How to support ANY algorithm without manual work?

The Solution: AI-Generated Visualization Steps

User Code → Gemini AI → JSON Steps → Canvas Renderer → Interactive Visualization
         (Analyze)    (Generate)    (Frontend)

Architecture Decisions:

Decision 1: Why AI Over Templates?

Templates approach: 155+ algorithms × manual steps = months of work
AI approach: Gemini analyzes ANY code automatically
Trade-off: API dependency vs. automatic generation at scale

Decision 2: Where to Render?

Server-side rendering: High CPU usage, poor UX
Client-side (Canvas API): Better performance, lower server load
Chosen: Client-side with JSON steps from server

Decision 3: Data Format
Gemini returns structured JSON:

Step format:
- Description (plain English)
- Array state at this step
- Elements to highlight
- Comparison pointers

Supported Algorithms:

Sorting: Bubble, Merge, Quick, Heap, Insertion
Searching: Binary, Linear, DFS, BFS
Data Structures: Stack, Queue, Trees, Graphs
DP: Fibonacci, Knapsack, LCS with table visualization

3. Collaborative Whiteboard

The Challenge:
Enable real-time drawing for multiple users in mock interviews.

The Solution: WebSocket + Pub/Sub Architecture

User A draws → Socket.io Server → Redis Pub/Sub → All Users in Room
                     ↓
                 MongoDB (persist)

Architecture Decisions:

Decision 1: WebSocket vs. Polling?

Polling: Simple but wasteful (10K users × 5s intervals = 2K QPS)
WebSocket: Persistent connection, instant updates
Chosen: Socket.io for fallback support (WebSocket → long polling)

Decision 2: How to Scale WebSockets Across Multiple Servers?

Problem: User A on Server 1, User B on Server 2
Solution: Redis Pub/Sub for cross-server communication
How it works:
- Server 1 publishes draw event to Redis
- Server 2 subscribes and receives event
- Server 2 sends to User B via WebSocket

Decision 3: Persistence Strategy

Approach 1: Save on every draw → Too many DB writes
Approach 2: Save on disconnect → Lose data if server crashes
Chosen: Auto-save every 5 seconds to MongoDB
Recovery: Load from DB on reconnect

Data Model:

WhiteboardSession {
  sessionId: unique identifier
  problemId: which problem being discussed
  participants: array of user IDs with roles
  elements: Excalidraw drawing data
  createdAt, updatedAt
}

🔐 Security Architecture - Defense in Depth

6 Layers of Security

Layer 1: Network Perimeter

CloudFlare DDoS protection (unlimited)
Rate limiting: 1000 requests/minute per IP
TLS 1.3 encryption

Layer 2: Load Balancer (Nginx)

Per-user rate limiting (100 req/min)
Request size limits (10 MB max)
Header validation & sanitization

Layer 3: Authentication & Authorization

JWT tokens: HS256 algorithm, 7-day expiry
Session validation: Every request checks Redis
RBAC: User vs Admin permissions

Layer 4: Input Validation

Code size limit: 10 KB (prevents DoS)
Forbidden pattern detection:
- require('child_process')
- import subprocess
- Runtime.getRuntime().exec()
- system(), eval()

Layer 5: Code Execution Sandbox (Judge0)

Docker isolation: Each submission in separate container
Resource limits:
- CPU time: 2 seconds max
- Memory: 256 MB max
- Processes: 30 max
Network: Completely disabled
Filesystem: Read-only (except /tmp)
Seccomp profiles: Block dangerous syscalls

Layer 6: Data Security

Encryption at rest: AES-256
Password hashing: Bcrypt (10 rounds)
Secrets: AWS Secrets Manager
Database backups: Daily full + 6h incremental

Why 6 Layers?
If an attacker bypasses one layer, 5 more remain. Single points of failure = bad.

📊 Scalability: Handling 10,000 Concurrent Users

Horizontal Scaling Strategy

Kubernetes HPA (Horizontal Pod Autoscaler):

Configuration:
- Min replicas: 3 (high availability)
- Max replicas: 20 (resource management)
- Scale up: CPU > 70% OR Memory > 80%
- Scale down: CPU < 40% for 5 minutes

Why Kubernetes?

Auto-healing (pod crashes → restart)
Rolling updates (zero downtime deploys)
Resource management (CPU/memory limits)
Service discovery (automatic DNS)

Database Scaling Strategy

MongoDB (Primary Database):

Architecture: Replica Set (PSS)
- 1 Primary (us-east-1) → All writes
- 1 Secondary (us-west-1) → Read queries
- 1 Secondary (eu-west-1) → Read queries

Read Preference: secondaryPreferred (40% load on each secondary)
Write Concern: majority (data safety)

Future: Shard when > 10M documents
Shard Key: { userId: "hashed" } for even distribution

PostgreSQL (Analytics):

Architecture: Master-Replica
- Master: All writes (metrics, logs)
- Replica 1: Analytics queries
- Replica 2: Reporting dashboards

Extension: TimescaleDB for time-series optimization
Use case: User activity over time, submission trends

Redis (Cache & Pub/Sub):

Architecture: Cluster (3 nodes)
- Node 1: Master (cache + sessions)
- Node 2: Replica (failover)
- Node 3: Replica (failover)

Persistence: RDB snapshots (5 min) + AOF
Max Memory: 4 GB
Eviction Policy: allkeys-lru (least recently used)

Worker Scaling

BullMQ Queue Configuration:

Code Execution Queue:
- Min workers: 5
- Max workers: 50
- Concurrency: 10 jobs per worker
- Scale trigger: Queue depth > 100

AI Avatar Queue:
- Min workers: 2
- Max workers: 20
- Concurrency: 5 jobs per worker
- Scale trigger: Queue depth > 50

Visualizer Queue:
- Min workers: 2
- Max workers: 15
- Concurrency: 5 jobs per worker

Math Check:

Peak Load: 1,000 submissions/minute
         = 16.7 submissions/second

Average execution time: 2 seconds

Required concurrent workers:
16.7 submissions/sec × 2 sec = 33.4 workers

Configured max: 50 workers
Headroom: 50 - 34 = 16 workers (47% buffer) ✓

🗄️ Data Architecture Decisions

Why MongoDB for Primary DB?

Pros:
✅ Flexible schema (problems have varying test cases)
✅ Horizontal scaling with sharding
✅ Rich query language (filter by difficulty, tags, companies)
✅ Replica sets for HA

Cons:
❌ Weaker transactions (fixed in 4.0+)
❌ Larger storage footprint

Use Cases:

Problems collection (155+ documents)
Submissions collection (millions of documents)
Whiteboard sessions

Why PostgreSQL for Analytics?

Pros:
✅ ACID transactions
✅ Complex joins for user analytics
✅ TimescaleDB for time-series optimization
✅ Better for aggregations

Use Cases:

Submission analytics (success rate over time)
User activity logs
Leaderboard snapshots

Why Redis?

Pros:
✅ Sub-millisecond latency
✅ Sorted Sets for leaderboards (O(log N) operations)
✅ Pub/Sub for WebSocket scaling
✅ Built-in TTL for sessions

Use Cases:

Session storage (7-day TTL)
Problem caching (1-hour TTL)
Leaderboard (Redis Sorted Set)
WebSocket pub/sub

Leaderboard Implementation:

Data Structure: Redis Sorted Set
Command: ZADD leaderboard:contest123 <score> <userId>
Retrieve Top 100: ZREVRANGE leaderboard:contest123 0 99 WITHSCORES

Time Complexity: O(log N)
Handles: 10K users × 5-second polling = 2K QPS easily

🎯 Architecture Decisions Explained

Decision 1: Why BullMQ Over AWS SQS?

Comparison:

Feature	BullMQ (Redis)	AWS SQS
Latency	< 10ms	50-100ms
Priority Queues	✅ Native	❌ Separate queues
Retry Logic	✅ Built-in	Manual
Local Dev	✅ Easy	❌ Need AWS account
Infrastructure	Uses existing Redis	Additional service

Chosen: BullMQ for lower latency and simpler infrastructure.

Decision 2: Why Socket.io Over Native WebSocket?

Socket.io Advantages:

✅ Automatic fallback (WebSocket → long polling)
✅ Reconnection logic built-in
✅ Room-based messaging
✅ Cross-platform (web + mobile)

Trade-off: Slightly larger bundle size, but better compatibility.

Decision 3: Why Next.js Over Pure React?

Next.js Benefits:

✅ Server-side rendering (better SEO)
✅ API routes (no separate Express for simple endpoints)
✅ Image optimization
✅ Automatic code splitting

Use Case: Problem listing page needs SEO for Google.

Decision 4: Why Separate PostgreSQL for Analytics?

Why Not Just MongoDB?

MongoDB aggregations are slower for complex queries
PostgreSQL better for JOINs (users + submissions + problems)
TimescaleDB optimizes time-series queries (activity over time)

Trade-off: More complexity (2 databases) but better performance.

🚀 Performance Metrics

Achieved SLA:

✅ Code execution: < 3s (p95)
✅ Page load: < 2s
✅ API latency: < 500ms (p95)
✅ WebSocket latency: < 100ms
✅ Cache hit rate: > 70%
✅ Uptime: 99.9% (43 minutes downtime/month allowed)

How We Measure:

Prometheus for metrics collection
Grafana for dashboards
Sentry for error tracking
ELK Stack for log aggregation

🎓 Key Learnings

1. Async Processing is Non-Negotiable

Early Mistake:
I initially tried synchronous code execution. When 1000 submissions/minute hit, API servers timed out.

Solution:
BullMQ job queue with auto-scaling workers. Now:

API responds instantly with "submitted"
Worker processes in background
WebSocket notifies user when done

2. Caching is Critical for Performance

Without Caching:

Every problem fetch → MongoDB query
Every avatar question → 30-second generation time

With Caching:

85% problem queries served from Redis
70% avatar videos served from cache
Result: 80% reduction in MongoDB load, instant response for cached queries

3. Security in Layers, Not Walls

Wrong Approach:
"If our firewall is strong, we're safe."

Right Approach:
6 layers of defense. If one fails, 5 remain.

Example: Even if attacker bypasses rate limiting (Layer 1-2), they hit:

JWT validation (Layer 3)
Input sanitization (Layer 4)
Docker sandbox (Layer 5)

4. Monitor Before You Scale

Built Monitoring First:

Prometheus metrics from day one
Grafana dashboards before launch
Sentry error tracking in alpha

Why? You can't optimize what you can't measure. Without metrics, scaling is guesswork.

🔮 Future Improvements

Technical Debt to Address

Self-host Judge0
- Current: Using Judge0 API
- Plan: Docker on Kubernetes for better control
- Benefit: More flexibility in resource allocation
Multi-region Deployment
- Current: Single region (us-east-1)
- Issue: High latency for Asia/Europe users
- Plan: CloudFlare Workers + edge caching
Database Sharding
- Current: Single MongoDB replica set
- Trigger: When > 10M submissions
- Strategy: Shard by userId (hashed)
GraphQL API
- Current: REST with over-fetching
- Benefit: Reduce data transfer by 40%

🤔 Questions I'd Ask Myself in System Design Interview

Q: Why not use AWS Lambda for code execution?
A: Lambda has 15-minute timeout, cold starts add latency. Judge0 in Docker has consistent performance and better resource limits.

Q: Why MongoDB AND PostgreSQL? Why not just one?
A: Different workloads. MongoDB excels at flexible schemas and horizontal scaling. PostgreSQL excels at complex analytics. Multi-database is common in microservices.

Q: How do you prevent one user from DDoSing your platform?
A: Rate limiting at 3 levels - CloudFlare (per IP), Nginx (per user), Application (per API endpoint). Plus BullMQ queue prevents worker overload.

Q: What happens if Redis goes down?
A: 3-node cluster with automatic failover. If all nodes fail: Sessions lost (users re-login), cache miss (MongoDB serves requests), WebSocket disconnects (auto-reconnect). Not ideal, but platform stays up.

Q: Why 99.9% uptime and not 99.99%?
A: Trade-off between availability and complexity. 99.9% = 43 min/month downtime (acceptable for coding practice). 99.99% requires multi-region deployment with significantly more infrastructure complexity.

📖 Recommended Reading

If you're designing a similar system:

Books:

"Designing Data-Intensive Applications" by Martin Kleppmann
"System Design Interview" by Alex Xu

Resources:

🎯 Conclusion

Building CodeNova taught me that good architecture is about trade-offs, not perfection.

Key Takeaways:

Async everything - Queues are your friend
Cache aggressively - Improves performance and reduces load
Security in layers - Defense in depth
Measure first, optimize second - Metrics before scaling

The architecture diagram isn't just boxes and arrows - it represents:

Hundreds of hours of research
Dozens of failed experiments
Lessons from production incidents

If I were to start over, I'd:

✅ Build monitoring first (kept this)
✅ Use queues from day one (learned this the hard way)
✅ Start with fewer databases (added PostgreSQL later)
❌ Not self-host initially (buy before build)

💬 Discussion

How would you design this differently?

Would you use:

Serverless (Lambda) instead of Kubernetes?
GraphQL instead of REST?
DynamoDB instead of MongoDB?
Different AI providers?

Drop your thoughts in the comments! 👇

I'm especially interested in:

Better ways to optimize AI response generation
Better ways to scale WebSockets
Alternative code execution sandboxes

Built with ❤️ and lots of ☕ by Bhupesh Chikara

systemdesign #architecture #webdev #ai #mongodb #kubernetes #redis #postgresql #websocket #nodejs #react #typescript #microservices #cloudcomputing #devops

The CAP Theorem: Why Consistency, Availability, and Partition Tolerance Can't All Be Friends

Bhupesh Chikara — Fri, 30 May 2025 07:00:56 +0000

Hey Devs! 👋

Heard of the "CAP theorem" in system design? It sounds academic, but it's crucial for distributed systems (like microservices or multi-server databases). This post breaks CAP down simply. Let's go!

🤔 What is the CAP Theorem?

The CAP theorem (or Brewer's theorem) is key for distributed data stores. It states a distributed system can't guarantee all three simultaneously:

Consistency
Availability
Partition Tolerance

Understanding these trade-offs is vital for good design.

🧐 Breaking Down "CAP"

Consistency: All Nodes See the Same Data, Now

All reads get the most recent write or an error. After a write, all nodes reflect that update, giving users a unified data view.

Analogy: A shared doc where everyone instantly sees the latest saved version.

Availability: Every Request Gets a Response

Every request to a working node gets a response. The system is operational, though responses might not always have the absolute latest data.

Analogy: An online store that's always open, even if product info occasionally has a slight delay in updating everywhere.

Partition Tolerance: System Works Despite Network Issues

The system works despite network communication failures between nodes (e.g., due to a failed switch or cable).

Analogy: Office branches operating independently when their network connection drops, then syncing later.

Why Partition Tolerance is Key:

Network failures (partitions) are inevitable in distributed systems. Thus, Partition Tolerance is essential; without it, systems become unreliable during glitches. So, most distributed systems need partition tolerance.

⚖️ The Core Trade-off: CP vs. AP

Since Partition Tolerance (P) is usually required, the main CAP trade-off during a partition is between Consistency (C) and Availability (A).

CP Systems (Consistency + Partition Tolerance)

Prioritize consistency during partitions. If data can't be verified as current, the affected part of the system may become unavailable (refusing writes/reads) to prevent inconsistency.

Use Cases: Financial systems, inventory management—where accuracy trumps constant uptime.

AP Systems (Availability + Partition Tolerance)

Prioritize availability during partitions. The system stays operational, even if it means some nodes serve slightly older data (eventual consistency) until the partition resolves.

Use Cases: Social media, e-commerce listings—where high availability is key, and slight data staleness is acceptable.

🌈 Nuances to CAP

Not just 2 of 3: Real systems have nuanced behaviors; choices aren't always absolute.
Latency matters: Operation speed is critical beyond CAP guarantees.
Eventual Consistency: Common in AP systems; data eventually becomes consistent if no new updates occur.
Context is King: The best CP/AP choice depends on your app's needs (e.g., banking: CP; social feeds: AP).

🎁 Wrapping Up

CAP isn't about achieving all three guarantees—it's a model for understanding vital trade-offs in distributed systems. Knowing C, A, and P helps you make informed design choices. Keep CAP in mind when architecting or choosing distributed databases.

Share your CAP experiences in the comments! 👇