<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Bhupesh Chikara</title>
    <description>The latest articles on DEV Community by Bhupesh Chikara (@bchikara).</description>
    <link>https://dev.to/bchikara</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3098843%2F59b6ca33-10e4-44ce-8456-ec41137692b4.jpeg</url>
      <title>DEV Community: Bhupesh Chikara</title>
      <link>https://dev.to/bchikara</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bchikara"/>
    <language>en</language>
    <item>
      <title>How to Check 10 Million Usernames in Under 1 Millisecond</title>
      <dc:creator>Bhupesh Chikara</dc:creator>
      <pubDate>Mon, 02 Mar 2026 03:51:38 +0000</pubDate>
      <link>https://dev.to/bchikara/how-to-check-10-million-usernames-in-under-1-millisecond-38bp</link>
      <guid>https://dev.to/bchikara/how-to-check-10-million-usernames-in-under-1-millisecond-38bp</guid>
      <description>&lt;p&gt;Every second, platforms like GitHub, Instagram, and Twitter process thousands of username availability checks. A seemingly simple operation that becomes a critical performance bottleneck at scale. I built a production-grade proof-of-concept to measure exactly how different architectures handle this challenge.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvuct9w9eg6cchys5cfvk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvuct9w9eg6cchys5cfvk.png" alt="GitHub Username Validation Interface - Shows real-time username availability checking during signup" width="800" height="422"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Challenge
&lt;/h2&gt;

&lt;p&gt;When a user types "john_doe" during signup, the system must instantly verify if that username exists among millions of registered users. At 1000 requests per second, this translates to 86.4 million database queries per day.&lt;/p&gt;

&lt;p&gt;Traditional approaches create two fundamental problems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Performance Degradation&lt;/strong&gt;: Each username check requires a database roundtrip, introducing 10-50ms latency per request.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Infrastructure Costs&lt;/strong&gt;: Database CPU and I/O costs scale linearly with request volume, creating exponential cost growth.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Architectural Approaches
&lt;/h2&gt;

&lt;p&gt;I tested three production-viable architectures against a dataset of 10 million usernames under sustained load of 1000 requests/second.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture 1: PostgreSQL Direct Query
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Every request hits PostgreSQL&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT EXISTS(SELECT 1 FROM usernames WHERE username = $1)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;username&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Characteristics&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100% accuracy guaranteed&lt;/li&gt;
&lt;li&gt;Network + query latency on every request&lt;/li&gt;
&lt;li&gt;Database becomes the bottleneck&lt;/li&gt;
&lt;li&gt;Simple to implement and maintain&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Architecture 2: Redis In-Memory Cache
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Query Redis SET&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;exists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sIsMember&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;usernames&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;username&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Characteristics&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100% accuracy with proper synchronization&lt;/li&gt;
&lt;li&gt;Sub-5ms latency&lt;/li&gt;
&lt;li&gt;Requires ~500MB memory for 10M usernames&lt;/li&gt;
&lt;li&gt;Zero database queries after initial load&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Architecture 3: Bloom Filter + Database Fallback
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Check Bloom filter first&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;mightExist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;bloomFilter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;username&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;mightExist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Definitely NOT exists - return immediately&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;available&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Maybe exists - verify with database&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;actuallyExists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT EXISTS...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;available&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;actuallyExists&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Characteristics&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In-process checks eliminate network overhead&lt;/li&gt;
&lt;li&gt;95% of requests return instantly without database query&lt;/li&gt;
&lt;li&gt;~10MB memory for 10M usernames&lt;/li&gt;
&lt;li&gt;1-5% false positive rate (acceptable for this use case)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc7kmyna20pvcnh6ettgu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc7kmyna20pvcnh6ettgu.png" alt="Performance Comparison Chart - PostgreSQL vs Redis vs Bloom Filter showing latency metrics (P50, P95, P99) and throughput across three architectures" width="800" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Experiment Design
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Infrastructure Stack
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Runtime&lt;/strong&gt;: Node.js 18 with Express&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database&lt;/strong&gt;: PostgreSQL 15 with indexed username column&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache&lt;/strong&gt;: Redis 7.x with optimized memory settings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bloom Filter&lt;/strong&gt;: bloom-filters library (1% false positive rate)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load Testing&lt;/strong&gt;: k6 with realistic traffic patterns&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt;: Prometheus + Grafana for metrics collection&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Test Methodology
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Dataset&lt;/strong&gt;: 10 million pre-populated usernames&lt;br&gt;
&lt;strong&gt;Traffic Pattern&lt;/strong&gt;: 90% new users, 10% existing users (realistic signup distribution)&lt;br&gt;
&lt;strong&gt;Load Profile&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Phase 1 (0-30s):   Warmup - 100 req/s
Phase 2 (30-90s):  Baseline - 500 req/s
Phase 3 (90-210s): Peak - 1000 req/s
Phase 4 (210-240s): Ramp down
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each test ran for 4 minutes with continuous monitoring of latency, throughput, and resource utilization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;PostgreSQL&lt;/th&gt;
&lt;th&gt;Redis&lt;/th&gt;
&lt;th&gt;Bloom Filter&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;P50 Latency&lt;/td&gt;
&lt;td&gt;23.4ms&lt;/td&gt;
&lt;td&gt;2.8ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.08ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P95 Latency&lt;/td&gt;
&lt;td&gt;45.2ms&lt;/td&gt;
&lt;td&gt;8.1ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;0.31ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;P99 Latency&lt;/td&gt;
&lt;td&gt;67.8ms&lt;/td&gt;
&lt;td&gt;14.5ms&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;2.1ms&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Throughput&lt;/td&gt;
&lt;td&gt;850 req/s&lt;/td&gt;
&lt;td&gt;3500 req/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;12,000 req/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DB Queries (per 100K)&lt;/td&gt;
&lt;td&gt;100,000&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;5,127&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory Usage&lt;/td&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;487 MB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;9.6 MB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;100%&lt;/td&gt;
&lt;td&gt;99.87%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Key Findings
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1. Latency Improvement: 300x Faster&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Bloom filters achieved P50 latency of 0.08ms compared to PostgreSQL's 23.4ms. This 29,150% improvement comes from eliminating network overhead entirely - the check happens in-process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Database Query Reduction: 95%&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Out of 100,000 requests, Bloom filters only triggered 5,127 database queries (5.13% false positive rate). The remaining 94.87% returned instantly with zero database load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Memory Efficiency: 50x Less&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Redis required 487MB to store 10 million usernames. The Bloom filter used just 9.6MB for the same dataset - a 98% memory reduction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Cost Implications&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;At $0.000001 per database query:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;PostgreSQL: $86.40/day for 86.4M queries&lt;/li&gt;
&lt;li&gt;Bloom Filter: $4.32/day for 4.32M queries (5% fallback)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Savings&lt;/strong&gt;: $2,462/month at 1000 req/s sustained load&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Understanding Bloom Filters
&lt;/h2&gt;

&lt;p&gt;A Bloom filter is a probabilistic data structure that tests set membership with two critical properties:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No False Negatives&lt;/strong&gt;: If the filter says an element doesn't exist, it's 100% correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Possible False Positives&lt;/strong&gt;: If the filter says an element might exist, you must verify with the source of truth.&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works
&lt;/h3&gt;

&lt;p&gt;When adding a username to the filter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"john_doe" → hash1(x), hash2(x), hash3(x) → set bits at positions [142, 891, 1523]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When checking if a username exists:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;If ANY bit is 0 → Definitely NOT exists (return immediately)
If ALL bits are 1 → MAYBE exists (check database)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The false positive rate is configurable through bit array size and number of hash functions. Our implementation uses a 1% FPR, meaning 99% accuracy for "might exist" cases.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdflflfh2dooyl82uumro.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdflflfh2dooyl82uumro.png" alt="Bloom Filter Bit Array Visualization - Demonstrates how hash functions map usernames to bit positions and how membership checks work with visual representation of bit array operations" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  When to Use Each Approach
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Choose PostgreSQL&lt;/strong&gt; when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request volume is low (&amp;lt;100 req/s)&lt;/li&gt;
&lt;li&gt;Strong consistency is critical&lt;/li&gt;
&lt;li&gt;Team lacks operational expertise for distributed systems&lt;/li&gt;
&lt;li&gt;Simplicity outweighs performance optimization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Redis&lt;/strong&gt; when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High traffic requires &amp;lt;5ms response times&lt;/li&gt;
&lt;li&gt;100% accuracy is non-negotiable&lt;/li&gt;
&lt;li&gt;Budget supports 50MB+ memory per million items&lt;/li&gt;
&lt;li&gt;Need distributed caching across multiple services&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Choose Bloom Filters&lt;/strong&gt; when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Massive scale demands sub-millisecond response&lt;/li&gt;
&lt;li&gt;Negative lookups dominate (90%+ checking new items)&lt;/li&gt;
&lt;li&gt;Memory constraints exist&lt;/li&gt;
&lt;li&gt;1-5% false positive rate is acceptable&lt;/li&gt;
&lt;li&gt;Minimizing database load is critical&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Handling False Positives
&lt;/h3&gt;

&lt;p&gt;False positives in username availability checks are &lt;strong&gt;operationally acceptable&lt;/strong&gt; because:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User experience remains identical (username shown as taken)&lt;/li&gt;
&lt;li&gt;Database verification occurs transparently&lt;/li&gt;
&lt;li&gt;No data corruption or incorrect state&lt;/li&gt;
&lt;li&gt;Performance benefit outweighs rare false positive
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Bloom filter says "maybe exists"&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bloomFilter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;username&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Always verify with database&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;actuallyExists&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT EXISTS...&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;actuallyExists&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// False positive detected&lt;/span&gt;
    &lt;span class="c1"&gt;// User can still register - no negative impact&lt;/span&gt;
    &lt;span class="nx"&gt;metrics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;increment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bloom_filter_false_positive&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Data Synchronization
&lt;/h3&gt;

&lt;p&gt;Maintain filter accuracy through:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// On new user registration&lt;/span&gt;
&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;registerUser&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;username&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// 1. Insert into database&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;INSERT INTO users (username) VALUES ($1)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;username&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

  &lt;span class="c1"&gt;// 2. Update Bloom filter immediately&lt;/span&gt;
  &lt;span class="nx"&gt;bloomFilter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;username&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// 3. Persist filter periodically (every hour)&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;shouldPersist&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;saveBloomFilterToDisk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;bloomFilter&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Real-World Applications
&lt;/h2&gt;

&lt;p&gt;Major technology platforms use Bloom filters for similar use cases:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: Checks user passwords against 10 billion leaked credentials in &amp;lt;1ms&lt;br&gt;
&lt;strong&gt;Medium&lt;/strong&gt;: Filters already-read articles from recommendation feeds&lt;br&gt;
&lt;strong&gt;Google Chrome&lt;/strong&gt;: Detects malicious URLs locally without server requests&lt;br&gt;
&lt;strong&gt;Akamai CDN&lt;/strong&gt;: Performs cache existence checks at edge nodes&lt;br&gt;
&lt;strong&gt;Bitcoin&lt;/strong&gt;: Validates transaction double-spending in UTXO sets&lt;/p&gt;
&lt;h2&gt;
  
  
  Implementation Guide
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Building the Bloom Filter
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;BloomFilter&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;require&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bloom-filters&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Create filter: 10M capacity, 1% false positive rate&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;filter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;BloomFilter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Load existing usernames&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;users&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT username FROM users&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forEach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;user&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;username&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;

&lt;span class="c1"&gt;// Persist to disk&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;filterData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;size&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;size&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;nbHashes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nbHashes&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;bits&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;from&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;_bits&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;totalItems&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;users&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;fpr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.01&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="nx"&gt;fs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;writeFileSync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bloom-filter.json&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;filterData&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Production API Implementation
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;/check-username&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;username&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;req&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

  &lt;span class="c1"&gt;// In-process Bloom filter check&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;mightExist&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;bloomFilter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;username&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;mightExist&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// 95% of requests return here&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;available&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;latency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;bloom_filter&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
    &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="c1"&gt;// 5% of requests verify with database&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;SELECT EXISTS(SELECT 1 FROM users WHERE username = $1)&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;username&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="nx"&gt;res&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;available&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;exists&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;latency&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;database_fallback&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;falsePositive&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;exists&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  Reproduction Instructions
&lt;/h2&gt;

&lt;p&gt;Full source code and setup available on GitHub:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/builtbychikara/WhatIfSeries.git
&lt;span class="nb"&gt;cd &lt;/span&gt;what-if/bloom-filter

&lt;span class="c"&gt;# Install dependencies&lt;/span&gt;
npm &lt;span class="nb"&gt;install&lt;/span&gt;

&lt;span class="c"&gt;# Start infrastructure (PostgreSQL, Redis, Prometheus, Grafana)&lt;/span&gt;
npm run docker:up

&lt;span class="c"&gt;# Seed 10M usernames (takes ~15 minutes)&lt;/span&gt;
npm run seed

&lt;span class="c"&gt;# Run services (3 separate terminals)&lt;/span&gt;
npm run start:postgres  &lt;span class="c"&gt;# Terminal 1&lt;/span&gt;
npm run start:redis     &lt;span class="c"&gt;# Terminal 2&lt;/span&gt;
npm run start:bloom     &lt;span class="c"&gt;# Terminal 3&lt;/span&gt;

&lt;span class="c"&gt;# Execute load tests&lt;/span&gt;
npm run experiment

&lt;span class="c"&gt;# View results&lt;/span&gt;
open http://localhost:3100  &lt;span class="c"&gt;# Grafana dashboard&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Bloom filters provide a compelling solution for username availability checks at scale:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;300x latency improvement&lt;/strong&gt; over direct database queries&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;95% reduction&lt;/strong&gt; in database load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;50x memory efficiency&lt;/strong&gt; compared to Redis caching&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Production-proven&lt;/strong&gt; at companies like GitHub, Medium, and Chrome&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 1-5% false positive rate is an acceptable trade-off for the massive performance and cost benefits. For high-traffic applications where negative lookups dominate, Bloom filters are a battle-tested solution.&lt;/p&gt;

&lt;p&gt;The complete proof-of-concept demonstrates production-grade implementation with comprehensive monitoring, load testing, and detailed metrics. All code is open source and ready to run.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;GitHub Repository&lt;/strong&gt;: &lt;a href="https://github.com/builtbychikara/WhatIfSeries/tree/main/what-if/bloom-filter" rel="noopener noreferrer"&gt;https://github.com/builtbychikara/WhatIfSeries/tree/main/what-if/bloom-filter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tech Stack&lt;/strong&gt;: Node.js, PostgreSQL, Redis, Bloom Filters, k6, Prometheus, Grafana, Docker&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Part of the What If Series&lt;/strong&gt; - Production-grade POCs exploring system design decisions through empirical measurement.&lt;/p&gt;

</description>
      <category>performance</category>
      <category>database</category>
      <category>redis</category>
      <category>postgres</category>
    </item>
    <item>
      <title>What If Your Database Goes Down? REST vs Kafka Under Fire</title>
      <dc:creator>Bhupesh Chikara</dc:creator>
      <pubDate>Thu, 26 Feb 2026 22:01:13 +0000</pubDate>
      <link>https://dev.to/bchikara/what-if-your-database-goes-down-rest-vs-kafka-under-fire-3469</link>
      <guid>https://dev.to/bchikara/what-if-your-database-goes-down-rest-vs-kafka-under-fire-3469</guid>
      <description>&lt;h1&gt;
  
  
  What If Your Database Goes Down? REST vs Kafka Under Fire
&lt;/h1&gt;

&lt;p&gt;Companies like Uber, Netflix, and Airbnb have migrated from traditional REST APIs to event-driven architectures. This shift isn't driven by trends, but by fundamental architectural resilience requirements. To quantify these differences, I built a production-grade chaos engineering proof-of-concept, simulating real-world database failures under sustained load.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Hypothesis
&lt;/h2&gt;

&lt;p&gt;Traditional synchronous REST APIs create tight coupling between API servers and databases. When the database fails, the API fails. Event-driven architectures using Kafka decouple producers from consumers through message buffering, theoretically providing resilience during infrastructure failures.&lt;/p&gt;

&lt;p&gt;We designed an experiment to measure this resilience difference empirically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5s7g07yrdedkdf2l9yud.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5s7g07yrdedkdf2l9yud.png" alt="REST vs Kafka Architecture Comparison" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Experiment Design
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Scenario
&lt;/h3&gt;

&lt;p&gt;We simulated Uber's real-time driver location tracking system, where high-frequency updates must be reliably persisted to PostgreSQL.&lt;/p&gt;

&lt;h3&gt;
  
  
  Infrastructure
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Load Pattern&lt;/strong&gt;: Constant 50 requests/second&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Failure Injection&lt;/strong&gt;: PostgreSQL crash for 120 seconds (t=90s to t=210s)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duration&lt;/strong&gt;: 5 minutes total (300 seconds)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load Testing&lt;/strong&gt;: k6 with automated orchestration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt;: Prometheus + Grafana for real-time metrics collection&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Architectures Tested
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Architecture A: Synchronous REST&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Direct HTTP POST to REST API&lt;/li&gt;
&lt;li&gt;Immediate database INSERT&lt;/li&gt;
&lt;li&gt;Response after database confirmation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Architecture B: Asynchronous Kafka&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;HTTP POST to Producer API&lt;/li&gt;
&lt;li&gt;Event published to Kafka topic&lt;/li&gt;
&lt;li&gt;Consumer processes events with circuit breaker pattern&lt;/li&gt;
&lt;li&gt;Asynchronous database INSERT&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both architectures handled identical traffic patterns and experienced the same database failure window.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  REST API Performance
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Metric                  Value
Total Requests         30,002
Successful             15,001
Failed                 15,001
Error Rate             50.00%
P50 Latency            3.60ms
P95 Latency            15.55ms
Average Latency        11.24ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Failure Characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Immediate error propagation to clients&lt;/li&gt;
&lt;li&gt;No request buffering capability&lt;/li&gt;
&lt;li&gt;Manual intervention required for recovery&lt;/li&gt;
&lt;li&gt;50% of requests returned HTTP 500 errors&lt;/li&gt;
&lt;li&gt;Complete service degradation during database outage&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Kafka Architecture Performance
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Metric                  Value
Total Requests         15,000
Successful             15,000
Failed                 0
Error Rate             0.00%
P50 Latency            3.47ms
P95 Latency            12.10ms
Average Latency        6.77ms
Latency Improvement    39.77%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Failure Characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Zero client-facing errors&lt;/li&gt;
&lt;li&gt;Automatic request buffering in Kafka topics&lt;/li&gt;
&lt;li&gt;Circuit breaker pattern enabled graceful degradation&lt;/li&gt;
&lt;li&gt;Automatic recovery when database restored&lt;/li&gt;
&lt;li&gt;Transparent failure handling from client perspective&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  REST: Synchronous Coupling
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client → REST API → PostgreSQL → Response
           ↓ (if DB fails)
       HTTP 500 Error
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Failure Mode:&lt;/strong&gt;&lt;br&gt;
When PostgreSQL becomes unavailable, the REST API has no buffering mechanism. Each incoming request attempts a database connection, fails after timeout (2000ms), and returns an error to the client. This creates a cascading failure pattern where API health is directly tied to database availability.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong consistency guarantees&lt;/li&gt;
&lt;li&gt;Immediate failure feedback&lt;/li&gt;
&lt;li&gt;No request buffering&lt;/li&gt;
&lt;li&gt;Tight coupling between components&lt;/li&gt;
&lt;li&gt;Simple operational model&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Kafka: Asynchronous Decoupling
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Client → Producer API → Kafka Topic → Consumer → PostgreSQL
         ↓ (immediate)           ↓ (buffered)
     HTTP 202 Accepted     Circuit Breaker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Failure Mode:&lt;/strong&gt;&lt;br&gt;
When PostgreSQL fails, the Producer API continues accepting requests and publishing events to Kafka. The Consumer detects database failures through the circuit breaker pattern, transitions to OPEN state, and stops attempting writes. Events accumulate in Kafka's persistent log. When PostgreSQL recovers, the circuit breaker transitions to HALF_OPEN, tests connectivity, and resumes processing the buffered events.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Characteristics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Eventual consistency model&lt;/li&gt;
&lt;li&gt;Request buffering via Kafka topics&lt;/li&gt;
&lt;li&gt;Automatic failure detection and recovery&lt;/li&gt;
&lt;li&gt;Component independence&lt;/li&gt;
&lt;li&gt;Complex operational requirements&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Circuit Breaker Implementation
&lt;/h2&gt;

&lt;p&gt;The circuit breaker pattern is critical for preventing cascading failures in distributed systems. Our implementation uses a state machine with three states:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CircuitBreaker&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;threshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;timeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;30000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;resetTimeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CLOSED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureThreshold&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resetTimeout&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;resetTimeout&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nextAttempt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;OPEN&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nextAttempt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;Circuit breaker OPEN - database unavailable&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;HALF_OPEN&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;fn&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;onSuccess&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;onFailure&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nf"&gt;onSuccess&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;HALF_OPEN&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;successCount&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;successCount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;CLOSED&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
        &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;successCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="nf"&gt;onFailure&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureCount&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureCount&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;failureThreshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;OPEN&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
      &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;nextAttempt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="k"&gt;this&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;resetTimeout&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;State Transitions:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CLOSED&lt;/strong&gt;: Normal operation, all requests processed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OPEN&lt;/strong&gt;: Failure threshold exceeded, stop attempting database writes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;HALF_OPEN&lt;/strong&gt;: Testing recovery, require 3 consecutive successes before full recovery&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Business Impact Analysis
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Assumptions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Request rate: 1,000 requests/second&lt;/li&gt;
&lt;li&gt;Revenue per request: $0.01&lt;/li&gt;
&lt;li&gt;Database outage duration: 5 minutes&lt;/li&gt;
&lt;li&gt;Error rate impact: 50% (REST) vs 0% (Kafka)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  REST Architecture Impact
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Failed Requests:    150,000 (50% × 1,000 req/s × 300s)
Revenue Loss:       $1,500
Recovery Time:      Manual intervention required
Customer Impact:    Severe degradation, visible errors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Kafka Architecture Impact
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Failed Requests:    0 (buffered in Kafka)
Revenue Loss:       $0
Recovery Time:      Automatic
Customer Impact:    None (transparent to clients)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;ROI Calculation:&lt;/strong&gt;&lt;br&gt;
At scale, assuming one 5-minute database outage per month, Kafka prevents $18,000 in annual revenue loss. This excludes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Customer support costs from incident tickets&lt;/li&gt;
&lt;li&gt;Engineering time for manual recovery&lt;/li&gt;
&lt;li&gt;Reputational damage from service degradation&lt;/li&gt;
&lt;li&gt;Compliance implications for SLA violations&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Technical Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Runtime&lt;/strong&gt;: Node.js 18+ with Express&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Message Broker&lt;/strong&gt;: Apache Kafka 3.x with KafkaJS client&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Database&lt;/strong&gt;: PostgreSQL 15 with connection pooling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Load Testing&lt;/strong&gt;: k6 with custom JavaScript scenarios&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Monitoring&lt;/strong&gt;: Prometheus 2.x + Grafana 9.x&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Orchestration&lt;/strong&gt;: Docker Compose for reproducible environments&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chaos Engineering&lt;/strong&gt;: Automated failure injection scripts&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Observability Metrics
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Prometheus Instrumentation
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// REST API Metrics&lt;/span&gt;
&lt;span class="nx"&gt;rest_http_requests_total&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;status&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;200|500&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;rest_request_duration_seconds&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;quantile&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;0.5|0.95|0.99&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Kafka Producer Metrics&lt;/span&gt;
&lt;span class="nx"&gt;kafka_produce_total&lt;/span&gt;
&lt;span class="nx"&gt;kafka_produce_latency_milliseconds&lt;/span&gt;
&lt;span class="nx"&gt;kafka_produce_errors_total&lt;/span&gt;

&lt;span class="c1"&gt;// Consumer Metrics with Circuit Breaker&lt;/span&gt;
&lt;span class="nx"&gt;circuit_breaker_state&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;CLOSED|HALF_OPEN|OPEN&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;kafka_consumer_lag_seconds&lt;/span&gt;
&lt;span class="nx"&gt;db_write_failures_total&lt;/span&gt;
&lt;span class="nx"&gt;db_connection_pool_size&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;These metrics enabled real-time failure detection and post-mortem analysis of system behavior during the outage window.&lt;/p&gt;
&lt;h2&gt;
  
  
  Reproduction Guide
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install k6 load testing tool&lt;/span&gt;
brew &lt;span class="nb"&gt;install &lt;/span&gt;k6  &lt;span class="c"&gt;# macOS&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt-get &lt;span class="nb"&gt;install &lt;/span&gt;k6  &lt;span class="c"&gt;# Linux&lt;/span&gt;

&lt;span class="c"&gt;# Clone repository&lt;/span&gt;
git clone https://github.com/builtbychikara/WhatIfSeries
&lt;span class="nb"&gt;cd &lt;/span&gt;WhatIfSeries/what-if/kafka-vs-rest-polling
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Infrastructure Startup
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Start Kafka, PostgreSQL, Prometheus, and Grafana&lt;/span&gt;
docker-compose up &lt;span class="nt"&gt;-d&lt;/span&gt;

&lt;span class="c"&gt;# Verify all services are healthy&lt;/span&gt;
docker-compose ps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Service Deployment
&lt;/h3&gt;

&lt;p&gt;Open three terminal windows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Terminal 1: REST API (Port 3001)&lt;/span&gt;
npm run start:rest

&lt;span class="c"&gt;# Terminal 2: Kafka Producer API (Port 3002)&lt;/span&gt;
npm run start:kafka

&lt;span class="c"&gt;# Terminal 3: Kafka Consumer with Circuit Breaker (Port 3003)&lt;/span&gt;
npm run start:consumer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Execute Experiment
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Run automated chaos engineering experiment&lt;/span&gt;
npm run experiment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Experiment Timeline
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Phase         Time        Load        Database State
Warmup        0-30s       25 req/s    Healthy
Normal        30-90s      50 req/s    Healthy
Crash         90-210s     50 req/s    OFFLINE
Recovery      210-300s    50 req/s    Healthy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The experiment script automatically injects the database failure at t=90s and restores it at t=210s, while maintaining constant load throughout.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Findings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Resilience Through Decoupling
&lt;/h3&gt;

&lt;p&gt;Event-driven architectures achieve resilience not through redundancy, but through temporal decoupling. Kafka's persistent log provides a buffer that absorbs transient failures, converting availability problems into latency problems.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Circuit Breaker Necessity
&lt;/h3&gt;

&lt;p&gt;Without the circuit breaker pattern, the Kafka consumer would continuously retry failed database writes, wasting resources and potentially overwhelming the database during recovery. The circuit breaker provides:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Automatic failure detection&lt;/li&gt;
&lt;li&gt;Resource conservation during outages&lt;/li&gt;
&lt;li&gt;Controlled recovery testing&lt;/li&gt;
&lt;li&gt;Metrics for failure state visibility&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Consistency Trade-offs
&lt;/h3&gt;

&lt;p&gt;REST provides strong consistency - clients know immediately whether their write succeeded. Kafka provides eventual consistency - clients receive acknowledgment that the write is buffered, not that it's persisted. This trade-off must align with business requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Operational Complexity
&lt;/h3&gt;

&lt;p&gt;Kafka introduces operational overhead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Additional infrastructure (brokers, ZooKeeper/KRaft)&lt;/li&gt;
&lt;li&gt;Message ordering guarantees to maintain&lt;/li&gt;
&lt;li&gt;Consumer lag monitoring requirements&lt;/li&gt;
&lt;li&gt;Topic partition management&lt;/li&gt;
&lt;li&gt;Schema evolution considerations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This complexity must be justified by resilience requirements.&lt;/p&gt;

&lt;h2&gt;
  
  
  Decision Framework
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Choose REST When:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Traffic Patterns&lt;/strong&gt;: Request rate under 100 req/s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency Requirements&lt;/strong&gt;: Strong consistency required&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Team Capabilities&lt;/strong&gt;: Limited operational expertise&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System Simplicity&lt;/strong&gt;: Monolithic architecture preferred&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency Requirements&lt;/strong&gt;: Sub-10ms response times required&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Choose Kafka When:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Traffic Patterns&lt;/strong&gt;: Request rate exceeds 1,000 req/s&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resilience Requirements&lt;/strong&gt;: Infrastructure failures must be transparent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event Consumers&lt;/strong&gt;: Multiple downstream systems need the same events&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency Trade-offs&lt;/strong&gt;: Eventual consistency acceptable&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational Maturity&lt;/strong&gt;: Team can manage distributed systems&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Future Work
&lt;/h2&gt;

&lt;p&gt;This experiment is part of the What If Series, exploring system design decisions through empirical measurement rather than theoretical analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Upcoming Experiments:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;High-throughput scaling: Measuring Kafka performance at 1M+ req/s&lt;/li&gt;
&lt;li&gt;Cache failure analysis: Redis outage impact on application tier&lt;/li&gt;
&lt;li&gt;Real-time processing: Stream processing 1TB of logs with sub-second latency&lt;/li&gt;
&lt;li&gt;Network partition testing: Consistency guarantees under split-brain scenarios&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Code Repository
&lt;/h2&gt;

&lt;p&gt;Full implementation including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Complete source code for both architectures&lt;/li&gt;
&lt;li&gt;Docker Compose infrastructure definitions&lt;/li&gt;
&lt;li&gt;k6 load testing scenarios&lt;/li&gt;
&lt;li&gt;Prometheus metric exporters&lt;/li&gt;
&lt;li&gt;Automated chaos injection scripts&lt;/li&gt;
&lt;li&gt;Grafana dashboard configurations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/builtbychikara/WhatIfSeries/tree/main/what-if/kafka-vs-rest-polling" rel="noopener noreferrer"&gt;https://github.com/builtbychikara/WhatIfSeries/tree/main/what-if/kafka-vs-rest-polling&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Files&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;kafka-pubsub-service/consumer.js&lt;/code&gt; - Circuit breaker implementation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;k6/runner.js&lt;/code&gt; - Chaos engineering experiment orchestration&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;docker-compose.yml&lt;/code&gt; - Complete infrastructure stack&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Event-driven architectures using Kafka demonstrate measurably superior resilience during database failures, with zero client-facing errors compared to 50% error rates in synchronous REST implementations. However, this resilience comes at the cost of operational complexity and consistency guarantees.&lt;/p&gt;

&lt;p&gt;The choice between REST and Kafka should be driven by quantified requirements for resilience, throughput, and acceptable consistency models, not by architectural preferences or industry trends.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Acknowledgments&lt;/strong&gt;: Built in collaboration with Aakanksh Singh to demonstrate production-grade system design patterns through empirical measurement.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tech Stack&lt;/strong&gt;: Node.js, Apache Kafka, PostgreSQL, k6, Prometheus, Grafana, Docker&lt;/p&gt;

</description>
      <category>kafka</category>
      <category>systemdesign</category>
      <category>chaosengineering</category>
      <category>node</category>
    </item>
    <item>
      <title>Building CodeNova: System Design Deep Dive into an AI-Enhanced Coding Platform</title>
      <dc:creator>Bhupesh Chikara</dc:creator>
      <pubDate>Sun, 30 Nov 2025 01:14:02 +0000</pubDate>
      <link>https://dev.to/bchikara/building-codenova-system-design-deep-dive-into-an-ai-enhanced-coding-platform-11d4</link>
      <guid>https://dev.to/bchikara/building-codenova-system-design-deep-dive-into-an-ai-enhanced-coding-platform-11d4</guid>
      <description>&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I designed and built &lt;strong&gt;CodeNova&lt;/strong&gt;, a scalable coding interview platform handling 10K+ concurrent users with three AI-powered features: &lt;strong&gt;video avatar tutor&lt;/strong&gt;, &lt;strong&gt;algorithm visualizer&lt;/strong&gt;, and &lt;strong&gt;collaborative whiteboard&lt;/strong&gt;. This is a deep dive into the system architecture and design decisions.&lt;/p&gt;




&lt;p&gt;

  &lt;iframe src="https://www.youtube.com/embed/BlJtQZ85rnw"&gt;
  &lt;/iframe&gt;


&lt;/p&gt;

&lt;h2&gt;
  
  
  🎯 What is CodeNova?
&lt;/h2&gt;

&lt;p&gt;CodeNova is an AI-enhanced coding interview platform designed for scalability and learning. Core features include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;155+ problems&lt;/strong&gt; across multiple difficulty levels&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;10+ programming languages&lt;/strong&gt; with sandboxed execution&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI video tutor&lt;/strong&gt; with realistic avatar and natural voice&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic algorithm visualization&lt;/strong&gt; for any code&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Real-time collaborative whiteboard&lt;/strong&gt; for mock interviews&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Contest leaderboards&lt;/strong&gt; with analytics&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Scale:&lt;/strong&gt; Built to handle &lt;strong&gt;10,000 concurrent users&lt;/strong&gt;, &lt;strong&gt;1,000 submissions/minute&lt;/strong&gt;, with &lt;strong&gt;99.9% uptime&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  🏗️ High-Level Architecture
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F46evcoh1sk21eclygwr1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F46evcoh1sk21eclygwr1.png" alt="CodeNova Architecture"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  System Overview
&lt;/h3&gt;

&lt;p&gt;The architecture follows a &lt;strong&gt;microservices-ready design&lt;/strong&gt; with clear separation of concerns across 6 layers:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer 1: Client (Browser)
    ↓
Layer 2: CDN &amp;amp; Load Balancing (CloudFlare + Nginx)
    ↓
Layer 3: Application Tier (Next.js + Express + Socket.io)
    ↓
Layer 4: Data Tier (MongoDB + Redis + PostgreSQL)
    ↓
Layer 5: Queue Layer (BullMQ)
    ↓
Layer 6: Workers &amp;amp; External Services (Judge0, Gemini AI, ElevenLabs, ANAM)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🌟 Three Unique Features - Architecture Breakdown
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. AI Video Avatar Tutor
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Challenge:&lt;/strong&gt;&lt;br&gt;
How do you provide personalized video explanations to thousands of users without hiring human tutors?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: Three-Stage Pipeline&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Question → Gemini AI → ElevenLabs → ANAM AI → Cached Video
              (Text Gen)   (TTS)        (Avatar)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Architecture Decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 1: Why Three Separate Services?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Gemini AI&lt;/strong&gt; - Best at generating educational content&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ElevenLabs&lt;/strong&gt; - Most natural-sounding TTS (better than AWS Polly)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ANAM AI&lt;/strong&gt; - Realistic lip-sync (alternatives: D-ID, Synthesia)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trade-off:&lt;/strong&gt; Higher complexity but better quality. Users prefer natural voice over robotic TTS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 2: Caching Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Problem:&lt;/strong&gt; Generating avatar videos takes 30 seconds per request&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Redis cache with 24-hour TTL for common questions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; 70% cache hit rate significantly reduces generation load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Decision 3: Async Processing&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why:&lt;/strong&gt; 30-second generation time blocks API&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How:&lt;/strong&gt; BullMQ job queue&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benefit:&lt;/strong&gt; User sees loading screen, gets notification when ready&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  2. AI-Powered Algorithm Visualizer
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Challenge:&lt;/strong&gt;&lt;br&gt;
Traditional visualizers need manual step creation for each algorithm. How to support ANY algorithm without manual work?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: AI-Generated Visualization Steps&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User Code → Gemini AI → JSON Steps → Canvas Renderer → Interactive Visualization
         (Analyze)    (Generate)    (Frontend)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Architecture Decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 1: Why AI Over Templates?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Templates approach:&lt;/strong&gt; 155+ algorithms × manual steps = months of work&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AI approach:&lt;/strong&gt; Gemini analyzes ANY code automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-off:&lt;/strong&gt; API dependency vs. automatic generation at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Decision 2: Where to Render?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Server-side rendering:&lt;/strong&gt; High CPU usage, poor UX&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client-side (Canvas API):&lt;/strong&gt; Better performance, lower server load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chosen:&lt;/strong&gt; Client-side with JSON steps from server&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Decision 3: Data Format&lt;/strong&gt;&lt;br&gt;
Gemini returns structured JSON:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step format:
- Description (plain English)
- Array state at this step
- Elements to highlight
- Comparison pointers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Supported Algorithms:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Sorting: Bubble, Merge, Quick, Heap, Insertion&lt;/li&gt;
&lt;li&gt;Searching: Binary, Linear, DFS, BFS&lt;/li&gt;
&lt;li&gt;Data Structures: Stack, Queue, Trees, Graphs&lt;/li&gt;
&lt;li&gt;DP: Fibonacci, Knapsack, LCS with table visualization&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  3. Collaborative Whiteboard
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Challenge:&lt;/strong&gt;&lt;br&gt;
Enable real-time drawing for multiple users in mock interviews.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Solution: WebSocket + Pub/Sub Architecture&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User A draws → Socket.io Server → Redis Pub/Sub → All Users in Room
                     ↓
                 MongoDB (persist)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Architecture Decisions:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Decision 1: WebSocket vs. Polling?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Polling:&lt;/strong&gt; Simple but wasteful (10K users × 5s intervals = 2K QPS)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebSocket:&lt;/strong&gt; Persistent connection, instant updates&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chosen:&lt;/strong&gt; Socket.io for fallback support (WebSocket → long polling)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Decision 2: How to Scale WebSockets Across Multiple Servers?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Problem:&lt;/strong&gt; User A on Server 1, User B on Server 2&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; Redis Pub/Sub for cross-server communication&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;How it works:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;Server 1 publishes draw event to Redis&lt;/li&gt;
&lt;li&gt;Server 2 subscribes and receives event&lt;/li&gt;
&lt;li&gt;Server 2 sends to User B via WebSocket&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Decision 3: Persistence Strategy&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Approach 1:&lt;/strong&gt; Save on every draw → Too many DB writes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Approach 2:&lt;/strong&gt; Save on disconnect → Lose data if server crashes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chosen:&lt;/strong&gt; Auto-save every 5 seconds to MongoDB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recovery:&lt;/strong&gt; Load from DB on reconnect&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Data Model:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;WhiteboardSession {
  sessionId: unique identifier
  problemId: which problem being discussed
  participants: array of user IDs with roles
  elements: Excalidraw drawing data
  createdAt, updatedAt
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔐 Security Architecture - Defense in Depth
&lt;/h2&gt;

&lt;h3&gt;
  
  
  6 Layers of Security
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: Network Perimeter&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CloudFlare DDoS protection (unlimited)&lt;/li&gt;
&lt;li&gt;Rate limiting: 1000 requests/minute per IP&lt;/li&gt;
&lt;li&gt;TLS 1.3 encryption&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Load Balancer (Nginx)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Per-user rate limiting (100 req/min)&lt;/li&gt;
&lt;li&gt;Request size limits (10 MB max)&lt;/li&gt;
&lt;li&gt;Header validation &amp;amp; sanitization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: Authentication &amp;amp; Authorization&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;JWT tokens:&lt;/strong&gt; HS256 algorithm, 7-day expiry&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Session validation:&lt;/strong&gt; Every request checks Redis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RBAC:&lt;/strong&gt; User vs Admin permissions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 4: Input Validation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Code size limit: 10 KB (prevents DoS)&lt;/li&gt;
&lt;li&gt;Forbidden pattern detection:

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;require('child_process')&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;import subprocess&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Runtime.getRuntime().exec()&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;system()&lt;/code&gt;, &lt;code&gt;eval()&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 5: Code Execution Sandbox (Judge0)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Docker isolation:&lt;/strong&gt; Each submission in separate container&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resource limits:&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;CPU time: 2 seconds max&lt;/li&gt;
&lt;li&gt;Memory: 256 MB max&lt;/li&gt;
&lt;li&gt;Processes: 30 max&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Network:&lt;/strong&gt; Completely disabled&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Filesystem:&lt;/strong&gt; Read-only (except /tmp)&lt;/li&gt;

&lt;li&gt;

&lt;strong&gt;Seccomp profiles:&lt;/strong&gt; Block dangerous syscalls&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Layer 6: Data Security&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Encryption at rest: AES-256&lt;/li&gt;
&lt;li&gt;Password hashing: Bcrypt (10 rounds)&lt;/li&gt;
&lt;li&gt;Secrets: AWS Secrets Manager&lt;/li&gt;
&lt;li&gt;Database backups: Daily full + 6h incremental&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why 6 Layers?&lt;/strong&gt;&lt;br&gt;
If an attacker bypasses one layer, 5 more remain. Single points of failure = bad.&lt;/p&gt;


&lt;h2&gt;
  
  
  📊 Scalability: Handling 10,000 Concurrent Users
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Horizontal Scaling Strategy
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Kubernetes HPA (Horizontal Pod Autoscaler):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Configuration:
- Min replicas: 3 (high availability)
- Max replicas: 20 (resource management)
- Scale up: CPU &amp;gt; 70% OR Memory &amp;gt; 80%
- Scale down: CPU &amp;lt; 40% for 5 minutes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Why Kubernetes?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Auto-healing (pod crashes → restart)&lt;/li&gt;
&lt;li&gt;Rolling updates (zero downtime deploys)&lt;/li&gt;
&lt;li&gt;Resource management (CPU/memory limits)&lt;/li&gt;
&lt;li&gt;Service discovery (automatic DNS)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Database Scaling Strategy
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;MongoDB (Primary Database):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Architecture: Replica Set (PSS)
- 1 Primary (us-east-1) → All writes
- 1 Secondary (us-west-1) → Read queries
- 1 Secondary (eu-west-1) → Read queries

Read Preference: secondaryPreferred (40% load on each secondary)
Write Concern: majority (data safety)

Future: Shard when &amp;gt; 10M documents
Shard Key: { userId: "hashed" } for even distribution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;PostgreSQL (Analytics):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Architecture: Master-Replica
- Master: All writes (metrics, logs)
- Replica 1: Analytics queries
- Replica 2: Reporting dashboards

Extension: TimescaleDB for time-series optimization
Use case: User activity over time, submission trends
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Redis (Cache &amp;amp; Pub/Sub):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Architecture: Cluster (3 nodes)
- Node 1: Master (cache + sessions)
- Node 2: Replica (failover)
- Node 3: Replica (failover)

Persistence: RDB snapshots (5 min) + AOF
Max Memory: 4 GB
Eviction Policy: allkeys-lru (least recently used)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Worker Scaling
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;BullMQ Queue Configuration:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Code Execution Queue:
- Min workers: 5
- Max workers: 50
- Concurrency: 10 jobs per worker
- Scale trigger: Queue depth &amp;gt; 100

AI Avatar Queue:
- Min workers: 2
- Max workers: 20
- Concurrency: 5 jobs per worker
- Scale trigger: Queue depth &amp;gt; 50

Visualizer Queue:
- Min workers: 2
- Max workers: 15
- Concurrency: 5 jobs per worker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Math Check:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Peak Load: 1,000 submissions/minute
         = 16.7 submissions/second

Average execution time: 2 seconds

Required concurrent workers:
16.7 submissions/sec × 2 sec = 33.4 workers

Configured max: 50 workers
Headroom: 50 - 34 = 16 workers (47% buffer) ✓
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🗄️ Data Architecture Decisions
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why MongoDB for Primary DB?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;br&gt;
✅ Flexible schema (problems have varying test cases)&lt;br&gt;
✅ Horizontal scaling with sharding&lt;br&gt;
✅ Rich query language (filter by difficulty, tags, companies)&lt;br&gt;
✅ Replica sets for HA&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cons:&lt;/strong&gt;&lt;br&gt;
❌ Weaker transactions (fixed in 4.0+)&lt;br&gt;
❌ Larger storage footprint&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Problems collection (155+ documents)&lt;/li&gt;
&lt;li&gt;Submissions collection (millions of documents)&lt;/li&gt;
&lt;li&gt;Whiteboard sessions&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Why PostgreSQL for Analytics?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;br&gt;
✅ ACID transactions&lt;br&gt;
✅ Complex joins for user analytics&lt;br&gt;
✅ TimescaleDB for time-series optimization&lt;br&gt;
✅ Better for aggregations&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Submission analytics (success rate over time)&lt;/li&gt;
&lt;li&gt;User activity logs&lt;/li&gt;
&lt;li&gt;Leaderboard snapshots&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Why Redis?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Pros:&lt;/strong&gt;&lt;br&gt;
✅ Sub-millisecond latency&lt;br&gt;
✅ Sorted Sets for leaderboards (O(log N) operations)&lt;br&gt;
✅ Pub/Sub for WebSocket scaling&lt;br&gt;
✅ Built-in TTL for sessions&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use Cases:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Session storage (7-day TTL)&lt;/li&gt;
&lt;li&gt;Problem caching (1-hour TTL)&lt;/li&gt;
&lt;li&gt;Leaderboard (Redis Sorted Set)&lt;/li&gt;
&lt;li&gt;WebSocket pub/sub&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Leaderboard Implementation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Data Structure: Redis Sorted Set
Command: ZADD leaderboard:contest123 &amp;lt;score&amp;gt; &amp;lt;userId&amp;gt;
Retrieve Top 100: ZREVRANGE leaderboard:contest123 0 99 WITHSCORES

Time Complexity: O(log N)
Handles: 10K users × 5-second polling = 2K QPS easily
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🎯 Architecture Decisions Explained
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Decision 1: Why BullMQ Over AWS SQS?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Comparison:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;BullMQ (Redis)&lt;/th&gt;
&lt;th&gt;AWS SQS&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Latency&lt;/td&gt;
&lt;td&gt;&amp;lt; 10ms&lt;/td&gt;
&lt;td&gt;50-100ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Priority Queues&lt;/td&gt;
&lt;td&gt;✅ Native&lt;/td&gt;
&lt;td&gt;❌ Separate queues&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retry Logic&lt;/td&gt;
&lt;td&gt;✅ Built-in&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Local Dev&lt;/td&gt;
&lt;td&gt;✅ Easy&lt;/td&gt;
&lt;td&gt;❌ Need AWS account&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Uses existing Redis&lt;/td&gt;
&lt;td&gt;Additional service&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Chosen:&lt;/strong&gt; BullMQ for lower latency and simpler infrastructure.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision 2: Why Socket.io Over Native WebSocket?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Socket.io Advantages:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Automatic fallback (WebSocket → long polling)&lt;/li&gt;
&lt;li&gt;✅ Reconnection logic built-in&lt;/li&gt;
&lt;li&gt;✅ Room-based messaging&lt;/li&gt;
&lt;li&gt;✅ Cross-platform (web + mobile)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trade-off:&lt;/strong&gt; Slightly larger bundle size, but better compatibility.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision 3: Why Next.js Over Pure React?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Next.js Benefits:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Server-side rendering (better SEO)&lt;/li&gt;
&lt;li&gt;✅ API routes (no separate Express for simple endpoints)&lt;/li&gt;
&lt;li&gt;✅ Image optimization&lt;/li&gt;
&lt;li&gt;✅ Automatic code splitting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use Case:&lt;/strong&gt; Problem listing page needs SEO for Google.&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision 4: Why Separate PostgreSQL for Analytics?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Why Not Just MongoDB?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MongoDB aggregations are slower for complex queries&lt;/li&gt;
&lt;li&gt;PostgreSQL better for JOINs (users + submissions + problems)&lt;/li&gt;
&lt;li&gt;TimescaleDB optimizes time-series queries (activity over time)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Trade-off:&lt;/strong&gt; More complexity (2 databases) but better performance.&lt;/p&gt;




&lt;h2&gt;
  
  
  🚀 Performance Metrics
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Achieved SLA:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Code execution: &amp;lt; 3s (p95)&lt;/li&gt;
&lt;li&gt;✅ Page load: &amp;lt; 2s&lt;/li&gt;
&lt;li&gt;✅ API latency: &amp;lt; 500ms (p95)&lt;/li&gt;
&lt;li&gt;✅ WebSocket latency: &amp;lt; 100ms&lt;/li&gt;
&lt;li&gt;✅ Cache hit rate: &amp;gt; 70%&lt;/li&gt;
&lt;li&gt;✅ Uptime: 99.9% (43 minutes downtime/month allowed)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;How We Measure:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus for metrics collection&lt;/li&gt;
&lt;li&gt;Grafana for dashboards&lt;/li&gt;
&lt;li&gt;Sentry for error tracking&lt;/li&gt;
&lt;li&gt;ELK Stack for log aggregation&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🎓 Key Learnings
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Async Processing is Non-Negotiable
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Early Mistake:&lt;/strong&gt;&lt;br&gt;
I initially tried synchronous code execution. When 1000 submissions/minute hit, API servers timed out.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution:&lt;/strong&gt;&lt;br&gt;
BullMQ job queue with auto-scaling workers. Now:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;API responds instantly with "submitted"&lt;/li&gt;
&lt;li&gt;Worker processes in background&lt;/li&gt;
&lt;li&gt;WebSocket notifies user when done&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Caching is Critical for Performance
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Without Caching:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Every problem fetch → MongoDB query&lt;/li&gt;
&lt;li&gt;Every avatar question → 30-second generation time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;With Caching:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;85% problem queries served from Redis&lt;/li&gt;
&lt;li&gt;70% avatar videos served from cache&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Result:&lt;/strong&gt; 80% reduction in MongoDB load, instant response for cached queries&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Security in Layers, Not Walls
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Wrong Approach:&lt;/strong&gt;&lt;br&gt;
"If our firewall is strong, we're safe."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Right Approach:&lt;/strong&gt;&lt;br&gt;
6 layers of defense. If one fails, 5 remain.&lt;/p&gt;

&lt;p&gt;Example: Even if attacker bypasses rate limiting (Layer 1-2), they hit:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;JWT validation (Layer 3)&lt;/li&gt;
&lt;li&gt;Input sanitization (Layer 4)&lt;/li&gt;
&lt;li&gt;Docker sandbox (Layer 5)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Monitor Before You Scale
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Built Monitoring First:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prometheus metrics from day one&lt;/li&gt;
&lt;li&gt;Grafana dashboards before launch&lt;/li&gt;
&lt;li&gt;Sentry error tracking in alpha&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt; You can't optimize what you can't measure. Without metrics, scaling is guesswork.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔮 Future Improvements
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Technical Debt to Address
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Self-host Judge0&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: Using Judge0 API&lt;/li&gt;
&lt;li&gt;Plan: Docker on Kubernetes for better control&lt;/li&gt;
&lt;li&gt;Benefit: More flexibility in resource allocation&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Multi-region Deployment&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: Single region (us-east-1)&lt;/li&gt;
&lt;li&gt;Issue: High latency for Asia/Europe users&lt;/li&gt;
&lt;li&gt;Plan: CloudFlare Workers + edge caching&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Database Sharding&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: Single MongoDB replica set&lt;/li&gt;
&lt;li&gt;Trigger: When &amp;gt; 10M submissions&lt;/li&gt;
&lt;li&gt;Strategy: Shard by userId (hashed)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;GraphQL API&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Current: REST with over-fetching&lt;/li&gt;
&lt;li&gt;Benefit: Reduce data transfer by 40%&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  🤔 Questions I'd Ask Myself in System Design Interview
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Q: Why not use AWS Lambda for code execution?&lt;/strong&gt;&lt;br&gt;
A: Lambda has 15-minute timeout, cold starts add latency. Judge0 in Docker has consistent performance and better resource limits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Why MongoDB AND PostgreSQL? Why not just one?&lt;/strong&gt;&lt;br&gt;
A: Different workloads. MongoDB excels at flexible schemas and horizontal scaling. PostgreSQL excels at complex analytics. Multi-database is common in microservices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: How do you prevent one user from DDoSing your platform?&lt;/strong&gt;&lt;br&gt;
A: Rate limiting at 3 levels - CloudFlare (per IP), Nginx (per user), Application (per API endpoint). Plus BullMQ queue prevents worker overload.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: What happens if Redis goes down?&lt;/strong&gt;&lt;br&gt;
A: 3-node cluster with automatic failover. If all nodes fail: Sessions lost (users re-login), cache miss (MongoDB serves requests), WebSocket disconnects (auto-reconnect). Not ideal, but platform stays up.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Q: Why 99.9% uptime and not 99.99%?&lt;/strong&gt;&lt;br&gt;
A: Trade-off between availability and complexity. 99.9% = 43 min/month downtime (acceptable for coding practice). 99.99% requires multi-region deployment with significantly more infrastructure complexity.&lt;/p&gt;




&lt;h2&gt;
  
  
  📖 Recommended Reading
&lt;/h2&gt;

&lt;p&gt;If you're designing a similar system:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Books:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"Designing Data-Intensive Applications" by Martin Kleppmann&lt;/li&gt;
&lt;li&gt;"System Design Interview" by Alex Xu&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Resources:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.hellointerview.com/learn/system-design/problem-breakdowns/leetcode" rel="noopener noreferrer"&gt;LeetCode System Design (HelloInterview)&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.mongodb.com/docs/manual/sharding/" rel="noopener noreferrer"&gt;MongoDB Sharding Guide&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://redis.io/commands/zadd/" rel="noopener noreferrer"&gt;Redis Sorted Sets for Leaderboards&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🎯 Conclusion
&lt;/h2&gt;

&lt;p&gt;Building CodeNova taught me that &lt;strong&gt;good architecture is about trade-offs&lt;/strong&gt;, not perfection.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Takeaways:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Async everything&lt;/strong&gt; - Queues are your friend&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cache aggressively&lt;/strong&gt; - Improves performance and reduces load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Security in layers&lt;/strong&gt; - Defense in depth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measure first, optimize second&lt;/strong&gt; - Metrics before scaling&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The architecture diagram isn't just boxes and arrows - it represents:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hundreds of hours of research&lt;/li&gt;
&lt;li&gt;Dozens of failed experiments&lt;/li&gt;
&lt;li&gt;Lessons from production incidents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;If I were to start over&lt;/strong&gt;, I'd:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Build monitoring first (kept this)&lt;/li&gt;
&lt;li&gt;✅ Use queues from day one (learned this the hard way)&lt;/li&gt;
&lt;li&gt;✅ Start with fewer databases (added PostgreSQL later)&lt;/li&gt;
&lt;li&gt;❌ Not self-host initially (buy before build)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  💬 Discussion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How would you design this differently?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Would you use:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Serverless (Lambda) instead of Kubernetes?&lt;/li&gt;
&lt;li&gt;GraphQL instead of REST?&lt;/li&gt;
&lt;li&gt;DynamoDB instead of MongoDB?&lt;/li&gt;
&lt;li&gt;Different AI providers?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Drop your thoughts in the comments! 👇&lt;/p&gt;

&lt;p&gt;I'm especially interested in:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Better ways to optimize AI response generation&lt;/li&gt;
&lt;li&gt;Better ways to scale WebSockets&lt;/li&gt;
&lt;li&gt;Alternative code execution sandboxes&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Built with ❤️ and lots of ☕ by Bhupesh Chikara&lt;/strong&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  systemdesign #architecture #webdev #ai #mongodb #kubernetes #redis #postgresql #websocket #nodejs #react #typescript #microservices #cloudcomputing #devops
&lt;/h1&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>The CAP Theorem: Why Consistency, Availability, and Partition Tolerance Can't All Be Friends</title>
      <dc:creator>Bhupesh Chikara</dc:creator>
      <pubDate>Fri, 30 May 2025 07:00:56 +0000</pubDate>
      <link>https://dev.to/bchikara/demystifying-the-cap-theorem-a-developers-guide-5727</link>
      <guid>https://dev.to/bchikara/demystifying-the-cap-theorem-a-developers-guide-5727</guid>
      <description>&lt;p&gt;Hey Devs! 👋&lt;/p&gt;

&lt;p&gt;Heard of the "CAP theorem" in system design? It sounds academic, but it's crucial for distributed systems (like microservices or multi-server databases). This post breaks CAP down simply. Let's go!&lt;/p&gt;

&lt;h2&gt;
  
  
  🤔 What is the CAP Theorem?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmotz3rmdmshnx2jgeuu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvmotz3rmdmshnx2jgeuu.png" alt="CAP Theorem" width="800" height="666"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The CAP theorem (or Brewer's theorem) is key for distributed data stores. It states &lt;strong&gt;a distributed system can't guarantee all three simultaneously&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;C&lt;/strong&gt;onsistency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A&lt;/strong&gt;vailability&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;P&lt;/strong&gt;artition Tolerance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Understanding these trade-offs is vital for good design.&lt;/p&gt;

&lt;h2&gt;
  
  
  🧐 Breaking Down "CAP"
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuiv2f3nbe08xyfpwwhx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpuiv2f3nbe08xyfpwwhx.png" alt="CAP Theorem Overall Trade-off" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;C&lt;/strong&gt;onsistency: All Nodes See the Same Data, Now
&lt;/h3&gt;

&lt;p&gt;All reads get the most recent write or an error. After a write, all nodes reflect that update, giving users a unified data view.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Analogy:&lt;/em&gt; A shared doc where everyone instantly sees the latest saved version.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;A&lt;/strong&gt;vailability: Every Request Gets a Response
&lt;/h3&gt;

&lt;p&gt;Every request to a working node gets a response. The system is operational, though responses might not always have the absolute latest data.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Analogy:&lt;/em&gt; An online store that's always open, even if product info occasionally has a slight delay in updating everywhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;P&lt;/strong&gt;artition Tolerance: System Works Despite Network Issues
&lt;/h3&gt;

&lt;p&gt;The system works despite network communication failures between nodes (e.g., due to a failed switch or cable).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Analogy:&lt;/em&gt; Office branches operating independently when their network connection drops, then syncing later.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Why Partition Tolerance is Key:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Network failures (partitions) are inevitable in distributed systems. Thus, &lt;strong&gt;P&lt;/strong&gt;artition Tolerance is essential; without it, systems become unreliable during glitches. So, &lt;strong&gt;most distributed systems need partition tolerance&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚖️ The Core Trade-off: CP vs. AP
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvhqbfg8i29266jqus9hb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvhqbfg8i29266jqus9hb.png" alt="CP vs AP Systems Trade-off" width="800" height="365"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since Partition Tolerance (P) is usually required, the main CAP trade-off during a partition is between Consistency (C) and Availability (A).&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;CP Systems (Consistency + Partition Tolerance)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Prioritize consistency during partitions. If data can't be verified as current, the affected part of the system may become unavailable (refusing writes/reads) to prevent inconsistency.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Use Cases:&lt;/em&gt; Financial systems, inventory management—where accuracy trumps constant uptime.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;AP Systems (Availability + Partition Tolerance)&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Prioritize availability during partitions. The system stays operational, even if it means some nodes serve slightly older data (eventual consistency) until the partition resolves.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Use Cases:&lt;/em&gt; Social media, e-commerce listings—where high availability is key, and slight data staleness is acceptable.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🌈 Nuances to CAP
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Not just 2 of 3:&lt;/strong&gt; Real systems have nuanced behaviors; choices aren't always absolute.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency matters:&lt;/strong&gt; Operation speed is critical beyond CAP guarantees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Eventual Consistency:&lt;/strong&gt; Common in AP systems; data eventually becomes consistent if no new updates occur.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context is King:&lt;/strong&gt; The best CP/AP choice depends on your app's needs (e.g., banking: &lt;strong&gt;CP&lt;/strong&gt;; social feeds: &lt;strong&gt;AP&lt;/strong&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  🎁 Wrapping Up
&lt;/h2&gt;

&lt;p&gt;CAP isn't about achieving all three guarantees—it's a model for understanding vital trade-offs in distributed systems. Knowing C, A, and P helps you make informed design choices. Keep CAP in mind when architecting or choosing distributed databases.&lt;/p&gt;

&lt;p&gt;Share your CAP experiences in the comments! 👇&lt;/p&gt;

</description>
      <category>systemdesign</category>
      <category>captheorem</category>
      <category>highleveldesign</category>
      <category>designpatterns</category>
    </item>
  </channel>
</rss>
