DEV Community

Cover image for Building CodeNova: System Design Deep Dive into an AI-Enhanced Coding Platform
Bhupesh Chikara
Bhupesh Chikara

Posted on

Building CodeNova: System Design Deep Dive into an AI-Enhanced Coding Platform

TL;DR

I designed and built CodeNova, a scalable coding interview platform handling 10K+ concurrent users with three AI-powered features: video avatar tutor, algorithm visualizer, and collaborative whiteboard. This is a deep dive into the system architecture and design decisions.


๐ŸŽฏ What is CodeNova?

CodeNova is an AI-enhanced coding interview platform designed for scalability and learning. Core features include:

  • 155+ problems across multiple difficulty levels
  • 10+ programming languages with sandboxed execution
  • AI video tutor with realistic avatar and natural voice
  • Automatic algorithm visualization for any code
  • Real-time collaborative whiteboard for mock interviews
  • Contest leaderboards with analytics

Scale: Built to handle 10,000 concurrent users, 1,000 submissions/minute, with 99.9% uptime.


๐Ÿ—๏ธ High-Level Architecture

CodeNova Architecture

System Overview

The architecture follows a microservices-ready design with clear separation of concerns across 6 layers:

Layer 1: Client (Browser)
    โ†“
Layer 2: CDN & Load Balancing (CloudFlare + Nginx)
    โ†“
Layer 3: Application Tier (Next.js + Express + Socket.io)
    โ†“
Layer 4: Data Tier (MongoDB + Redis + PostgreSQL)
    โ†“
Layer 5: Queue Layer (BullMQ)
    โ†“
Layer 6: Workers & External Services (Judge0, Gemini AI, ElevenLabs, ANAM)
Enter fullscreen mode Exit fullscreen mode

๐ŸŒŸ Three Unique Features - Architecture Breakdown

1. AI Video Avatar Tutor

The Challenge:
How do you provide personalized video explanations to thousands of users without hiring human tutors?

The Solution: Three-Stage Pipeline

User Question โ†’ Gemini AI โ†’ ElevenLabs โ†’ ANAM AI โ†’ Cached Video
              (Text Gen)   (TTS)        (Avatar)
Enter fullscreen mode Exit fullscreen mode

Architecture Decisions:

Decision 1: Why Three Separate Services?

  • Gemini AI - Best at generating educational content
  • ElevenLabs - Most natural-sounding TTS (better than AWS Polly)
  • ANAM AI - Realistic lip-sync (alternatives: D-ID, Synthesia)

Trade-off: Higher complexity but better quality. Users prefer natural voice over robotic TTS.

Decision 2: Caching Strategy

  • Problem: Generating avatar videos takes 30 seconds per request
  • Solution: Redis cache with 24-hour TTL for common questions
  • Result: 70% cache hit rate significantly reduces generation load

Decision 3: Async Processing

  • Why: 30-second generation time blocks API
  • How: BullMQ job queue
  • Benefit: User sees loading screen, gets notification when ready

2. AI-Powered Algorithm Visualizer

The Challenge:
Traditional visualizers need manual step creation for each algorithm. How to support ANY algorithm without manual work?

The Solution: AI-Generated Visualization Steps

User Code โ†’ Gemini AI โ†’ JSON Steps โ†’ Canvas Renderer โ†’ Interactive Visualization
         (Analyze)    (Generate)    (Frontend)
Enter fullscreen mode Exit fullscreen mode

Architecture Decisions:

Decision 1: Why AI Over Templates?

  • Templates approach: 155+ algorithms ร— manual steps = months of work
  • AI approach: Gemini analyzes ANY code automatically
  • Trade-off: API dependency vs. automatic generation at scale

Decision 2: Where to Render?

  • Server-side rendering: High CPU usage, poor UX
  • Client-side (Canvas API): Better performance, lower server load
  • Chosen: Client-side with JSON steps from server

Decision 3: Data Format
Gemini returns structured JSON:

Step format:
- Description (plain English)
- Array state at this step
- Elements to highlight
- Comparison pointers
Enter fullscreen mode Exit fullscreen mode

Supported Algorithms:

  • Sorting: Bubble, Merge, Quick, Heap, Insertion
  • Searching: Binary, Linear, DFS, BFS
  • Data Structures: Stack, Queue, Trees, Graphs
  • DP: Fibonacci, Knapsack, LCS with table visualization

3. Collaborative Whiteboard

The Challenge:
Enable real-time drawing for multiple users in mock interviews.

The Solution: WebSocket + Pub/Sub Architecture

User A draws โ†’ Socket.io Server โ†’ Redis Pub/Sub โ†’ All Users in Room
                     โ†“
                 MongoDB (persist)
Enter fullscreen mode Exit fullscreen mode

Architecture Decisions:

Decision 1: WebSocket vs. Polling?

  • Polling: Simple but wasteful (10K users ร— 5s intervals = 2K QPS)
  • WebSocket: Persistent connection, instant updates
  • Chosen: Socket.io for fallback support (WebSocket โ†’ long polling)

Decision 2: How to Scale WebSockets Across Multiple Servers?

  • Problem: User A on Server 1, User B on Server 2
  • Solution: Redis Pub/Sub for cross-server communication
  • How it works:
    • Server 1 publishes draw event to Redis
    • Server 2 subscribes and receives event
    • Server 2 sends to User B via WebSocket

Decision 3: Persistence Strategy

  • Approach 1: Save on every draw โ†’ Too many DB writes
  • Approach 2: Save on disconnect โ†’ Lose data if server crashes
  • Chosen: Auto-save every 5 seconds to MongoDB
  • Recovery: Load from DB on reconnect

Data Model:

WhiteboardSession {
  sessionId: unique identifier
  problemId: which problem being discussed
  participants: array of user IDs with roles
  elements: Excalidraw drawing data
  createdAt, updatedAt
}
Enter fullscreen mode Exit fullscreen mode

๐Ÿ” Security Architecture - Defense in Depth

6 Layers of Security

Layer 1: Network Perimeter

  • CloudFlare DDoS protection (unlimited)
  • Rate limiting: 1000 requests/minute per IP
  • TLS 1.3 encryption

Layer 2: Load Balancer (Nginx)

  • Per-user rate limiting (100 req/min)
  • Request size limits (10 MB max)
  • Header validation & sanitization

Layer 3: Authentication & Authorization

  • JWT tokens: HS256 algorithm, 7-day expiry
  • Session validation: Every request checks Redis
  • RBAC: User vs Admin permissions

Layer 4: Input Validation

  • Code size limit: 10 KB (prevents DoS)
  • Forbidden pattern detection:
    • require('child_process')
    • import subprocess
    • Runtime.getRuntime().exec()
    • system(), eval()

Layer 5: Code Execution Sandbox (Judge0)

  • Docker isolation: Each submission in separate container
  • Resource limits:
    • CPU time: 2 seconds max
    • Memory: 256 MB max
    • Processes: 30 max
  • Network: Completely disabled
  • Filesystem: Read-only (except /tmp)
  • Seccomp profiles: Block dangerous syscalls

Layer 6: Data Security

  • Encryption at rest: AES-256
  • Password hashing: Bcrypt (10 rounds)
  • Secrets: AWS Secrets Manager
  • Database backups: Daily full + 6h incremental

Why 6 Layers?
If an attacker bypasses one layer, 5 more remain. Single points of failure = bad.


๐Ÿ“Š Scalability: Handling 10,000 Concurrent Users

Horizontal Scaling Strategy

Kubernetes HPA (Horizontal Pod Autoscaler):

Configuration:
- Min replicas: 3 (high availability)
- Max replicas: 20 (resource management)
- Scale up: CPU > 70% OR Memory > 80%
- Scale down: CPU < 40% for 5 minutes
Enter fullscreen mode Exit fullscreen mode

Why Kubernetes?

  • Auto-healing (pod crashes โ†’ restart)
  • Rolling updates (zero downtime deploys)
  • Resource management (CPU/memory limits)
  • Service discovery (automatic DNS)

Database Scaling Strategy

MongoDB (Primary Database):

Architecture: Replica Set (PSS)
- 1 Primary (us-east-1) โ†’ All writes
- 1 Secondary (us-west-1) โ†’ Read queries
- 1 Secondary (eu-west-1) โ†’ Read queries

Read Preference: secondaryPreferred (40% load on each secondary)
Write Concern: majority (data safety)

Future: Shard when > 10M documents
Shard Key: { userId: "hashed" } for even distribution
Enter fullscreen mode Exit fullscreen mode

PostgreSQL (Analytics):

Architecture: Master-Replica
- Master: All writes (metrics, logs)
- Replica 1: Analytics queries
- Replica 2: Reporting dashboards

Extension: TimescaleDB for time-series optimization
Use case: User activity over time, submission trends
Enter fullscreen mode Exit fullscreen mode

Redis (Cache & Pub/Sub):

Architecture: Cluster (3 nodes)
- Node 1: Master (cache + sessions)
- Node 2: Replica (failover)
- Node 3: Replica (failover)

Persistence: RDB snapshots (5 min) + AOF
Max Memory: 4 GB
Eviction Policy: allkeys-lru (least recently used)
Enter fullscreen mode Exit fullscreen mode

Worker Scaling

BullMQ Queue Configuration:

Code Execution Queue:
- Min workers: 5
- Max workers: 50
- Concurrency: 10 jobs per worker
- Scale trigger: Queue depth > 100

AI Avatar Queue:
- Min workers: 2
- Max workers: 20
- Concurrency: 5 jobs per worker
- Scale trigger: Queue depth > 50

Visualizer Queue:
- Min workers: 2
- Max workers: 15
- Concurrency: 5 jobs per worker
Enter fullscreen mode Exit fullscreen mode

Math Check:

Peak Load: 1,000 submissions/minute
         = 16.7 submissions/second

Average execution time: 2 seconds

Required concurrent workers:
16.7 submissions/sec ร— 2 sec = 33.4 workers

Configured max: 50 workers
Headroom: 50 - 34 = 16 workers (47% buffer) โœ“
Enter fullscreen mode Exit fullscreen mode

๐Ÿ—„๏ธ Data Architecture Decisions

Why MongoDB for Primary DB?

Pros:
โœ… Flexible schema (problems have varying test cases)
โœ… Horizontal scaling with sharding
โœ… Rich query language (filter by difficulty, tags, companies)
โœ… Replica sets for HA

Cons:
โŒ Weaker transactions (fixed in 4.0+)
โŒ Larger storage footprint

Use Cases:

  • Problems collection (155+ documents)
  • Submissions collection (millions of documents)
  • Whiteboard sessions

Why PostgreSQL for Analytics?

Pros:
โœ… ACID transactions
โœ… Complex joins for user analytics
โœ… TimescaleDB for time-series optimization
โœ… Better for aggregations

Use Cases:

  • Submission analytics (success rate over time)
  • User activity logs
  • Leaderboard snapshots

Why Redis?

Pros:
โœ… Sub-millisecond latency
โœ… Sorted Sets for leaderboards (O(log N) operations)
โœ… Pub/Sub for WebSocket scaling
โœ… Built-in TTL for sessions

Use Cases:

  • Session storage (7-day TTL)
  • Problem caching (1-hour TTL)
  • Leaderboard (Redis Sorted Set)
  • WebSocket pub/sub

Leaderboard Implementation:

Data Structure: Redis Sorted Set
Command: ZADD leaderboard:contest123 <score> <userId>
Retrieve Top 100: ZREVRANGE leaderboard:contest123 0 99 WITHSCORES

Time Complexity: O(log N)
Handles: 10K users ร— 5-second polling = 2K QPS easily
Enter fullscreen mode Exit fullscreen mode

๐ŸŽฏ Architecture Decisions Explained

Decision 1: Why BullMQ Over AWS SQS?

Comparison:

Feature BullMQ (Redis) AWS SQS
Latency < 10ms 50-100ms
Priority Queues โœ… Native โŒ Separate queues
Retry Logic โœ… Built-in Manual
Local Dev โœ… Easy โŒ Need AWS account
Infrastructure Uses existing Redis Additional service

Chosen: BullMQ for lower latency and simpler infrastructure.

Decision 2: Why Socket.io Over Native WebSocket?

Socket.io Advantages:

  • โœ… Automatic fallback (WebSocket โ†’ long polling)
  • โœ… Reconnection logic built-in
  • โœ… Room-based messaging
  • โœ… Cross-platform (web + mobile)

Trade-off: Slightly larger bundle size, but better compatibility.

Decision 3: Why Next.js Over Pure React?

Next.js Benefits:

  • โœ… Server-side rendering (better SEO)
  • โœ… API routes (no separate Express for simple endpoints)
  • โœ… Image optimization
  • โœ… Automatic code splitting

Use Case: Problem listing page needs SEO for Google.

Decision 4: Why Separate PostgreSQL for Analytics?

Why Not Just MongoDB?

  • MongoDB aggregations are slower for complex queries
  • PostgreSQL better for JOINs (users + submissions + problems)
  • TimescaleDB optimizes time-series queries (activity over time)

Trade-off: More complexity (2 databases) but better performance.


๐Ÿš€ Performance Metrics

Achieved SLA:

  • โœ… Code execution: < 3s (p95)
  • โœ… Page load: < 2s
  • โœ… API latency: < 500ms (p95)
  • โœ… WebSocket latency: < 100ms
  • โœ… Cache hit rate: > 70%
  • โœ… Uptime: 99.9% (43 minutes downtime/month allowed)

How We Measure:

  • Prometheus for metrics collection
  • Grafana for dashboards
  • Sentry for error tracking
  • ELK Stack for log aggregation

๐ŸŽ“ Key Learnings

1. Async Processing is Non-Negotiable

Early Mistake:
I initially tried synchronous code execution. When 1000 submissions/minute hit, API servers timed out.

Solution:
BullMQ job queue with auto-scaling workers. Now:

  • API responds instantly with "submitted"
  • Worker processes in background
  • WebSocket notifies user when done

2. Caching is Critical for Performance

Without Caching:

  • Every problem fetch โ†’ MongoDB query
  • Every avatar question โ†’ 30-second generation time

With Caching:

  • 85% problem queries served from Redis
  • 70% avatar videos served from cache
  • Result: 80% reduction in MongoDB load, instant response for cached queries

3. Security in Layers, Not Walls

Wrong Approach:
"If our firewall is strong, we're safe."

Right Approach:
6 layers of defense. If one fails, 5 remain.

Example: Even if attacker bypasses rate limiting (Layer 1-2), they hit:

  • JWT validation (Layer 3)
  • Input sanitization (Layer 4)
  • Docker sandbox (Layer 5)

4. Monitor Before You Scale

Built Monitoring First:

  • Prometheus metrics from day one
  • Grafana dashboards before launch
  • Sentry error tracking in alpha

Why? You can't optimize what you can't measure. Without metrics, scaling is guesswork.


๐Ÿ”ฎ Future Improvements

Technical Debt to Address

  1. Self-host Judge0

    • Current: Using Judge0 API
    • Plan: Docker on Kubernetes for better control
    • Benefit: More flexibility in resource allocation
  2. Multi-region Deployment

    • Current: Single region (us-east-1)
    • Issue: High latency for Asia/Europe users
    • Plan: CloudFlare Workers + edge caching
  3. Database Sharding

    • Current: Single MongoDB replica set
    • Trigger: When > 10M submissions
    • Strategy: Shard by userId (hashed)
  4. GraphQL API

    • Current: REST with over-fetching
    • Benefit: Reduce data transfer by 40%

๐Ÿค” Questions I'd Ask Myself in System Design Interview

Q: Why not use AWS Lambda for code execution?
A: Lambda has 15-minute timeout, cold starts add latency. Judge0 in Docker has consistent performance and better resource limits.

Q: Why MongoDB AND PostgreSQL? Why not just one?
A: Different workloads. MongoDB excels at flexible schemas and horizontal scaling. PostgreSQL excels at complex analytics. Multi-database is common in microservices.

Q: How do you prevent one user from DDoSing your platform?
A: Rate limiting at 3 levels - CloudFlare (per IP), Nginx (per user), Application (per API endpoint). Plus BullMQ queue prevents worker overload.

Q: What happens if Redis goes down?
A: 3-node cluster with automatic failover. If all nodes fail: Sessions lost (users re-login), cache miss (MongoDB serves requests), WebSocket disconnects (auto-reconnect). Not ideal, but platform stays up.

Q: Why 99.9% uptime and not 99.99%?
A: Trade-off between availability and complexity. 99.9% = 43 min/month downtime (acceptable for coding practice). 99.99% requires multi-region deployment with significantly more infrastructure complexity.


๐Ÿ“– Recommended Reading

If you're designing a similar system:

Books:

  • "Designing Data-Intensive Applications" by Martin Kleppmann
  • "System Design Interview" by Alex Xu

Resources:


๐ŸŽฏ Conclusion

Building CodeNova taught me that good architecture is about trade-offs, not perfection.

Key Takeaways:

  1. Async everything - Queues are your friend
  2. Cache aggressively - Improves performance and reduces load
  3. Security in layers - Defense in depth
  4. Measure first, optimize second - Metrics before scaling

The architecture diagram isn't just boxes and arrows - it represents:

  • Hundreds of hours of research
  • Dozens of failed experiments
  • Lessons from production incidents

If I were to start over, I'd:

  • โœ… Build monitoring first (kept this)
  • โœ… Use queues from day one (learned this the hard way)
  • โœ… Start with fewer databases (added PostgreSQL later)
  • โŒ Not self-host initially (buy before build)

๐Ÿ’ฌ Discussion

How would you design this differently?

Would you use:

  • Serverless (Lambda) instead of Kubernetes?
  • GraphQL instead of REST?
  • DynamoDB instead of MongoDB?
  • Different AI providers?

Drop your thoughts in the comments! ๐Ÿ‘‡

I'm especially interested in:

  • Better ways to optimize AI response generation
  • Better ways to scale WebSockets
  • Alternative code execution sandboxes

Built with โค๏ธ and lots of โ˜• by Bhupesh Chikara


systemdesign #architecture #webdev #ai #mongodb #kubernetes #redis #postgresql #websocket #nodejs #react #typescript #microservices #cloudcomputing #devops

Top comments (0)