DEV Community: Vikas Kumar

Operational Transformation (OT)

Vikas Kumar — Wed, 11 Feb 2026 09:21:27 +0000

Background

Distributed systems that allow concurrent updates face a difficult problem:

How do we keep replicas consistent when operations arrive late, out of order, or at the same time?

Operational Transformation (OT) is a technique that solves this by modifying operations instead of merging full state. It is best known from collaborative editors, but the idea applies to any system where replicas exchange operations.

The Core Problem

In distributed systems:

Multiple replicas hold the same logical data
Each replica can update independently
Network delays are unavoidable
Messages may arrive late or out of order

When a user creates an operation, it is based on the current local state. By the time the operation reaches another replica, that state may have changed.

If we apply the operation without adjustment, inconsistencies can occur.

Why Operations Break

Consider a replicated list:

[A B C D]
Indexes: 0 1 2 3

Two users edit at the same time:

User 1 → Delete(1)      // removes B
User 2 → Insert(2, X)

Both operations are valid locally.

Replica 1:

Delete(1) → [A C D]

Replica 2:

Insert(2, X) → [A B X C D]

Now Replica 1 receives Insert(2, X).

But index 2 no longer points to the same position as before. The operation’s assumption about the structure is now wrong.

What Operational Transformation Does

Operational Transformation prevents this issue by rewriting incoming operations so they match the current state.

Instead of merging state, OT systems:

Detect concurrent operations
Transform late operations
Adjust parameters (like indexes)
Apply safely

No updates are discarded, and replicas remain consistent.

The Key Idea Behind OT

At the center of OT is the transformation function:

T(op_incoming, op_existing) → transformed_op

Meaning:

Before applying a remote operation, transform it relative to operations already applied.

Goal:
Ensure the operation still represents the user’s original intent.

Example — Insert vs Insert

Initial state

[A B C D]

Concurrent operations

op1 = Insert(1, X)
op2 = Insert(1, Y)

Both operations target the same position.

If applied blindly, replicas may diverge depending on arrival order.

Replica 1

Apply op1 -> [A X B C D]
Apply op2 -> [A Y X B C D]

Replica 2

Apply op2 -> [A Y B C D]
Apply op1 -> [A X Y B C D]

Final states differ -> divergence.

OT Resolution Requires Deterministic Ordering

Assume a deterministic rule:

op1 < op2   (example: lower site ID or earlier timestamp)

Transform op2 against op1:

Transform Insert(1, Y) against Insert(1, X)
-> Insert(1 + length(X), Y)
-> Insert(2, Y)

Apply Operations Safely

Apply op1 -> [A X B C D]
Apply transformed op2 -> [A X Y B C D]

If Ordering Were Reversed

If the rule says:

op2 < op1

Transform op1 against op2:

Insert(1, X) -> Insert(2, X)
Final -> [A Y X B C D]

Example — Insert vs Delete

Initial state

[A B C D]

Concurrent operations

op1 = Delete(1)
op2 = Insert(2, X)

If deletion happens first:

Delete(1) -> [A C D]

One element before the insert position is gone. The insert must shift left:

Insert(2 - 1, X) -> Insert(1, X)

Result:

[A X C D]

If insertion happens first:

Insert(2, X) -> [A B X C D]

The deletion still works because its target is unchanged.

Example — Concurrent Deletes

Initial state

[A B C D]

Concurrent operations

op1 = Delete(1)
op2 = Delete(2)

Without transformation:

Delete(1) -> [A C D]
Delete(2) -> removes D (wrong)

With OT:

Transform Delete(2) against Delete(1)
-> Delete(2 - 1)
-> Delete(1)

Final state:

[A D]

Both deletions behave as intended.

Final State Property

All operations are preserved or correctly adjusted
Deterministic rules guarantee convergence
Replicas reach identical states despite different arrival orders

What Correctness Means in OT

A correct OT system must preserve three important properties:

Convergence – All replicas reach the same state
Intention Preservation – User actions keep their meaning
Causality Preservation – Dependencies are respected

These guarantees prevent subtle divergence bugs.

High‑Level Execution Flow

Typical OT workflow:

User generates a local operation
Apply it locally immediately (low latency)
Send it to other replicas
On receiving a remote operation:

Transform it against concurrent operations
Apply the transformed version

Transformation ensures safety under concurrency.

Why Transformation Is Necessary

Operations depend on structure. When the structure changes, operations may become invalid due to:

Index shifts
Element movement
Structural changes

OT continuously adjusts operations so they remain correct.

Advantages of OT

Operational Transformation enables:

Immediate local updates
Smooth concurrent editing
Low metadata overhead
Natural user experience

These benefits made OT popular in early collaborative systems.

Challenges & Limitations

OT is conceptually simple but hard to implement correctly.

Common difficulties include:

Complex transformation rules
Many edge cases
Subtle correctness bugs
Difficult testing and verification

Incorrect logic can cause silent divergence between replicas.

OT vs Naive Conflict Handling

Simpler systems often:

Overwrite updates
Reject concurrent changes
Use last‑write‑wins rules

OT avoids these issues by transforming operations instead of discarding them.

OT vs CRDT (Conceptual Difference)

Both OT and CRDTs aim for replica convergence but follow different strategies.

Operational Transformation

Focuses on rewriting operations
Sensitive to ordering
Lower metadata overhead

CRDTs

Focus on specially designed data structures
Ordering‑independent merges
Higher metadata overhead

Both approaches are valid depending on system requirements.

Final Perspective

A simple way to understand OT:

Before applying a remote operation, rewrite it so it makes sense for the current state.

Or even more intuitively:

Fix the coordinates before executing the command.

Operational Transformation is a good fit when:

Systems exchange operations rather than full state
Updates depend on positional context
Low metadata overhead is desired
Some central coordination is acceptable

Most commonly seen in collaborative editing systems.

Central Insight of OT:

Operational Transformation does not pick a winning update, Instead, it keeps all valid operations and modifies them only when needed. Conflicts are handled by making operations compatible rather than rejecting them, Operational Transformation is fundamentally about keeping operations valid in a constantly changing replicated system.

Instead of rejecting conflicts or merging full state, OT reshapes operations to guarantee:

Consistency
Convergence
Preservation of user intent

Conflict-free Replicated Data Types (CRDTs)

Vikas Kumar — Wed, 11 Feb 2026 08:32:40 +0000

Background

Modern distributed systems frequently replicate data across multiple machines, regions, or user devices. Replication is a fundamental design choice that improves system behavior and user experience.

Why replication matters:

High availability – the system continues working even if some nodes fail
Low latency – users interact with nearby replicas
Offline support – devices can operate while disconnected
Fault tolerance – redundancy prevents data loss

The Fundamental Challenge

Replication introduces a critical question:

What happens when multiple replicas modify the same data concurrently?

In distributed environments, concurrent updates are not an edge case — they are the norm.

The Core Problem in Distributed Systems

Distributed systems inherently operate under imperfect conditions:

Nodes maintain independent copies of data
Network partitions and disconnections occur
Updates may happen at the same time
Messages can be delayed or reordered

Without careful design, these realities can cause:

Conflicts between updates
Lost updates
Diverging replicas
Inconsistent system state

Traditional Approaches: Coordination

Classic distributed system designs rely on coordination mechanisms to preserve correctness:

Locks
Leader-based systems
Consensus protocols (e.g., Paxos, Raft)

While effective, coordination introduces trade‑offs:

Increased latency
Reduced availability during failures
Higher system complexity

Correctness is preserved, but performance and resilience may suffer.

A Different Perspective: CRDTs

Conflict-free Replicated Data Types (CRDTs) take a fundamentally different approach.

Instead of preventing conflicts through coordination, CRDTs are designed so that:

Concurrent updates are expected
Conflicts are mathematically impossible or automatically resolved
Replicas always converge to the same state

This enables systems that remain:

Highly available
Low latency
Partition tolerant

CRDTs shift the burden from runtime coordination to data structure design.

What is a CRDT?

A Conflict-free Replicated Data Type (CRDT) is a data structure specifically designed for distributed systems where multiple replicas may update data independently.

A CRDT ensures that:

Replicas can update data independently
Replicas can merge safely without coordination
Conflicts do not occur (by design)
All replicas eventually converge to the same state
No central coordinator or locking mechanism is required

CRDTs provide strong eventual consistency through deterministic merge rules.

Why CRDTs Work

CRDTs rely on mathematically defined merge operations with three critical properties:

1. Commutative

The order of merging does not matter.

merge(A, B) = merge(B, A)

2. Associative

The grouping of merges does not matter.

merge(A, merge(B, C)) = merge(merge(A, B), C)

3. Idempotent

Repeating merges is safe and produces no side effects.

merge(A, A) = A

Because CRDT merge operations satisfy these properties, replicas always converge, regardless of:

Message delays
Network partitions
Duplicate updates
Out-of-order delivery

Two Main Types of CRDTs

State-Based CRDTs (Convergent Replicated Data Types)

Replicas exchange their entire state during synchronization.

How they work:

Each replica updates its local state independently
Replicas periodically share their full state
A deterministic merge function combines states

Key characteristics:

Simple to reason about
Naturally resilient to message duplication
Robust under unreliable networks
Larger messages due to full-state transfer

Operation-Based CRDTs (Commutative Replicated Data Types)

Replicas exchange operations instead of full state.

How they work:

Replicas generate operations (add, remove, insert, etc.)
Operations are broadcast to other replicas
Operations are designed to commute safely

Key characteristics:

More bandwidth-efficient
Lower message size
Requires reliable delivery assumptions
More complex to design correctly

Example 1 — Distributed Counter

Assume two replicas start with the same value:

Value = 0

Both replicas go offline and update independently:

Replica A increments → +1
Replica B increments → +1

After synchronization, the correct final value should be:

Value = 2

How a CRDT Counter Solves This

Instead of storing a single integer, each replica maintains per-replica state.

Replica A → { A: 1, B: 0 }
Replica B → { A: 0, B: 1 }

Merge rule:

Take the maximum value for each replica slot

Merged result:

{ A: 1, B: 1 } → Value = 2

No updates are lost, even without coordination.

PN-Counter (Supports Decrements)

A PN-Counter extends the basic counter to support decrements.

It internally maintains two counters:

One for increments (P = Positive)
One for decrements (N = Negative)

Final value calculation:

value = increments − decrements

This preserves convergence while allowing both operations.

Example 2 — Concurrent Text Editing

Initial text:

Hello World

Two users edit concurrently at the same logical position:

User A inserts "vikas" after "Hello "
User B inserts "nannu" at the same place

Why Traditional Systems Struggle

If edits rely purely on numeric indexes:

Both target index 6
Order of arrival affects result
One update may overwrite the other
Replicas may diverge

How CRDTs Fix This

CRDT-based editors avoid fragile positional indexes.

Instead:

Every character is assigned a unique identifier
Insertions occur relative to identifiers, not indexes
Concurrent inserts are preserved by design

Possible merged results:

Hello vikasnannu World

Hello nannuvikas World

The exact order depends on deterministic rules, but all replicas agree on the same result.

CRDT Data Structure Categories

CRDTs are not limited to a single data model. They exist for many common data structures, enabling safe replication across a wide range of application needs.

Registers

Registers store a single value.

Example: Last-Write-Wins (LWW) Register
Merge rule: choose the value with the latest timestamp.

Use cases:

Configuration values
User profile fields
Simple shared state

Counters

Counters track numeric updates under concurrency.

Examples:

G-Counter (Grow-only) – supports increments only
PN-Counter (Positive-Negative) – supports increments and decrements

Use cases:

Likes / views / reactions
Distributed metrics
Rate tracking

Sets

Sets maintain collections of elements with safe concurrent modifications.

Examples:

G-Set (Grow-only Set) – elements can only be added
OR-Set (Observed-Remove Set) – supports add and remove safely

Use cases:

Tags / labels
Membership tracking
Feature flags

Maps / JSON Structures

Complex objects can be built by composing smaller CRDTs.

Idea: Each field is itself a CRDT.

Use cases:

Shared documents
Application state
Nested data models

Sequences

Sequences maintain ordered collections, essential for collaborative editing.

Use cases:

Text editors
Real-time collaboration tools
Ordered shared logs

Handling Deletions

Deletion is fundamentally harder than insertion in distributed systems.

A common CRDT technique is the use of tombstones:

Elements are marked as deleted instead of removed
Metadata is preserved for correct merging

Trade-off:

Increased storage / metadata overhead
Guaranteed convergence and correctness

What CRDTs Guarantee

CRDT-based systems provide strong distributed safety properties:

No lost updates
No manual conflict resolution
Eventual convergence across replicas
High availability under failures
Partition tolerance by design
No locks, leaders, or coordination required

Advantages of CRDTs

CRDTs are powerful because they naturally align with distributed environments:

Allow independent replica updates
Operate correctly under offline conditions
Eliminate complex conflict resolution logic
Scale efficiently across regions
Reduce coordination overhead

Limitations of CRDTs

CRDTs are not universally applicable. Practical challenges include:

Metadata growth over time
Memory and storage overhead
Non-intuitive ordering behavior
Difficulty enforcing strict invariants

Poor fit for systems requiring:

Strong consistency guarantees
Global ordering constraints
Complex transactional invariants

Examples:

Banking systems
Financial ledgers
Strictly serialized workflows

CRDTs vs Strong Consistency Systems

Two contrasting design philosophies exist in distributed systems.

Strong Consistency Systems:

Use consensus protocols
Enforce global ordering
Provide immediate consistency
Typically incur higher latency

CRDT-Based Systems:

Avoid coordination
Accept eventual consistency
Prioritize availability and latency

The correct choice depends entirely on application requirements.

Ideal Use Cases for CRDTs

CRDTs work best in environments where:

Concurrent updates are common
Offline operation is expected
Low latency is critical
Eventual consistency is acceptable

Examples:

Collaborative editors
Offline-first applications
Distributed counters
Edge / multi-device systems
Shared state applications

Final Thoughts

CRDTs do not resolve conflicts after they occur. They prevent conflicts by design. Every update is structured so merging is always deterministic and safe.

A helpful way to reason about CRDTs:

Replicas never fight over updates.
They record changes independently and merge deterministically.

CRDTs represent an elegant shift in distributed system design:

Instead of coordinating every update, replicas evolve independently while still guaranteeing convergence.

They are especially valuable in modern systems where:

Offline usage is normal
Latency directly impacts user experience
Global coordination is expensive

Used appropriately, CRDTs dramatically simplify distributed data management while improving system resilience.

Design HLD - Recomendation Sytem

Vikas Kumar — Sun, 08 Feb 2026 07:33:44 +0000

About - Recomendation Sytem

A recommendation system is a service that predicts and ranks items a user is most likely to engage with, based on their behavior, preferences, and context. It helps users discover relevant content at scale while optimizing business goals like engagement, retention, or revenue.

Requirements

Functional Requirements

Support personalized recommendations for users.
Support homepage, “Up Next” , and contextual recommendations.
Support hybrid recommendation strategies (collaborative + content-based).
Support real-time personalization using recent user interactions.
Support large-scale candidate generation from billions of items.
Support multi-stage ranking (candidate generation, scoring, re-ranking).
Support cold-start handling for new users and new items.
Support business-rule and policy-based re-ranking.
Support tracking of user interactions and feedback signals.

Non-Functional Requirements

Highly available and fault tolerant.
Low-latency recommendation serving (sub-200 ms).
High throughput for large-scale user traffic.
Horizontally scalable with growing users and content.
Real-time freshness of recommendations.
Consistent user experience across devices and regions.
Cost-efficient operation at scale.
Secure access to user and content data.
Observability for model performance and system health.

Key Concepts You Must Know

What Is an Embedding?

An embedding is a way to convert users and items (videos, products, songs) into numbers that computers can compare.

A user embedding represents what a user likes. An item embedding represents what an item is about. If a user and an item have similar embeddings, the system assumes the user may like that item.

Think of embeddings like coordinates on a map. Users and items that are close on the map are considered a good match. This is the foundation of modern recommendation systems.

What Is Candidate Generation?

The system has billions of items, but it cannot look at all of them for every user.

So the first step is candidate generation:
Quickly select a small shortlist (usually a few thousand items). These items are possibly relevant to the user. Speed is more important than accuracy here

Example:
From 10 billion videos → pick 10,000 “maybe interesting” videos
If a good item is not picked here, it will never be recommended later.

Candidate Generation vs Ranking

Candidate Generation: “What are some items this user might like?”, Fast, rough, recall-focused

Ranking: “Out of these candidates, which ones are the best?”, Slower, smarter, accuracy-focused

You cannot fix a bad candidate list with ranking later.

Multi-Stage Recommendation Architecture

Real systems do not use one big model.

They use multiple stages: Candidate Generation (thousands), Light Ranking (hundreds), Heavy Ranking (dozens), Re-Ranking (final list)

Each stage: Looks at fewer items, Uses more computation, Improves quality

This is how systems stay fast and scalable.

Collaborative vs Content-Based Recommendations

Collaborative Filtering: Based on user behavior, “Users like you watched this”, Does NOT need item details

Content-Based Filtering: Based on item properties, “This video is about cooking, and you watch cooking videos”, Works even for new items

Most real systems use both together (hybrid).

Real-Time Signals vs Offline Signals

Offline signals: Long-term behavior, Historical data, Stable preferences

Real-time signals: What the user just watched, Recent clicks or searches, Current intent

Good systems combine both: Offline = who the user is & Real-time = what the user wants right now

Cold Start Problem

Cold start happens when: A new user joins (no history), A new item is added (no views)

How systems handle it: Use item content (title, genre, tags), Use popularity or trending items, Use basic user info (language, location)

Show items to small groups and learn quickly

Exploration vs Exploitation

Exploitation: Show items the system is confident the user will like

Exploration: Occasionally show something new or uncertain

Why exploration matters: Prevents boring, repetitive feeds, Helps discover new interests, Helps new items get exposure

Usually: Top positions → safe choices, Lower positions → more exploration

Feedback Loops & Popularity Bias

Recommendations change user behavior.

Problem: Popular items get shown more, More views → even more recommendations, New or niche items get ignored, This is called popularity bias.

Systems fix this by: Adding diversity, Limiting overexposure, Forcing exploration

Re-Ranking & Business Rules

Even after ranking, the system may adjust results to: Increase diversity, Promote fresh content, Enforce policies (age, region, safety), Support creators or business goals

This happens in the final re-ranking step.

Implicit vs Explicit Feedback

Explicit feedback: Likes, ratings, Clear but rare

Implicit feedback: Watch time, clicks, skips, Noisy but abundant

Most systems rely mainly on implicit feedback, because users rarely rate content.

Approximate Nearest Neighbor (ANN) Search

To find similar embeddings: Exact comparison is too slow at scale, ANN finds “close enough” matches very fast

Trade-off: Slight accuracy loss. Huge speed gain

ANN is what makes large-scale recommendations possible.

Feature Freshness & Drift

User interests change. Trends change. Models trained on old data become wrong.

Systems must: Update features frequently, Detect when data patterns change
Retrain or adjust models, Otherwise, recommendations silently degrade.

Observability for Recommendation Systems

The system can be “up” but still be bad.

So we monitor: Engagement (CTR, watch time), Model accuracy, Bias and diversity, Feature freshness

Without observability, problems are discovered too late.

Simple Mental Model

User → Embedding → Candidate Generation → Ranking → Re-ranking → Recommendation

Capacity Estimation

Key Assumptions

Daily Active Users (DAU): 100M
Sessions per user per day: 5
Recommendation surfaces per session: 2 (Homepage, Up Next)
Items recommended per request: 10
Total item catalog: ~10B
Target latency: < 200 ms

Traffic Estimation

Recommendation requests per user/day ⇒ 5 × 2 = 10
Total requests/day ⇒ 100M × 10 = 1B requests/day
Average QPS ⇒ ~12K requests/sec
Peak QPS (5–10×) ⇒ ~60K–120K requests/sec

Candidate Generation & Ranking

Candidates per request: ~10K
Heavy ranking input: ~500
Final output: Top 10
Only a tiny fraction of the catalog reaches expensive models, keeping latency and cost under control.

Storage Estimation

User embeddings: ⇒ 100M users × 1 KB ≈ 100 GB
Item embeddings: ⇒ 10B items × 1 KB ≈ ~10 TB
Stored in distributed vector storage with sharding and replication.

Interaction Data

User interactions/day: ~5B events
Event size: ~200 bytes
Daily ingestion volume: ~1 TB/day
Processed asynchronously via streaming pipelines.

Key Takeaways

Recommendation systems are read-heavy and bursty
Candidate generation dominates compute cost
Caching and ANN search are mandatory
Heavy models must be used sparingly
Latency is driven by QPS, not storage

Core Entities

User: Represents a platform user for whom recommendations are generated.

Item (Video / Product / Content): Represents a recommendable entity such as a video, movie, or product.

User–Item Interaction: Represents an interaction between a user and an item (view, click, watch time, like, skip).

User Embedding: Represents a numerical vector capturing a user’s preferences and interests.

Item Embedding: Represents a numerical vector capturing an item’s characteristics and semantics.

Candidate Set: Represents a shortlist of potentially relevant items generated for ranking.

Recommendation Context: Represents the request-time context such as surface, device, time, and current item.

Recommendation Result: Represents the final ranked list of items shown to the user.

Feature Store: Represents a centralized store for precomputed user and item features used during inference.

Interaction Event: Represents a logged feedback event used for training, evaluation, and monitoring.

Database Design

Database Choice

A recommendation system uses multiple specialized data stores, not a single database, because access patterns are very different.

Distributed NoSQL Store (Cassandra / DynamoDB / Bigtable)
Used for high-throughput storage of user profiles, interaction events, and recommendation metadata. Chosen for horizontal scalability, fast writes, and predictable performance at scale.

Vector Store / ANN Index (FAISS / ScaNN / Milvus / OpenSearch Vector)
Used to store and query user and item embeddings for candidate generation.
Optimized for approximate nearest neighbor search, not relational queries.

Object Storage (S3 / GCS / HDFS)
Used for raw interaction logs, training data, and offline analytics.
Cheap, durable, and suitable for batch processing.

Cache (Redis / Memcached)
Used for hot data such as user embeddings, recent interactions, and precomputed recommendations. Critical for meeting sub-200 ms latency.

This separation ensures each workload is handled by the right storage system.

Schema

User Table
Represents platform users.

User
- user_id (PK)
- language
- region
- account_created_at
- status

Used for: Personalization, Cold-start handling, Feature lookup

Item Table
Represents recommendable content.

Item
- item_id (PK)
- type (video / product)
- category
- language
- creator_id
- published_at
- status

Used for: Content-based filtering, Policy and availability checks, User–Item Interaction Table, Represents user feedback signals.

UserItemInteraction

UserItemInteraction
- user_id (PK)
- item_id (PK)
- interaction_type (view / click / like / watch)
- interaction_value (e.g. watch_time)
- timestamp

Used for: Model training, Real-time personalization, Feedback loops, Write-heavy and append-only.

User Embedding Table

UserEmbedding
- user_id (PK)
- embedding_vector
- updated_at

Used for: Candidate generation

ANN search

Item Embedding Table
ItemEmbedding
- item_id (PK)
- embedding_vector
- updated_at

Used for: Similarity search, Cold-start recommendations

Recommendation Result Cache

UserRecommendation
- user_id (PK)
- surface (homepage / up_next)
- item_list
- generated_at

Used for: Fast homepage loads, Cache-heavy read paths

Indexing Strategy

Interactions indexed by (user_id, timestamp) for recent behavior
Items indexed by (category, status)
Embeddings indexed in ANN structures, not traditional DB indexes
Time-based partitioning for interaction logs
Indexes are chosen based on actual query patterns, not normalization.

Transaction Model

The system avoids multi-table transactions in the serving path.
Each write (interaction, embedding update, log event) is independent
Reads are eventually consistent across stores
Recommendation requests are read-only operations
This keeps latency low and throughput high.

Failure Handling

Interaction events are written asynchronously via queues
If embedding updates fail, older embeddings are reused
Cache failures fall back to database reads
ANN service failures fall back to popular or cached recommendations
The system degrades gracefully, never blocks the user.

Consistency Model

Strong consistency: Used for user identity and item availability
Eventual consistency: Used for interactions, embeddings, analytics, and recommendations

Why this works:
Slightly stale recommendations are acceptable, High availability and low latency are more important than strict consistency

API / Endpoints

Get Recommendations: Fetches personalized recommendations for a user and surface.

GET /recommendations

Request

{
  "user_id": "string",
  "surface": "homepage | up_next | related",
  "context": {
    "current_item_id": "string (optional)",
    "device": "mobile | web | tv",
    "region": "string"
  },
  "limit": 10
}

Response

{
  "user_id": "string",
  "surface": "homepage",
  "recommendations": [
    {
      "item_id": "string",
      "score": 0.92
    }
  ],
  "generated_at": "datetime"
}

Get Cached Recommendations: Returns precomputed recommendations if available.

GET /recommendations/cached

Request

{
  "user_id": "string",
  "surface": "homepage"
}

Response

{
  "recommendations": ["item_1", "item_2", "item_3"],
  "generated_at": "datetime"
}

Log Interaction Event: Records user feedback for training and personalization.

POST /interactions

Request

{
  "user_id": "string",
  "item_id": "string",
  "interaction_type": "view | click | watch | like | skip",
  "interaction_value": 120,
  "surface": "homepage | up_next",
  "timestamp": "datetime"
}

Response

{
  "status": "accepted"
}

Update User Profile: Updates user attributes used for personalization.

PUT /users/{user_id}

Request

{
  "language": "string",
  "region": "string",
  "preferences": {
    "categories": ["string"]
  }
}

Response

{
  "status": "updated"
}

Trigger Model Refresh (Internal / Admin): Triggers offline or near-real-time model updates.

POST /models/refresh

Response

{
  "status": "refresh_started"
}

Key API Design Notes

All recommendation APIs are read-optimized and low latency
Interaction logging APIs are asynchronous and non-blocking
Recommendation responses may be eventually consistent
Cached and real-time recommendations coexist
Admin APIs are restricted to internal services

System Components

1. Client (Web / Mobile / TV Apps)

Primary Responsibilities:
Requests recommendations for different surfaces (Homepage, Up Next, Search, Contextual).
Sends user interaction events such as views, clicks, watch time, skips, likes.
Passes lightweight context (device, locale, time, surface type).

Examples:
Web apps, Mobile apps, Smart TV apps

Why:
Keeps recommendation logic centralized and ensures consistent experience across devices.

2. API Gateway

Primary Responsibilities:
Acts as the secure ingress for recommendation APIs.
Handles authentication, authorization, and request validation.
Applies rate limits and traffic shaping.
Routes requests to the Recommendation Service.

Examples:
API Gateway, Envoy, NGINX

Why:
Provides centralized security and traffic control without coupling clients to backend services.

3. Recommendation Service (Serving Orchestrator)

Primary Responsibilities:
Accepts recommendation requests with user and context.
Orchestrates candidate generation, ranking, and re-ranking.
Applies timeout budgets and fallback strategies.
Aggregates final ranked results and returns them to clients.

Examples:
Stateless microservice (Java / Go / Node.js)

Why:
Acts as the real-time brain of the system while remaining horizontally scalable.

4. Candidate Generation Service

Primary Responsibilities:
Retrieves a large pool of potentially relevant items (thousands).
Uses lightweight models, embeddings, popularity, and heuristics.
Optimized for high recall and low latency.

Examples:
Embedding-based retrieval, popularity services

Why:
Reduces billions of items to a manageable candidate set for downstream ranking.

5. Vector Store / ANN Service

Primary Responsibilities:
Stores user and item embeddings.
Supports approximate nearest neighbor (ANN) search.
Provides fast similarity lookups at scale.

Examples:
Vector databases, ANN indices

Why:
Exact similarity search does not scale; ANN makes embedding-based retrieval feasible in real time.

6. Ranking Service

Primary Responsibilities:
Scores candidate items using ML models.
Combines user features, item features, and context.
Produces relevance scores for each candidate.

Examples:
Two-tower models, deep ranking models

Why:
Provides high-precision ordering once the candidate set is small enough.

7. Re-Ranking & Policy Engine

Primary Responsibilities:
Applies business rules and constraints: Diversity, Freshness, Fairness, Content safety, Sponsored content
Adjusts ordering without retraining models.

Why:
Ensures recommendations align with product, legal, and business goals.

8. Feature Store

Primary Responsibilities:
Stores precomputed user and item features.
Serves features consistently to both training and serving pipelines.
Supports low-latency online reads.

Examples:
Online + offline feature stores

Why:
Prevents feature skew and avoids expensive recomputation at request time.

9. Interaction Logging Service

Primary Responsibilities:
Collects user interaction events asynchronously.
Validates and enriches events.
Publishes events to the event stream.

Examples:
Event ingestion microservice

Why:
Decouples user actions from downstream analytics and training systems.

10. Event Stream / Message Queue

Primary Responsibilities:
Buffers user interaction events at scale.
Provides durability and backpressure handling.
Enables multiple consumers (real-time + batch).

Examples:
Distributed message queues

Why:
Absorbs traffic spikes and enables reliable data pipelines.

11. Stream Processing Service (Real-Time Layer)

Primary Responsibilities:
Processes interaction events in near real time.
Updates short-term user interests and trends.
Feeds real-time personalization features.

Examples:
Stream processors

Why:
Keeps recommendations fresh and responsive to recent user behavior.

12. Cache Layer

Primary Responsibilities:
Caches hot recommendations and embeddings.
Stores precomputed results for frequent users.
Reduces load on backend services.

Examples:
In-memory caches

Why:
Critical for meeting sub-200 ms latency SLOs.

13. Metadata Database

Primary Responsibilities:
Stores user profiles, item metadata, and configuration.
Supports high read throughput and horizontal scaling.
Acts as the source of truth for non-ML data.

Examples:
Distributed NoSQL databases

Why:
Optimized for scale and availability rather than complex transactions.

High-Level Flows

Flow 0: Homepage Recommendation (Happy Path)

Client requests recommendations for the Homepage with user ID and context (device, locale, time).
API Gateway authenticates the request and forwards it to the Recommendation Service.
Recommendation Service checks the cache for precomputed results.
On cache miss, it triggers candidate generation.
Candidate Generation retrieves thousands of relevant items using embeddings, popularity, and heuristics.
Ranking Service scores candidates using Algorithm or ML models.
Re-Ranking & Policy Engine applies diversity, freshness, and safety rules.
Final ranked list is returned to the client.

Guarantee: Sub-200 ms latency with high-quality personalized recommendations.

Flow 1: Real-Time Personalization Update

User watches, clicks, skips, or searches for content.
Client sends interaction events asynchronously to the Interaction Logging Service.
Events are published to the Event Stream.
Stream Processing Service updates short-term user features (recent interests, intent).
Updated features are written to the Feature Store.
Subsequent recommendation requests reflect the latest behavior.

Guarantee: Recommendations adapt within seconds to recent user actions.

Flow 2: “Up Next” / Contextual Recommendation

Client requests recommendations with current item context (e.g., video being watched).
Recommendation Service forwards context to Candidate Generation.
Candidate Generation retrieves items similar to the current item and user preferences.
Ranking prioritizes relevance, continuity, and completion likelihood.
Re-ranking enforces freshness and avoids repetition.
Results are returned to the client.

Guarantee: Smooth content continuation and session-level engagement.

Flow 3: Cold Start – New User

New user requests recommendations with no interaction history.
Recommendation Service detects missing user embeddings.
Candidate Generation falls back to:
Popular content
Regional and language-based items
Editorial or curated lists
Lightweight ranking applies basic personalization using context.
Results are cached with short TTL.
As interactions arrive, the system transitions to personalized recommendations.

Guarantee: Reasonable recommendations even without historical data.

Flow 4: Cold Start – New Item

A new item is added to the platform.
Item metadata and content features are processed offline.
Item embedding is generated and stored in the Vector Store.
Candidate Generation includes the item for relevant users.
Exposure is throttled and monitored to collect early feedback.
Interaction signals gradually improve ranking confidence.

Guarantee: New items get fair exposure without degrading recommendation quality.

Flow 5: Cache-First Serving Path

Recommendation Service checks cache using (user_id, surface) key.
If hit, cached recommendations are returned immediately.
If stale or expired, async refresh is triggered in the background.
Fresh results replace the cache entry.

Guarantee: Ultra-low latency for frequent users and popular surfaces.

Flow 6: Fallback on Dependency Failure

Candidate Generation or Ranking exceeds timeout budget.
Recommendation Service triggers fallback strategy:
Cached results
Popular or trending items
Simplified heuristic ranking
Response is returned within latency SLO.
Failure metrics are emitted for monitoring.

Guarantee: System degrades gracefully without user-visible failures.

Flow 7: Observability & Feedback Loop

Recommendation impressions and interactions are logged.
Analytics pipelines compute engagement and quality metrics.
Alerts trigger on drops in CTR, watch time, or diversity.
Insights feed back into model tuning and policy updates.

Guarantee: Silent recommendation degradation is detected early.

Deep Dives – Functional Requirements

1. Support Personalized Recommendations for Users

The system generates recommendations tailored to each user based on their historical behavior, preferences, and context.
Personalization is achieved by combining long-term user signals (past interactions) with short-term intent (recent activity) to avoid generic or repetitive recommendations.

2. Support Homepage, “Up Next”, and Contextual Recommendations

Different surfaces have different goals and constraints.
The system supports multiple recommendation surfaces by accepting surface type and context at request time, allowing the same backend to produce results optimized for discovery (Homepage), continuity (Up Next), or relevance to a current item (Contextual).

3. Support Hybrid Recommendation Strategies

Relying on a single signal source is fragile at scale.
The system combines interaction-based signals (what similar users engage with) and content-based signals (item metadata and semantics) to improve robustness, coverage, and cold-start behavior.

4. Support Real-Time Personalization Using Recent User Interactions

User intent changes rapidly during a session.
Recent interactions such as clicks, skips, and watch time are processed asynchronously and reflected in near real time, ensuring recommendations adapt within seconds instead of waiting for offline updates.

5. Support Large-Scale Candidate Generation from Billions of Items

Scoring the entire catalog per request is infeasible.
The system first retrieves a high-recall candidate set using lightweight retrieval techniques, reducing the search space from billions of items to thousands before applying more expensive ranking logic.

6. Support Multi-Stage Ranking

Recommendation quality and latency are balanced using a staged pipeline.
Early stages prioritize speed and recall, while later stages focus on precision and ordering, allowing strict latency budgets to be met without sacrificing relevance.

7. Support Cold-Start Handling for New Users and New Items

New users and items lack interaction history.
The system falls back to popularity, regional trends, content attributes, and contextual signals, gradually transitioning to personalized recommendations as interactions are collected.

8. Support Business-Rule and Policy-Based Re-Ranking

Model scores alone are insufficient for production systems.
Final ranking applies constraints such as diversity, freshness, fairness, content safety, and sponsored placement to align recommendations with product, legal, and business requirements without retraining core logic.

9. Support Tracking of User Interactions and Feedback Signals

Every recommendation impression and user interaction is logged asynchronously.
These signals power real-time personalization, offline evaluation, monitoring, and long-term system improvement without impacting serving latency.

Deep Dives Non-Functional Requirements

1. Highly Available and Fault Tolerant

The system must continue serving recommendations despite failures in individual services or dependencies.
All serving components are stateless and horizontally scalable, while critical data is stored in replicated and durable systems to avoid single points of failure.

2. Low-Latency Recommendation Serving (Sub-200 ms)

Recommendation requests are latency-sensitive and must return results within strict SLOs.
The system enforces cache-first access, timeout budgets per stage, and lightweight fallbacks to guarantee predictable response times under load.

3. High Throughput for Large-Scale User Traffic

The system must handle millions of concurrent users and bursty traffic patterns.
Asynchronous event ingestion, batched processing, and partitioned queues ensure sustained high throughput without impacting serving performance.

4. Horizontally Scalable with Growing Users and Content

All core components scale horizontally by adding instances rather than redesigning the system.
Growth in users, content, or regions is handled through partitioning, sharding, and independent scaling of retrieval, ranking, and caching layers.

5. Real-Time Freshness of Recommendations

User behavior and content trends change rapidly.
The system incorporates near real-time interaction signals and frequent cache refreshes to prevent stale recommendations while avoiding excessive recomputation.

6. Consistent User Experience Across Devices and Regions

Users may switch devices or locations frequently.
Recommendations are generated using a unified serving pipeline with region-aware data access, ensuring consistency while respecting latency and locality constraints.

7. Cost-Efficient Operation at Scale

Serving recommendations is a high-QPS workload.
The system minimizes cost by using multi-stage pipelines, aggressive caching, and lightweight retrieval before expensive computation, ensuring cost scales linearly with traffic.

8. Secure Access to User and Content Data

User behavior data is sensitive and must be protected.
All APIs are authenticated and authorized, data is encrypted in transit and at rest, and access is restricted based on service identity and least-privilege principles.

9. Observability for System Health and Recommendation Quality

System health cannot be judged by uptime alone.
The system tracks latency, error rates, cache hit ratios, and downstream dependency health, along with engagement and quality metrics to detect silent degradation.

Failure Handling & Fallback Strategies

Recommendation systems must remain responsive even when dependencies fail or degrade. The system is designed to fail fast, degrade gracefully, and never block the user experience.

Cache Miss or Cache Unavailability

If cached recommendations are unavailable or expired, the system bypasses the cache and triggers the normal serving pipeline.
If recomputation exceeds latency budgets, a simpler fallback (popular or trending items) is returned to avoid user-visible delays.

Guarantee: Cache failures never block recommendation delivery.

Candidate Generation Timeout or Failure

Candidate generation has a strict timeout budget. If it fails or times out, the system falls back to: Recently cached candidate sets, Popular or regional content, Lightweight heuristic-based retrieval

Guarantee: Requests complete within latency SLOs even if retrieval degrades.

Vector Store / ANN Service Degradation

If the ANN service becomes slow or unavailable, the system avoids synchronous retries. Requests are served using precomputed or cached candidates while health checks and alerts trigger remediation.

Guarantee: Embedding search failures do not cascade into full system outages.

Ranking Service Timeout

Ranking is bounded by a hard deadline.If ranking exceeds its time budget, partially scored results or previously cached rankings are returned.

Guarantee: Ranking accuracy is sacrificed before latency guarantees.

Re-Ranking or Policy Engine Failure

Re-ranking logic is designed to be optional and best-effort. If it fails, the system returns the ranked list without additional constraints rather than failing the request.

Guarantee: Business rules enhance quality but never break delivery.

Real-Time Signal Unavailability

If real-time personalization signals are delayed or unavailable, the system falls back to long-term user preferences. Offline features remain the stable baseline.

Guarantee: Recommendation quality degrades gracefully without sudden behavior shifts.

Event Stream Backlog or Processing Lag

If interaction events lag or queues build up, serving continues unaffected. Lag is monitored and corrected asynchronously without blocking recommendation requests.

Guarantee: Data pipeline issues never impact real-time serving.

Partial Data or Feature Store Outage

If some features cannot be fetched, the system proceeds with a reduced feature set. Missing features are treated as optional, not mandatory.

Guarantee: Feature unavailability does not cause request failures.

Regional Failure or Zone Outage

If a region or availability zone becomes unhealthy, traffic is shifted to healthy regions. Cached and regionally replicated data ensures continuity.

Guarantee: Regional outages result in degraded quality, not downtime.

Graceful Degradation Under Extreme Load

When the system is overloaded: Low-priority surfaces are throttledCache TTLs are increased. Expensive computation is skipped

Guarantee: Core recommendation flows remain available under peak load.

Trade-Offs

Multi-Stage Retrieval vs Single-Stage Ranking

Choice: Multi-stage recommendation pipeline
Pros: Scales to billions of items, predictable latency, independent optimization of stages
Cons: Higher system and operational complexity
Why This Works: Single-stage scoring is infeasible at scale; staged pipelines are the only practical way to meet strict latency SLOs.

Approximate Retrieval vs Exact Search

Choice: Approximate retrieval for candidate generation
Pros: Orders-of-magnitude faster, enables real-time serving, bounded latency
Cons: Slight recall loss due to approximation
Why This Works: Small recall loss is acceptable and compensated by downstream ranking.

Real-Time Freshness vs Serving Latency

Choice: Near real-time personalization with bounded freshness
Pros: Responsive to recent behavior without blocking requests
Cons: Very recent actions may not appear immediately
Why This Works: Users value fast responses more than perfectly fresh recommendations.

Cache-First Serving vs Always Compute

Choice: Cache-first serving with asynchronous refresh
Pros: Low latency, reduced backend load, improved tail performance
Cons: Cached results can be slightly stale
Why This Works: Slight staleness is acceptable in exchange for reliability and speed.

Personalization Depth vs System Cost

Choice: Deep personalization only after candidate reduction
Pros: Keeps compute cost bounded, predictable scaling
Cons: Early-stage retrieval is less personalized
Why This Works: Fine-grained personalization only matters when the candidate set is small.

Exploration vs Exploitation

Choice: Controlled exploration at lower ranks
Pros: Prevents stagnation, discovers new interests and content
Cons: Short-term engagement may dip slightly
Why This Works: Long-term engagement improves with limited, targeted exploration.

Consistency vs Availability

Choice: Eventual consistency for recommendation data
Pros: Higher availability and lower latency
Cons: Temporary inconsistencies in results
Why This Works: Recommendations are advisory, not transactional.

Centralized Orchestration vs Fully Distributed Logic

Choice: Centralized Recommendation Service
Pros: Clear ownership, better observability, strict latency control
Cons: Requires careful horizontal scaling
Why This Works: Central orchestration simplifies control without sacrificing scalability.

Business Rules in Models vs Post-Ranking Policies

Choice: Policy-based re-ranking outside models
Pros: Faster iteration, no retraining required
Cons: Additional processing step
Why This Works: Business logic changes faster than models and should remain decoupled.

Frequently Asked Questions in Interviews

Why can’t we score all items for every recommendation request?

Because real-world catalogs can contain billions of items, and scoring each one would exceed both latency and compute budgets.
Multi-stage retrieval limits expensive computation to a small candidate set, making real-time serving feasible.

What happens if candidate generation misses good items?

Those items will never reach downstream ranking or re-ranking stages.
This is why candidate generation is optimized for high recall and often uses multiple retrieval strategies to reduce blind spots.

Why do we separate candidate generation and ranking?

Candidate generation focuses on recall and speed, while ranking focuses on precision and ordering.
Separating these concerns allows each stage to be optimized independently under strict latency constraints.

Why do we need both ranking and re-ranking?

Ranking determines relevance based on learned signals and context.
Re-ranking applies product, safety, fairness, and diversity constraints that are difficult or risky to encode directly into ranking logic.

How do you handle real-time personalization without increasing latency?

User interactions are ingested asynchronously and reflected through fast-access features.
Serving never blocks on real-time pipelines and falls back to long-term preferences if recent signals are delayed.

How does the system handle cold-start users?

When no interaction history exists, the system relies on popularity, regional trends, and contextual signals.
As soon as interactions are collected, personalization gradually increases without abrupt behavior changes.

How does the system handle cold-start items?

New items rely on content attributes and controlled initial exposure.
Early interaction signals are monitored before the item is fully trusted in ranking to avoid quality degradation.

How do you ensure recommendations stay fresh?

Short-term signals update frequently and cached results use bounded TTLs.
Offline updates continuously refresh long-term preferences without impacting live traffic.

What happens if the vector store or retrieval layer goes down?

The system avoids retries on the critical path and switches to cached or heuristic-based candidates.
Availability and latency are preserved even if relevance temporarily degrades.

Why is eventual consistency acceptable in recommendation systems?

Recommendations guide user choice but do not represent a source of truth.
Temporary inconsistencies are preferable to increased latency or reduced availability.

How do you prevent popularity bias and content monopolization?

The system applies diversity constraints, exposure caps, and controlled exploration.
This ensures long-tail content receives visibility while preserving relevance.

How do you debug bad or surprising recommendations?

Every recommendation request and interaction is logged with traceable identifiers.
Drops in engagement, diversity, or freshness trigger alerts and investigation.

Which metrics matter most in recommendation systems?

System health metrics include latency, error rates, and cache hit ratios.
Quality metrics include engagement, retention, diversity, and long-term user satisfaction.

How does the system scale to 10× or 100× traffic?

All serving components are stateless and horizontally scalable.
Capacity is increased by adding replicas, cache nodes, and partitions without redesigning the system.

Why are training and serving decoupled?

Coupling them would make serving dependent on slow or unstable pipelines.
Serving always relies on the last known good state to protect latency and availability.

How do you ensure consistent recommendations across devices?

A unified serving pipeline is used across web, mobile, and TV clients.
Device context influences ranking behavior without fragmenting core logic.

What are the biggest scalability bottlenecks?

Candidate retrieval latency and cache miss amplification at peak traffic.
These are mitigated using aggressive caching, fallbacks, and timeout budgets.

What would you simplify if system traffic were low?

Reduce the number of stages, caching layers, and fallback paths.
System complexity should scale with traffic and business needs, not precede them.

High-Level Summary

This recommendation system uses a multi-stage, cache-first architecture to serve personalized results at scale under strict latency constraints. Candidate generation, ranking, and policy-based re-ranking are cleanly separated to balance relevance, freshness, and business rules. The system is highly available, horizontally scalable, and designed to degrade gracefully during partial failures. Real-time feedback loops and strong observability ensure recommendation quality improves continuously without impacting reliability.

Feel free to ask questions or share your thoughts — happy to discuss!

Design HLD - Notification Sytem

Vikas Kumar — Sat, 07 Feb 2026 06:30:45 +0000

Requirements

Functional Requirements

Support sending notifications to users.
Support delivery across multiple channels (Email, SMS, Push, In-app).
Support critical and promotional notification types.
Support user notification preferences and opt-in/opt-out.
Support scheduled notifications.
Support bulk notifications targeting large user groups.
Support safe retries and idempotent notification processing.
Support tracking of notification delivery status.

Non-Functional Requirements

Highly available and fault tolerant.
Low-latency delivery for critical notifications.
High throughput with large-scale fan-out.
Highly scalable with increasing traffic.
Durable notification processing with no message loss.
Secure notification delivery and access control.
Cost-efficient operation at scale.

Key Concepts You Must Know

Notification vs Delivery Attempt

A notification represents the logical intent to notify a user, while delivery attempts represent concrete, channel-specific executions. A single notification can result in multiple delivery attempts due to retries, fallbacks, or multi-channel delivery.

Critical vs Promotional Isolation

Critical notifications such as OTPs or chat messages must be processed in isolation from promotional traffic. This prevents head-of-line blocking and guarantees that spikes in bulk or campaign traffic do not impact latency-sensitive notifications.

Priority-Aware Queuing

Notifications are routed through priority-aware queues so that high-priority messages are always processed ahead of lower-priority ones. This ensures predictable latency for critical flows even under heavy system load.

Idempotent Processing

All notification operations must be idempotent to safely handle retries caused by network failures or timeouts. Repeating the same request should always result in the same final state without creating duplicate notifications.

Safe Retries

Transient failures during delivery should trigger automatic retries using controlled retry policies such as exponential backoff. Retries must be bounded to avoid infinite loops and system overload.

Scheduling vs Immediate Delivery

Immediate notifications are dispatched as soon as they are accepted by the system, while scheduled notifications are stored and triggered at a future time. Scheduling logic must be reliable and time-correct to ensure notifications are sent neither early nor late.

Bulk Fan-out Model

Bulk notifications should be expanded asynchronously into individual notification instances. Fan-out must happen outside the critical path to prevent large campaigns from overwhelming the system.

User Preferences Enforcement

Notification delivery must respect user-configured preferences such as opt-in, opt-out, preferred channels, and quiet hours. Preferences are enforced consistently across all notification types, with configurable exceptions for critical messages.

Dead Letter Queue (DLQ)

Notifications that fail permanently after exhausting retries are moved to a Dead Letter Queue. The DLQ provides visibility, auditability, and a mechanism for manual inspection or reprocessing.

Durable Event Processing

Once a notification is accepted, it must be durably persisted so it is not lost due to crashes or restarts. Durability guarantees that every accepted notification is eventually processed or explicitly marked as failed.

Capacity Estimation

Key Assumptions

DAU (Daily Active Users): ~50 million
Notifications per user per day: ~5
Traffic mix: ~80% critical, ~20% promotional
Traffic pattern: Write-heavy with bursty fan-out
System scale: Large-scale, distributed SaaS system assumed

Notification Volume Estimation

Total notifications per day ⇒ 50M users × 5 notifications ⇒ ~250M notifications/day
Critical notifications ⇒ ~80% of 250M ≈ ~200M/day
Promotional notifications ⇒ ~20% of 250M ≈ ~50M/day

Throughput Estimation (QPS)

Average write QPS ⇒ 250M / 86,400 ⇒ ~2,900 notifications/sec
Peak write QPS ⇒ Up to ~1,000,000 notifications/sec during spikes
Fan-out amplification ⇒ A single bulk request can expand into thousands to millions of notifications

Read Traffic Estimation

Status checks, analytics, dashboards ⇒ Reads assumed ~2–3× writes ⇒ Average read QPS ≈ ~6,000–9,000/sec

Metadata Size Estimation

Metadata per notification ⇒ ~1 KB (IDs, user, channel, status, retries, timestamps)
Metadata per day ⇒ 250M × 1 KB ⇒ ~250 GB/day
Monthly metadata (30 days retention) ⇒ ~7.5 TB

Core Entities

User: Represents a system user who receives notifications.
Notification: Represents the logical intent to notify a user; stores type, priority, schedule, and lifecycle state, not delivery execution.
Delivery Attempt: Represents a single channel-specific attempt to deliver a notification and captures retries and failures.
Notification Preference: Represents user-defined preferences such as opt-in/opt-out, preferred channels, and quiet hours.
Campaign: Represents a bulk or promotional notification request that targets a large group of users.
Schedule: Represents a time-based trigger that controls when a notification or campaign should be delivered.
Retry Task: Represents a delayed retry for a failed delivery attempt using a retry policy.
Dead Letter Entry: Represents a permanently failed notification that requires audit or manual intervention.

Database Design

Database Choice

The system uses a distributed NoSQL database (such as Cassandra or DynamoDB) to store notification metadata. This is because the system needs to handle very high write traffic, scale horizontally, and remain fast even during large notification spikes.
Data is partitioned by tenant and user so that notifications are evenly spread across nodes and no single partition becomes a bottleneck. Time-based fields (like creation time) are used to efficiently query recent notifications and to clean up old data.
A relational database may be used for tenant configuration, billing, and reporting, where strong relationships and transactional queries are more important than write throughput.

Users Table

Represents system users.

User

user_id (PK)
tenant_id
created_at
status

Used for

User identity
Tenant isolation
Preference lookup

Notification Table

Represents a user-visible notification.

Notification

notification_id (PK)
user_id (FK → User)
tenant_id
type (critical / promotional)
priority
status (pending / delivered / failed / expired)
scheduled_at
expiry_at
created_at

Key Points

One row per user notification
Represents intent and lifecycle
Used for auditing and status queries

DeliveryAttempt Table

Represents channel-level delivery execution.

DeliveryAttempt

attempt_id (PK)
notification_id (FK → Notification)
channel (email / sms / push / in-app)
status (success / failed / retrying)
retry_count
last_error
created_at

Key Points

Multiple attempts per notification
Tracks retries and failures
Enables per-channel isolation

NotificationPreference Table

Represents user notification preferences.

NotificationPreference

user_id (PK)
channel
enabled
quiet_hours
updated_at

Key Points

Source of truth for opt-in / opt-out
Enforced during processing

Campaign Table

Represents bulk notification requests.

Campaign

campaign_id (PK)
tenant_id
status (scheduled / active / completed / cancelled)
scheduled_at
expiry_at
created_at

Key Points

Used only for bulk notifications
Expanded asynchronously into notifications

RetryTask Table

Represents scheduled retries.

RetryTask

retry_task_id (PK)
attempt_id (FK → DeliveryAttempt)
next_retry_at
retry_policy
created_at

Key Points

Retries are time-based, not immediate
Drives retry scheduling

DeadLetter Table

Represents permanently failed notifications.

DeadLetter

notification_id
channel
failure_reason
created_at

Key Points

Terminal failure state
Used for audit and investigation

Indexing Strategy

| Access Pattern           | Index                 |
| ------------------------ | --------------------- |
| Fetch user notifications | (user_id, created_at) |
| Priority processing      | (priority, status)    |
| Retry scheduling         | (next_retry_at)       |
| Campaign expansion       | (campaign_id)         |
| Cleanup jobs             | (status, expiry_at)   |

Indexes are chosen based on actual query patterns, not theoretical normalization.

Transaction Model

The system avoids complex multi-table transactions. Each notification-related operation is handled as a single atomic write, which keeps the system fast and reliable.
To handle retries safely, the system uses idempotency keys, ensuring that the same request processed multiple times results in only one notification. Notification state moves forward in a controlled manner (for example: PENDING → DELIVERED → FAILED) and never moves backward.

This approach keeps the system correct even when requests are retried or processed in parallel.

Failure Handling

If a notification is saved successfully but delivery fails, it remains in a pending or retryable state and is retried automatically. Retry information is stored so the system can safely continue even after crashes or restarts.
Notifications that fail permanently are moved to a Dead Letter Queue, making failures visible and easy to investigate. Background jobs periodically scan for stuck or inconsistent records and safely recover or clean them up.

Consistency Model

The system uses strong consistency for critical data such as notification creation, status updates, retries, and user preferences. This ensures users do not receive duplicate or incorrect notifications.
For analytics and reporting, the system uses eventual consistency, since slight delays in metrics do not affect correctness. This balance allows the system to scale efficiently while keeping user-facing behavior correct.

API / Endpoints

Send Notification → POST: /notifications

Creates a new notification request.

Request

{
  "user_id": "string",
  "type": "critical | promotional",
  "channels": ["email", "sms", "push"],
  "message": {
    "title": "string",
    "body": "string"
  },
  "schedule_at": "datetime (optional)",
  "expiry_at": "datetime (optional)",
  "idempotency_key": "string"
}

Response

{
  "status": "accepted",
  "notification_id": "uuid"
}

Send Bulk Notifications

Creates a bulk notification campaign. → POST: /notifications/bulk

Request

{
  "campaign_name": "string",
  "type": "promotional",
  "target": {
    "segment_id": "string"
  },
  "channels": ["email", "push"],
  "message": {
    "title": "string",
    "body": "string"
  },
  "schedule_at": "datetime",
  "expiry_at": "datetime"
}

Response

{
  "status": "accepted",
  "campaign_id": "uuid"
}

Get Notification Status

Fetches the current status of a notification. → GET: /notifications/{notification_id}

Response

{
  "notification_id": "uuid",
  "status": "pending | delivered | failed | expired",
  "last_updated": "datetime"
}

Retry Notification (Internal / Admin)

Triggers a retry for a failed notification. → POST: /notifications/{notification_id}/retry

Response

{
  "status": "retry_scheduled"
}

Cancel Scheduled Notification

Cancels a notification that has not yet been delivered. → DELETE: /notifications/{notification_id}

Response

{
  "status": "cancelled"
}

Get User Notification Preferences

Fetches notification preferences for a user. → GET: /users/{user_id}/preferences

Response

{
  "channels": {
    "email": true,
    "sms": false,
    "push": true
  },
  "quiet_hours": {
    "start": "22:00",
    "end": "08:00"
  }
}

Update User Notification Preferences

Updates notification preferences for a user. → PUT: /users/{user_id}/preferences

Request

{
  "channels": {
    "email": true,
    "sms": false,
    "push": true
  },
  "quiet_hours": {
    "start": "22:00",
    "end": "08:00"
  }
}

Response

{
  "status": "updated"
}

List Notifications (Optional)

Fetches recent notifications for a user. → GET: /users/{user_id}/notifications?limit=20

Response

{
  "notifications": [
    {
      "notification_id": "uuid",
      "status": "delivered",
      "created_at": "datetime"
    }
  ]
}

Key API Design Notes

All write APIs are idempotent using idempotency_key.
APIs are asynchronous; delivery is not guaranteed at request time.
Bulk APIs only enqueue campaigns; fan-out happens asynchronously.
Admin and retry APIs are restricted to internal services.

System Components

1. Client (Web / Mobile / Backend Producers)

Primary Responsibilities:

Generates notification requests in response to user actions or system events such as login, payment, chat messages, or campaigns.
Attaches idempotency keys and contextual metadata (user, tenant, type, priority).
Does not wait for delivery completion and treats notification APIs as asynchronous.

Examples:
Web apps, Mobile apps, Order Service, Auth Service, Chat Service

Why:
Keeps product services simple and prevents notification latency from impacting core user flows.

2. API Gateway

Primary Responsibilities:

Acts as the secure ingress layer for all notification APIs.
Performs authentication, authorization, tenant validation, schema validation, and request normalization
Applies per-tenant and per-client rate limits to protect downstream systems.
Rejects duplicate requests early using idempotency keys when possible.

Examples:
AWS API Gateway, Kong, NGINX, Envoy

Why:
Provides centralized security, traffic control, and isolation at scale.

3. Notification Service (Control Plane)

Primary Responsibilities:

Validates notification requests and applies business rules.
Classifies notifications as critical or promotional and assigns priority.
Fetches and enforces user preferences including opt-in, channel selection, and quiet hours.
Validates scheduling and expiry constraints.
Persists notification metadata as the source of truth.
Publishes notification events to the message queue for further processing.

Examples:
Spring Boot / Node.js / Go microservice

Why:
Centralizes orchestration logic while keeping the system asynchronous and scalable.

4. Message Queue / Event Bus

Primary Responsibilities:

Decouples notification ingestion from processing and delivery.
Buffers traffic spikes and absorbs bursty workloads.
Provides ordering guarantees where required (e.g., per user).
Uses separate topics or queues to isolate critical traffic from promotional traffic.
Ensures at-least-once delivery semantics.

Examples:
Apache Kafka, AWS SNS + SQS

Why:
Enables high-throughput, fault-tolerant, and scalable event-driven processing.

5. Scheduler Service

Primary Responsibilities:

Stores and manages scheduled notifications and delayed retry tasks.
Triggers notification events exactly at their scheduled execution time.
Ensures notifications are not delivered before schedule_at or after expiry_at.
Handles large volumes of scheduled tasks using partitioned or sharded scheduling.

Examples:
Kafka delay topics, Redis Sorted Sets, Quartz, AWS EventBridge

Why:
Provides reliable time-based execution without inefficient polling.

6. Campaign / Fan-out Service

Primary Responsibilities:

Processes bulk notification requests and resolves target audiences.
Expands campaigns into per-user notification events asynchronously.
Applies batching, throttling, and backpressure to control fan-out rate.
Tracks campaign progress and completion state.

Examples:
Custom fan-out service + Kafka consumers, Flink/Spark for very large campaigns

Why:
Prevents large campaigns from overwhelming real-time notification flows.

7. Channel Workers – Email

Primary Responsibilities:

Consumes email notification events and formats email content.
Integrates with email providers and handles provider-specific constraints.
Manages retries, bounces, and transient failures.
Emits delivery results back into the system.

Examples:
Amazon SES, SendGrid, Mailgun

Why:
Email delivery requires specialized handling and independent scaling.

8. Channel Workers – SMS

Primary Responsibilities:

Delivers SMS notifications with low latency.
Handles provider throttling, regional routing, and failover.
Normalizes errors from different providers into a common failure model.

Examples:
Twilio, Vonage (Nexmo), AWS SNS

Why:
SMS delivery is latency-sensitive and highly provider-dependent.

9. Channel Workers – Push

Primary Responsibilities:

Sends push notifications to mobile and web devices.
Manages device tokens, expiration, and invalid token cleanup.
Handles platform-specific delivery semantics and retries.

Examples:
Firebase Cloud Messaging (FCM), Apple Push Notification Service (APNs)

Why:
Push platforms require tight integration with OS-level services.

10. Channel Workers – In-App

Primary Responsibilities:

Delivers real-time notifications to active users over persistent connections.
Maintains connection state and fan-out to connected clients.
Falls back gracefully when users are offline.

Examples:
WebSockets, Server-Sent Events (SSE), Redis Pub/Sub

Why:
Provides the lowest-latency notification path for active users.

11. Retry Service

Primary Responsibilities:

Tracks failed delivery attempts and retry counts.
Applies retry policies such as exponential backoff and maximum retry limits.
Schedules retries through the Scheduler Service.
Ensures retries are controlled and do not cause retry storms.

Examples:
Kafka retry topics, Redis delay queues, SQS with visibility timeout

Why:
Improves reliability while protecting the system under failure conditions.

12. Dead Letter Queue (DLQ)

Primary Responsibilities:
Stores notifications that fail permanently after all retries.
Captures failure context and error metadata.
Supports auditing, alerting, and optional manual reprocessing.

Examples:
Kafka DLQ topics, AWS SQS DLQ

Why:
Ensures failures are visible and never silently dropped.

13. Preference Service

Primary Responsibilities:
Stores user notification preferences and channel-level settings.
Provides low-latency reads for preference enforcement.
Acts as the single source of truth for opt-in and quiet hours.

Examples:
Microservice + Redis cache + DynamoDB/Cassandra

Why:
Preference checks are on the critical path and must be fast and consistent.

14. Metadata Database

Primary Responsibilities:
Stores notification lifecycle state, delivery attempts, retry metadata, and audit logs.
Supports strong consistency for state transitions.
Optimized for high write throughput and time-based access patterns.

Examples:
Cassandra, DynamoDB, ScyllaDB

Why:
Designed for massive scale and durability under heavy write load.

15. Cache

Primary Responsibilities:

Caches hot data such as preferences, idempotency keys, and rate-limit counters.
Reduces load on the primary database and lowers latency.

Examples:
Redis, Memcached

Why:
Improves performance and protects databases under peak load.

16. Analytics & Tracking Service

Primary Responsibilities:

Consumes delivery events asynchronously.
Generates metrics for success rate, latency, retries, and failures.
Supports dashboards, alerts, and reporting.

Examples:
Kafka Streams, Flink, ClickHouse, BigQuery

Why:
Separates observability from the critical delivery path.

17. Monitoring & Alerting Service

Primary Responsibilities:

Tracks system health, queue lag, error rates, and SLOs.
Triggers alerts for abnormal behavior or degradation.

Examples:
Prometheus, Grafana, Datadog

Why:
Early detection is critical in high-throughput systems.

18. Logging Service

Primary Responsibilities:
Aggregates logs from all services for debugging and audits.
Supports correlation across distributed requests.

Examples:
ELK Stack, OpenSearch

Why:
Distributed systems require centralized visibility.

19. Security & Secrets Management

Primary Responsibilities:

Manages encryption keys, API credentials, and sensitive configuration.
Enforces encryption at rest and in transit.

Examples:
AWS KMS, HashiCorp Vault, AWS Secrets Manager

Why:
Protects sensitive data and ensures compliance.

High-Level Flows

Flow 0: Default Notification Flow (Happy Path)

This is the baseline flow that everything else builds on.

Client sends a notification request with an idempotency key to the API Gateway.
API Gateway authenticates the client, validates the request, and applies rate limits.
Request is forwarded to the Notification Service.
Notification Service: Validates payload, Classifies notification type (critical / promotional), Assigns priority, Fetches and enforces user preferences, Validates scheduling and expiry
Notification metadata is written durably to the database.
Notification Service publishes an event to the appropriate queue/topic.
Channel Worker consumes the event and sends the notification via the provider.
Delivery result is recorded and emitted to analytics.

Guarantee: Notification is accepted, processed asynchronously, and delivered successfully.

Flow 1: Critical Notification (Low-Latency Path)

Notification is classified as critical (OTP, chat, security alert).
Event is published to a high-priority queue/topic.
Dedicated high-priority Channel Workers consume the event immediately.
Worker sends notification to the provider with aggressive timeouts.
Delivery result is recorded synchronously.

Guarantee: Sub-second p99 latency, No impact from bulk or promotional traffic

Flow 2: Promotional Notification (Best-Effort Path)

Notification is classified as promotional.
Notification Service enforces: Opt-in / opt-out, Quiet hours, Frequency caps, Expiry time
Event is published to a low-priority queue/topic.
Workers process messages opportunistically.
Before sending, expiry is re-checked.

Guarantee: Delivered only within validity window, Never blocks critical traffic

Flow 3: Scheduled Notification

Client provides schedule_at.
Notification Service stores the notification in scheduled state.
Scheduler Service tracks the schedule using a time-indexed store.
At trigger time, Scheduler publishes the event to the queue.
Normal delivery flow resumes.

Guarantee: Sent exactly at scheduled time, No early or late delivery

Flow 4: Bulk Notification / Campaign (Fan-out)

Client creates a bulk campaign.
Notification Service stores campaign metadata.
Campaign Service resolves target users asynchronously.
Campaign is expanded into per-user notifications in batches.
Batched events are published gradually with throttling.
Channel Workers deliver independently.

Guarantee: Fan-out is controlled, Bulk traffic never overloads real-time flows

Flow 5: Retry on Transient Failure

Failure Detection

Channel Worker calls provider.
Provider returns transient error: Timeout, 5xx, Rate limit, Network error

Retry Handling

Worker records failure and retry count.
Retry Service evaluates retry policy: Is error retryable? Retry count < max?
Retry Service computes next retry time (exponential backoff).
Retry is scheduled via Scheduler Service.
Scheduler republishes the event at retry time.
Worker retries delivery.

Guarantee: Safe retries, No retry storms, System remains stable under partial outages

Flow 6: Provider Failover (Multi-Vendor)

Channel Worker detects provider degradation: High error rate, Throttling, Timeouts.
Circuit breaker opens for the failing provider.
Traffic is shifted to a secondary provider (if configured).
Delivery attempts continue via backup provider.
Primary provider is retried after cool-down.

Guarantee: High availability despite provider outages, Graceful degradation

Flow 7: Permanent Failure → DLQ

Notification exceeds maximum retry attempts OR
Error is classified as non-retryable (invalid number, blocked email).
Notification is marked as failed.
Payload and failure context are written to DLQ.
Alerts are triggered for investigation.

Guarantee: No silent drops, Full auditability

Flow 8: Idempotent Request Handling

Client retries request due to timeout.
API Gateway / Notification Service checks idempotency key.
Duplicate request is detected.
Existing notification reference is returned.

Guarantee: No duplicate notifications, Safe client retries

Flow 9: Cancellation of Scheduled Notification

Client requests cancellation.
Notification Service validates state.
Notification is marked cancelled.
Scheduler skips execution if encountered.

Guarantee: Safe cancellation before delivery

Flow 10: Expiry Enforcement

Notification has expiry_at.
Before delivery, worker checks current time.
If expired: Delivery is skipped, Status is marked expired

Guarantee: Promotions are never delivered late

Flow 11: Per-User Ordering (When Required)

Notifications are keyed by user/device.
Queue guarantees ordering per key.
Workers process in order for each user.

Guarantee: Correct ordering for chat and conversational flows

Flow 12: Analytics & Tracking

Workers emit delivery events.
Analytics Service consumes asynchronously.
Metrics, dashboards, and alerts update.

Guarantee: Observability without impacting delivery latency

Deep Dives – Functional Requirements

1. Support Sending Notifications to Users

The system exposes asynchronous APIs that allow internal services and external clients to trigger notifications in a non-blocking manner.
Once a request is accepted, notification intent is durably persisted, ensuring the notification is not lost even if downstream components fail.

2. Support Delivery Across Multiple Channels

Notifications can be delivered through Email, SMS, Push, and In-app channels.
Each channel is implemented as an independent delivery pipeline with its own workers, providers, retry logic, and scaling policy, preventing failures in one channel from impacting others.

3. Support Critical and Promotional Notification Types

Notifications are classified at ingestion time based on type and priority.
Critical notifications are routed through high-priority queues and dedicated workers to guarantee low latency, while promotional notifications are routed through low-priority paths that tolerate delay and throttling.

4. Support User Notification Preferences and Opt-In/Opt-Out

User preferences such as channel enablement, quiet hours, and frequency limits are enforced before delivery.
Preferences are cached for low-latency access and treated as the source of truth, with limited and explicit overrides allowed for critical system alerts.

5. Support Scheduled Notifications

The system allows notifications to be scheduled for future delivery using a distributed scheduler.
Scheduled notifications are triggered exactly at the specified time, survive service restarts, and are validated against expiry constraints before being dispatched.

6. Support Bulk Notifications Targeting Large User Groups

Bulk notifications are modeled as campaigns that are expanded asynchronously into per-user notifications.
Fan-out is performed in batches with throttling and backpressure to protect downstream systems and preserve the performance of real-time notifications.

7. Support Safe Retries and Idempotent Processing

All notification operations use idempotency keys to ensure retries do not create duplicates.
Delivery failures are retried using controlled retry policies such as exponential backoff, with retry state persisted to survive crashes and restarts.

8. Support Tracking of Notification Delivery Status

Each notification and its delivery attempts are tracked through well-defined lifecycle states.
Delivery events are emitted asynchronously to analytics systems, enabling auditing, monitoring, and reporting without impacting delivery latency.

Non-Functional Requirements

1. Highly Available and Fault Tolerant

The system is composed of stateless services deployed across multiple availability zones.
All critical state (notification metadata, retry state, schedules) is stored in replicated and durable systems.
Failures of individual services, nodes, or zones do not result in downtime or message loss.

2. Low-Latency Delivery for Critical Notifications

Critical notifications are isolated using priority-aware queues and dedicated worker pools.
This prevents head-of-line blocking from bulk or promotional traffic.
The critical delivery path minimizes synchronous work to achieve predictable sub-second p99 latency.

3. High Throughput with Large-Scale Fan-out

The system uses asynchronous ingestion and delivery pipelines backed by high-throughput message queues.
Bulk notifications are expanded and delivered in batches with controlled fan-out rates.
This allows the system to sustain millions of notifications per second during peak events.

4. Highly Scalable with Increasing Traffic

All components scale horizontally and independently.
API servers scale with request volume, queues scale via partitioning, and workers scale based on backlog and lag.
Capacity increases linearly by adding instances, without architectural changes.

5. Durable Notification Processing with No Message Loss

Once a notification request is accepted, it is durably persisted before processing begins.
At-least-once delivery guarantees ensure notifications are eventually processed even after crashes or restarts.
Explicit lifecycle states prevent silent drops or stuck notifications.

6. Secure Notification Delivery and Access Control

All APIs are authenticated and authorized at the gateway layer with tenant-level isolation.
Sensitive data is encrypted both in transit and at rest.
Access to external delivery providers is tightly controlled using scoped credentials and secret rotation.

7. Cost-Efficient Operation at Scale

The system avoids synchronous delivery and keeps the critical path lightweight.
Promotional traffic is throttled and deprioritized to reduce peak infrastructure costs.
Analytics and reporting are handled asynchronously, keeping delivery fast and cost-efficient.

Trade Offs

1. At-Least-Once Delivery vs Exactly-Once Delivery

Choice: At-least-once delivery with idempotent processing.

Pros

Ensures no notification is ever lost.
Simplifies system design and improves throughput.

Cons

Duplicate delivery attempts are possible in failure scenarios.

Why This Works
Idempotency keys and state tracking prevent user-visible duplicates while preserving durability, which is more critical than strict exactly-once semantics.

2. Priority Isolation vs Single Unified Queue

Choice: Separate queues and workers for critical and promotional notifications.

Pros

Guarantees low latency for critical notifications.
Prevents promotional spikes from impacting OTPs or chat messages.

Cons

Increases operational complexity and infrastructure cost.

Why This Works
Latency guarantees for critical traffic are non-negotiable in real systems, and isolation is the simplest and most reliable way to enforce them.

3. Asynchronous Processing vs Synchronous Delivery

Choice: Asynchronous notification ingestion and delivery.

Pros

Enables very high throughput and resilience to downstream failures.
Protects clients from provider latency and outages.

Cons

Clients do not get immediate delivery confirmation.

Why This Works
Notifications are inherently asynchronous, and durability plus retries provide stronger guarantees than blocking APIs.

4. Fan-out at Write Time vs Fan-out at Read Time

Choice: Fan-out at write time for bulk and campaign notifications.

Pros

Simplifies delivery logic and tracking.
Allows per-user preference checks and rate limiting.

Cons

Higher write amplification and storage usage.

Why This Works
Write-heavy fan-out enables precise control, retries, and auditing, which are required for large-scale notification platforms.

5. Strong Consistency vs Eventual Consistency

Choice: Strong consistency for notification state, eventual consistency for analytics.

Pros

Prevents duplicate deliveries and inconsistent user experience.
Improves availability and performance for non-critical data.

Cons

Analytics may lag slightly behind real-time.

Why This Works
Users care about correct delivery, not real-time dashboards. Separating consistency models optimizes both correctness and scale.

6. Centralized Preference Checks vs Cached Preferences

Choice: Cache-first preference checks with database fallback.

Pros

Reduces latency and database load.
Supports real-time delivery at scale.

Cons

Cache invalidation adds complexity.

Why This Works
Preferences change infrequently compared to delivery volume, making caching a high-impact optimization.

7. Single Provider vs Multi-Provider Strategy

Choice: Multi-provider integration for email and SMS.

Pros

Improves reliability and reduces vendor lock-in.
Enables failover during provider outages.

Cons

Higher integration and operational complexity.

Why This Works
External providers are unreliable by nature; redundancy is essential for critical notifications.

8. Aggressive Retries vs Controlled Backoff

Choice: Controlled retries with exponential backoff.

Pros

Prevents retry storms and provider overload.
Improves system stability under failure.

Cons

Retries may introduce delivery delays.

Why This Works
Stability and provider trust are more important than aggressive retrying, especially at high scale.

9. Immediate Deletion vs Retained Delivery Logs

Choice: Retain notification logs with configurable TTL.

Pros

Supports auditing, debugging, and compliance.
Enables analytics and reporting.

Cons

Requires additional storage.

Why This Works
Storage is cheap compared to the cost of missing audit data in incidents or compliance scenarios.

10. Cost Optimization vs Peak Performance

Choice: Optimize cost for promotional traffic, optimize performance for critical traffic.

Pros

Keeps infrastructure costs predictable.
Protects user experience for high-priority notifications.

Cons

Promotional notifications may be delayed during peak load.

Why This Works
Business impact of delayed promotions is far lower than delayed critical alerts.

Frequently Asked Questions in Interviews

Q. Why do we separate critical and promotional notifications?

Critical notifications (OTP, security alerts, chat messages) have strict latency and reliability SLOs, while promotional notifications can tolerate delays.
By isolating them into separate queues, partitions, and worker pools, we prevent head-of-line blocking where a promotional spike could delay time-sensitive messages.
This guarantees predictable latency for critical traffic even during large campaigns.

Q. Why is at-least-once delivery preferred over exactly-once delivery?

Exactly-once delivery requires distributed transactions across queues, databases, and external providers, which is expensive and fragile at scale.
At-least-once delivery guarantees durability and availability, which are more important for notifications.
User-visible duplicates are avoided using idempotency keys and state checks, achieving practical correctness with far lower complexity.

Q. How do you prevent duplicate notifications during retries?

Each notification has a globally unique notification ID or idempotency key.
Before sending, workers check the persisted delivery state to ensure the notification hasn’t already been delivered.
Retries update state atomically, so even if the same message is processed twice, only one delivery attempt succeeds.

Q. How do you handle massive fan-out for promotional campaigns?

Bulk campaigns are expanded asynchronously rather than synchronously at API time.
The system processes recipients in batches, applies preferences and rate limits, and enqueues individual delivery tasks gradually.
Fan-out rate is throttled to protect downstream providers and internal infrastructure.

Q. What happens if the notification service crashes mid-processing?

All important state transitions are persisted before moving to the next step.
If a worker crashes after pulling a message but before acknowledging it, the message is re-delivered by the queue.
Because processing is idempotent, retries do not corrupt state or cause duplicates.

Q. How is per-user ordering guaranteed?

Notifications are partitioned by user ID (or user-channel key) in the message queue.
Consumers process messages sequentially within a partition, ensuring ordering for a given user.
Global ordering is intentionally not guaranteed, as it does not scale and is unnecessary.

Q. How do you handle external provider failures (SMS, Email, Push)?

Providers are treated as unreliable dependencies.
Each provider integration includes timeouts, bounded retries, and circuit breakers.
Failures are retried later or routed to fallback providers if configured.

Q. What if a provider is slow but not fully down?

Latency-based circuit breakers detect degradation even when errors are low.
Traffic is gradually reduced or paused to avoid queue buildup and cascading failures.
This protects system stability and prevents retry storms.

Q. How do you ensure users don’t receive expired promotions?

Promotional notifications include an explicit expiration timestamp.
Workers validate the expiry at delivery time and discard expired notifications immediately.
This ensures correctness even if notifications are delayed due to retries or backpressure.

Q. How are user preferences enforced at scale?

User preferences are cached in memory (e.g., Redis) for fast access.
The database remains the source of truth but is only consulted on cache misses or updates.
This allows preference checks to be performed inline without adding latency.

Q. How do you support scheduled notifications at large scale?

Scheduled notifications are stored in time-partitioned storage keyed by execution time.
A scheduler scans upcoming time windows and enqueues notifications just-in-time for delivery.
This avoids keeping millions of delayed messages sitting in queues.

Q. How do you prevent notification spam?

Rate limits are applied per user, per channel, and per tenant.
Promotional notifications are capped daily, while critical notifications bypass limits.
This protects user experience without impacting essential communication.

Q. How is multi-tenancy handled?

Each tenant has isolated identifiers, quotas, rate limits, and metrics.
Traffic from one tenant cannot starve resources for others.
Billing and usage tracking are enforced at the tenant level.

Q. How do you monitor system health?

Metrics track queue depth, consumer lag, latency percentiles, retry rates, and provider errors.
Dashboards provide real-time visibility, and alerts trigger when SLOs are violated.
This allows proactive issue detection before users are impacted.

Q. How do you debug a missing or delayed notification?

Every notification has a traceable lifecycle with immutable logs.
Operators can trace a notification ID across ingestion, scheduling, retries, and delivery attempts.
Dead Letter Queues preserve full context for permanent failures.

Q. What are the biggest scalability bottlenecks?

Metadata writes, fan-out amplification, and external provider rate limits.
These are mitigated using partitioning, batching, caching, and backpressure.
Provider limits often become the true ceiling, not internal infrastructure.

Q. How does the system behave under extreme load?

Critical notifications continue to flow with priority.
Promotional traffic is throttled, delayed, or dropped first.
The system degrades gracefully instead of failing catastrophically.

Q. Why not make notification delivery synchronous?

Synchronous delivery couples system availability to external providers.
Any provider latency or outage would block clients and reduce availability.
Asynchronous processing decouples ingestion from delivery and improves resilience.

Q. How would the system change at 10× or 100× scale?

The architecture remains the same.
We increase partitions, workers, and regional deployments.
No redesign is required—only capacity expansion.

Q. How do you add a new notification channel (e.g., WhatsApp)?

Add a new channel processor and provider integration.
Core ingestion, scheduling, retry, and tracking logic remains unchanged.
This keeps the system extensible and pluggable.

Q. What guarantees does the system actually provide?

Near-real-time delivery for critical notifications.
At-least-once delivery with idempotency.
Per-user ordering where required.
No delivery after expiry for promotions.

High-Level Summary

This notification system delivers low-latency, highly reliable critical notifications while supporting large-scale promotional fan-out without interference.
It uses an asynchronous, event-driven architecture with durable queues, idempotent processing, and safe retries to prevent message loss or duplication.
Traffic isolation, rate limiting, and expiry checks ensure correctness and user experience even during spikes or provider failures.
The system scales linearly and cost-efficiently, matching real-world production notification platforms.

Feel free to ask questions or share your thoughts — happy to discuss!

Design HLD - Distributed File Storage System -Dropbox | Image Upload Service

Vikas Kumar — Fri, 06 Feb 2026 08:25:37 +0000

Requirements

Functional Requirements

Support image upload and download across devices.
Identify and manage exact duplicate images.
Ensure safe retry of upload operations.
Support image transformations (e.g., thumbnails).
Provide secure image access.
Support automatic synchronization across user devices.
Support safe image deletion.

Non Functional Requirements

Highly available and fault tolerant.
Low-latency and high-throughput operations.
High scalability with growing traffic.
Durable and reliable file storage.
Secure storage and access control.
Support large file uploads up to 50 GB.
Cost-efficient at scale.

Key Concepts You Must Know

To be discussed during design

Object Storage vs Metadata Storage

Object storage is a distributed storage system optimized for storing large, unstructured binary data, while metadata storage is a structured data store used to manage information about those objects.

Databases are optimized for small, structured records and queries, not large files.
Object storage systems are optimized for durability, scalability, and cost, but not for complex querying.
Separating image bytes from metadata allows each system to do what it is best at.

Analogy (Library Model)
Object storage is the warehouse storing heavy books. Metadata storage is the catalog system telling you what the book is and where it lives.

Example
Metadata DB → image_id, owner_id, size, hash, storage_path
Object Store → actual image bytes

Multipart / Resumable Uploads

Multipart uploads divide large files into smaller parts that can be uploaded independently and reassembled by the storage system.

Large uploads are prone to network failures and timeouts.
Chunking allows retries at a fine-grained level instead of restarting the entire upload.
Upload state is tracked via an upload session.

Analogy (Shipping Boxes)
Instead of shipping one huge box, ship many small boxes. If one box is lost, only that box is resent.

Example
UploadSession ID
→ Chunk 1 uploaded
→ Chunk 2 uploaded
→ Chunk 3 failed → retry

Signed / Time-Bound URLs

Signed URLs provide temporary, secure access to private objects by embedding authentication information into the URL itself.

The backend validates access and generates a URL with an expiry time and signature.
Storage systems trust the signature and serve the object directly.
This avoids routing large downloads through application servers.

Analogy (Hotel Key Card)
A hotel card opens your room only for a limited time. After checkout, it stops working automatically.

Example
GET /image/123
→ Backend returns signed URL (expires in 5 min)
→ Client downloads from storage

Content-Based Deduplication

Content-based deduplication eliminates redundant data by identifying identical content using cryptographic hashes.

Before storing an image, the system computes its hash.
If the hash already exists, storage is skipped and a new reference is created.
Multiple users can reference the same underlying object.

Analogy (Pointer to Same File)
Instead of saving the same file twice, create another pointer to it.

Example
Hash(H1) exists
→ ref_count++
→ no new storage write

Cryptographic Hash (SHA-256)

SHA-256 is a cryptographic hash function that produces a fixed-length, collision-resistant fingerprint for any input.

Same input always produces the same hash.
Any change in input produces a drastically different hash.
Collision probability is negligible for practical systems.

Analogy (DNA for Files)
Files have unique DNA sequences.

Example
image.jpg → SHA-256 → 256-bit hash

Idempotent Operations

Idempotency ensures that repeating an operation produces the same final state as executing it once.

Network failures often cause retries.
Without idempotency, retries can corrupt data or create duplicates.
Idempotency is usually enforced using unique request IDs.

Analogy (Light Switch)
Turning the light ON multiple times keeps it ON.

Example
DELETE image/123
→ deleted = true
→ retry DELETE → no change

Two-Phase Deletion

Two-phase deletion separates logical deletion from physical deletion to ensure safety and consistency.

Immediate physical deletion is risky in distributed systems.
Soft delete hides the image immediately.
Hard delete is done later by a background process.

Analogy (Recycle Bin)
You delete a file → it goes to trash → later permanently removed.

Example
Phase 1: deleted = true
Phase 2: GC job removes blob

Capacity Estimation

Key Assumptions

DAU (Daily Active Users): ~10 million
Uploads per user per day: ~2 images
Average image size: ~5 MB
Traffic pattern: Read-heavy (images viewed more than uploaded)
System scale: Large-scale, distributed system assumed

Upload Volume Estimation

Total uploads per day => 10M users × 2 uploads = ~20M images/day
Total data uploaded per day => 20M images × 5 MB ≈ ~100 TB/day

Throughput Estimation (QPS)

Write Traffic - Average write QPS (Queries Per Second) => 20M / 86,400 ≈ ~200 uploads/sec
Read Traffic - Reads are assumed ~5× writes => Average read QPS: ~1,000/sec

Metadata Size Estimation

Metadata per image: ~100 bytes (IDs, hash, timestamps, flags)
Metadata per day => 20M × 100 B ≈ ~2 GB/day

Core Entities

User: Represents a system user who uploads, owns, and accesses images.
Image: Represents a logical image uploaded by a user; stores ownership and state, not the raw image bytes.
ImageObject (ImageBlob): Represents the actual binary image file stored in object storage; can be shared across multiple images due to deduplication.
ImageVariant: Represents derived versions of an image such as thumbnails or resized formats.
UploadSession: Represents an in-progress multipart upload and enables safe retries and resumable uploads.

Database Design

Users Table

Represents system users.

User
----
user_id (PK)
email
created_at
status

Used for

Ownership
Sharing
Access control

Image (Asset) Table

Represents a user-visible image.

Image
-----
image_id (PK)
owner_id (FK → User)
content_hash
name
size
visibility
status (active / deleted)
created_at
updated_at

Key Points

One row per user image.
Multiple images can reference the same content hash.
Soft delete is handled via status.

ImageContent (Blob) Table

Represents the actual stored image content.

ImageContent
------------
content_hash (PK)
storage_path
size
ref_count
created_at

Key Points

One row per unique image content.
ref_count tracks how many images reference this blob.
Enables safe deduplication and deletion.

ImageVariant Table

Represents thumbnails or resized versions.

ImageVariant
------------
variant_id (PK)
content_hash (FK → ImageContent)
variant_type (thumbnail_small, large, etc.)
storage_path
created_at

Key Points

Variants are tied to content, not individual users.
Generated asynchronously.

UploadSession Table

Tracks multipart uploads.

UploadSession
-------------
upload_session_id (PK)
owner_id
content_fingerprint
status (uploading / completed)
created_at
expires_at

Optional (if chunk-level tracking is needed)

UploadChunk
-----------
upload_session_id (FK)
chunk_number
status (uploaded / pending)
etag

Key Points

Enables resumable uploads.
Prevents restarting large uploads.

Indexing Strategy

| Access Pattern    | Index                  |
| ----------------- | ---------------------- |
| Fetch user images | (owner_id, created_at) |
| Dedup lookup      | content_hash           |
| Cleanup jobs      | status + ref_count     |
| Sync              | updated_at             |

Indexes are chosen based on actual query patterns, not theoretical normalization.

Consistency Model

Strong consistency for metadata updates (uploads, deletes).
Eventual consistency for: Sync across devices, Variant availability, Background cleanup This balances correctness with scalability.

Transactions & Conditional Writes

Deduplication uses conditional inserts on content_hash.
Reference counts are updated atomically.
Prevents race conditions when multiple users upload the same image.

Failure Handling at DB Level

If metadata write fails → upload not finalized.
Orphaned blobs are cleaned by background jobs.
DB failures degrade performance, not correctness.

API / Endpoints

Start Upload → POST: /uploads

Initializes a new upload session and returns the chunk size and session ID.

Request

{
  "file_name": "photo.jpg",
  "file_size": 50000000,
  "mime_type": "image/jpeg"
}

Response

{
  "upload_session_id": "us_123",
  "chunk_size": 5000000
}

Upload Chunk → PUT: /uploads/{upload_session_id}/chunks/{chunk_number}

Uploads a single chunk of the file and supports safe retries.

Request

Raw binary chunk data

Response

{
  "chunk_number": 3,
  "status": "uploaded"
}

Chunk number = position of this piece in the file (0,1,2,…)

Complete Upload → POST: /uploads/{upload_session_id}/complete

Finalizes the upload, assembles chunks, checks deduplication, and creates the image.

Response

{
  "image_id": "img_456",
  "status": "completed"
}

Get Image → GET: /images/{image_id}

Returns a time-bound signed URL to securely download the image.

Response

{
  "download_url": "https://signed-url",
  "expires_in": 300
}

Get Image Metadata → GET: /images/{image_id}/metadata

Fetches lightweight metadata without downloading the image.

Response

{
  "image_id": "img_456",
  "owner_id": "user_1",
  "size": 50000000,
  "status": "active",
  "created_at": "2026-02-05T10:00:00Z"
}

Update Image Metadata → PATCH: /images/{image_id}

Updates image metadata such as name or visibility.

Request

{
  "name": "vacation_photo.jpg",
  "visibility": "private"
}

Response

{
  "status": "updated"
}

Image Variants (Thumbnails) → GET: /images/{image_id}/variants/{variant_type}

Returns a signed URL for a specific image variant (e.g., thumbnail).

Response

{
  "download_url": "https://signed-url",
  "variant": "thumbnail_small"
}

Soft Delete → DELETE: /images/{image_id}

Soft-deletes the image by marking it as deleted in metadata.

Response

{
  "status": "deleted"
}

Hard Delete (Internal) → POST: /internal/images/{image_id}/cleanup

Permanently removes the image from storage after safety checks.

Response

{
  "status": "permanently_deleted"
}

Sync API (Multi-Device) → GET: /sync?since=timestamp

Returns images added, updated, or deleted since the last sync.

Response

{
  "added": ["img_789"],
  "updated": ["img_456"],
  "deleted": ["img_123"]
}

System Components

1. Client (Web / Mobile)

Provides UI for users to upload, download, view, and delete images.
Splits large images into fixed-size chunks and uploads them independently.
Retries only failed chunks during network failures.
Maintains local image state and syncs changes with the server.

2. Load Balancer & API Gateway

Acts as the single entry point for all client requests.
Authenticates users and enforces authorization rules.
Applies rate limiting and routes requests to backend services.
Shields backend services from direct internet exposure.

3. Image Service (Application Layer)

Stateless service that orchestrates all workflows.
Creates and manages upload sessions.
Generates signed URLs for secure upload and download.
Validates permissions and updates image metadata.
Coordinates deduplication, deletion, and sync logic.
Never handles raw image bytes directly.

4. Metadata Database

Persists all image-related metadata and relationships.
Stores ownership, content hash, object location, reference counts, and lifecycle state.
Serves as the source of truth for: Deduplication, Access control, Synchronization and Deletion safety

5. Object Storage

Stores the actual image binaries and transformed variants.
Images are addressed using their content hash.
Guarantees high durability and virtually unlimited scale.
Supports large objects (up to 50 GB).

6. Image Processing Service (Async Workers)

Consumes upload-completion events.
Generates thumbnails and other image variants asynchronously.
Writes transformed images back to object storage.
Updates metadata once processing completes.
Scales independently from user traffic.

7. CDN (Content Delivery Network)

Caches images and thumbnails close to end users.
Serves read-heavy traffic efficiently.
Uses signed URLs to ensure only authorized access.
Reduces load on object storage and backend services.

8. Sync / Notification Layer

Observes metadata changes in the system.
Notifies connected devices of updates using: Push (WebSockets/SSE) for active images and Polling for inactive images
Enables eventual consistency across all devices.

High-Level Flows

Flow 1: Image Upload

Client requests an upload session from the Image Service.
Image Service returns chunk size and signed upload URLs.
Client uploads image chunks directly to object storage.
On completion, Image Service: Computes SHA-256 hash, then Checks for duplicates, then Creates or updates metadata.
Image becomes available across devices.

Flow 2: Retry / Resume Upload

If a chunk upload fails, the client retries only that chunk.
Upload session tracks completed chunks.
Duplicate chunk uploads are ignored.
Ensures idempotent and reliable uploads.

Flow 3: Image Download

Client requests access to an image.
Image Service verifies ownership or shared access.
A time-bound signed URL is generated.
Client downloads the image from CDN or object storage.

Flow 4: Deduplication

SHA-256 hash uniquely identifies image content.
If a matching hash exists: No new blob is stored, Reference count is incremented
If not: Image is stored as a new object
Each user receives an independent asset reference.

Flow 5: Image Transformation

Upload completion emits an asynchronous event.
Image processing workers generate thumbnails and variants.
Variants are stored as separate objects.
Metadata is updated to reference new variants.

Flow 6: Multi-Device Synchronization

Metadata updates record change timestamps or versions.
Other devices fetch changes via sync APIs or receive push notifications.
Devices apply updates locally.
System converges using eventual consistency.

Flow 7: Image Deletion (Two-Phase)

User deletes image → metadata is marked as deleted.
Image is immediately hidden from all devices.
Background job checks reference count.
Image blob is permanently removed only when no references remain.

Deep Dives - Functional Requirements

1. Support Image Upload and Download Across Devices

Clients (web, mobile, desktop) upload images using direct-to-object-storage uploads via signed URLs.
Large files are split into chunks and uploaded independently.
Downloads use time-bound signed URLs and are served via CDN.
This allows seamless access from any device with low latency and high throughput.

2. Identify and Manage Exact Duplicate Images

The system computes a SHA-256 hash of image content during upload.
This hash uniquely identifies the image bytes.
If the hash already exists, the image blob is not stored again.
A new metadata reference (asset) is created pointing to the existing content.

3. Ensure Safe Retry of Upload Operations

Uploads use multipart (chunked) uploads.
Each chunk is uploaded independently and tracked via an upload session.
Failed chunks are retried without re-uploading completed chunks.
Operations are idempotent, preventing duplicate writes.

4. Support Image Transformations (e.g., Thumbnails)

After upload completion, an event is emitted.
Asynchronous workers generate thumbnails and other image variants.
Transformed images are stored separately and linked via metadata.
This keeps uploads fast and processing scalable.

5. Provide Secure Image Access

All images are stored in private object storage.
Access is granted using short-lived signed URLs after permission checks.
URLs expire automatically, limiting unauthorized access.
CDN integration ensures fast and secure delivery.

6. Support Automatic Synchronization Across User Devices

Metadata is the source of truth for image state.
Clients sync changes using polling or push notifications (WebSocket/SSE).
Only deltas (added, updated, deleted images) are synced.
Ensures eventual consistency across all devices.

7. Support Safe Image Deletion

Deletion is handled using a two-phase delete.
First, the image is soft-deleted in metadata and hidden immediately.
A background job deletes the image blob only when no references remain.
This prevents accidental data loss and works with deduplication.

Deep Dives - Non - Functional Requirements

1. High Availability & Fault Tolerance

All backend services are stateless and deployed across multiple availability zones.
Metadata and storage systems are replicated.
Idempotent APIs ensure retries don’t corrupt state.
Availability: 99.9%+ (system remains usable despite node/AZ failures)

2. Low Latency & High Throughput

Uploads and downloads go directly to object storage using signed URLs.
CDN serves read traffic close to users.
Duplicate uploads are short-circuited before storing data.
Heavy work (thumbnails, scans) runs asynchronously.
Duplicate upload latency: < 50 ms (no file transfer)
Image read latency (CDN): ~5–20 ms

3. High Scalability with Growing Traffic

Stateless services scale horizontally.
Metadata, storage, and processing scale independently.
Sharding by user/content hash avoids hotspots.
Scaling model: Linear (add instances → increase capacity)

4. Durable & Reliable File Storage

Images are stored in object storage with built-in replication.
Content-addressed (hash-based) storage ensures immutability.
Metadata is persisted in a replicated database.
Durability: Object storage-grade (11 nines)

5. Secure Storage & Access Control

All data encrypted in transit and at rest.
Storage buckets remain private.
Access granted via short-lived signed URLs after permission checks.
Signed URL validity: 5–10 minutes

6. Support Large File Uploads (Up to 50 GB)

Files are uploaded using multipart (chunked) uploads.
Clients retry only failed chunks.
Upload state tracked via upload sessions.
Max file size: 50 GB (network-bound, not server-bound)

7. Cost Efficiency at Scale

Exact deduplication stores identical images only once.
CDN reduces repeated reads from storage.
Lifecycle rules clean up unused data.
Storage savings via dedup: Significant (workload-dependent)

Trade Offs

1. Object Storage vs Database for Image Bytes

Choice: Store image bytes in object storage, not in a database.

Pros

Handles very large files efficiently
High durability and low cost
Scales independently from metadata

Cons

No complex querying on image data
Requires separate metadata store

Why This Works

Databases are optimized for small, structured data. Object storage is purpose-built for large blobs and is the industry standard for this use case.

2. Content-Based Deduplication (SHA-256)

Choice: Deduplicate images using cryptographic hashes.

Pros

Massive storage savings
Simple, deterministic duplicate detection
Enables safe reference counting

Cons

Hash computation adds CPU overhead
Only detects exact duplicates (not visually similar images)

Why This Works

Exact deduplication is reliable, fast, and sufficient for most storage optimization needs. Near-duplicate detection can be added later asynchronously.

3. Multipart Uploads vs Single Upload

Choice: Use multipart (chunked) uploads.

Pros

Supports very large files (up to 50 GB)
Allows resumable uploads
Improves user experience and reliability

Cons

More complex client logic
Requires tracking upload state

Why This Works

Single uploads do not scale for large files and fail badly under unreliable networks. Chunking is the industry-standard solution.

4. Direct-to-Object Storage Uploads

Choice: Clients upload/download directly from object storage using signed URLs.

Pros

Very high throughput
Backend stays lightweight and scalable
Lower infrastructure cost

Cons

Less visibility into byte-level progress on backend
Requires careful security handling

Why This Works

Keeping application servers out of the data path is critical for performance and cost at scale.

5. Asynchronous Image Processing

Choice: Generate thumbnails and variants asynchronously.

Pros

Faster upload completion
Better system throughput
Easy horizontal scaling

Cons

Variants are not immediately available
Requires eventual consistency handling

Why This Works

Users care more about upload completion than immediate thumbnails. Async processing optimizes both latency and scale.

6. Two-Phase Deletion

Choice: Soft delete first, hard delete later.

Pros

Prevents accidental data loss
Works safely with deduplication
Enables recovery and auditing

Cons

Requires background cleanup jobs
Storage freed with a delay

Why This Works

Immediate deletion is dangerous in distributed systems. Two-phase deletion is safer and widely used.

7. Eventual Consistency for Sync

Choice: Use eventual consistency for multi-device synchronization.

Pros

High availability and scalability
Reduced coordination overhead
Better performance under load

Cons

Temporary inconsistencies across devices
Requires conflict resolution logic

Why This Works

Strong consistency is unnecessary for file sync and would significantly reduce system availability and throughput.

8. Signed URLs as Bearer Tokens

Choice: Use short-lived signed URLs for access control.

Pros

Simple and scalable access control
Works seamlessly with CDN
No backend involvement during download

Cons

URLs can be shared while valid
Requires short expiration windows

Why This Works

Short-lived URLs significantly reduce risk while enabling high-performance delivery. Additional restrictions can be layered if needed.

Frequently Asked Questions in Interviews

Q. Why do production systems strictly separate binary storage from metadata storage?

Relational and NoSQL databases are optimized for small, mutable records with indexing and transactions. Storing large binaries:

Pollutes buffer cache
Increases replication lag
Makes backups and restores slow
Raises cost per GB significantly
Object storage is optimized for immutable large objects, providing:
Multi-AZ replication by default
High write throughput
Lifecycle policies (cold storage, deletion)
No need for manual sharding
Metadata DB stores only pointers (object_key, hash, size) — never raw bytes.

Q. What does a real metadata schema look like?

A minimal but scalable model:

Blob Table (Content-level)

hash (PK)
object_key
size
ref_count
created_at

Image Table (Ownership-level)

image_id (PK)
user_id (indexed)
hash (FK)
visibility / ACL
created_at
deleted_at

This allows:

Exact deduplication
Independent ownership
Safe deletion via reference counting

Q. Why are uploads designed as direct-to-object-storage in real systems?

Because backend servers:

Are expensive per byte
Are limited by NIC bandwidth
Add failure points

In production, backend servers act as a control plane:

Issue upload credentials
Validate metadata
Finalize uploads
All file bytes flow directly from client → object storage.

Q. How are signed uploads implemented technically?

Backend:

Initiates multipart upload with object storage
Generates signed URLs for each part
Returns upload session metadata to client

Client:

Uploads parts directly using signed URLs
Retries failed parts independently
Calls “complete upload” API after all parts succeed
Backend never touches file bytes.

Q. How is the entire upload workflow made idempotent?

Idempotency is enforced at three layers:

Upload session ID uniquely identifies an upload attempt
Chunk uploads are keyed by (session_id, part_number)

Completion step uses conditional update:

UPDATE uploads
SET status = COMPLETED
WHERE session_id = X AND status != COMPLETED

Retries are safe at every step.

Q. What happens if object storage succeeds but metadata commit fails?

The upload remains in a COMPLETED_IN_STORAGE but PENDING_METADATA state.

A background reconciler:

Scans incomplete uploads
Verifies object existence
Retries metadata commit
Expires uploads past TTL
No user-visible corruption occurs.

Q. Why is content-addressed storage used instead of IDs?

IDs identify ownership, not content.

Content hashes provide:

Deterministic identity
Deduplication
Integrity verification
Using IDs alone makes deduplication race-prone and expensive.

Q. When and how is the hash computed?

Client computes hash while chunking the file (streaming).
This avoids loading the full file into memory.

Optionally:

Backend verifies hash asynchronously for trust
Upload path is never blocked on verification

Q. How do you safely deduplicate under concurrent uploads?

Blob creation uses conditional insert:

INSERT INTO blobs (hash, ...)
IF NOT EXISTS

Outcomes:

One writer wins
Others reuse existing blob
Reference count increment is atomic
No locks, no race conditions.

Q. How do you avoid hot-hash contention?

Shard blob table by hash prefix
Cache hash existence in Redis
Use Bloom filters to skip DB hits on negative lookups
This keeps deduplication fast even for viral content.

Q. Why are multipart uploads mandatory?

Single uploads fail due to:

Client timeouts
Gateway size limits
Network instability

Multipart uploads allow:

Parallelism
Resume from failure
Independent retries per chunk

Q. How is resume implemented without backend state?

Object storage tracks uploaded parts.
Client queries uploaded part list and uploads only missing chunks.
Backend state is optional — object storage is the source of truth.

Q. What happens if an application server crashes mid-request?

Nothing breaks. All servers are stateless.
Requests retry against another instance.
No in-memory state is required for recovery.

Q. How does the system survive AZ or region failures?

App servers: multi-AZ autoscaling
Metadata DB: replicas + failover
Object storage: multi-AZ by default
CDN serves cached content during partial outages
Availability degrades gracefully, not catastrophically.

Q. Why is eventual consistency chosen?

Strong consistency requires cross-region coordination, increasing latency and reducing availability.

Eventual consistency:

Matches user expectations for file systems
Improves availability
Enables global scaling
Correctness is preserved at metadata layer.

Q. How do multiple devices stay in sync?

Devices sync metadata deltas, not binaries:
Polling or push notifications
Only changed image IDs fetched
Actual images downloaded lazily
This minimizes bandwidth and latency.

Q. How is access control enforced technically?

Buckets are private
Backend validates ACLs
Signed URLs scoped to object + operation + expiry
Clients never receive long-lived credentials.

Q. What prevents signed URL abuse?

Short expiration (minutes)
Single-object scope
Optional IP or device binding
Read-only vs write-only URLs
Even leaked URLs have minimal blast radius.

Q. What are the largest cost optimizations in practice?

Exact deduplication (storage)
CDN caching (egress)
Avoiding backend data transfer
Lifecycle rules for cold data
These dwarf micro-optimizations.

Q. Why not aggressively compress images?

JPEG/PNG/WebP are already compressed.
Extra compression:

Increases CPU cost
Adds latency
Saves negligible space
Compression is applied selectively, not globally.

Q. What bottleneck appears first at scale?

Metadata write throughput.
Solved via:

Sharding
Batching
Async writes
Cache-first lookups

Q. What changes at 10× or 100× scale?

Architecture remains unchanged.
We add:

More shards
More async workers
More regions
No redesign — only capacity expansion.

High-Level Summary

This system allows users to upload, store, and sync images across devices at scale. Images are stored using content-addressed object storage to enable exact deduplication, while metadata drives access control, synchronization, and lifecycle management. Large uploads are handled using multipart uploads, and all heavy processing is done asynchronously to keep latency low.

Feel free to ask questions or share your thoughts — happy to discuss!