Vikas Kumar

Posted on Feb 8

Design HLD - Recomendation Sytem

#systemdesign #architecture #ai #machinelearning

About - Recomendation Sytem

A recommendation system is a service that predicts and ranks items a user is most likely to engage with, based on their behavior, preferences, and context. It helps users discover relevant content at scale while optimizing business goals like engagement, retention, or revenue.

Requirements

Functional Requirements

Support personalized recommendations for users.
Support homepage, “Up Next” , and contextual recommendations.
Support hybrid recommendation strategies (collaborative + content-based).
Support real-time personalization using recent user interactions.
Support large-scale candidate generation from billions of items.
Support multi-stage ranking (candidate generation, scoring, re-ranking).
Support cold-start handling for new users and new items.
Support business-rule and policy-based re-ranking.
Support tracking of user interactions and feedback signals.

Non-Functional Requirements

Highly available and fault tolerant.
Low-latency recommendation serving (sub-200 ms).
High throughput for large-scale user traffic.
Horizontally scalable with growing users and content.
Real-time freshness of recommendations.
Consistent user experience across devices and regions.
Cost-efficient operation at scale.
Secure access to user and content data.
Observability for model performance and system health.

Key Concepts You Must Know

What Is an Embedding?

An embedding is a way to convert users and items (videos, products, songs) into numbers that computers can compare.

A user embedding represents what a user likes. An item embedding represents what an item is about. If a user and an item have similar embeddings, the system assumes the user may like that item.

Think of embeddings like coordinates on a map. Users and items that are close on the map are considered a good match. This is the foundation of modern recommendation systems.

What Is Candidate Generation?

The system has billions of items, but it cannot look at all of them for every user.

So the first step is candidate generation:
Quickly select a small shortlist (usually a few thousand items). These items are possibly relevant to the user. Speed is more important than accuracy here

Example:
From 10 billion videos → pick 10,000 “maybe interesting” videos
If a good item is not picked here, it will never be recommended later.

Candidate Generation vs Ranking

Candidate Generation: “What are some items this user might like?”, Fast, rough, recall-focused

Ranking: “Out of these candidates, which ones are the best?”, Slower, smarter, accuracy-focused

You cannot fix a bad candidate list with ranking later.

Multi-Stage Recommendation Architecture

Real systems do not use one big model.

They use multiple stages: Candidate Generation (thousands), Light Ranking (hundreds), Heavy Ranking (dozens), Re-Ranking (final list)

Each stage: Looks at fewer items, Uses more computation, Improves quality

This is how systems stay fast and scalable.

Collaborative vs Content-Based Recommendations

Collaborative Filtering: Based on user behavior, “Users like you watched this”, Does NOT need item details

Content-Based Filtering: Based on item properties, “This video is about cooking, and you watch cooking videos”, Works even for new items

Most real systems use both together (hybrid).

Real-Time Signals vs Offline Signals

Offline signals: Long-term behavior, Historical data, Stable preferences

Real-time signals: What the user just watched, Recent clicks or searches, Current intent

Good systems combine both: Offline = who the user is & Real-time = what the user wants right now

Cold Start Problem

Cold start happens when: A new user joins (no history), A new item is added (no views)

How systems handle it: Use item content (title, genre, tags), Use popularity or trending items, Use basic user info (language, location)

Show items to small groups and learn quickly

Exploration vs Exploitation

Exploitation: Show items the system is confident the user will like

Exploration: Occasionally show something new or uncertain

Why exploration matters: Prevents boring, repetitive feeds, Helps discover new interests, Helps new items get exposure

Usually: Top positions → safe choices, Lower positions → more exploration

Feedback Loops & Popularity Bias

Recommendations change user behavior.

Problem: Popular items get shown more, More views → even more recommendations, New or niche items get ignored, This is called popularity bias.

Systems fix this by: Adding diversity, Limiting overexposure, Forcing exploration

Re-Ranking & Business Rules

Even after ranking, the system may adjust results to: Increase diversity, Promote fresh content, Enforce policies (age, region, safety), Support creators or business goals

This happens in the final re-ranking step.

Implicit vs Explicit Feedback

Explicit feedback: Likes, ratings, Clear but rare

Implicit feedback: Watch time, clicks, skips, Noisy but abundant

Most systems rely mainly on implicit feedback, because users rarely rate content.

Approximate Nearest Neighbor (ANN) Search

To find similar embeddings: Exact comparison is too slow at scale, ANN finds “close enough” matches very fast

Trade-off: Slight accuracy loss. Huge speed gain

ANN is what makes large-scale recommendations possible.

Feature Freshness & Drift

User interests change. Trends change. Models trained on old data become wrong.

Systems must: Update features frequently, Detect when data patterns change
Retrain or adjust models, Otherwise, recommendations silently degrade.

Observability for Recommendation Systems

The system can be “up” but still be bad.

So we monitor: Engagement (CTR, watch time), Model accuracy, Bias and diversity, Feature freshness

Without observability, problems are discovered too late.

Simple Mental Model

User → Embedding → Candidate Generation → Ranking → Re-ranking → Recommendation

Capacity Estimation

Key Assumptions

Daily Active Users (DAU): 100M
Sessions per user per day: 5
Recommendation surfaces per session: 2 (Homepage, Up Next)
Items recommended per request: 10
Total item catalog: ~10B
Target latency: < 200 ms

Traffic Estimation

Recommendation requests per user/day ⇒ 5 × 2 = 10
Total requests/day ⇒ 100M × 10 = 1B requests/day
Average QPS ⇒ ~12K requests/sec
Peak QPS (5–10×) ⇒ ~60K–120K requests/sec

Candidate Generation & Ranking

Candidates per request: ~10K
Heavy ranking input: ~500
Final output: Top 10
Only a tiny fraction of the catalog reaches expensive models, keeping latency and cost under control.

Storage Estimation

User embeddings: ⇒ 100M users × 1 KB ≈ 100 GB
Item embeddings: ⇒ 10B items × 1 KB ≈ ~10 TB
Stored in distributed vector storage with sharding and replication.

Interaction Data

User interactions/day: ~5B events
Event size: ~200 bytes
Daily ingestion volume: ~1 TB/day
Processed asynchronously via streaming pipelines.

Key Takeaways

Recommendation systems are read-heavy and bursty
Candidate generation dominates compute cost
Caching and ANN search are mandatory
Heavy models must be used sparingly
Latency is driven by QPS, not storage

Core Entities

User: Represents a platform user for whom recommendations are generated.

Item (Video / Product / Content): Represents a recommendable entity such as a video, movie, or product.

User–Item Interaction: Represents an interaction between a user and an item (view, click, watch time, like, skip).

User Embedding: Represents a numerical vector capturing a user’s preferences and interests.

Item Embedding: Represents a numerical vector capturing an item’s characteristics and semantics.

Candidate Set: Represents a shortlist of potentially relevant items generated for ranking.

Recommendation Context: Represents the request-time context such as surface, device, time, and current item.

Recommendation Result: Represents the final ranked list of items shown to the user.

Feature Store: Represents a centralized store for precomputed user and item features used during inference.

Interaction Event: Represents a logged feedback event used for training, evaluation, and monitoring.

Database Design

Database Choice

A recommendation system uses multiple specialized data stores, not a single database, because access patterns are very different.

Distributed NoSQL Store (Cassandra / DynamoDB / Bigtable)
Used for high-throughput storage of user profiles, interaction events, and recommendation metadata. Chosen for horizontal scalability, fast writes, and predictable performance at scale.

Vector Store / ANN Index (FAISS / ScaNN / Milvus / OpenSearch Vector)
Used to store and query user and item embeddings for candidate generation.
Optimized for approximate nearest neighbor search, not relational queries.

Object Storage (S3 / GCS / HDFS)
Used for raw interaction logs, training data, and offline analytics.
Cheap, durable, and suitable for batch processing.

Cache (Redis / Memcached)
Used for hot data such as user embeddings, recent interactions, and precomputed recommendations. Critical for meeting sub-200 ms latency.

This separation ensures each workload is handled by the right storage system.

Schema

User Table
Represents platform users.

User
- user_id (PK)
- language
- region
- account_created_at
- status

Used for: Personalization, Cold-start handling, Feature lookup

Item Table
Represents recommendable content.

Item
- item_id (PK)
- type (video / product)
- category
- language
- creator_id
- published_at
- status

Used for: Content-based filtering, Policy and availability checks, User–Item Interaction Table, Represents user feedback signals.

UserItemInteraction

UserItemInteraction
- user_id (PK)
- item_id (PK)
- interaction_type (view / click / like / watch)
- interaction_value (e.g. watch_time)
- timestamp

Used for: Model training, Real-time personalization, Feedback loops, Write-heavy and append-only.

User Embedding Table

UserEmbedding
- user_id (PK)
- embedding_vector
- updated_at

Used for: Candidate generation

ANN search

Item Embedding Table
ItemEmbedding
- item_id (PK)
- embedding_vector
- updated_at

Used for: Similarity search, Cold-start recommendations

Recommendation Result Cache

UserRecommendation
- user_id (PK)
- surface (homepage / up_next)
- item_list
- generated_at

Used for: Fast homepage loads, Cache-heavy read paths

Indexing Strategy

Interactions indexed by (user_id, timestamp) for recent behavior
Items indexed by (category, status)
Embeddings indexed in ANN structures, not traditional DB indexes
Time-based partitioning for interaction logs
Indexes are chosen based on actual query patterns, not normalization.

Transaction Model

The system avoids multi-table transactions in the serving path.
Each write (interaction, embedding update, log event) is independent
Reads are eventually consistent across stores
Recommendation requests are read-only operations
This keeps latency low and throughput high.

Failure Handling

Interaction events are written asynchronously via queues
If embedding updates fail, older embeddings are reused
Cache failures fall back to database reads
ANN service failures fall back to popular or cached recommendations
The system degrades gracefully, never blocks the user.

Consistency Model

Strong consistency: Used for user identity and item availability
Eventual consistency: Used for interactions, embeddings, analytics, and recommendations

Why this works:
Slightly stale recommendations are acceptable, High availability and low latency are more important than strict consistency

API / Endpoints

Get Recommendations: Fetches personalized recommendations for a user and surface.

GET /recommendations

Request

{
  "user_id": "string",
  "surface": "homepage | up_next | related",
  "context": {
    "current_item_id": "string (optional)",
    "device": "mobile | web | tv",
    "region": "string"
  },
  "limit": 10
}

Response

{
  "user_id": "string",
  "surface": "homepage",
  "recommendations": [
    {
      "item_id": "string",
      "score": 0.92
    }
  ],
  "generated_at": "datetime"
}

Get Cached Recommendations: Returns precomputed recommendations if available.

GET /recommendations/cached

Request

{
  "user_id": "string",
  "surface": "homepage"
}

Response

{
  "recommendations": ["item_1", "item_2", "item_3"],
  "generated_at": "datetime"
}

Log Interaction Event: Records user feedback for training and personalization.

POST /interactions

Request

{
  "user_id": "string",
  "item_id": "string",
  "interaction_type": "view | click | watch | like | skip",
  "interaction_value": 120,
  "surface": "homepage | up_next",
  "timestamp": "datetime"
}

Response

{
  "status": "accepted"
}

Update User Profile: Updates user attributes used for personalization.

PUT /users/{user_id}

Request

{
  "language": "string",
  "region": "string",
  "preferences": {
    "categories": ["string"]
  }
}

Response

{
  "status": "updated"
}

Trigger Model Refresh (Internal / Admin): Triggers offline or near-real-time model updates.

POST /models/refresh

Response

{
  "status": "refresh_started"
}

Key API Design Notes

All recommendation APIs are read-optimized and low latency
Interaction logging APIs are asynchronous and non-blocking
Recommendation responses may be eventually consistent
Cached and real-time recommendations coexist
Admin APIs are restricted to internal services

System Components

1. Client (Web / Mobile / TV Apps)

Primary Responsibilities:
Requests recommendations for different surfaces (Homepage, Up Next, Search, Contextual).
Sends user interaction events such as views, clicks, watch time, skips, likes.
Passes lightweight context (device, locale, time, surface type).

Examples:
Web apps, Mobile apps, Smart TV apps

Why:
Keeps recommendation logic centralized and ensures consistent experience across devices.

2. API Gateway

Primary Responsibilities:
Acts as the secure ingress for recommendation APIs.
Handles authentication, authorization, and request validation.
Applies rate limits and traffic shaping.
Routes requests to the Recommendation Service.

Examples:
API Gateway, Envoy, NGINX

Why:
Provides centralized security and traffic control without coupling clients to backend services.

3. Recommendation Service (Serving Orchestrator)

Primary Responsibilities:
Accepts recommendation requests with user and context.
Orchestrates candidate generation, ranking, and re-ranking.
Applies timeout budgets and fallback strategies.
Aggregates final ranked results and returns them to clients.

Examples:
Stateless microservice (Java / Go / Node.js)

Why:
Acts as the real-time brain of the system while remaining horizontally scalable.

4. Candidate Generation Service

Primary Responsibilities:
Retrieves a large pool of potentially relevant items (thousands).
Uses lightweight models, embeddings, popularity, and heuristics.
Optimized for high recall and low latency.

Examples:
Embedding-based retrieval, popularity services

Why:
Reduces billions of items to a manageable candidate set for downstream ranking.

5. Vector Store / ANN Service

Primary Responsibilities:
Stores user and item embeddings.
Supports approximate nearest neighbor (ANN) search.
Provides fast similarity lookups at scale.

Examples:
Vector databases, ANN indices

Why:
Exact similarity search does not scale; ANN makes embedding-based retrieval feasible in real time.

6. Ranking Service

Primary Responsibilities:
Scores candidate items using ML models.
Combines user features, item features, and context.
Produces relevance scores for each candidate.

Examples:
Two-tower models, deep ranking models

Why:
Provides high-precision ordering once the candidate set is small enough.

7. Re-Ranking & Policy Engine

Primary Responsibilities:
Applies business rules and constraints: Diversity, Freshness, Fairness, Content safety, Sponsored content
Adjusts ordering without retraining models.

Why:
Ensures recommendations align with product, legal, and business goals.

8. Feature Store

Primary Responsibilities:
Stores precomputed user and item features.
Serves features consistently to both training and serving pipelines.
Supports low-latency online reads.

Examples:
Online + offline feature stores

Why:
Prevents feature skew and avoids expensive recomputation at request time.

9. Interaction Logging Service

Primary Responsibilities:
Collects user interaction events asynchronously.
Validates and enriches events.
Publishes events to the event stream.

Examples:
Event ingestion microservice

Why:
Decouples user actions from downstream analytics and training systems.

10. Event Stream / Message Queue

Primary Responsibilities:
Buffers user interaction events at scale.
Provides durability and backpressure handling.
Enables multiple consumers (real-time + batch).

Examples:
Distributed message queues

Why:
Absorbs traffic spikes and enables reliable data pipelines.

11. Stream Processing Service (Real-Time Layer)

Primary Responsibilities:
Processes interaction events in near real time.
Updates short-term user interests and trends.
Feeds real-time personalization features.

Examples:
Stream processors

Why:
Keeps recommendations fresh and responsive to recent user behavior.

12. Cache Layer

Primary Responsibilities:
Caches hot recommendations and embeddings.
Stores precomputed results for frequent users.
Reduces load on backend services.

Examples:
In-memory caches

Why:
Critical for meeting sub-200 ms latency SLOs.

13. Metadata Database

Primary Responsibilities:
Stores user profiles, item metadata, and configuration.
Supports high read throughput and horizontal scaling.
Acts as the source of truth for non-ML data.

Examples:
Distributed NoSQL databases

Why:
Optimized for scale and availability rather than complex transactions.

High-Level Flows

Flow 0: Homepage Recommendation (Happy Path)

Client requests recommendations for the Homepage with user ID and context (device, locale, time).
API Gateway authenticates the request and forwards it to the Recommendation Service.
Recommendation Service checks the cache for precomputed results.
On cache miss, it triggers candidate generation.
Candidate Generation retrieves thousands of relevant items using embeddings, popularity, and heuristics.
Ranking Service scores candidates using Algorithm or ML models.
Re-Ranking & Policy Engine applies diversity, freshness, and safety rules.
Final ranked list is returned to the client.

Guarantee: Sub-200 ms latency with high-quality personalized recommendations.

Flow 1: Real-Time Personalization Update

User watches, clicks, skips, or searches for content.
Client sends interaction events asynchronously to the Interaction Logging Service.
Events are published to the Event Stream.
Stream Processing Service updates short-term user features (recent interests, intent).
Updated features are written to the Feature Store.
Subsequent recommendation requests reflect the latest behavior.

Guarantee: Recommendations adapt within seconds to recent user actions.

Flow 2: “Up Next” / Contextual Recommendation

Client requests recommendations with current item context (e.g., video being watched).
Recommendation Service forwards context to Candidate Generation.
Candidate Generation retrieves items similar to the current item and user preferences.
Ranking prioritizes relevance, continuity, and completion likelihood.
Re-ranking enforces freshness and avoids repetition.
Results are returned to the client.

Guarantee: Smooth content continuation and session-level engagement.

Flow 3: Cold Start – New User

New user requests recommendations with no interaction history.
Recommendation Service detects missing user embeddings.
Candidate Generation falls back to:
Popular content
Regional and language-based items
Editorial or curated lists
Lightweight ranking applies basic personalization using context.
Results are cached with short TTL.
As interactions arrive, the system transitions to personalized recommendations.

Guarantee: Reasonable recommendations even without historical data.

Flow 4: Cold Start – New Item

A new item is added to the platform.
Item metadata and content features are processed offline.
Item embedding is generated and stored in the Vector Store.
Candidate Generation includes the item for relevant users.
Exposure is throttled and monitored to collect early feedback.
Interaction signals gradually improve ranking confidence.

Guarantee: New items get fair exposure without degrading recommendation quality.

Flow 5: Cache-First Serving Path

Recommendation Service checks cache using (user_id, surface) key.
If hit, cached recommendations are returned immediately.
If stale or expired, async refresh is triggered in the background.
Fresh results replace the cache entry.

Guarantee: Ultra-low latency for frequent users and popular surfaces.

Flow 6: Fallback on Dependency Failure

Candidate Generation or Ranking exceeds timeout budget.
Recommendation Service triggers fallback strategy:
Cached results
Popular or trending items
Simplified heuristic ranking
Response is returned within latency SLO.
Failure metrics are emitted for monitoring.

Guarantee: System degrades gracefully without user-visible failures.

Flow 7: Observability & Feedback Loop

Recommendation impressions and interactions are logged.
Analytics pipelines compute engagement and quality metrics.
Alerts trigger on drops in CTR, watch time, or diversity.
Insights feed back into model tuning and policy updates.

Guarantee: Silent recommendation degradation is detected early.

Deep Dives – Functional Requirements

1. Support Personalized Recommendations for Users

The system generates recommendations tailored to each user based on their historical behavior, preferences, and context.
Personalization is achieved by combining long-term user signals (past interactions) with short-term intent (recent activity) to avoid generic or repetitive recommendations.

2. Support Homepage, “Up Next”, and Contextual Recommendations

Different surfaces have different goals and constraints.
The system supports multiple recommendation surfaces by accepting surface type and context at request time, allowing the same backend to produce results optimized for discovery (Homepage), continuity (Up Next), or relevance to a current item (Contextual).

3. Support Hybrid Recommendation Strategies

Relying on a single signal source is fragile at scale.
The system combines interaction-based signals (what similar users engage with) and content-based signals (item metadata and semantics) to improve robustness, coverage, and cold-start behavior.

4. Support Real-Time Personalization Using Recent User Interactions

User intent changes rapidly during a session.
Recent interactions such as clicks, skips, and watch time are processed asynchronously and reflected in near real time, ensuring recommendations adapt within seconds instead of waiting for offline updates.

5. Support Large-Scale Candidate Generation from Billions of Items

Scoring the entire catalog per request is infeasible.
The system first retrieves a high-recall candidate set using lightweight retrieval techniques, reducing the search space from billions of items to thousands before applying more expensive ranking logic.

6. Support Multi-Stage Ranking

Recommendation quality and latency are balanced using a staged pipeline.
Early stages prioritize speed and recall, while later stages focus on precision and ordering, allowing strict latency budgets to be met without sacrificing relevance.

7. Support Cold-Start Handling for New Users and New Items

New users and items lack interaction history.
The system falls back to popularity, regional trends, content attributes, and contextual signals, gradually transitioning to personalized recommendations as interactions are collected.

8. Support Business-Rule and Policy-Based Re-Ranking

Model scores alone are insufficient for production systems.
Final ranking applies constraints such as diversity, freshness, fairness, content safety, and sponsored placement to align recommendations with product, legal, and business requirements without retraining core logic.

9. Support Tracking of User Interactions and Feedback Signals

Every recommendation impression and user interaction is logged asynchronously.
These signals power real-time personalization, offline evaluation, monitoring, and long-term system improvement without impacting serving latency.

Deep Dives Non-Functional Requirements

1. Highly Available and Fault Tolerant

The system must continue serving recommendations despite failures in individual services or dependencies.
All serving components are stateless and horizontally scalable, while critical data is stored in replicated and durable systems to avoid single points of failure.

2. Low-Latency Recommendation Serving (Sub-200 ms)

Recommendation requests are latency-sensitive and must return results within strict SLOs.
The system enforces cache-first access, timeout budgets per stage, and lightweight fallbacks to guarantee predictable response times under load.

3. High Throughput for Large-Scale User Traffic

The system must handle millions of concurrent users and bursty traffic patterns.
Asynchronous event ingestion, batched processing, and partitioned queues ensure sustained high throughput without impacting serving performance.

4. Horizontally Scalable with Growing Users and Content

All core components scale horizontally by adding instances rather than redesigning the system.
Growth in users, content, or regions is handled through partitioning, sharding, and independent scaling of retrieval, ranking, and caching layers.

5. Real-Time Freshness of Recommendations

User behavior and content trends change rapidly.
The system incorporates near real-time interaction signals and frequent cache refreshes to prevent stale recommendations while avoiding excessive recomputation.

6. Consistent User Experience Across Devices and Regions

Users may switch devices or locations frequently.
Recommendations are generated using a unified serving pipeline with region-aware data access, ensuring consistency while respecting latency and locality constraints.

7. Cost-Efficient Operation at Scale

Serving recommendations is a high-QPS workload.
The system minimizes cost by using multi-stage pipelines, aggressive caching, and lightweight retrieval before expensive computation, ensuring cost scales linearly with traffic.

8. Secure Access to User and Content Data

User behavior data is sensitive and must be protected.
All APIs are authenticated and authorized, data is encrypted in transit and at rest, and access is restricted based on service identity and least-privilege principles.

9. Observability for System Health and Recommendation Quality

System health cannot be judged by uptime alone.
The system tracks latency, error rates, cache hit ratios, and downstream dependency health, along with engagement and quality metrics to detect silent degradation.

Failure Handling & Fallback Strategies

Recommendation systems must remain responsive even when dependencies fail or degrade. The system is designed to fail fast, degrade gracefully, and never block the user experience.

Cache Miss or Cache Unavailability

If cached recommendations are unavailable or expired, the system bypasses the cache and triggers the normal serving pipeline.
If recomputation exceeds latency budgets, a simpler fallback (popular or trending items) is returned to avoid user-visible delays.

Guarantee: Cache failures never block recommendation delivery.

Candidate Generation Timeout or Failure

Candidate generation has a strict timeout budget. If it fails or times out, the system falls back to: Recently cached candidate sets, Popular or regional content, Lightweight heuristic-based retrieval

Guarantee: Requests complete within latency SLOs even if retrieval degrades.

Vector Store / ANN Service Degradation

If the ANN service becomes slow or unavailable, the system avoids synchronous retries. Requests are served using precomputed or cached candidates while health checks and alerts trigger remediation.

Guarantee: Embedding search failures do not cascade into full system outages.

Ranking Service Timeout

Ranking is bounded by a hard deadline.If ranking exceeds its time budget, partially scored results or previously cached rankings are returned.

Guarantee: Ranking accuracy is sacrificed before latency guarantees.

Re-Ranking or Policy Engine Failure

Re-ranking logic is designed to be optional and best-effort. If it fails, the system returns the ranked list without additional constraints rather than failing the request.

Guarantee: Business rules enhance quality but never break delivery.

Real-Time Signal Unavailability

If real-time personalization signals are delayed or unavailable, the system falls back to long-term user preferences. Offline features remain the stable baseline.

Guarantee: Recommendation quality degrades gracefully without sudden behavior shifts.

Event Stream Backlog or Processing Lag

If interaction events lag or queues build up, serving continues unaffected. Lag is monitored and corrected asynchronously without blocking recommendation requests.

Guarantee: Data pipeline issues never impact real-time serving.

Partial Data or Feature Store Outage

If some features cannot be fetched, the system proceeds with a reduced feature set. Missing features are treated as optional, not mandatory.

Guarantee: Feature unavailability does not cause request failures.

Regional Failure or Zone Outage

If a region or availability zone becomes unhealthy, traffic is shifted to healthy regions. Cached and regionally replicated data ensures continuity.

Guarantee: Regional outages result in degraded quality, not downtime.

Graceful Degradation Under Extreme Load

When the system is overloaded: Low-priority surfaces are throttledCache TTLs are increased. Expensive computation is skipped

Guarantee: Core recommendation flows remain available under peak load.

Trade-Offs

Multi-Stage Retrieval vs Single-Stage Ranking

Choice: Multi-stage recommendation pipeline
Pros: Scales to billions of items, predictable latency, independent optimization of stages
Cons: Higher system and operational complexity
Why This Works: Single-stage scoring is infeasible at scale; staged pipelines are the only practical way to meet strict latency SLOs.

Approximate Retrieval vs Exact Search

Choice: Approximate retrieval for candidate generation
Pros: Orders-of-magnitude faster, enables real-time serving, bounded latency
Cons: Slight recall loss due to approximation
Why This Works: Small recall loss is acceptable and compensated by downstream ranking.

Real-Time Freshness vs Serving Latency

Choice: Near real-time personalization with bounded freshness
Pros: Responsive to recent behavior without blocking requests
Cons: Very recent actions may not appear immediately
Why This Works: Users value fast responses more than perfectly fresh recommendations.

Cache-First Serving vs Always Compute

Choice: Cache-first serving with asynchronous refresh
Pros: Low latency, reduced backend load, improved tail performance
Cons: Cached results can be slightly stale
Why This Works: Slight staleness is acceptable in exchange for reliability and speed.

Personalization Depth vs System Cost

Choice: Deep personalization only after candidate reduction
Pros: Keeps compute cost bounded, predictable scaling
Cons: Early-stage retrieval is less personalized
Why This Works: Fine-grained personalization only matters when the candidate set is small.

Exploration vs Exploitation

Choice: Controlled exploration at lower ranks
Pros: Prevents stagnation, discovers new interests and content
Cons: Short-term engagement may dip slightly
Why This Works: Long-term engagement improves with limited, targeted exploration.

Consistency vs Availability

Choice: Eventual consistency for recommendation data
Pros: Higher availability and lower latency
Cons: Temporary inconsistencies in results
Why This Works: Recommendations are advisory, not transactional.

Centralized Orchestration vs Fully Distributed Logic

Choice: Centralized Recommendation Service
Pros: Clear ownership, better observability, strict latency control
Cons: Requires careful horizontal scaling
Why This Works: Central orchestration simplifies control without sacrificing scalability.

Business Rules in Models vs Post-Ranking Policies

Choice: Policy-based re-ranking outside models
Pros: Faster iteration, no retraining required
Cons: Additional processing step
Why This Works: Business logic changes faster than models and should remain decoupled.

Frequently Asked Questions in Interviews

Why can’t we score all items for every recommendation request?

Because real-world catalogs can contain billions of items, and scoring each one would exceed both latency and compute budgets.
Multi-stage retrieval limits expensive computation to a small candidate set, making real-time serving feasible.

What happens if candidate generation misses good items?

Those items will never reach downstream ranking or re-ranking stages.
This is why candidate generation is optimized for high recall and often uses multiple retrieval strategies to reduce blind spots.

Why do we separate candidate generation and ranking?

Candidate generation focuses on recall and speed, while ranking focuses on precision and ordering.
Separating these concerns allows each stage to be optimized independently under strict latency constraints.

Why do we need both ranking and re-ranking?

Ranking determines relevance based on learned signals and context.
Re-ranking applies product, safety, fairness, and diversity constraints that are difficult or risky to encode directly into ranking logic.

How do you handle real-time personalization without increasing latency?

User interactions are ingested asynchronously and reflected through fast-access features.
Serving never blocks on real-time pipelines and falls back to long-term preferences if recent signals are delayed.

How does the system handle cold-start users?

When no interaction history exists, the system relies on popularity, regional trends, and contextual signals.
As soon as interactions are collected, personalization gradually increases without abrupt behavior changes.

How does the system handle cold-start items?

New items rely on content attributes and controlled initial exposure.
Early interaction signals are monitored before the item is fully trusted in ranking to avoid quality degradation.

How do you ensure recommendations stay fresh?

Short-term signals update frequently and cached results use bounded TTLs.
Offline updates continuously refresh long-term preferences without impacting live traffic.

What happens if the vector store or retrieval layer goes down?

The system avoids retries on the critical path and switches to cached or heuristic-based candidates.
Availability and latency are preserved even if relevance temporarily degrades.

Why is eventual consistency acceptable in recommendation systems?

Recommendations guide user choice but do not represent a source of truth.
Temporary inconsistencies are preferable to increased latency or reduced availability.

How do you prevent popularity bias and content monopolization?

The system applies diversity constraints, exposure caps, and controlled exploration.
This ensures long-tail content receives visibility while preserving relevance.

How do you debug bad or surprising recommendations?

Every recommendation request and interaction is logged with traceable identifiers.
Drops in engagement, diversity, or freshness trigger alerts and investigation.

Which metrics matter most in recommendation systems?

System health metrics include latency, error rates, and cache hit ratios.
Quality metrics include engagement, retention, diversity, and long-term user satisfaction.

How does the system scale to 10× or 100× traffic?

All serving components are stateless and horizontally scalable.
Capacity is increased by adding replicas, cache nodes, and partitions without redesigning the system.

Why are training and serving decoupled?

Coupling them would make serving dependent on slow or unstable pipelines.
Serving always relies on the last known good state to protect latency and availability.

How do you ensure consistent recommendations across devices?

A unified serving pipeline is used across web, mobile, and TV clients.
Device context influences ranking behavior without fragmenting core logic.

What are the biggest scalability bottlenecks?

Candidate retrieval latency and cache miss amplification at peak traffic.
These are mitigated using aggressive caching, fallbacks, and timeout budgets.

What would you simplify if system traffic were low?

Reduce the number of stages, caching layers, and fallback paths.
System complexity should scale with traffic and business needs, not precede them.

High-Level Summary

This recommendation system uses a multi-stage, cache-first architecture to serve personalized results at scale under strict latency constraints. Candidate generation, ranking, and policy-based re-ranking are cleanly separated to balance relevance, freshness, and business rules. The system is highly available, horizontally scalable, and designed to degrade gracefully during partial failures. Real-time feedback loops and strong observability ensure recommendation quality improves continuously without impacting reliability.

Feel free to ask questions or share your thoughts — happy to discuss!