Netflix's Recommendation Pipeline Without ML First

#architecture #machinelearning #tutorial #programming

Book: System Design Pocket Guide: Interviews
Also by me: RAG Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

The interviewer says "design Netflix's recommendation system." Eight minutes in, the candidate is whiteboarding a transformer architecture, sketching attention heads, debating whether to fine-tune Llama or train from scratch. The interviewer is staring at the corner of the room.

The recommendation panel at any FAANG-shaped company is not testing whether you remember the YouTube DNN paper. It is testing whether you understand that the model is one box on the architecture diagram, not the whole diagram. My rough split when I sketch these systems on a whiteboard: the model is maybe 15% of the work, and the remaining 85% is candidate generation, ranking infrastructure, feature stores, online A/B, and the data pipelines that feed all four. Numbers are illustrative, but the ratio matches the shape of every production system I've seen. If you spend the interview on the 15%, you fail the 85%.

The Netflix tech blog's piece on system architectures for personalization lays this out explicitly: the architecture splits offline, nearline, and online layers, and the model is one component among many. Walk that pipeline first. If there's time at the end, talk about the model.

Latency budget — the constraint that decides everything

Before any architecture, the budget. A reasonable working budget for a homepage row is on the order of 100 ms after the user opens the app — that's the number I use when I reason about systems of this shape, not a published Netflix figure. One plausible split (mine, illustrative):

15 ms — gateway, TLS, deserialization
20 ms — feature fetch
30 ms — ranking inference
20 ms — post-filtering, business rules, formatting
15 ms — slack for tail latency

That leaves zero milliseconds for "score every title in the catalog against this user." Take a catalog on the order of tens of thousands of titles (Netflix's global catalog has been quoted around 17–18k, varies per region). Multiply by ~200 features per title, then a model forward pass each. Try to fit that in 30 ms? It doesn't. So the system narrows the candidate set first, then ranks the narrowed set. This is the entire point of the multi-stage pipeline.

Three stages, each with a different cost-per-item budget:

Candidate generation. ~18,000 → 500 items. Cheap retrieval. Sub-millisecond per item.
Ranking. 500 → 50 items. Expensive scoring. Tens of microseconds per item.
Post-filtering and re-ranking. 50 → final ordered list. Business rules and diversity.

Naming the budget upfront in an interview signals you've actually built one of these.

Candidate generation without an ML model

The instinct is to reach for a two-tower deep network. The senior answer is "we have four candidate generators running in parallel, three of them are not ML." Catalog-scale retrieval is mostly bookkeeping.

The four classics:

Popularity. Top items by region, recency-weighted. One precomputed list per region, refreshed every 15 minutes. Hits the long tail of users with no history.
Recently watched + similar. Item-to-item collaborative filtering. For each item, store the top-K co-watched items, computed offline from session co-occurrence.
Continue watching. User has a half-finished episode. That row exists because somebody wrote a SQL query against the playback events table.
New release injection. Boost items released in the last 30 days, gated by genre affinity from the user's watch history.

Three of those are "run a daily Spark job, dump the result into Redis or Cassandra, fetch with a key lookup at request time." Zero models. In practice they generate the majority of what ends up in the final ranked list — my own rough estimate from walking the design with engineers who've shipped these, not a published figure.

Here is the candidate-generation skeleton — popularity plus item-to-item collaborative filtering, no ML libraries imported.

from collections import Counter, defaultdict
from dataclasses import dataclass
from typing import Iterable

@dataclass
class WatchEvent:
    user_id: str
    item_id: str
    completed: bool
    watched_at_ts: int

def cooccurrence_matrix(
    events: Iterable[WatchEvent], top_k: int = 50
) -> dict[str, list[tuple[str, float]]]:
    """For each item, the top-K co-watched items by Jaccard."""
    user_items: dict[str, set[str]] = defaultdict(set)
    item_users: dict[str, set[str]] = defaultdict(set)
    for e in events:
        if not e.completed:
            continue
        user_items[e.user_id].add(e.item_id)
        item_users[e.item_id].add(e.user_id)

    sims: dict[str, list[tuple[str, float]]] = {}
    for item, users in item_users.items():
        scores: Counter = Counter()
        for u in users:
            for other in user_items[u]:
                if other != item:
                    scores[other] += 1
        ranked = []
        for other, n_co in scores.most_common(top_k * 4):
            denom = len(users | item_users[other])
            if denom:
                ranked.append((other, n_co / denom))
        ranked.sort(key=lambda x: -x[1])
        sims[item] = ranked[:top_k]
    return sims

def popularity_by_region(
    events: Iterable[WatchEvent],
    user_region: dict[str, str],
    half_life_hours: float = 72,
) -> dict[str, list[str]]:
    """Region → recency-decayed top items."""
    import math
    now = max(e.watched_at_ts for e in events)
    decay = math.log(2) / (half_life_hours * 3600)
    scores: dict[str, Counter] = defaultdict(Counter)
    for e in events:
        region = user_region.get(e.user_id, "ROW")
        weight = math.exp(-decay * (now - e.watched_at_ts))
        scores[region][e.item_id] += weight
    return {
        r: [i for i, _ in c.most_common(500)]
        for r, c in scores.items()
    }

That's the offline half. Both functions run once a day in Spark, write their dictionaries to a KV store, and never touch the request path. Now the online half — three lookups and a sort:

def candidates_for(
    user_id: str,
    history: list[str],
    region: str,
    sims: dict[str, list[tuple[str, float]]],
    pop: dict[str, list[str]],
    cap: int = 500,
) -> list[str]:
    seen = set(history)
    pool: dict[str, float] = {}

    # Item-to-item from recent history.
    for item in history[-20:]:
        for other, score in sims.get(item, []):
            if other in seen:
                continue
            pool[other] = max(pool.get(other, 0.0), score)

    # Backfill with regional popularity.
    for i, item in enumerate(pop.get(region, [])):
        if item in seen or item in pool:
            continue
        pool[item] = 0.1 * (1.0 - i / len(pop[region]))

    return [i for i, _ in sorted(
        pool.items(), key=lambda kv: -kv[1]
    )[:cap]]

That's a working candidate generator. Run cooccurrence_matrix and popularity_by_region once a day in Spark. Dump the dictionaries into a fast KV store. At request time, candidates_for is three lookups and a sort. Expect single-digit-millisecond P99 on a warm cache (back-of-envelope, not benchmarked).

The two-tower model becomes a fifth generator slotted next to these. In the systems I've seen it tends to contribute a minority share of the final candidates — often something like 20–30%, rough estimate. It does not replace the cheap ones.

Ranking — where ML earns its keep

Ranking is where a learned model genuinely helps. You are scoring 500 items against a single user, with 100–300 features per pair. Logistic regression, gradient-boosted trees, or a small DNN. The architecture matters less than the feature engineering and the serving infrastructure.

The features come from a feature store, fetched in a single batch:

User features — embeddings from watch history, demographic priors, time-of-day patterns.
Item features — genre tags, release date, popularity decile.
Interaction features — has the user seen the trailer, dwelled on the box art, skipped the title before.
Context features — device, time of day, previously watched in this session.

The store you pick matters. Redis works for small workloads. At Netflix scale you need something like Feast, Vertex Feature Store, or a homegrown system on Cassandra. The constraint: 500 items × 200 features in 20 ms means you fetch in one batch, not 500 round-trips.

The ranker emits a score per item. You sort. You keep the top 50. Done.

Post-filtering and re-ranking

The 50 ranked items still don't go straight to the user. Three more passes:

Diversity. Don't show six action movies in a row. Apply a determinantal point process or a simpler greedy "max-K-from-genre" rule.
Business constraints. Regional licensing — does this title exist in this country's catalog right now? Content moderation — is this title currently flagged?
Freshness boost. Recently added titles get a small lift, capped per row.

All three passes are rules-based, and all three are non-negotiable for the product to work.

A/B testing infrastructure is half the system

The recommendation pipeline is a hypothesis machine. Every change needs an A/B test before it ships: a new candidate generator, a retrained ranker, a tweaked diversity rule. The infrastructure to run those experiments is half the engineering investment.

What the experimentation platform owes you:

Deterministic bucketing. A user lands in the same arm every request. Hash (user_id, experiment_id) mod N. Sticky and stateless.
Holdout populations. A small fraction of traffic never enters any experiment, used for long-run measurement.
Metric pipelines. Watch time, completion rate, retention at 7/30 days, churn proxies. Computed from event logs in batch, not online.
Guardrails. Auto-stop an arm if a top-line metric drops by more than X%. Catches bad rankers before they bleed engagement.

The candidates who skip A/B in their interview answer signal that they've never shipped a recommendation change in production. The interviewer is listening for it.

What the panel is silently scoring

The interviewer is not waiting for you to draw a transformer. They are waiting to hear you name a latency budget before any architecture, treat candidate generators as a portfolio rather than a single model, talk about the feature store as the load-bearing piece, and put A/B infrastructure on the diagram as a first-class box. Hit those four beats and the ranker discussion is optional. Skip them and a perfect ranker won't save the answer.

If you want to practice this, sketch the same pipeline shape against a domain you actually know — your company's search ranker, a feed, a notifications system. The stages translate. The budget exercise is the part most candidates have never done out loud, and it's the part that flips the panel's read of you from "knows the papers" to "has shipped one."

If this was useful

Recommendation pipelines and retrieval pipelines share more structure than people expect: candidate generation, reranking, diversity injection, online vs offline split. System Design Pocket Guide: Interviews walks through the recommendation system design among 14 others, with the same emphasis on naming the latency budget before the architecture. If you're shipping the same shape of pipeline against documents instead of titles, RAG Pocket Guide is the document-retrieval analogue — same multi-stage funnel, different inputs.