shubham pandey (Connoisseur)

Posted on Mar 22

Designing Netflix / Video Streaming at Scale A System Design Deep Dive — Question by Question

#architecture #distributedsystems #interview #systemdesign

Introduction

Netflix serves 4K video to 200 million subscribers simultaneously without buffering. On the surface it is just playing a video file. Underneath it is one of the most sophisticated distributed systems ever built — spanning global content delivery, adaptive streaming, parallel encoding pipelines, geo licensing, and personalized recommendations all working together seamlessly. This post walks through every challenge question by question, including wrong turns and how to navigate out of them.

Challenge 1: Global Video Delivery

Interview Question: If you had one central server in the US storing all Netflix movies, a user in Mumbai, a user in London, and a user in Tokyo all click play simultaneously. What is the most fundamental problem they all face?

Navigation: Serving 100GB video files from a single central server to users across the world means every user pays the cost of geographic distance — high latency, slow start times, and saturated long distance network links. The solution is obvious once you frame it correctly — video data needs to physically live close to the user.

Solution: Content Delivery Network — CDN with regional servers.

Netflix built their own CDN called Open Connect. They place servers called Open Connect Appliances directly inside ISP data centers worldwide. Instead of video traveling from a US central server to Mumbai, it travels from a Mumbai ISP server a few milliseconds away.

Central server stores master copy of all content
Regional CDN servers cache popular content close to users
Mumbai user streams from Mumbai CDN server — low latency
London user streams from London CDN server — low latency
Tokyo user streams from Tokyo CDN server — low latency

Key Insight: Video data must live geographically close to the user. A CDN is not an optimization — it is a fundamental requirement for global video streaming at scale.

Challenge 2: Smart CDN Cache Management

Interview Question: CDN storage is expensive. You cannot cache every title on every regional server. How do you decide what content to cache on which regional server — and what happens when a user requests a title not cached on their nearest CDN?

Solution: Two caching strategies working together — push and pull.

Push based caching — proactive:

New blockbuster release approaching launch day
Netflix proactively pushes content to all relevant CDN servers before launch
Day one release — content already cached everywhere — zero cache misses on launch

Pull based caching — reactive:

User requests title not cached on nearest CDN server
CDN pulls content from central server on first request
Caches it locally with TTL for future requests
All subsequent users in that region hit the cache

Unpopular titles:

Rarely requested content never gets cached
Served directly from central server
CDN storage reserved for content that justifies caching

Dynamic TTL based on viewing patterns:
A fixed TTL for all content is naive. A show extremely popular during its launch week but dead 3 months later should not hold CDN space indefinitely.

Solution: Viewing Analytics Service collects watch events and publishes to Kafka. TTL Management Service consumes from Kafka, computes viewing frequency per title per region, and dynamically adjusts TTL accordingly.

Avengers launch week — 10 million views — TTL 30 days
Avengers 3 months later — 1000 views — TTL 3 days
Obscure documentary — 10 views — TTL 0, evict immediately

All TTL adjustments happen asynchronously — never blocking the streaming experience.

Key Insight: Smart CDN management combines proactive push for known blockbusters, reactive pull for long tail content, and dynamic TTL adjustment based on real viewing patterns. Static caching strategies waste expensive CDN storage.

Challenge 3: Video Chunking and Instant Playback

Interview Question: A 4K HDR movie is 100GB. A user has a 50 Mbps connection. Downloading 100GB takes 4.4 hours. But Netflix starts playing in under 3 seconds. How is this physically possible?

Navigation: The user does not need all 100GB before playback starts. They only need the first few seconds of video. If you split the video into small chunks and start playing the first chunk while the rest download in the background, playback starts almost instantly.

Solution: Video Chunking — split video into 2 to 10 second chunks.

3 hour movie split into roughly 2000 individual chunks
Each chunk is 2 to 10 seconds of video
User needs only the first chunk to start playing — a few megabytes
Subsequent chunks download in background while current chunk plays
Playback starts in under 3 seconds regardless of file size

Key Insight: You never need the whole file to start playing. Chunking transforms an impossible 100GB download problem into a trivial few megabyte first chunk problem.

Challenge 4: Adaptive Bitrate Streaming

Interview Question: Netflix stores the same chunk at 5 different quality levels. Why — and who decides which quality to request?

Netflix stores every chunk at multiple quality levels:

4K HDR — 8 Mbps — 4MB per chunk
1080p — 4 Mbps — 2MB per chunk
720p — 2 Mbps — 1MB per chunk
480p — 1 Mbps — 0.5MB per chunk
360p — 0.5 Mbps — 0.25MB per chunk

Different users have different bandwidth. The same user has different bandwidth at different moments — strong WiFi at home, weak signal in the kitchen, mobile data on the commute.

With a single quality video: connection degrades → buffering → terrible experience.
With multiple quality versions: connection degrades → seamlessly switch to lower quality chunk → no buffering.

The client decides — not the server. The Netflix app runs an ABR Algorithm — Adaptive Bitrate Algorithm — that continuously monitors download speed of recent chunks and current buffer level, then decides which quality chunk to request next.

ABR decision logic:

Buffer above 30 seconds and speed above 20 Mbps — request 4K chunk
Buffer above 15 seconds and speed above 8 Mbps — request 1080p chunk
Buffer below 10 seconds and speed dropping — request 720p chunk
Buffer below 5 seconds — request 360p immediately to prevent buffering

Quality adjustments are seamless and invisible to the user. The app makes hundreds of these decisions per viewing session.

Key Insight: Multiple quality versions per chunk combined with client side adaptive bitrate selection eliminates buffering under any network condition. The client always has the right quality for current bandwidth.

Challenge 5: Parallel Video Encoding Pipeline

Interview Question: A raw master file could be 1TB of uncompressed footage. Netflix needs to create thousands of chunks at 5 quality levels each. Encoding a 3 hour movie sequentially on one machine could take days. Netflix adds thousands of new titles every year. How do you process them fast enough?

Navigation: The key insight is that chunks are independent of each other. You do not need to encode chunk 1 before encoding chunk 2. If you can encode all chunks simultaneously across thousands of machines, a process that took days takes minutes.

Solution: Parallel chunk encoding across a distributed encoding farm.

Raw master file arrives
Chunking Service splits movie into 2000 independent chunks
Each chunk published as a job to Kafka job queue
2000 encoding machines each pick up one job
Every chunk encodes simultaneously across the farm
Each machine produces 5 quality versions of its chunk
All 2000 chunks complete — Merge Service reassembles into final encoded movie
CDN Distribution pushes encoded content to regional servers

A 3 hour movie that took days on one machine now takes minutes across 2000 machines.

Key Insight: Chunking solves two problems simultaneously — instant playback for users and parallel encoding for Netflix. The same chunk boundaries that enable streaming also enable massively parallel encoding.

Challenge 6: Fault Tolerant Encoding Pipeline

Interview Question: Your encoding farm has 2000 machines encoding chunks simultaneously. One machine fails mid encoding. 1999 chunks complete successfully but chunk 847 is lost. The entire movie is incomplete. At scale machine failures happen constantly. How do you make the pipeline fault tolerant?

Navigation: The failed chunk needs to be detected and retried on a different machine automatically. This requires a coordinator that tracks the state of every chunk job and reassigns failed jobs without human intervention.

Solution: Kafka job queue with Coordinator Service tracking chunk states.

Every chunk job has a state:

PENDING — waiting to be picked up by a worker
IN PROGRESS — currently being encoded by a worker
COMPLETED — successfully encoded
FAILED — encoding failed, needs retry

Coordinator Service monitors all job states:

Job stuck IN PROGRESS too long — worker crashed — set back to PENDING — reassigned to healthy worker
All 2000 jobs COMPLETED — trigger Merge Service automatically
Job fails repeatedly — alert engineering team

TTL on IN PROGRESS state — same pattern from WhatsApp and Stock Exchange designs:

Worker picks up chunk — job marked IN PROGRESS with TTL of expected encoding time plus buffer
Worker crashes — TTL expires — job automatically returns to PENDING — reassigned
No manual intervention needed — pipeline self heals

Key Insight: A job queue with explicit state tracking and TTL based failure detection makes distributed encoding pipelines self healing. Individual machine failures never block movie processing.

Challenge 7: Geo Licensing Checks

Interview Question: Netflix operates in 190 countries with different licensing rules per title per country. When a user clicks play Netflix must instantly verify if content is licensed in their country. This check must happen in milliseconds. How do you design it?

Solution: Redis Set per title with DynamoDB fallback and fail open strategy.

Data structure — Redis Set per title:

Key is title ID
Value is Redis Set containing all countries where title is licensed
SISMEMBER titleID countryCode returns true or false in O(1)
10000 titles times 190 countries — entirely manageable in Redis memory

Licensing updates — eventual consistency is acceptable:

Licensing changes happen a few times per day not in milliseconds
DynamoDB stores licensing rules as source of truth
Async background job syncs DynamoDB to Redis every few minutes
Eventual consistency is perfectly fine — nobody needs sub-second licensing propagation
This is deliberate under-engineering — Kafka streaming for licensing updates would be overkill

Fallback strategy — graceful degradation:

Redis available — O(1) licensing check — instant response
Redis down — fall back to DynamoDB — slightly slower but always available
Both down — fail open — allow playback — minor licensing risk accepted

Fail open philosophy: Netflix prioritizes user experience over minor licensing violations. Blocking 200 million users from watching anything during a 30 second Redis outage causes massive revenue loss and reputational damage. Serving unlicensed content to a small number of users for 30 seconds is an acceptable tradeoff.

This is called Graceful Degradation — system degrades to a slower but functional state rather than failing completely.

Key Insight: Redis Set gives O(1) geo licensing checks. DynamoDB fallback means Redis downtime is never user facing. Fail open philosophy ensures Netflix never goes dark over a licensing check infrastructure failure.

Challenge 8: Personalized Recommendations

Interview Question: Netflix shows every user a completely personalized homepage based on watch history, ratings, similar users, trending content, and time of day patterns. Generating this in real time for 200 million users simultaneously seems impossible. How do you make personalized recommendations appear instantly?

Navigation: Recommendations do not need to be real time. Your taste does not change between 2pm and 2:05pm. Generating recommendations once per day and caching them is indistinguishable from real time generation — but vastly cheaper and faster.

Solution: Offline precomputed recommendations with DynamoDB plus Redis cache.

Offline ML pipeline runs continuously in background:

Analyzes watch history of all 200 million users
Runs collaborative filtering — users with similar taste to you watched X
Runs content based filtering — you liked action movies so here are more
Computes personalized top 100 recommendations per user per region
Stores results in DynamoDB
Invalidates Redis cache so fresh recommendations load on next session

User opens Netflix app:

Fetch precomputed recommendations from Redis cache — instant O(1) read
Cache miss — load from DynamoDB — populate Redis — return results
No ML computation at request time — pure database read
Homepage loads in under 200ms

Cache invalidation strategy:

ML pipeline reruns — new recommendations computed — DynamoDB updated
Redis cache invalidated per user
Next app open — cache miss — fresh recommendations loaded from DynamoDB — cached again

Key Insight: Precomputing recommendations offline transforms an impossibly complex real time ML problem into a trivial database read. User taste changes slowly — daily recomputation is indistinguishable from real time for the user experience.

Full Architecture Summary

Global video delivery — CDN with regional servers, Netflix Open Connect inside ISPs
CDN cache management — Push for blockbusters, pull on demand, dynamic TTL via Kafka
Video chunking — 2 to 10 second chunks for instant playback start
Adaptive bitrate — 5 quality versions per chunk, client side ABR algorithm
Encoding pipeline — Parallel chunk encoding across distributed farm
Fault tolerant encoding — Kafka job queue with coordinator and TTL based failure detection
Geo licensing — Redis Set per title with DynamoDB fallback and fail open strategy
Personalized recommendations — Offline ML pipeline stored in DynamoDB plus Redis cache

Final Thoughts

Netflix is a masterclass in knowing when to compute eagerly and when to compute lazily. Recommendations are precomputed because real time ML at 200 million users is impossible. Chunks are encoded in parallel because sequential encoding is too slow. Licensing checks use eventual consistency because sub-second propagation is unnecessary overkill.

The recurring theme across every layer is that the right architecture matches the actual requirements — not a theoretical ideal. Netflix does not need real time recommendations. It does not need synchronous licensing updates. It does not need to serve the full 100GB file before playback. Recognizing what you do not need is just as important as knowing what you do.

Happy building. 🚀

DEV Community