DEV Community

Cover image for Netflix System Design- Backend Architecture
Elegberun Olugbenga
Elegberun Olugbenga

Posted on • Updated on

Netflix System Design- Backend Architecture

Cover Photo by Alexander Shatov on Unsplash

Netflix accounts for about 15% of the world's internet bandwidth traffic, serving over 6 billion hours of content per month to nearly every country in the world. Building a robust, highly scalable, reliable, and efficient backend system is no small engineering feat, but the ambitious team at Netflix has proven that problems exist to be solved.

This article analyzes the Netflix system architecture as researched from online sources. Section 1 provides a simplified overview of the Netflix system. Section 2 provides an overview of the backend architecture, and section 3 provides a detailed look at the individual system components. For a complete guide on modern system design, you can try grokking the modern system design for engineers course.

1. Overview

Netflix operates in two clouds Amazon Web Services and Open Connect(Netflix content delivery network).
Netflix-High-Level-System-Architecture

The overall Netflix system consists of three main parts.

  • Open Connect Appliances(OCA) - Open Connect is Netflix’s custom global content delivery network(CDN). These OCA servers are placed inside internet service providers (ISPs) and internet exchange locations (IXPs) networks around the world to deliver Netflix content to users.

  • Client — A client is any device from which you play Netflix videos. This consists of all the applications that interface with the Netflix servers.

Netflix supports many different devices, including smart TVs, Android and iOS platforms, gaming consoles, etc. All these apps are written using platform-specific code. The Netflix web app is written using reactJS, which was influenced by several factors, some of which include startup speed, runtime performance, and modularity.

  • Backend - This includes databases, servers, logging frameworks, application monitoring, recommendation engine, background services, etc... When the user loads the Netflix app, all requests are handled by the backend server in AWS Login, recommendations, the home page, users history, billing, customer support. Some of these backend services include (AWS EC2 instances, AWS S3, AWS DynamoDB, Cassandra, Hadoop, Kafka, etc).

2. Backend Architecture

Netflix is one of the major drivers of microservices architecture. Every component of their system is a collection of loosely coupled services that collaborate. The microservice architecture enables the rapid, frequent, and reliable delivery of large, complex applications. The figure below is an overview of the backend architecture.
backend

Backend Architecture
  1. The Client sends a Play request to a Backend running on AWS. Netflix routes traffic to its services using Amazon's Elastic Load Balancer (ELB) service.
  2. AWS ELB will forward that request to the API Gateway Service. Netflix uses Zuul as its API gateway, which is built to allow dynamic routing, traffic monitoring, security, and resilience to failures at the edge of cloud deployment.
  3. The application API component is the core business logic behind Netflix's operations. Several types of API correspond to different user activities, such as the Signup API and the Discovery/Recommendation API for retrieving video recommendations. In this scenario, the forwarded request from the API Gateway Service is handled by the Play API.
  4. Play API will call a microservice or a sequence of microservices to fulfill the request.
  5. Microservices are mostly stateless small programs. Thousands of these services can communicate with each other.
  6. Microservices can save or get data from a data store during this process.
  7. Microservices can send events to track user activities or other data to the Stream Processing Pipeline for either real-time processing of personalized recommendations or batch processing of business intelligence tasks.
  8. The Stream Processing Pipeline data can be persistented to other data stores such as AWS S3, Hadoop HDFS, Cassandra, etc.

3. Backend Components

Open Connect

Open Connect handles everything that happens after you hit play on a video. This system is responsible for streaming video to your device. The following diagram illustrates how the playback process works.
PlayBlackProcess

Open Connect Design Image
  1. OCAs ping AWS instances to report their health, the routes they have learned, and the files they have on them.
  2. A user on a client device requests playback of a title (TV show or movie) from the Netflix application in AWS.
  3. The Netflix playback service checks for the user's authorization, permission, and licensing, then chooses which files to serve the client taking into account the current network speed and client resolution.
  4. The steering service picks the OCA from which the files should be served, generates URLs for these OCAs, and hands them back to the playback service.
  5. The playback service hands over the URLs of the OCA to the client, and the client requests the video files from that OCA.

Zuul2-API GATEWAY

Netflix uses Amazon's Elastic Load Balancer (ELB) service to route traffic to services. ELBs are set up such that the load is balanced across zones first, then instances.
AmazonELB

Amazon Elastic Load Balancer

This load balancer routes requests to the API gateway service; Netflix uses Zuul as its API gateway; it handles all the requests and performs the dynamic routing of microservice applications. It works as a front door for all the requests.

For Example, /api/products is mapped to the product service, and /api/user is mapped to the user service. The Zuul Server dynamically routes the requests to the respective backend applications. Zuul provides a range of different types of filters that allow them to quickly and nimbly apply functionality to the edge service.

The Cloud Gateway team at Netflix runs and operates more than 80 clusters of Zuul 2, sending traffic to about 100 (and growing) backend service clusters which amount to more than 1 million requests per second.

open-sourcing-zuul-2
zuul2

Zuul Architecture

The Netty handlers on the front and back of the filters are mainly responsible for handling the network protocol, web server, connection management, and proxying work. With those inner workings abstracted away, the filters do all the heavy lifting.

  • The inbound filters run before proxying the request and can be used for authentication, routing, or decorating the request.
  • The endpoint filters can either return a static response or proxy the request to the backend service. The outbound filters run after a response has been returned and can be used to add or remove custom headers or metrics.

The Zuul 2 Api gateway forwards the request to the appropriate Application API.

Application API

Currently, the Application APIs are defined under three categories: Signup API -for non-member requests such as sign-up, billing, free trial, etc., Discovery API-for search, recommendation requests, and Play API- for streaming, view licensing requests, etc. When a user clicks signup, for example, Zuul will route the request to the Signup API.

If you consider an example of an already subscribed user. Supposing the user clicks on play for the latest episode of Peaky Blinders, the request will be routed to the playback API. The API, in turn, calls several microservices under the hood. Some of these calls can be made in parallel because they don’t depend on each other. Others have to be sequenced in a specific order. The API contains all the logic to sequence and parallelize the calls as necessary. The device, in turn, doesn’t need to know anything about the orchestration that goes on under the hood when the customer clicks “play.”
Api archi

Netflix API Architecture

Signup requests map to signup backend services, Playback requests, with some exceptions, map only to playback backend services, and similarly, discovery APIs map to discovery services.

Hystrix- Distributed API Services Management

Hystrix
Hystrix

Hystrix Architecture

In any distributed environment (with a lot of dependencies), inevitably, some of the many service dependencies will fail. It can be unmanageable to monitor the health and state of all the services as more and more services will be stood up, and some services may be taken down or simply broken down. Hystrix comes with help by providing a user-friendly dashboard. Hystrix library is used to control the interaction between these distributed services by adding some latency tolerance and fault tolerance logic.

Consider this example from Netflix: They have a microservice that provides a tailored list of movies back to the user. If the service fails, they reroute the traffic to circumvent the failure to another vanilla microservice that simply returns the top 10 family-friendly movies. So they have this safe failover that they can go to, and that is the classic example of the first circuit breaking.

Note:

Netflix Hystrix is no longer in active development and is currently in maintenance mode. Some internal projects are currently being built with resilience4j

https://github.com/resilience4j/resilience4j

Titus- Container Management

Titus
Titus is a container management platform that provides scalable and reliable container execution and cloud-native integration with Amazon AWS.
titus-arch

Titus Architecture

It is a framework on top of Apache Mesos, a cluster management system that brokers available resources across a fleet of machines.
Titus is run in production at Netflix, managing thousands of AWS EC2 instances and launching hundreds of thousands of containers daily for both batch and service workloads. Just think of it as the Netflix version of Kubernetes.

Titus runs about 3 million containers per week.

Datastores

EVCache

A cache's primary purpose is to increase data retrieval performance by reducing the need to access the underlying slower storage layer. Trading off capacity for speed, a cache typically stores a subset of data transiently.
EVCache

Two use cases for caching is to:

  • Provides fast access to frequently stored data.
  • Provides fast access to computed(memoized) data. Netflix's microservices rely on caches for fast, reliable access to multiple types of data like a member’s viewing history, ratings, and personalized recommendations.
    EVCache
    EVCache Diagram

EVCache is a Memcached and spymemcached-based caching solution mainly used for caching frequently used data on AWS EC2 infrastructure.
EVCache is an abbreviation for:

  • Ephemeral - The data stored is for a short duration as specified by its TTL (Time To Live).
  • Volatile - The data can disappear at any time (Evicted).
  • Cache - An in-memory key-value store.

SSDs for Caching

Traditionally, caching is done on RAM. Storing large amounts of data on RAM is expensive, so Netflix decided to move some caching data to SSD.

Modern disk technologies based on SSD are providing fast access to data but at a much lower cost when compared to RAM. The cost to store 1 TB of data on SSD is much lower than storing the same amount using RAM.

Evolution of application Data Caching

MySQL

Netflix uses AWS EC2 instances of MYSQL for its Billing infrastructure. Billing infrastructure is responsible for managing Netflix members' billing states. This includes keeping track of open/paid billing periods, the amount of credit on the member’s account, managing the member's payment status, initiating charge requests, and what date the member has paid through.

The payment processor needed the ACID capabilities of an RDBMS to process charge transactions.
Netflix Datastore

Netflix Datastore

Apache Cassandra

Cassandra is a free and open-source distributed wide-column store. The NoSQL database is designed to handle large amounts of data across many commodity servers, providing high availability with no single point of failure.

Netflix uses Cassandra for its scalability, lack of single points of failure, and cross-regional deployments. ” In effect, a single global Cassandra cluster can simultaneously service applications and asynchronously replicate data across multiple geographic locations.

Netflix stores all kinds of data across its Cassandra DB instances, including all user-collected event metrics.

As user data began to increase, there needed to be a more efficient way to manage data storage. Netflix Redesigned data storage architecture with two main goals in mind:

  • Smaller Storage Footprint.
  • Consistent Read/Write Performance as viewing per member grows.

The solution to the large data problem was to compress the old rows. Data were divided into two types:

  • Live Viewing History (LiveVH): Small number of recent viewing records with frequent updates. The data is stored in uncompressed form.
  • Compressed Viewing History (CompressedVH): A large number of older viewing records with rare updates. The data is compressed to reduce the storage footprint. Compressed viewing history is stored in a single column per row key.
    CompressedVH
    Compressed Viewing History

Stream Processing Pipeline

Did you know that Netflix personalizes movie artwork just for you? You might be surprised to learn the image shown for each video is selected specifically for you. Not everyone sees the same image.
stranger-things
Netflix tries to select artwork that highlights the most relevant aspect of a video based on the data it has learned about you, such as your viewing history and interests.

Stream Processing Data Pipeline has become Netflix’s data backbone of business analytics and personalized recommendation tasks. It is responsible for producing, collecting, processing, aggregating, and moving all microservice events to other data processors in near real-time.

Streaming data is data that is generated continuously by thousands of data sources, which typically send in the data records simultaneously and in small sizes (order of Kilobytes). Streaming data includes a wide variety of data such as log files generated by customers using your mobile or web applications, e-commerce purchases, in-game player activity, information from social networks, financial trading floors, or geospatial services, and telemetry from connected devices or instrumentation in data centers.

AWS- What is streaming Data?

This data needs to be processed sequentially and incrementally on a record-by-record basis or over sliding time windows and used for a wide variety of analytics, including correlations, aggregations, filtering, and sampling.

Information derived from such analysis gives companies visibility into many aspects of their business and customer activity, such as service usage (for metering/billing), server activity, website clicks, and the geo-location of devices, people, and physical goods, and enables them to respond promptly to emerging situations. For example, businesses can track changes in public sentiment on their brands and products by continuously analyzing social media streams and responding promptly as necessary.

The stream processing platform processes trillions of events and petabytes of data per day.
streaming

The Viewing History Service captures all the videos played by members. Beacon is another service that captures all impression events and user activities within Netflix. All the data collected by the Viewing History and Beacon services is sent to Kafka.

Apache Kafka- Analyzing Streaming Data

Kafka is open-source software that provides a framework for storing, reading, and analyzing streaming data.

Netflix embraces Apache Kafka® as the de-facto standard for its eventing, messaging, and stream processing needs. Kafka acts as a bridge for all point-to-point and Netflix Studio-wide communications.

How Netflix uses Kafka
Kafka

Apache Chukwe- Analyzing Streaming Data

Apache Chukwe is an open-source data collection system that collects logs or events from a distributed system. It is built on top of HDFS and Map-reduce framework. It comes with Hadoop’s scalability and robustness features. It includes a lot of powerful and flexible toolkits to display, monitor, and analyze data. Chukwe collects the events from different parts of the system; From Chukwe, you can do monitoring and analysis, or you can use the dashboard to view the events. Chukwe writes the event in the Hadoop file sequence format (S3).

Apache Spark - Analyzing Streaming Data

Netflix uses Apache Spark and Machine learning to recommend movies. Apache Spark is an open-source unified analytics engine for large-scale data processing.

On a live user request, the aggregated play popularity(how many times a video is played) and take rate(Fraction of play events over impression events for a given video) data, along with other explicit signals such as members’ viewing history and past ratings, are used to compute personalized content for the user. The following figure shows the end-to-end infrastructure for building user movie recommendations.
data-processor-engine

Data Processing Engine

Elastic Search - Error Logging and Monitoring

Netflix uses elastic search for data visualization, customer support, and error detection in the system.

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents.

With elastic search, they can easily monitor the state of the system and troubleshoot error logs and failures.

Conclusion

This article provides a detailed analysis of the Netflix Backend architecture. To test your knowledge of Netflix system design, try out this interactive Quiz on Netflix's Design. For more information, refer to the references in the section below.

If you're looking for a detailed guide to system design, check out this system design interview prep.

Follow me here and across my social media for more content like this Linkedin. Twitter

REFERENCES

  1. Titus
  2. Open Sourcing Zuul
  3. Open Connect
  4. How Netflix Backend Operates
  5. A Design Analysis Of Cloud-Based Microservices Architecture At Netflix
  6. Netflix System Design
  7. Hystrix What is it and Why is it Used
  8. Engineering Tradeoffs and the Netflix API Re-architecture
  9. Application Data Caching Using ssds
  10. Evolution Of Application Data Caching From Ram To ssd
  11. Netflix Billing Migration To Aws
  12. Optimizing The Netflix Api
  13. What's Trending On Netflix
  14. Evolution Of The Netflix Data Pipeline
  15. Zuul API Gateway
  16. CAP_theorem
  17. What Happens When You Press Play
  18. Scaling Time Series Data Storage Part1
  19. Netflix-High-Level-System-Architecture
  20. How-Kafka-is-used-by-Netflix
  21. Netflix Consumes 15% of the World's Internet Bandwidth
  22. Top 14 System Design interview questions for software engineers
  23. Quiz on Netflix System Design

Top comments (20)

Collapse
 
codingjlu profile image
codingjlu

Thank you so much for sharing! I'm wondering, where do you get all this data from?

Collapse
 
gbengelebs profile image
Elegberun Olugbenga

You're welcome. Its mostly from Netflix's engineering blog

Collapse
 
couchcamote profile image
couchcamote

Thanks for this compilation. I was looking before for one page "summary" of Netflix's architecture. I was reading from different sources and had not enough time or patience of doing it like this.

It would also be cool to have a some sort of something like this for Netflix's resiliency designs and tools like Chaos Monkey. But maybe leave it to someone else who has time.

Collapse
 
gbengelebs profile image
Elegberun Olugbenga

Thank you for your feedback couchcamote, Chaos monkey deserves its own seperate article but I think I'd leave it for someone else 😅

Collapse
 
fcpauldiaz profile image
Pablo Díaz

How do they test their work? they have a dev environment?

Collapse
 
dino16m profile image
Dino16m

I'm interested in knowing this as well. The system has so many moving parts.

Collapse
 
gbengelebs profile image
Elegberun Olugbenga

Netflix practices something called Chaos Engineering.
You can check out this video for more info.
youtu.be/3WRVgC8SiGc

“Chaos engineering…is the discipline of experimenting on production to find vulnerabilities in the system before they render it unusable for your customers. We do this at Netflix through a tool that we call ChAP…[It] can catch vulnerabilities, and allows users to inject failures into services and prod that validate their assumptions about those services before they become full-blown outages.”

Collapse
 
gbengelebs profile image
Elegberun Olugbenga

Yeah, I guess they have they also carry out tests in production. Check out this video.
youtu.be/3WRVgC8SiGc

Collapse
 
masonsansa profile image
Sansa Mason

Great to know about the complete working architecture of Netflix with end to end strategies applied for it. Its really a special treat for the business peoples to know about Netflix. Try to update How does Netflix make money over here, It will add some additional information in this sections. Thanks.

Collapse
 
itsahsanmangal profile image
Ahsan Mangal 👨🏻‍💻

I learned a lot from reading it. However, I wish the post had more details on some topics, such as how Netflix handles failures and redundancy, how Netflix optimizes video quality for different devices and bandwidths, and how Netflix uses AWS and Open Connect clouds. It would also be helpful if the post had some references or sources for the information presented. Overall, a good introduction to Netflix’s backend architecture.

Collapse
 
morpheus04 profile image
morpheus04

fantastic article. totally enjoyed the read! learnt a new thing too - "Not everyone sees the same image." wow really didn't realize that but now I do

Collapse
 
gbengelebs profile image
Elegberun Olugbenga

Thank you morpheus. Glad you enjoyed it. I was wowed when I realized that too. So many amazing things going on behind the scenes.

Collapse
 
kef profile image
Kef

Great series, i learned a lot. Thank you for sharing. You really did your homework on this. Keep them coming.

Collapse
 
gbengelebs profile image
Elegberun Olugbenga

Thank you Kef, Glad you enjoyed it. Definitely more coming

Collapse
 
mobiletonster profile image
tony spencer

This is a great article. Nice job! No wonder my Netflix rates keep going up….lol

Collapse
 
gbengelebs profile image
Elegberun Olugbenga

Lool, I guess that's the price we pay for their good work 😅

Collapse
 
kingroyaltea profile image
KingRoyaltea

Good day elegberun can you please help me with an asp.net application

Collapse
 
csanni profile image
csanni • Edited

Thanks for aggregating things at one blog ! Great effort !

Collapse
 
sourjaya profile image
Sourjaya Das

Great Read. Motivated me to write about another system and its design which is not properly documented under one post.

Collapse
 
ironbats profile image
ironbats

Amazing