The Initial Approach :
When Netflix first started handling large amounts of viewing data, they chose Apache Cassandra® for the following reasons:
Apache Cassandra® allows for a flexible structure, where each row can store a growing number of viewing records without performance issues.
Netflix’s system processes significantly more writes (data being stored) than reads (data being retrieved). The ratio is approximately 9:1, meaning for every 9 new records added, only 1 is read. Apache Cassandra® excels in handling such workloads.
Instead of enforcing strict consistency, Netflix prioritizes availability and speed, ensuring users always have access to their watch history, even if data updates take a little longer to sync. Apache Cassandra® supports this tradeoff with eventual consistency, meaning that all copies of data will eventually match up across the system.
See the diagram below, which shows the data model of Apache Cassandra® using column families.
To structure the data efficiently, Netflix designed a simple yet scalable storage model in Cassandra®.
Each user’s viewing history was stored under their unique ID (CustomerId). Every viewing record (such as a movie or TV show watched) was stored in a separate column under that user’s ID. To handle millions of users, Netflix used "horizontal partitioning," meaning data was spread across multiple servers based on CustomerId. This ensured that no single server was overloaded.
Reads and Writes in the Initial System
The diagram below shows how the initial system handled reads and writes to the viewing history data.
Every time a user started watching a show or movie, Netflix added a new column to their viewing history record in the database. If the user paused or stopped watching, that same column was updated to reflect their latest progress.
While storing data was easy, retrieving it efficiently became more challenging as users' viewing histories grew. Netflix used three different methods to fetch data, each with its advantages and drawbacks:
Retrieving the Entire Viewing History: If a user had only watched a few shows, Netflix could quickly fetch their entire history in one request. However, for long-time users with large viewing histories, this approach became slow and inefficient as more data accumulated.
Searching by Time Range: Sometimes, Netflix only needed to fetch records from a specific time period, like a user’s viewing history from the last month. While this method worked well in some cases, performance varied depending on how many records were stored within the selected time range.
Using Pagination for Large Histories: To avoid loading huge amounts of data at once, Netflix used pagination. This approach prevented timeouts (when a request takes too long and fails), but it also increased the overall time needed to retrieve all records.
Top comments (0)