Note - we are not focusing on the video streaming, but the live commenting feature.
Here it goes:
Scope:
• User can comment on a content which she is viewing.
• User Can view comments of other people who are commenting on the same content.
Scale Numbers:
• 10K Contents per minute.
• 650K comments per minute.
• 100K users view per sec.
Clarifying Questions:
Can a user only comment when the stream is live?
(based on this the data retention can be decided)
Non-Functional Requirement:
• Highly Scalable
• Highly Available (99.99 % )
• Minimum Latency (p99 500MS)
• Eventual Consistency
API:
• POST/ ActivateViewerShip(userId, ContentId)
• POST/ DeactivateViewerShip(userId, ContentId)
• POST/ Comment(userId, ContentId, Comment)
PULL Model:
This will do http polling on a given interval, and get the related comments of the pertinent content.
This would not give real time experience to user, also if there are no comments we would be exhausting the http calls for doing nothing.
Minimizing the polling interval <=~5 sec would increase the server load drastically.
PUSH Model:
User lands to the content -> Stores the user viewership info into DB -> Get the Viewership Info of that content - > Broadcast to respective users.
Data Modelling:
Content_viewership:
Columns: ContentId, UserId, CreatedTime, IsActive
ContentId should be indexed, as most of the query will be on this column.
UserId can also be indexed when the users leaves the live commenting panel, it needs to update IsActive = false.
( we can delete the inactive records from the main table and put those in HDFS or any file system,
if future auditing or analytics are required, should be clarified)
Content_Comments
Columns: CommentId, ContentId, UserId, Comment, CreatedTime, IsActive
In this table also we need to index the ContentId and UserId(the one who makes the comment)..
IsActive flag for moving the deactivated data to some file system and free up the main table.
Calculation:
Compute:
W : 10K QPS
R : 100K QPS
commenting rate is significantly lower than our viewing rate
Storage:
1 comment data = 3Kb ( viewership + comment)
Total : 3 * 650K
Approx : 2GB per minute
But if we are deleting the inactive rows we can assume, the there would be negligible growth in the DB.
High Level Design:
In addition to this *Write Locally and Read Globally can be discussed during an interview, the concept is very well described in here : https://engineering.fb.com/2011/02/07/core-data/live-commenting-behind-the-scenes/
Too much of writing, I am putting the High Level Diagram. The scale, message queue, and caching details are pretty common.
If you see any any bottleneck or have any suggestion, feel free to put a comment.
Top comments (4)
Thank you for this great content. Would you please explain about indexing on ContentID and UserID? How is it done and how does it help for this design?
Thanks for the comment.
While broadcasting comments to the viewers of a video content, viewership information needs to be queried per content. That's why index on ContentId is important.
Now also, when you scroll down(or move out of comment panel) and leave a video, you are not supposed to see the comments.
In that case there can be a update query, which will deactivate your viewership info.
UPDATE Viewership_Info SET IsActive = false Where ContentId = 1 AND UserId = 1
This query will be efficient if we index on both ContentId and UserId.
Now how it can be done?
In RDBMS scenario, we can create a non-clustered on both the columns.
But as we are having very frequent writes, over indexing could bring inefficiency, we can think of NoSQL solutions keeping the indexing requirement intact.
Let me do some reading on this will get back on the NoSQL database selection in this scenario soon.
Thank you for your response. I am very new into system design area. I learnt a lot from your post. Are you using websocket in this design? Does it mean while user is viewing contents a hash table should store userID, serverID info?
I was thinking with NoSQL, we will have to keep two key value datastores.
1) userID(key) : a set data structure with postIDs -- O(1) insert and delete
1) postID(key) : a set data structure with userIDs -- O(1) insert and delete
Very informative content