Design Twitter search:
Use cases:
Search for tweets
Search for User
Search for trending topics
Functional requirements:
Search for tweets
Non-Functional requirements:
The system should be available ( when the search is being performed it should always return some results)
Design Constraints:
1B total users
500M are active uses
100M tweets every day are created
Each tweet is 100B
100M will be 100M*100B = 10GB/day storage
100000/24*60*60 = 1.15Mb/second bandwidth
Api Design:
search(api_dev_key, token,noOfresults,rankScheme,timestamp);
api_dev_key: user dev key for authenticated user, based on which user will be throttled
token: search word
noOfResults: how many result lists should be given to user
rankScheme: how the results are stored is it based on likes, popularity, etc.
timestamp: time of the search query
return: Json response, having a list of tweet details having tweetId, likes, timestamp of the tweet, and some other metadata
High level design:
Storage estimation:
10gb/day, i.e. 10*30*12 = 3.6TB
Let's say we want to store the tweets for the next 5 years, then 3.6*5 = 18TB
Let's say we never want to be more than 80% full at any time i.e 22TB( approx)
Let's say the data is duplicated then total storage will be 22*2 = 44TB
If a modern server can have a storage capacity of 4TB, then we will need 12 such servers.
Let's say we are using a MySQL database to store the tweet object with two-column tweetId, and tweet ( having the text of the twee)
Let's say we want tweetId to be a unique number, total tweet per day is 100M, then per 5 years it will be 100 * 365 * 5 = 182500M = 182.5B
To store 182.5B tweets we will need log2(182.5B) bits, which is close to 38bits = 5Bytes
Index:
Let us think of generating indexes based on the words in the tweets, So a word will be mapped to a list of tweetIds that have those words.
Let us say we are trying to generate indexes for the most popular English words and nouns like names of famous people and places, etc.
Let us say we have 300K such English words and 200k such nouns, it will be a total of 500K words.
Let us assume each word is 5 Bytes( or has 5 chars 1byte each), the total storage for the word will be: 500K*5B = 2.5Mb.
Let us say we want to index tweets of only the past 2 years since we are getting 182.5B tweets in five years, we will get 72.8B = 73B tweets.
It will make the total of 73B*5B (since each tweetId is of 5Bytes) = 365Gb.
Let us say on average each tweet has 40 words and say we are not including prepositions of the tweets in indexing and let us say after removing them we get.
15 words that need to be indexed in each tweet This means each TweetID will be stored 15 times in our index. It will make the total of 365*15GB + 2.5Mb = 5.4TB(approx).
Let us assume that the memory capacity of each index server is: 144GB, then we will need a total of 38 index servers to store the indexing details.
So our index would be like a big distributed hash table, where ‘key’ would be the word and ‘value’ would be a list of TweetIDs of all those tweets that contain that word.
We can shard our data based on two criteria:
Sharding based on Words: While building our index, we will iterate through all
the words of a tweet and calculate the hash of each word to find the server where it would be indexed. To find all tweets containing a specific word we have to query only the server which contains this word.
We have a couple of issues with this approach:
- What if a word becomes hot? Then there will be a lot of queries on the server holding that word. This high load will affect the performance of our service.
- Over time, some words can end up storing a lot of TweetIDs compared to others, therefore, maintaining a uniform distribution of words while tweets are growing is quite tricky. To recover from these situations we either have to repartition our data or use consistent hashing.
Sharding based on the tweet object: While storing, we will pass the TweetID to our hash function to find the server and index all the words of the tweet on that server. While querying for a particular word, we have to query all the servers, and each server will return a set of TweetIDs. A centralized server will aggregate these results to return them to the user.
Fault tolerance:
What will happen when an index server dies? We can have a secondary replica of each server and if the primary server dies it can take control after the failover. Both primary and secondary servers will have the same copy of the index.
What if both primary and secondary servers die at the same time? We have to allocate a new server and rebuild the same index on it. How can we do that? We don’t know what words/tweets were kept on this server. If we were using ‘Sharding based on the tweet object’, the brute-force solution would be to iterate through the whole database and filter TweetIDs using our hash function to figure out all the required tweets that would be stored on this server. This would be inefficient and also during the time when the server was being rebuilt we would not be able to serve any query from it, thus missing some tweets that should have been seen by the user.
How can we efficiently retrieve a mapping between tweets and the index server? We have to build a reverse index that will map all the TweetID to their index server. Our Index-Builder server can hold this information. We will need to build a Hashtable where the ‘key’ will be the index server number and the ‘value’ will be a HashSet
containing all the TweetIDs being kept at that index server. Notice that we are keeping all the TweetIDs in a HashSet; this will enable us to add/remove tweets from our index quickly. So now, whenever an index server has to rebuild itself, it can simply ask the Index-Builder server for all the tweets it needs to store and then fetch those tweets to build the index. This approach will surely be fast. We should also have a replica of the Index-Builder server for fault tolerance.
Caches:
Hot tweets in cache in front of db, with LRU policy, based on client's usage pattern, we can decide how many instances of cache we need.
Load balancing: Lb in between app and the twitter-server, lb between server and cache, and lb between the twitter-server and the backend-servers.
We can use round robin for distributing requests (it can stop sending requests to servers that are down, drawback: it will not take into account the traffic load on the server and keep on sending requests to the server, which might not be good if a server is being overwhelmed with requests, hence we might think of having more intelligent LB)
Ranking:
How about if we want to rank the search results by social graph distance, popularity, relevance, etc?
Let’s assume we want to rank tweets by popularity, like how many likes or comments a tweet is getting, etc. In such a case, our ranking algorithm can calculate a ‘popularity number’ (based on the number of likes etc.) and store it with the index. Each partition can sort the results based on this popularity number before returning results to the aggregator server. The aggregator server combines all these results, sorts them based on the popularity number, and sends the top results to the user.
Similarly, we can Design Twitter
Top comments (0)