The prerequisite for this is, how to design a URL Shortner Service. Please read it here...
Pastebin is a service that allows users to paste text-based data or images over the internet and generate a unique short URL for a user to access the uploaded data. The user can also share the short URL so that anyone can view the content.
The user who created it can also update or modify the data if he/she is logged in (or has an API key if the request is made using Rest API). Try using Pastebin once from here...
Our service should provide the following features:
- Users can either enter a block of text or upload a text file and get a unique short URL.
- URLs should expire after a certain period of time if the expiration time is provided.
- Given a short URL user should be able to access the original content.
- The service should be REST API accessible.
- Users should be able to log in
- Users should be able to modify the content.
- It should provide analytics features i.e, how many times the URL is visited.
- Users should be able to make the content password protected.
The important thing to note is that the number of reading requests will be 100 times more than the number of paste(writing) requests.
Suppose, we are going to receive 100M(100 Million) new requests per month, with a 100:1 read/write ratio. So, the total number of reading requests:
100*100M = 10000M or 10B paste read request
and the total number of writing request in 10 years
100M*12(months/year)*10(years)= 12B write requests
Let the maximum size per paste be 1MB.
Amount of paste size per month =
100M*1MB = 100 TB paste/month
Amount of paste in an year =
100TB*12 = 1200TB or 1.2PB paste/year
Assume our service runs for 10 years, then total data accumulated =
1200TB * 10 years = 12 PB approx
Our service is going to be REST accessible. So, let’s starts by making REST accessible functions:
1. create_paste( api_key, content, expiration_date, visibility):
- api_key: A unique API key provided to each user, to protect from the spammers, access and resource control for user etc.
- content: The actual content the user wants to paste.
- expiration_date: The amount of time after which the content should be expired. Ex: week, month, year, never.
- Return Value: The short Url generated to access the content or an error code in case of the inappropriate parameter.
- short_url: The URL provided by create_paste method to access the content.
- Return Value: The original paste content or an invalid error code, if the URL is incorrect.
When a user makes a paste, we need to return a unique short URL. But now, how should this URL be created?🤔
The encoding can be Base36( characters allowed [a-z][0-9]) or Base62( characters allowed [A-Z][a-z][0-9]) or Base64(characters allowed [A-Z][a-z][0-9],’+’,’/’). Since all of the characters in Base64 are URL safe characters, we will choose Base64 as our encoding technique.
Using Base64 encoding if we choose the
URL with length 6, will give 64^6 = ~68.7 Billion URLs URL with length 7, will give 64^7 = ~5 Trillion URLs URL with length 8, will give 64^8 = ~281 Trillion URLs
We have already estimated that we would be storing 12B URLs, we can safely choose the short URL to be 7 characters long.
Now, to generate a unique short URL, we can calculate the MD5 hash of the paste content and user IP Address, which will produce a 128-bit hash value. Now when we encode the MD5 result to Base64 encoding the resultant string will be 22 characters long.
(MD5 result = 32character output = 32*4bit = 128 bit output. Base64 encoding will use 6 bit to represent each character, MD5 -> Base64 give (128/6) ~ 22 character output)
For choosing the short URL, first, we can choose the first 7 characters of the resulting Base64 encoding.
The issue with the above algorithm: The URL, generated using this method can be repeated(though changes are really slim 1/2^22 but still, in the worst case the repetition is there).
To overcome this problem we will use KGS or Key Generating Service.
The standalone Key Generation Service (KGS) will generate random seven-letter strings beforehand and stores them in a database. Whenever we want a short URL, we will just take one of the already-generated keys and use it. This approach will make things quite simple and fast. Not only are we not encoding the URL, but we won’t have to worry about duplications or collisions. KGS will make sure all the keys inserted into key-DB are unique
Since we are storing here actual text-based content(paste) and limit for each paste size is 1MB. As calculated above if service runs for 10 years, we will end up accumulating 12 PB of data, which is too much to be stored in a database.
Hence we will Object Storage service such as AWS S3.
So, the flow will be as follow: We will store the actual paste content in the S3 bucket and store the paste-path(URL) in the database.
Let’s see the data we need to store in the database:
- User ID: A unique user id or API key to make users globally distinguishable.
- Name: The name of the user.
- Email: The email id of the user
- Password: Password of the user to facilitate the login feature.
- Creation Date: The date on which the user was registered.
- Short Url: 7 characters long unique short URL.
- Paste Path: The URL of S3 Bucket, where the original content is stored.
- Expiration Date: The date after which this short URL should become invalid
- Last Accessed: The time this paste was last accessed. (For analytics purpose)
- UserId: The unique user id or API key of the user who created the short URL.
We have two different choices of databases: 1) Relational Databases(MySQL) 2) NoSQL Databases(Cassandra).
In general, Relational Databases are good if we have lots of complex queries involving joins, but they are slow. NoSQL databases are pathetic at handling the relationship queries but they are faster.
Now, we don’t really need lots of relationships among data, but we do need a fast read and write speed. Hence we will choose NoSQL Database. The key for each row can be the short URL because it is going to be globally unique.
To scale out our database, we need to partition it into several machines or nodes, so that it can store information about billions of URLs. Hence now we can store more data in memory because of more machines or nodes. For database sharding, we will use the Consistent Hashing technique.
In this technique, we will find the hash of the short URL we are going to store and determine the machine/shard in which we are going to store this particular URL using Consistent Hashing. The hash function will randomly and uniformly distribute the URLs into different partitions or shards. We can decide the number of shards we are going to make and then we can choose an appropriate hash function that random number representing the partition/shard number.
Scalable PasteBin type service architecture
We know that our database is going to be read heavily. Till now we have found a way to speed up the writing process, but the reads are still slow. So we have to find some way to speed up the reading process.
Caching is the solution. We can cache the URLs that are going to be accessed frequently. For example, a URL that appears on the trending page of any social networking website. Hence many people are going to visit the URL. We can use caching services like Memcached.
Things we need to consider after adding the caching layer:
When the cache is full, we need to replace the URLs in the cache with the trending ones. For this, we will use the LRU(Least Recently Used) policy. The URL in the cache which has been referred least number of times will be removed.
Synchronizing the cache with the original content. If the user updates or deletes the original paste content, the corresponding changes have to be reflected in the cache too.
We can shard the cache too. This will help us store more data in memory because of the more machines. For deciding which thing goes to which shard can be done using “Consistent Hashing”.
Now, this is how we speed up our read and write requests, but still, our system is prone to network bandwidth bottleneck and single point of failure.
To overcome the problem of limited network bandwidth and single point of failure, we will use Load Balancers. Load Balancer does its magic by diving the traffic among a group of server thus resulting in improved response and availability of a website or application. Read more here…
To distribute the load among server we will use the Least Bandwidth Method. This algorithm will choose the server currently serving the least amount of traffic, measured in megabits per second (Mbps).
We can place the Load Balancers between:
The client and the server.
The server and the database.
The web server and the Key Generating Service.
This is how we will design a highly scalable PasteBin type Service that can work in realtime. 👨💻
This post was originally published at nlogn.in