loading...
Nlogn

Hacking into Pastebin scalable architecture - System Design

mayankjoshi profile image mayank joshi Updated on ・7 min read

System Architecture (4 Part Series)

1) Designing a URL shortening service from scratch to scale million of users. 2) Hacking into Pastebin scalable architecture - System Design 3) Design a Scalable request Rate Limiting Algorithm for API- System Design 4) How to use cache and Redux to make your App blazing fast

The prerequisite for this is, how to design a URL Shortner Service. Please read it here...

1. What is Pastebin?

Pastebin is a service that allows users to paste text-based data or images over the internet and generate a unique short URL for a user to access the uploaded data. The user can also share the short URL so that anyone can view the content.

The user who created it can also update or modify the data if he/she is logged in (or has an API key if the request is made using Rest API). Try using Pastebin once from here...

2. Design Goals

Our service should provide the following features:

1. Mandatory Features

  1. Users can either enter a block of text or upload a text file and get a unique short URL.
  2. URLs should expire after a certain period of time if the expiration time is provided.
  3. Given a short URL user should be able to access the original content.
  4. The service should be REST API accessible.

2. Optional Features

  1. Users should be able to log in
  2. Users should be able to modify the content.
  3. It should provide analytics features i.e, how many times the URL is visited.
  4. Users should be able to make the content password protected.

3. Capacity Estimation

The important thing to note is that the number of reading requests will be 100 times more than the number of paste(writing) requests.

Suppose, we are going to receive 100M(100 Million) new requests per month, with a 100:1 read/write ratio. So, the total number of reading requests:

100*100M = 10000M or 10B paste read request

and the total number of writing request in 10 years

100M*12(months/year)*10(years)= 12B write requests

Let the maximum size per paste be 1MB.

Amount of paste size per month =

  100M*1MB =  100 TB paste/month

Amount of paste in an year =

100TB*12 = 1200TB or 1.2PB paste/year

Assume our service runs for 10 years, then total data accumulated =

1200TB * 10 years = 12 PB approx

4. Algorithm REST Endpoints

Our service is going to be REST accessible. So, let’s starts by making REST accessible functions:

1. create_paste( api_key, content, expiration_date, visibility):

  1. api_key: A unique API key provided to each user, to protect from the spammers, access and resource control for user etc.
  2. content: The actual content the user wants to paste.
  3. expiration_date: The amount of time after which the content should be expired. Ex: week, month, year, never.
  4. Return Value: The short Url generated to access the content or an error code in case of the inappropriate parameter.

2. read_paste(short_url):

  1. short_url: The URL provided by create_paste method to access the content.
  2. Return Value: The original paste content or an invalid error code, if the URL is incorrect.

5. Short URL Generating Algorithm

When a user makes a paste, we need to return a unique short URL. But now, how should this URL be created?🤔

URL Encoding

The encoding can be Base36( characters allowed [a-z][0-9]) or Base62( characters allowed [A-Z][a-z][0-9]) or Base64(characters allowed [A-Z][a-z][0-9],’+’,’/’). Since all of the characters in Base64 are URL safe characters, we will choose Base64 as our encoding technique.

Url Length 

Using Base64 encoding if we choose the

URL with length 6, will give 64^6 = ~68.7 Billion URLs
URL with length 7, will give 64^7 = ~5 Trillion URLs
URL with length 8, will give 64^8 = ~281 Trillion URLs

We have already estimated that we would be storing 12B URLs, we can safely choose the short URL to be 7 characters long.

Now, to generate a unique short URL, we can calculate the MD5 hash of the paste content and user IP Address, which will produce a 128-bit hash value. Now when we encode the MD5 result to Base64 encoding the resultant string will be 22 characters long.

(MD5 result = 32character output = 32*4bit = 128 bit output.
Base64 encoding will use 6 bit to represent each character,
MD5 -> Base64 give (128/6) ~ 22 character output)

For choosing the short URL, first, we can choose the first 7 characters of the resulting Base64 encoding.

The issue with the above algorithm: The URL, generated using this method can be repeated(though changes are really slim 1/2^22 but still, in the worst case the repetition is there).

To overcome this problem we will use KGS or Key Generating Service.

Key Generating Service

The standalone Key Generation Service (KGS) will generate random seven-letter strings beforehand and stores them in a database. Whenever we want a short URL, we will just take one of the already-generated keys and use it. This approach will make things quite simple and fast. Not only are we not encoding the URL, but we won’t have to worry about duplications or collisions. KGS will make sure all the keys inserted into key-DB are unique

6. Storage Choice & Design

Since we are storing here actual text-based content(paste) and limit for each paste size is 1MB. As calculated above if service runs for 10 years, we will end up accumulating 12 PB of data, which is too much to be stored in a database.

Hence we will Object Storage service such as AWS S3.

So, the flow will be as follow: We will store the actual paste content in the S3 bucket and store the paste-path(URL) in the database.

Let’s see the data we need to store in the database:

1. Data Related to user

  1. User ID: A unique user id or API key to make users globally distinguishable.
  2. Name: The name of the user.
  3. Email: The email id of the user
  4. Password: Password of the user to facilitate the login feature.
  5. Creation Date: The date on which the user was registered.

2. Data Related to Paste

  1. Short Url: 7 characters long unique short URL.
  2. Paste Path: The URL of S3 Bucket, where the original content is stored.
  3. Expiration Date: The date after which this short URL should become invalid
  4. Last Accessed: The time this paste was last accessed. (For analytics purpose)
  5. UserId: The unique user id or API key of the user who created the short URL.

We have two different choices of databases: 1) Relational Databases(MySQL) 2) NoSQL Databases(Cassandra).

In general, Relational Databases are good if we have lots of complex queries involving joins, but they are slow. NoSQL databases are pathetic at handling the relationship queries but they are faster.

Now, we don’t really need lots of relationships among data, but we do need a fast read and write speed. Hence we will choose NoSQL Database. The key for each row can be the short URL because it is going to be globally unique.

DATABASE SHARDING

To scale out our database, we need to partition it into several machines or nodes, so that it can store information about billions of URLs. Hence now we can store more data in memory because of more machines or nodes. For database sharding, we will use the Consistent Hashing technique.

In this technique, we will find the hash of the short URL we are going to store and determine the machine/shard in which we are going to store this particular URL using Consistent Hashing. The hash function will randomly and uniformly distribute the URLs into different partitions or shards. We can decide the number of shards we are going to make and then we can choose an appropriate hash function that random number representing the partition/shard number.

Scalable PasteBin type service architecture

7. SPEEDING UP THE READ OPERATION

We know that our database is going to be read heavily. Till now we have found a way to speed up the writing process, but the reads are still slow. So we have to find some way to speed up the reading process.

Caching is the solution. We can cache the URLs that are going to be accessed frequently. For example, a URL that appears on the trending page of any social networking website. Hence many people are going to visit the URL. We can use caching services like Memcached.

Things we need to consider after adding the caching layer:

When the cache is full, we need to replace the URLs in the cache with the trending ones. For this, we will use the LRU(Least Recently Used) policy. The URL in the cache which has been referred least number of times will be removed.
Synchronizing the cache with the original content. If the user updates or deletes the original paste content, the corresponding changes have to be reflected in the cache too.
We can shard the cache too. This will help us store more data in memory because of the more machines. For deciding which thing goes to which shard can be done using “Consistent Hashing”.

Now, this is how we speed up our read and write requests, but still, our system is prone to network bandwidth bottleneck and single point of failure.

8. LOAD BALANCING

To overcome the problem of limited network bandwidth and single point of failure, we will use Load Balancers. Load Balancer does its magic by diving the traffic among a group of server thus resulting in improved response and availability of a website or application. Read more here…

To distribute the load among server we will use the Least Bandwidth Method. This algorithm will choose the server currently serving the least amount of traffic, measured in megabits per second (Mbps).

We can place the Load Balancers between:

The client and the server.
The server and the database.
The web server and the Key Generating Service.

This is how we will design a highly scalable PasteBin type Service that can work in realtime. 👨‍💻

Further Reading:

  1. Consistent Hashing
  2. Load Balancers
  3. Designing a Tiny URL Service

This post was originally published at nlogn.in

System Architecture (4 Part Series)

1) Designing a URL shortening service from scratch to scale million of users. 2) Hacking into Pastebin scalable architecture - System Design 3) Design a Scalable request Rate Limiting Algorithm for API- System Design 4) How to use cache and Redux to make your App blazing fast

Posted on Mar 28 by:

mayankjoshi profile

mayank joshi

@mayankjoshi

I love system design and most of the time I find myself learning or designing one of them.

Nlogn

nlogn is a Computer Science portal specially designed to help you prepare for product-based companies interviews by practicing a wide variety of problem-related to system Desing, Data-Structure and much more.

Discussion

markdown guide
 

12PB. Too much data, how much is that in compressed format?

It's not too much after 10 years though, since it's not like you guys are not gonna do anything in those 10 years after first development. 😁

 

You are correct, lots and lots of work will be done in 10 years, but here we are assuming that even if we don't make any update for 10 years, the service will work perfectly fine.

How much is that in a compressed format? See it will depend on the compression algorithm we are using. Also, we are not going to store the data in compressed format, because it will again add an overhead of compressing and decompressing the data thus decreased performance.