DEV Community

loading...
Cover image for Designing a URL shortening service from scratch to scale million of users.
Nlogn

Designing a URL shortening service from scratch to scale million of users.

mayankjoshi profile image mayank joshi Originally published at nlogn.in Updated on ・7 min read

The goal of the project is to design a Url shortening service like bit.ly or tinyurl.com that is realtime scalable.

1. Url Shortener 

Url Shortener is a service that creates a short alias URL for a corresponding long URL. So, now whenever the user visits the short URL, he will be redirected to the original URL.

Example: Let the original URL be: 
Long URL: https://nlogn.in/horizontal-scaling-and-vertical-scaling/
Short URL is: https://tinyurl.com/tepyk8t

Whenever you will click the second short URL, you will automatically 
be redirected to the page referred by the long URL. 
Try Now.
Enter fullscreen mode Exit fullscreen mode

2. Design Goals 

The URL shortening service should have the following features:

Mandatory Features

  1. Given a long URL it should be able to generate Unique Short URL.
  2. Given a short URL it should redirect to the original URL.
  3. The URL should expire after a timespan.

Optional Features

  1. The service should be REST API accessible.
  2. It should provide analytics features i.e, how many times the URL is visited.
  3. A user should be able to pick a custom URL.

3. Capacity Estimation

The important thing to note is that the number of reading requests will be 1000 times more than the number of write requests.

Suppose, we are going to receive 1B(1 Billion) new URL's shorting requests per month, with a 100:1 read/write ratio. So, the total number of redirections:

100*1B = 100B redirections

 
If we are going to store, each URL(long and corresponding short URL) for almost 10 years, then the numbers of URLs we are going to store are:

1B*120 = 120B URLs

Now if on average every URL is going to be of 200 bytes, then total storage required:

200B*120byte = 24TB storages 😱

4. Algorithm REST Endpoints

Let's starts by making two functions accessible through REST API:

1. create_shortURL( long_url, api_key, custom_url, expiry_date)

  1. long_url: A long URL that needs to be shortened.
  2. api_key: A unique API key provided to each user, to protect from the spammers, access and resource control for user etc.
  3. custom_url(optional): The custom short link URL, user want to use.
  4. expiry_date(optional): The date after which the short URL becomes invalid.
  5. Return Value: The short Url generated, or error code in case of the inappropriate parameter.

2. redirect_shortURL( short_url)

  1. short_url: The short URL generated from the above function.
  2. Return Value: The original long URL, or invalid URL error code.

5. Shortening Algorithm

The URL that we have to generate should be a short URL. But how much short? Also, what type of encoding should be done?🤔

URL Encoding

The encoding can be Base36 (characters allowed [a-z][0-9]) or Base62 ( characters allowed [A-Z][a-z][0-9]) or Base64 (characters allowed [A-Z][a-z][0-9],'+','/'). Since all of the characters in Base64 are URL safe characters, we will choose Base64 as our encoding technique.

Url Length

Using Base64 encoding if we choose the

URL with length 6, will give 64^6 = ~68.7 Billion URLs
URL with length 7, will give 64^7 = ~5 Trillion URLs
URL with length 8, will give 64^8 = ~281 Trillion URLs
Enter fullscreen mode Exit fullscreen mode

We have already estimated that we would be storing 120B URLs, hence we can safely choose the short URL to be 7 characters long.

Now, to generate a unique short URL, we can calculate the MD5 hash of the long URL, which will produce a 128-bit hash value. Now when we encode the MD5 result to Base64 encoding the resultant string will be 22 characters long.

(MD5 result = 32character output = 32*4bit = 128 bit output.
Base64 encoding will use 6 bit to represent each character,
MD5 -> Base64 give (128/6) ~ 22 character output)
Enter fullscreen mode Exit fullscreen mode

For choosing the short URL, first, we can randomly swap some character of the Base64 encoding result and then pick any 7 characters randomly from the result.

URL Shortener Working URL Shortener Working

Potential Issues

Using the above method we can come across 2 potential issues:

  1. If multiple users enter the same Long URL, in the worst case they can get the same short URL.
  2. If the user enters the URL-encoded URL i.e, https://nlogn.in/system-design?rel=source (UTF-8 format) and https%3A%2F%2Fnlogn.in%2Fsystem-design%3Frel%3Dsource (URL-encoded format). Both URLs are the same but encoding is different. Check here.
Let's solve the first issue:

We can append the user IP address  to a the long URL and then do the shortening porcess, in this way we won't get redundency. Correct?🤔
No, two users can have same public IP addresses if they are connected via same router.

One solution is to add user-id or API key to the long URL and then do the shortening. This will work fine, but user have to be logged in to create short URL. Let's stick to this for now.

Let's solve the second issue:

For every request recieved we will use UTF-8 encoding only. If a URL is URL-encoded, we will first convert it into UTF-8 format. For this we can have our own service or we can use some 3rd party services.

6. Database Design & Choice

Let's see the data we need to store:

Data Related to user

User ID: A unique user id or API key to make user globally distinguishable.
Name: The name of the user.
Email: The email id of the user
Password: Password of the user to facilitate login feature.
Creation Date: The date on which the user was registered.

Data Related to ShortLink

Short Url: 7 character long unique short URL.
Original Url: The original long URL.
UserId: The unique user id or API key of the user who created the short URL.
Expiration Date: The date after which this short URL should become invalid.

Data Storage Observation
  1. We will be storing Billions of links.
  2. The short link redirection should be fast.
  3. Our service is going to be read heavily.
  4. The only relation that is going to exist is which user created which URL and that too is going to accessed very less.

We have two different choices of databases: 1) Relational Databases(MySQL) 2) NoSQL Databases( Cassandra).

In general, Relational Databases are good if we have lots of complex queires involving joins, but they are slow. NoSQL databases are pathetic at handling the relationship queries but they are faster.

Now, we don't really need lots of relationship among data, but we do need the fast read and write speed. Hence we will choose NoSQL Database. The key for each row can be the shorturl, because it is going to be globally unique.

Database Sharding

To scale out our database, we need to partition it into several machines or nodes, so that it can store information about billions of URLs. Hence now we can store more data in memory because of more machines or nodes. For database sharding we will use hash based partition technique. 

In this technique, we will find the hash of the shorturl we are going to store, and determine the machine/shard in which we are going to store this particular URL. The hash function will randomly distribute the URLs into different partation or shard. We can decide the number of shards we are going to make and then we can choose appropriate hash function that random number representing the partation/shard number.( Ex if we have chosen 512 shards, then hash function will generate a number between [1-512] or [0-511] ).

7. Speeding Up the Read Operation

We know that our database is going to be read heavily. Till now we have find the way to speed up the writing process, but the reads are still slow. So we have to find some way to speed up the reading process.

Caching is the solution. We can cache the URLs that are going to be accessed frequently. For example, a url that appears on the trending page of any social networking website. Hence many people are going to visit the url. We can use the caching servies like memcached.

Things we need to consider after adding the caching layer:

When the cache is full, we need to replace the URLs in cache with the treanding ones. For this we will use LRU(Least Recently Used) policy. The URL in the cace which have been refered the least number of times will be removed.
Synchronizing the cache with the original URL. If the user updates or deletes the original link, the corresponding changes has to be reflected in the cache too.
We can shared the cache too. This will help us store more data in memory because of the more machines. For deciding which thing go to which shard can be done using "Hashing" or "Consistant Hashing".

Now this is now we have speed up our read and write request, but still our system is prone to network bandwidth bottleneck and single point of failure.

8. Load Balancing

To overcome the problem of limited network bandwidth and single point of failue, we will use Load Balancers. Load Balancer does its magic by diving the traffic among a group of server thus resulting in improved response and availability of a website or application. Read more here...

To distriubte  the load among server we will use the Least Bandwidth Method. This algorithm will choose the server currently serving the least amount of traffic, measured in megabits per second (Mbps).

We can place the Load Balancers between:

  1. The client and the server.
  2. The server and the database.

This is how we will design a highly scalable URL Shortening Service that can work in realtime. 👨‍💻


This post was originally published at nlogn

Discussion (14)

pic
Editor guide
Collapse
ameliagapin profile image
Amelia Gapin

Can you talk a little about the use of the expiration date? Why not make the links last forever? If they can expire, you're potentially dumping billions of links onto the internet that will eventually 404 or 410. At Bitly, our links last forever, at least on our end, because don't want to have billions of inactive links out there in the world when it can be avoided. We can't control what happens on the side of the original link's host, but we can at least control our part of the chain.

Your solution still has a collision issue for the "hashes." If you're randomly choosing characters from the hashes, there's nothing that prevents duplicates other than hoping that the statistical likelihood of that happening is on your side.

Collapse
mayankjoshi profile image
mayank joshi Author • Edited

First Question: We are adding an expiration field so, that in case a user wants the links to be active for a certain time, then he can provide it. Secondly, we can't keep every link forever, that's a lot and lots of data. Already here we are going to consider that a link is going to last for 10 years and 10 years is a lot of time.
Still, we can use the analytics feature to see, how frequently the links are being used. In this way, we can filter out unused or spam links and only keep, the links that are active or are still being used and expiration date will help us here in cleaning process of spam links.

Second Question:

Your solution still has a collision issue for the "hashes."

No, it doesn't, here we have assumed that a user has to login to create short links.

One solution is to add user-id or API key to the long URL and then do the shortening. This will work fine, but the user has to be logged in to create a short URL. Let's stick to this for now.

Since every user has to be logged in or need to have an API key for rest calls, we are going to append it to the original URL, to avoid a collision. Hence two, users shortening the same URLs are going to get different short links and the same user can't create two different hashes for one URL.

Feel, free to tell me if you have any other doubt.
Cheers.

Collapse
dpaine20 profile image
David Paine20

Hello Mayank, I think it is better that you should mention that 10 years period, the use of the analytics feature to see, how frequently the links are being used. in your article part. And for the URL decode/encode, here is also a great tool, like to suggest
url-decode.com/
that tool will really help you in your future search.

Thread Thread
mayankjoshi profile image
mayank joshi Author

Sure I will do it that you and thank you for the tool recommendations.

Collapse
ameliagapin profile image
Amelia Gapin

So, for the first point, ten years is definitely a long time, but things can also last a long time on the internet. Letting the user set this date is asking for a lot of shorter term urls that end up expired fairly quickly (months?). There's a reason we keep them forever at Bitly.

If you're removing expired links from your database, you can't even return a 410, you have to return a 404. And in that case, you're going to eventually end up with a lot of links on your domain that are 404ing. That doesn't inspire much confidence in the average user who is just clicking on your shortened links. Eventually, they learn that links on your domain are a crapshoot. Especially because the 404 is coming from you, not the original URL's host.

Over years and years, you're going to be using a lot of storage for the links, yes, but if you're also keeping analytics data, that is A TON more data. If you want to keep your data storage smaller, you could keep all links, but set a limit on the analytics data you keep to something like 2-3 years.

Anyway, we definitely have links at Bitly that I would expect to be used for >10 years.

On the second point, you're not using the entire hash. You're only taking seven random characters from it. You have no guarantee that you didn't happen to end up with those same seven characters from another hashed URL.

Now, to generate a unique short URL, we can calculate the MD5 hash of the long URL, which will produce a 128-bit hash value. Now when we encode the MD5 result to Base64 encoding the resultant string will be 22 characters long.

For choosing the short URL, first, we can randomly swap some character of the Base64 encoding result and then pick any 7 characters randomly from the result.
Enter fullscreen mode Exit fullscreen mode

This is essentially random for all intents and purposes. You started with something non-random, a hash, but then by swapping characters and randomly selecting any 7 of them, you ended up at a result that is no better than random.

Nothing there prevents any two links from ending up with the same seven characters in the backhalf.

Thread Thread
naingaungphyo profile image
Naing Aung Phyo

I also wonder about random picking of 7 characters.
Can you tell me what is your logic behind in bit.ly for uniqueness of those links.

Thread Thread
mayankjoshi profile image
mayank joshi Author • Edited


This is essentially random for all intents and purposes. You started with something non-random, a hash, but then by swapping characters and randomly selecting any 7 of them, you ended up at a result that is no better than random

@ameliagapin I understood what you are trying to say.

What about this approach: Take the first 5 characters from the 22 characters generated as a result(of hashing and encoding) and remaining 2 characters from the end of the generated string?

See what I meant by analytics is that, we can store the number of times a link is visited and the last time it was visited.
Then, we can find the link with the least visits in 10 years or link which has not been used for 10 long years or has been used 10-20 times only, we can remove those links.

Or one more solution is that we can check if the corresponding long URL is still active. If not we can delete those links.

In this way, we are keeping popular URLs forever and eliminating unwanted URLs.

What are your views on this architecture?

Thread Thread
mayankjoshi profile image
mayank joshi Author • Edited

@naingaungphyo Please refer to section 5. Shortening Algorithm for the explanation.

Thread Thread
naingaungphyo profile image
Naing Aung Phyo • Edited

@mayankjoshi Sorry for misunderstanding.

I was asking @ameliagapin for that question.

Can you tell me what is your logic behind in bit.ly for uniqueness of those links.

Thread Thread
mayankjoshi profile image
mayank joshi Author

See, even I am trying to figure out this. I have proposed an architecture design to @ameliagaping I will get you updated, once I'm certain.

Collapse
rehanvdm profile image
Rehan van der Merwe • Edited

Load balancing, servers, caching, Casandra clusters (not sure why have the #serverless tag?).. It may scale but your wallet won't.. I also get what @amelia Gapin is saying, you might have left out a step where you check if that short_id exists, if it does then choose other characters/combinations. To be honest, I don't understand why you go through all that hashing trouble? Why not just use something like npmjs.com/package/shortid

Collapse
mayankjoshi profile image
mayank joshi Author

This architecture design is free from checking if the short URL already exists because if we are going to check whether a short URL already exists or not, it is going to increase checking overhead and can result in decreased performance.

The URL, shortening algorithm is written in such a way, that every time the short URL generated should be unique and we don't have to check for its uniqueness.

Tell me if you have further doubts.

Collapse
rehanvdm profile image
Rehan van der Merwe

Yeah that might be unique but you can still pick the same 7 characters as did before. At some point you will have a col

I take it that you just write systems on paper and implement much less? It might sound like a "good-ish" idea but you will at some point have a collision as @ameliagapin states.

Maybe you should just mark these posts as "theoretical designs" rather than #serverless as there are no serverless parts in this? You don't want any one to actually start building this, wreck there wallet and then later have terrible scaling issues due to the different types of technologies involved, expert team required and collisions happening all the time without even being aware of them.

Thread Thread
mayankjoshi profile image
mayank joshi Author

That's why I have proposed a new design to @ameliagapin to see if it works as a random unique URL generator.

You don't want anyone to actually start building this, wreck there wallet and then later have terrible scaling issues due to the different types of technologies involved, the expert team required and collisions happening all the time without even being aware of them.

Actually I am planning to implement it, that's why I am really curious about the fail cases of this architecture to make the update changes to it.

Sure, I will change the tags.

Thank you,
cheers.