loading...

re: Designing a URL shortening service from scratch to scale million of users. VIEW POST

FULL DISCUSSION
 

Can you talk a little about the use of the expiration date? Why not make the links last forever? If they can expire, you're potentially dumping billions of links onto the internet that will eventually 404 or 410. At Bitly, our links last forever, at least on our end, because don't want to have billions of inactive links out there in the world when it can be avoided. We can't control what happens on the side of the original link's host, but we can at least control our part of the chain.

Your solution still has a collision issue for the "hashes." If you're randomly choosing characters from the hashes, there's nothing that prevents duplicates other than hoping that the statistical likelihood of that happening is on your side.

 

First Question: We are adding an expiration field so, that in case a user wants the links to be active for a certain time, then he can provide it. Secondly, we can't keep every link forever, that's a lot and lots of data. Already here we are going to consider that a link is going to last for 10 years and 10 years is a lot of time.
Still, we can use the analytics feature to see, how frequently the links are being used. In this way, we can filter out unused or spam links and only keep, the links that are active or are still being used and expiration date will help us here in cleaning process of spam links.

Second Question:

Your solution still has a collision issue for the "hashes."

No, it doesn't, here we have assumed that a user has to login to create short links.

One solution is to add user-id or API key to the long URL and then do the shortening. This will work fine, but the user has to be logged in to create a short URL. Let's stick to this for now.

Since every user has to be logged in or need to have an API key for rest calls, we are going to append it to the original URL, to avoid a collision. Hence two, users shortening the same URLs are going to get different short links and the same user can't create two different hashes for one URL.

Feel, free to tell me if you have any other doubt.
Cheers.

 

Hello Mayank, I think it is better that you should mention that 10 years period, the use of the analytics feature to see, how frequently the links are being used. in your article part. And for the URL decode/encode, here is also a great tool, like to suggest
url-decode.com/
that tool will really help you in your future search.

Sure I will do it that you and thank you for the tool recommendations.

 

So, for the first point, ten years is definitely a long time, but things can also last a long time on the internet. Letting the user set this date is asking for a lot of shorter term urls that end up expired fairly quickly (months?). There's a reason we keep them forever at Bitly.

If you're removing expired links from your database, you can't even return a 410, you have to return a 404. And in that case, you're going to eventually end up with a lot of links on your domain that are 404ing. That doesn't inspire much confidence in the average user who is just clicking on your shortened links. Eventually, they learn that links on your domain are a crapshoot. Especially because the 404 is coming from you, not the original URL's host.

Over years and years, you're going to be using a lot of storage for the links, yes, but if you're also keeping analytics data, that is A TON more data. If you want to keep your data storage smaller, you could keep all links, but set a limit on the analytics data you keep to something like 2-3 years.

Anyway, we definitely have links at Bitly that I would expect to be used for >10 years.

On the second point, you're not using the entire hash. You're only taking seven random characters from it. You have no guarantee that you didn't happen to end up with those same seven characters from another hashed URL.

Now, to generate a unique short URL, we can calculate the MD5 hash of the long URL, which will produce a 128-bit hash value. Now when we encode the MD5 result to Base64 encoding the resultant string will be 22 characters long.

For choosing the short URL, first, we can randomly swap some character of the Base64 encoding result and then pick any 7 characters randomly from the result.

This is essentially random for all intents and purposes. You started with something non-random, a hash, but then by swapping characters and randomly selecting any 7 of them, you ended up at a result that is no better than random.

Nothing there prevents any two links from ending up with the same seven characters in the backhalf.

I also wonder about random picking of 7 characters.
Can you tell me what is your logic behind in bit.ly for uniqueness of those links.


This is essentially random for all intents and purposes. You started with something non-random, a hash, but then by swapping characters and randomly selecting any 7 of them, you ended up at a result that is no better than random

@ameliagapin I understood what you are trying to say.

What about this approach: Take the first 5 characters from the 22 characters generated as a result(of hashing and encoding) and remaining 2 characters from the end of the generated string?

See what I meant by analytics is that, we can store the number of times a link is visited and the last time it was visited.
Then, we can find the link with the least visits in 10 years or link which has not been used for 10 long years or has been used 10-20 times only, we can remove those links.

Or one more solution is that we can check if the corresponding long URL is still active. If not we can delete those links.

In this way, we are keeping popular URLs forever and eliminating unwanted URLs.

What are your views on this architecture?

@naingaungphyo Please refer to section 5. Shortening Algorithm for the explanation.

@mayankjoshi Sorry for misunderstanding.

I was asking @ameliagapin for that question.

Can you tell me what is your logic behind in bit.ly for uniqueness of those links.

See, even I am trying to figure out this. I have proposed an architecture design to @ameliagaping I will get you updated, once I'm certain.

code of conduct - report abuse