loading...

re: Designing a URL shortening service from scratch to scale million of users. VIEW POST

TOP OF THREAD FULL DISCUSSION
re: First Question: We are adding an expiration field so, that in case a user wants the links to be active for a certain time, then he can provide it. ...
 

So, for the first point, ten years is definitely a long time, but things can also last a long time on the internet. Letting the user set this date is asking for a lot of shorter term urls that end up expired fairly quickly (months?). There's a reason we keep them forever at Bitly.

If you're removing expired links from your database, you can't even return a 410, you have to return a 404. And in that case, you're going to eventually end up with a lot of links on your domain that are 404ing. That doesn't inspire much confidence in the average user who is just clicking on your shortened links. Eventually, they learn that links on your domain are a crapshoot. Especially because the 404 is coming from you, not the original URL's host.

Over years and years, you're going to be using a lot of storage for the links, yes, but if you're also keeping analytics data, that is A TON more data. If you want to keep your data storage smaller, you could keep all links, but set a limit on the analytics data you keep to something like 2-3 years.

Anyway, we definitely have links at Bitly that I would expect to be used for >10 years.

On the second point, you're not using the entire hash. You're only taking seven random characters from it. You have no guarantee that you didn't happen to end up with those same seven characters from another hashed URL.

Now, to generate a unique short URL, we can calculate the MD5 hash of the long URL, which will produce a 128-bit hash value. Now when we encode the MD5 result to Base64 encoding the resultant string will be 22 characters long.

For choosing the short URL, first, we can randomly swap some character of the Base64 encoding result and then pick any 7 characters randomly from the result.

This is essentially random for all intents and purposes. You started with something non-random, a hash, but then by swapping characters and randomly selecting any 7 of them, you ended up at a result that is no better than random.

Nothing there prevents any two links from ending up with the same seven characters in the backhalf.

I also wonder about random picking of 7 characters.
Can you tell me what is your logic behind in bit.ly for uniqueness of those links.


This is essentially random for all intents and purposes. You started with something non-random, a hash, but then by swapping characters and randomly selecting any 7 of them, you ended up at a result that is no better than random

@ameliagapin I understood what you are trying to say.

What about this approach: Take the first 5 characters from the 22 characters generated as a result(of hashing and encoding) and remaining 2 characters from the end of the generated string?

See what I meant by analytics is that, we can store the number of times a link is visited and the last time it was visited.
Then, we can find the link with the least visits in 10 years or link which has not been used for 10 long years or has been used 10-20 times only, we can remove those links.

Or one more solution is that we can check if the corresponding long URL is still active. If not we can delete those links.

In this way, we are keeping popular URLs forever and eliminating unwanted URLs.

What are your views on this architecture?

@naingaungphyo Please refer to section 5. Shortening Algorithm for the explanation.

@mayankjoshi Sorry for misunderstanding.

I was asking @ameliagapin for that question.

Can you tell me what is your logic behind in bit.ly for uniqueness of those links.

See, even I am trying to figure out this. I have proposed an architecture design to @ameliagaping I will get you updated, once I'm certain.

code of conduct - report abuse