Abhishek Prajapati

Posted on Mar 3

Tiny URL Design

#mysql #redis #kafka #spark

What is a Tiny URL

It's a service that takes a long url like https://www.example.com/averylongpathoftheurlthatneedstobeshortened and converts it to something like https://sho.rt/34ssd21. Now whenever someone visits the short URL they will be redirected to the original URL

Advantage

With short url, it becomes easier to share and remember urls (provided you can define your own endpoint)

Disadvantages

The one problem that arises with this the risk of now knowing where you are getting redirected to. This can easily be used to phish people.

Requirements

Generate a short URL for a given long URL. Need to ensure they are unique. If not, then users can visit websites that they were not meant to see (like your portfolio)
Be able to see the number of clicks of the short URL generated.

Performance Consideration

Median URL has 10k clicks and popular URL can have more than million clicks
We may have to support 1 trillion short URLs at a time (Yes, its trillion)
The average size of the short URL would be in KBs, so total space required would be 1 trillion * 1KB = 1 PB
We will optimize for reads since most people would just be reading rather than generating

Link generation

In order to generate a shorter link for the provided long link, we can use a hash function which would take some parameters and generate a short url for us

Example hash function

hash(long_url, user_id, upload_timestamp)

long_url: The URL that needs to be shorted
user_id: The user who is shortening the URL
upload_timestamp: The time at which the short url is created

Now we need to accommodate a lot of URLs in order to avoid collision between the links, thus we need to determine the length of the short URL (this is the path of the URL i.e https://www.sho.rt/<hash_function_output>)

We will be using lower case alphabets and digits to generate the short URL path. So now we have total 36 characters (10 digits + 26 alphabets).

If the length of the short URL path is n then we have 36^n possibilities. In our earlier assumption we have aimed to store 1 Trillion URLs.

For n = 8
Number of possibilities = 36 ^ 8 ~= 2 Trillion

With n = 8, we have enough possibilities.

So our hashed URL would look something like this

hash(long_url, user_id, upload_timestamp) = https://www.sho.rt/2fjh7rw6

The length of short url path 2fjh7rw6 is 8

Now what should we do in order to deal with hash collisions, we can look for the next available hash since we would be storing the data in a DB table.

Assigning URLs (Writes)

Now we will try to decide on strategies to store the generated short URL.

Replication

We want to maximize our write throughput wherever possible even though our overall system would be optimized for reads.

Now let's assume we are using multi leader replication. We have 2 masters.

User 1 creates a link with hash abc on Master 1 and User 2 creates the same hash abc on Master 2.

After some time both the Masters sync up, in doing so they encounter that
hash abc exists on both of them. In order to solve this conflict the system is designed to use Last Write Wins policy. With that the value of abc is stored def.com which was uploaded by User 2 and the value xyz.com which was set by User 1 is lost.

When User 3 who is expecting "xyz.com" for hash abc is getting def.com, this leads to confusion.

The above issue can cause a lot of confusion and thus we will go with Single Leader Replication for our case

With single leader we won't have the data lost issue as described above but we now have a single point of failure but since our system would be more read heavy, we should be find with Single leader replication

Caching

We can use cache in order to speed up our writes by first writing to our cache and then flush those values to our data base.

This approach has the same issue of same hash being generated on both caches.

So, using a cache would not make much sense here

Partitioning

It is an important aspect of our design since we can use it to improve our reads and writes by distributing the load across different nodes

Since we have generated a short URL which is a hash we can use range based hashing reason being that the short URLs generated should be relatively even.

Now if a hash is generated against a long URL and it is being stored in one of the nodes but that node already has the generated short URL entry, then we can use probing to deal with this. The advantage of probing is that similar hashes will be present in the same node.

We also need to keep track of consistent hashing ring on coordination service, minimize reshuffling on cluster size change

Single Node (Master Node)

In single leader replication, we have a single write node. In this scenario there could be cases where 2 users end up generating the same hashes, What should be done in that situation ? We could use locking but the hash entry does not exist in the DB yet

To deal with this we have some workarounds

Predicate Locks

What are predicate locks ? This is a shared lock on row based on some condition that is satisfied.

For example,
We have the below query
SELECT * FROM urls WHERE shorturl = 'dgf4221d'

Now user1 is trying to run this query and in the meanwhile user2 is already utilizing the conditions mentioned in the above query then user1 needs to wait for user2 to finish.

Now we can lock the rows that don't exist yet and then insert the required row. If another user wants to create the same entry they will not be able to do so, since there is a predicate lock on the row because of the first user.

Example query to create a lock on the row (For Postgres)

BEGIN TRANSACTION ISOLATION LEVEL SERIALIZABLE;

-- Check if the user exists (this will acquire a predicate lock)
SELECT * FROM users WHERE email = 'user@example.com';

-- If no rows are found, insert
INSERT INTO users (email, name) VALUES ('user@example.com', 'John Doe');

COMMIT;

In order to improve the speed of the queries we can index on the short url column would make the predicate query much faster(O(n) vs O(logN))

We can also use stored procedure to deal with hash collision. If a user receives a collision then we can use probing to store the long url in the next available hash for the user.

Materializing conflicts

We can populate all the DB with all the possible rows so that users can lock onto row when they are trying to write.

Let's calculate the total space required to store 2 trillion short urls

No. of rows = 2 trillion
Size of 1 character = 1 byte
Length of each short url = 8

Total Space required = 2 trillion * 1 byte * 8 characters = 16 TB

16TB is not a lot of storage since we have personal PC with 1 TB storages

When a user is updating an existing entry, the DB will lock that row and if another user tries to update the same row then they would be incapable of doing so

Engine Implementation

We don't need range-based queries since we are going to mostly query a single row while reading or writing. So, a hash index would be much faster. We cannot use a in memory DB since we have 16 TB of data.

We have 2 choices - LSM tree + SS Table and BTree

Feature	LSM Tree + SSTable	B-Tree
Write Performance	✅ High (sequential writes, batched flushing)	❌ Lower (random disk writes for every insert/update)
Read Performance	❌ Moderate (requires merging multiple SSTables)	✅ High (direct lookup using tree traversal)
Write Amplification	✅ Lower (writes are buffered and flushed in bulk)	❌ Higher (writes propagate across multiple levels)
Read Amplification	❌ Higher (may need to scan multiple SSTables)	✅ Lower (direct path to data via tree traversal)
Storage Efficiency	✅ Better (compaction reduces fragmentation)	❌ Less Efficient (fragmentation due to node splits)
Compaction Needed?	✅ Yes (merging SSTables to optimize reads)	❌ No (data is structured directly in a balanced tree)
Durability	✅ Yes (WAL ensures no data loss before flushing to SSTables)	✅ Yes (data stored directly in the tree structure)
Concurrency Control	✅ Better (writes are append-only, reducing contention)	❌ More Locking Needed (modifies multiple nodes in-place)
Disk I/O	✅ Optimized (sequential writes, fewer random writes)	❌ More I/O (random writes and in-place updates)
Use Case	🔹 Write-heavy workloads (NoSQL, Logs, Streaming data)	🔹 Read-heavy workloads (Relational Databases, Indexes)
Examples	🔹 Apache Cassandra, LevelDB, RocksDB, Bigtable	🔹 MySQL (InnoDB), PostgreSQL, Oracle DB

Considering all the above points and requirement of our system. It makes a lot of sense to use BTree

Database Choice

So far we have made the following choices

Single leader replication
Partiotion
B-Tree based

Keeping all of these things in mind, we can use a relational DB and since we don't need to make distributed query we can choose MySQL for our case (We could also go with Postgresql here).

Maximizing read speeds

We can replicate data on multiple nodes in order to handle the user traffic.
We could possibly get stale data from replica but we could check leader on null result (This can lead to more load on the leader)

Handling Hotlinks

There may be some links that are accessed by a large number of people as compared to the other links. These links can be called Hotlinks

A caching layer can help mitigate a lot of load.

Caching layer can be scaled independently of DB
Partitioning the cache by shortURL should lead to fewer cache misses

Populating cache

We can't push every link to cache since we don't which links would be popular. So what should be the approach.

Use write back - This will cause write conflicts since there could be multiple entries for the same short URL

Use Write through - If the cache goes down then we risk losing the shortURL data and this also slows down the write speeds since now we are waiting for the DB to finish the update as well.

Use Write around - We can use this to update the DB first and then in case of a cache miss we can update the cache data as well. This will increase your cache miss but eventually we can better reads.

Analytics Solutions

We will now discuss the ways to update the number of clicks on a particular shortURL

Naive Approach

We keep a clicks counter per rows, we could just increment it.

So, if 2 users are trying to read the same row at the same time then we can have a race condition since now both the users will try to increment the counter at the same time this might lead to wrong count updates

We could have something like a lock / use atomic increment operation per row.

Stream Processing

We can place individual data somewhere that doesn't require grabbing locks and then aggregate later

We can explore the below options
DB - Relatively slow
In-Memory DB - Superfast but not durable
Log Based message - We can write to a write ahead log, which is durable and can be processed at later point

Now we can place the date in some sort of queue like Kafka

Click consumer

We have the following options to process the clicks data from the queue

HDFS + Spark
Batch jobs to aggregate the clicks data will be too infrequent since we would need to dump the clicks data to HDFS first and then process it.
Flink
Process each event individually in realtime, may send too many writes to the database depending on implementation.
Spark streaming
Processes events in configurable mini-batch sizes. This also does not put a lot of load on the DB unlike Flink

Considering our scenario and how frequently we want to update the clicks data, we can choose Spark Streaming since it processes the data in batches which does not put a lot of load on DB and also updates the DB frequently

Stream consumer frameworks enable us to ensure exactly one processing of events via checkpointing or queue offsets

Process Exactly once

We are guaranteed that the queue and Spark streaming will process the event exactly once but when we are trying to udpate the DB with the click data we might face some issues.

If Spark streaming wants to update the click count by 100 and sends a request to the DB and DB acknowledges the changes but in between the network Spark streaming service goes down. Now when the service comes back up it would not have received the acknowledgment for the previous 100 click count data and it would repush that data to DB

Now we have couple of options for this

Two phase commit - This is too slow for our systems since we would need a coordinator node to implement the 2 phase commit between the spark streaming service and DB
Idempotency Keys - Every update would have a idempotency key t o verify the latest update, if same key is present then we can reject the update

This would scale poorly if there are multiple publishers for the same row i.e lot of updates from multiple spark streaming services to the same row

One Publisher per row

Imaginee our data is growing and we need to process all the clicks data. Now to handle traffic we have implemented multiple kafka queues and spark streaming but now when we are trying to update the data we might need to store multiple idempotency keys in order to update the data correctly which might not be a good idea

Now, we can partition our Kafka queues and spark streaming consumers by short URL. we can ensure that only one consumer is publishing clicks for a short url at a time.

Benefits

Fewer idempotency keys to store
No need to grab locks on publish step

Deleting Expired Links

We can run a relatively less expensive batch job every x hours to check for expired links.
Only has to grab locks for row commonly being read

### High Level Diagram

Conclusion

Now with this system we should be able to support the initial numbers that we have quoted. Do let me know what can be improved in this system.

Credits

https://youtu.be/5V6Lam8GZo4?list=PLjTveVh7FakJOoY6GPZGWHHl4shhDT8iV

DEV Community