DEV Community: Andrei

Lean System Design: URL Shortener

Andrei — Tue, 23 Jun 2026 12:00:00 +0000

Hey!

As I wrote in the previous post, I became really interested in Lean System Design – real-life system design, not forced to satisfy Google-scale requirements.

Today, I want to review a classic example of a system design interview question – URL Shortener.

The task is simple – we need to design an application that converts long URLs to short ones, which can then be used for redirects to the original URLs. The goal is to have a short URL that can be easily shared as text or a readable QR code.

Common examples of such services are Bitly and TinyURL.

So we have just two user cases here: creating a tiny URL and opening a tiny URL.

And of course, we have non-functional requirements. Let's use a popular book by Alex Xu, System Design Interview, as a reference for what these requirements are and the canonical expected interview solution for this task.

Requirements

Interview requirements:

100 million URLs generated per day
The resulting URLs should be as short as possible
High availability, *scalability*, etc.

And we have some assumptions we can make out of these requirements to have a better understanding of numbers:

100 million URLs per day is about 1160 writes per second
We assume the read-to-write ratio is 10:1, so we'll have 11600 reads per second
We assume our service will run for 10 years, so we'll have 365 billion records, and having an average URL length of 100 chars, we'll need 365 TB of storage

OK, looks solid.

But are these requirements and assumptions realistic?

Let's see what we know about the biggest players on the market – Bitly and TinyURL.

Bitly, on their main page, says they have 256M Links & QR Codes created monthly
Bitly in 2022 had 5.7M monthly active users and 500K+ customers, according to their press release
Bitly also says in the same press release that they have "10 Billion+ Clicks and QR Code Scans Every Month"
TinyURL says it's trusted by ~3.9M users
TinyURL has created 30 billion short URLs over 23 years

What can we see here? Well, first of all, the biggest player on the market – Bitly – has just 256M URLs created monthly. That's only 8 million URLs generated per day. And they also have only 330 million redirects per day.

And these numbers convert to:

93 writes per second
3820 reads per second

What RDS instance do you actually need?

Your PostgreSQL RPS depends on whether your working set fits in RAM. I benchmarked every major RDS size to find that threshold for each – with measured RPS, latency, and cost per 1k requests by workload type.

Get PostgreSQL Sizing Cheat Sheet here.

So, the biggest player on the market, whose scale, from a business perspective, I believe is problematic to achieve, has a noticeably smaller load than is required for the task. If you want to build something similar in your company, your scale will probably be even smaller than that.

!> It's important to note here that there are services that use short URLs basically for anything, like Twitter/X. They shorten any URL any user posts, and they are probably the biggest player here, but this is an internal feature, and they don't share usage statistics, so I can't use it here for comparison. Also, you are usually not forced to shorten any URL your system sees. Twitter is a bit special, and there is no big reason to follow their decision.

Interview System Design

How can we design such a system?

There are two classical solutions to this, both described in Alex Xu's book:

Use hashing for long URLs, insert them into a relational DB, and retry in case of collisions (with appending the hashing result again and again if needed)
Use a distributed unique ID generator, and convert these IDs to Base62

Both of these options use a fixed-length ID to support the maximum number of URLs that can be generated (over 10 years) and to keep short URLs short.

I agree that both will work, and they are good answers for the system design interview nowadays. But what is not perfect in these options?

Hashing

The known issue with this option is that you need to check whether the hash is already in use. To do this efficiently, you will need an index on your DB. The ideal index here is something with a bloom filter – a probabilistic data structure which allows you to check if some element is definitely not in the set (or there is a chance it is in the set). However, a B-tree index will obviously work here as well.

So, we can speed up the check if the hash is already in use using the index. But at the same time, DB will need to maintain this index, which will decrease insert speed.

Distributed Unique ID Generator

The Distributed Unique ID Generator option isn't that bad. We want to use an ID generator here that reduces the chance of ID collisions in a distributed system to zero or nearly zero. One option is to use Snowflake ID.

With this option, you don't need to pay a lot of attention to collisions or maintaining your index. And this is a good direction if you really need to scale. But let's see what we can achieve without going distributed.

Real-World Design

As we saw, in the real world, we need to support just up to 100 writes per second and up to 4k reads per second. Both numbers look more than achievable on a single machine with a canonical DB.

I have vibecoded a simple solution in Go that uses either SQLite or PostgreSQL to demonstrate the possibilities of a single machine here.

We will not use hashing; we will use only a primary auto-incremental ID, which is encoded to Base64URL. So, this is a non-distributed version of the canonical solution #2.

To test the solution, I used my home server with a NAS motherboard and an Intel Celeron N5105 (which is very far from high-tier hardware, and it cost me like 150 EUR on AliExpress) and some HDD (not even an SSD).

I've also simulated a pre-filled DB with 3 billion records, which is the expected DB size after 1 year of service usage.

$ ./dbgen -type=sqlite -count=3000000000

Total records: 3,000,000,000
Total time:    1h26m48s
Avg rate:      576,035 records/s
File size:     438.66 GB

And, if we check the results, they will be more than acceptable.

$ ./loadtest -duration=60s -concurrency=50

=== URL Creation ===
Total Requests:  152412
Successful:      152412
Failed:          0
RPS:             2540.20 req/s

=== Redirect ===
Total Requests:  952170
Successful:      952170
Failed:          0
RPS:             15869.50 req/s

So, we have like 2540 writes per second and 15869 reads per second, which is enough not only for our realistic requirements, but also for the original interview requirements. And that's on my cheap, bizarre hardware.

All the scripts for data generation and load testing are available in the repo, so you can make these experiments yourself: GitHub Repo.

Availability

As we use a single machine setup, we have perfect consistency, but what about availability? If our machine goes down, our service will not be operational until it is back online.

But how bad is that?

If we analyse Bitly SLAs, we'll see that on their Enterprise plan (they don't give any guarantees on other plans), they promise 99.9% availability. That's 43m 50s of downtime per month. That's quite a lot!

There are two main reasons why you usually may have downtime in your system: deployments and incidents.

Deployments are usually straightforward, and you can replace binaries on the machine quite fast. If you don't have a long warm-up for your service, it's a matter of seconds. And I can't find any reason why the described URL shortener should have any warm-up. The example that I demonstrate and test doesn't use any caching or anything besides a lightweight wrapper around a DB.

It's more problematic when you want to migrate data during the downtime, but you probably don't want to. Even in a non-distributed world, it's better to avoid incompatible changes, and this shouldn't really be a big problem if you understand how to make compatible changes. And with compatible changes, you can migrate your data while your system is running.

For incidents, it's more complicated. You can never predict whether you will have an incident tomorrow, and you can't control their duration. It may depend heavily on your code quality, your processes, and many other factors. But it's important to remember that going distributed is not a silver bullet. Even with redundancy, you may still experience downtime for other reasons. And some of these reasons might be related to the fact that you went distributed and introduced this redundancy (due to an increased number of moving parts, greater complexity, etc.).

First Steps of Scaling

Still, I agree: making simple, stateless applications that scale horizontally is a cheap way to improve our scalability and availability. It's a step up in complexity, but it is not that bad, since our app is stateless, and this doesn't require any complex distributed connections.

So, let's switch our application from SQLite to PostgreSQL and check the performance.

A lot of cloud providers offer PostgreSQL at a reasonable price and reasonable availability guarantees (like Multi-AZ in AWS), and it provides a lot of functionality to you, so it's usually a good default choice, but technically, you could use any other relational DB.

What we are doing here is delegating the complexity of distributed systems to a well-matched tool that is solid, mature, well-represented on the market, and that gives us what we need. We basically build our system around this technology.

PostgreSQL

In order to test PostgreSQL version, I've used c7i.large EC2 instance, and PostgreSQL instance on db.r8g.large via RDS. I've also used IO2 disk with baseline IOPS for our DB.

The results of the same experiment are the following.

No Multi-AZ case:

$ ./loadtest -url=http://localhost:3000 -duration=60s -concurrency=50

=== URL Creation ===
Total Requests:  954455
Successful:      954455
Failed:          0
RPS:             15907.58 req/s

=== Redirect ===
Total Requests:  1474310
Successful:      1474310
Failed:          0
RPS:             24571.83 req/s

Multi-AZ case:

$ ./loadtest -url=http://localhost:3000 -duration=60s -concurrency=50

=== URL Creation ===
Total Requests:  881595
Successful:      881595
Failed:          0
RPS:             14693.25 req/s

=== Redirect ===
Total Requests:  1575232
Successful:      1575232
Failed:          0
RPS:             26253.87 req/s

This option doesn't use any sharding. It's basically a single DB with an additional stand-by instance to improve availability. Still, it performs well.

The total cost of this AWS solution is approximately USD 750 per month, which should be more than affordable if you want to compete with Bitly. And you can definitely go cheaper by adjusting the hardware to your actual load.

Below are charts showing write and read RPS for:

PostgreSQL solution hosted on AWS
SQLite solution hosted on my NAS
Interview requirements
Real-world requirements

As you can see, our lean design shows performance that easily satisfies both interview and real-world requirements.

Conclusion

Of course, such classical problems, like URL Shortener, are just an instrument in interviews to provoke further discussions and check how candidates approach system design. But isn't it remarkable that people are trained to answer such kinds of questions, overcomplicating everything?

As you may see, there is a lot of freedom in how to solve that. And you can pick a straightforward solution that will be performant enough to beat the biggest players on the market.

And of course, in some cases, this simple PostgreSQL setup will not be enough, and we'll need to scale further and adopt a different design, probably with some sharding. But how can we check if single-machine Postgres is enough? I've prepared a benchmark and a Cheat Sheet on PostgreSQL performance across different AWS instance types ⬇️

Get PostgreSQL Sizing Cheat Sheet here.

Lean System Design

Andrei — Tue, 12 May 2026 12:00:00 +0000

Hey!

Over the past year, I became interested in a systematic approach to system design. It appeared to be a quite overused term nowadays – it includes pure and in-depth knowledge about technological systems; it includes how to use them and how to build new systems out of existing ones in real life; and it also includes the "system design interview" area, which is somehow different from the first two points.

Eventually, I've also realised there is a strong push in the community toward building complex distributed systems. This is mainly driven by large tech companies, which need to scale their products to millions and millions of users. So, they require this approach of thinking in their interviews, they popularise it at conferences and in the tools they develop.

The problem is that you probably don't work at Google, and your system will never need to handle Google-scale traffic.

Distributed systems are complex, and applying approaches popularised by big tech companies might be harmful to most other companies. It could be harmful in terms of cost, development velocity, and maintainability.

Over my 13 years in engineering, I've seen many complex distributed systems that engineering teams built over months, only to be decommissioned shortly after release for business reasons. And I've also seen systems that became so unnecessarily complicated that maintaining them took up 80% of engineers' time.

I feel we need to address these concerns with something I'd call Lean System Design.

Lean System Design is the design of systems to meet explicit requirements with minimal complexity: measurable requirements, system-lifetime-aware decisions, long-term cost efficiency, and operational simplicity.

Real-World Requirements

No matter if you work for a big tech company or not, you need to know numbers in order to make decisions on the solutions you build.

The common issue here is that in big tech, you almost always need proper scalability to support millions of users. And companies outside of big tech frequently don't have internal expertise on what decisions to make. And, following the popularity, they may not choose the best options for their case.

Real-world systems frequently follow requirements similar to these:

10 to 1000 RPS, not millions;
latency and availability targets that aim ordinary users, not some sophisticated SLAs;
per use-case data correctness guarantees;
operability and maintainability – it should be easy to diagnose and fix the issues;
limited infra and people costs.

It doesn't look like we need something very complicated to satisfy this, right? So why do people choose distributed?

What RDS instance do you actually need?

Get PostgreSQL Sizing Cheat Sheet here.

Why Distributed?

Distributed systems are complicated because of their nature. When building such systems (and using existing ones), we need to constantly think about consistency and availability. And there is no single "slider" between consistency and availability that you can select some point on. We have a lot of tools to build distributed systems, and each tool is different: it can be configured differently and will behave differently, with its own issues, problems, and trade-offs. And choosing "distributed", we'll need to handle all these complexities and maintain the resulting system.

However, after resolving all these issues, we'll achieve what is never possible with a single-machine system – almost infinite scalability.

There are also other potential benefits frequently associated with building distributed (in a broad sense) systems, such as better team scalability or clearer domain separation. In my opinion, though, all these reasons are not that significant, and shouldn't be considered in the distributed vs non-distributed question. Teams organisation and domain separation just accidentally happened to be the same discussion because of the popularisation of microservices. These aspects may differ significantly even in a non-distributed world, and shouldn't be that tightly linked.

Single Machine

OK, so scalability. It is indeed not possible to achieve the same scalability with a single-machine setup. But what are the limits? Let's check the real-life examples.

Probably, the most popular example is Stack Overflow. In its best years, it had a single database and handled 6k queries per second, running smoothly and delivering low latency. The hardware needed to handle that would now probably cost a couple of thousand USD per month, which is more than affordable for a company of that scale. Also, Stack Overflow is quite large, and most of the systems engineers build will never reach such a scale or load. See more about their architecture.

GitHub Pages is also a noticeable example. Until around 2015, all the GitHub Pages were hosted on a single pair of machines (active + standby) with an nginx config regenerated by a cron job. Straightforward architecture on a single machine that was very efficient for them and supported thousands of requests per second. See more.

These are examples of successful companies that served millions of users while maintaining a simple architecture without large distributed systems.

Why Go Distributed?

We see there might be successful products and systems that run on a single machine or somewhere near that.

But I'm not rejecting distributed systems completely here. So, why should we go distributed?

In my opinion, the main reasons are:

If you clearly see that you can't achieve the needed scalability on a single machine. To do that, you need to know "numbers" – the requirements for expected load and system growth. And you also need to understand the limits of a single machine and of the software it runs. I'll go deeper into that in my next posts.
If you see that you can use an existing self-contained and stable distributed system, and you can build your system around it. A common example is databases. People who develop distributed databases have already solved many issues and addressed much of the complexity, and they usually hide some of this complexity inside. So, you, as a user, will only need a small portion of this complexity and delegate everything else to this central system. The problem here is that this "small portion of complexity" is usually still quite big, and people underestimate it.
If you are doing an interview for a big tech company 🙂

What's Next?

Of course, this is not only about distributed vs. non-distributed. With this post, I'd like to start a series reviewing system design topics through the lens of lean system design.

Next week, I'll pick a typical example of the system and walk through:

how to estimate rough load without pretending you know the future;
how to choose availability and consistency targets that match your business;
and what "good enough" looks like for the first year.

Get the next one in your inbox

I write Lean System Design — real benchmarks, real numbers, no Google-scale theater. New posts when I have something worth saying.

Subscribe and I'll also send you the PostgreSQL Sizing Cheat Sheet: RPS, latency, and cost per 1k requests across every major RDS instance, broken down by workload type.

Get PostgreSQL Sizing Cheat Sheet here.