DEV Community

Rajkiran
Rajkiran

Posted on

System Design - Back-of-Envelope Estimation: The Skill That Separates Senior Engineers from Everyone Else

The Interview That Changed How I Think About Numbers
Imagine you're in a system design interview. The interviewer says: "Design YouTube."

Candidate A immediately starts drawing boxes — video service, CDN, transcoding pipeline...

Candidate B pauses and says: "Before I design anything, let me understand the scale. YouTube has roughly 2 billion monthly active users. If 0.1% upload a video per day, that's 2 million uploads. At an average of 5 minutes at 1080p, that's roughly 10GB per video... we're talking about 20 petabytes of new storage daily. That tells me we need object storage like S3, not a database, and a distributed transcoding pipeline, not a single server."

Candidate B gets the offer.

Estimation isn't a math test. It's a demonstration that you understand how the numbers drive the architecture. Get the order of magnitude right, and your design decisions become obvious.

Why Engineers Fear Estimation (And Why You Shouldn't)
Most people think estimation requires precision. It doesn't.

Estimation in system design is about orders of magnitude — is this problem in the kilobytes or gigabytes range? Thousands of requests per second or millions?

Getting within 5-10x is perfectly acceptable. The goal is to avoid catastrophically wrong decisions — like designing a system to store 1TB when the actual data is 1 petabyte.

The tools you need are simpler than you think.

Foundation: Powers of 2 (Memorize These)
Every storage and data estimation comes back to these numbers. Internalize them:

Quick rules for estimation:

A plain text character ≈ 1 byte
A tweet (280 chars) ≈ 300 bytes with metadata
A high-res photo ≈ 3–5 MB
An MP3 song ≈ 4 MB
A 4K video (2 hours) ≈ 50–100 GB uncompressed, ~10 GB compressed

Foundation: Latency Numbers Every Engineer Should Know
These numbers were famously compiled by Jeff Dean at Google. They tell you how long operations take — which drives every architecture decision about caching, async processing, and geographic distribution.

What these numbers tell you:

1. Memory is 1000x faster than SSD. This is why caching exists.
2. SSD is 100x faster than HDD. This is why databases moved to SSDs.
3. Same-datacenter network is fast, cross-continent is slow. This is why CDNs and edge deployments exist.
4. Your app servers and databases should be in the same datacenter. Putting your DB in a different region adds 150ms to every query.

The mental model: If an operation goes to RAM, it's fast. If it goes to disk, it's 1000x slower. If it crosses a network, it's 500x slower than RAM. If it crosses continents, it's 300,000x slower.

The 4 Core Estimation Formulas
These four formulas cover 90% of all estimation questions in system design interviews.

Formula 1: QPS (Queries Per Second)
Average QPS = (DAU × requests per user per day) ÷ 86,400

Peak QPS = Average QPS × 2 to 3

_Why 86,400? _There are 86,400 seconds in a day (24 × 60 × 60).

Why 2-3x for peak? Traffic isn't uniform. Most apps see 2-3x their average load during peak hours (evenings for social apps, mornings for news, etc.).

Example — Twitter:

DAU = 300 million

Requests per user per day = ~20 (timeline, tweets, search)

Average QPS = 300M × 20 ÷ 86,400 ≈ 69,444 QPS ≈ 70K QPS

Peak QPS = 70K × 3 = 210K QPS

This immediately tells you: you need distributed infrastructure, not a single server.

Formula 2: Storage
Storage per day = records per day × record size (bytes)

Total storage = storage per day × retention period (days)

_Example — WhatsApp messages:
_

DAU = 1 billion

Messages per user per day = 100

Records per day = 1B × 100 = 100 billion messages/day

Record size = 100 bytes (text message + metadata)

Storage per day = 100B × 100 bytes = 10 TB/day

5-year retention = 10 TB × 365 × 5 ≈ 18 PB

18 petabytes in 5 years. Now you know why WhatsApp doesn't store message history on servers — it's stored locally on your device. That's not just privacy, it's economics.

Formula 3: Bandwidth
Bandwidth = QPS × average request/response size

Example — YouTube read bandwidth:

Peak QPS = 100,000 (video streams)

Average video bitrate = 5 Mbps (HD streaming)

Bandwidth = 100,000 × 5 Mbps = 500 Gbps

500 Gbps of outbound bandwidth. This is why YouTube uses a global CDN — no single data center can serve 500 Gbps to users across the world efficiently.

Formula 4: Cache Size
Cache size = daily active data × 20%

The 80/20 rule of caching: 20% of your data serves 80% of requests. Cache the hot 20%.

Example_ — Twitter timeline cache:
_

Daily active timeline data = ~50 TB

Cache size needed = 50 TB × 20% = 10 TB

10 TB of Redis cache to serve 80% of timeline reads from memory. The remaining 20% of requests fall through to the database. This is why Twitter's feed loads in milliseconds.

Worked Example: Estimate Twitter's Storage
Let's do this end-to-end, the way you'd do it in an interview.

Given assumptions (state these out loud):

300M DAU
Average user posts 1 tweet/day
A tweet = 280 characters max ≈ 300 bytes with metadata
A tweet with media: 50% have no media, 30% have images (3MB), 20% have videos (30MB)
Retention: indefinite (assume 10 years)

Text storage:

Tweets per day = 300M × 1 = 300M tweets

Text per day = 300M × 300 bytes = 90 GB/day

10-year text storage = 90 GB × 3,650 days ≈ 330 TB
_
Media storage:_

300M tweets × 30% with images = 90M image tweets/day

90M × 3 MB = 270 TB/day just for images

300M tweets × 20% with videos = 60M video tweets/day

60M × 30 MB = 1,800 TB = 1.8 PB/day for videos

Total daily storage: ~2 PB/day

10-year total: ~7.3 EB (exabytes)

Now your architecture makes sense: You need distributed object storage (not a database) for media, you need compression and tiered storage, and you need a CDN for delivery. The numbers told you this.

Worked Example: Estimate Netflix Bandwidth
Given assumptions:

220M subscribers
Peak concurrent viewers = 15% = 33M users
Average video quality = 1080p HD = 5 Mbps bitrate
_
Bandwidth calculation:_

Concurrent viewers = 33 million

Bandwidth per viewer = 5 Mbps

Total bandwidth = 33M × 5 Mbps = 165,000 Gbps = 165 Tbps

Netflix needs to deliver 165 terabits per second at peak.

This is why Netflix has its own CDN (Open Connect), deploys CDN servers inside ISP networks, and pre-positions popular content at the edge before you even request it. The numbers mandated this architecture.

_
Common Estimation Mistakes_

1. Forgetting the peak multiplier Average QPS of 10,000 with 3x peak = 30,000 QPS. Your servers must handle peak, not average.

2. Forgetting write amplification For every write to a database with 3 replicas, you actually do 3 writes. Storage and I/O costs 3x more than you'd naively calculate.

3. Mixing up MB and Mbps Storage is in bytes (MB, GB, TB). Network bandwidth is in bits per second (Mbps, Gbps). There are 8 bits in a byte. 1 Gbps of bandwidth can transfer 125 MB/second.

4. Not compressing A 4K video uncompressed = 50GB. With H.265 encoding = 5GB. Always assume compression for media. State your assumption.

5. Ignoring metadata overhead Every record has metadata — timestamps, user IDs, indexes. A 100-byte record in a database is probably stored as 200-300 bytes with indexes and overhead.

The Estimation Mindset for Interviews
When you walk into a system design interview, here's the script:

Step 1: Clarify the scale "What's the expected number of users? DAU or MAU?"

Step 2: State your assumptions "I'll assume each user sends 50 requests per day. Let me know if that's off."

Step 3: Calculate, thinking out loud "So that's 100M × 50 ÷ 86,400 ≈ 58K QPS, call it 60K. Peak might be 3x, so 180K QPS."

Step 4: Let numbers drive design "At 180K QPS, a single server won't cut it. We need horizontal scaling and a load balancer. That also means stateless design..."

The interviewer isn't checking your arithmetic. They're checking whether you use numbers to drive decisions, or whether you design by intuition and hope.

Key Takeaways
-Estimation is about order of magnitude, not precision. Within 10x is fine.
-86,400 seconds/day, 2-3x peak multiplier — never forget these two.
-Latency hierarchy: RAM → SSD → Network (same DC) → Network (cross-continent). Each jump is 100-1000x slower.
-Four formulas: QPS, Storage, Bandwidth, Cache Size. Master these and you can estimate anything.
-Numbers drive architecture. Before drawing a single component, estimate the scale. The scale tells you what you need.
-Always state your assumptions explicitly. A wrong assumption stated clearly is better than a correct answer that came from nowhere.

Top comments (0)