Intro
I found this topic rather misleading and always overcomplicated. Though I cannot disagree the version below is a lot more simplified than real life calculations, it's still covers 99% of the things you can encounter in your interview process.
What to estimate?
QPS - queries per second
RPS - reads per second
WPS - writes per second
Peak QPS = QPS * 2 (usually)
RW - read write ratio
Message size - size of the message in bytes if not given
Read Throughput - RPS * message size = N bytes per second
Write Throughput - WPS* message size = N bytes per second
💡 Throughput is how much data actually passed through and bandwidth is how much data CAN be passed through (network configuration)
Ex: 1gbps network bandwidth can pass 125mb/s
Storage - usually storage for N years
Replica storage - storage * 2-3 times
Cache storage - usually 20% of storage or so
Cache replica storage - cache storage * 2-3 times
Basic Numbers
seconds in a day - 24 * 60 * 60 = 86400, roughly 10^5
1 ASCI letter - 1 char
timestamp - 8 bytes (2^64)
103 - 1kb
106 - 1mb
109 - 1gb
1012 - 1tb
1015 - 1pb
1018 - 1eb
Powers of two
Power Exact Value Approx Value Bytes
---------------------------------------------------------------
7 128
8 256
10 1024 1 thousand 1 KB
16 65,536 64 KB
20 1,048,576 1 million 1 MB
30 1,073,741,824 1 billion 1 GB
32 4,294,967,296 4 GB
40 1,099,511,627,776 1 trillion 1 TB
Latency numbers every programmer should know
Latency Comparison Numbers
--------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 10,000 ns 10 us
Send 1 KB bytes over 1 Gbps network 10,000 ns 10 us
Read 4 KB randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
HDD seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from 1 Gbps 10,000,000 ns 10,000 us 10 ms 40x memory, 10X SSD
Read 1 MB sequentially from HDD 30,000,000 ns 30,000 us 30 ms 120x memory, 30X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
Notes
-----
1 ns = 10<sup>-9</sup> seconds
1 us = 10<sup>-6</sup> seconds = 1,000 ns
1 ms = 10<sup>-3</sup> seconds = 1,000 us = 1,000,000 ns
Handy metrics based on latency numbers
Read sequentially from HDD at 30 MB/s
Read sequentially from 1 Gbps Ethernet at 100 MB/s
Read sequentially from SSD at 1 GB/s
Read sequentially from main memory at 4 GB/s
6-7 world-wide round trips per second
2,000 round trips per second within a data center
How to estimate?
- Clarify number of daily users and number of total users.
- Ask about number of request from user on average. From here you can get QPS.
- Think about peak QPS, Reads and Writes.
- Assume (clarify) message size.
- Calculate throughput.
- If it's possible think about average data size. And calculate storage and cache here.
Estimation example
You have 10M daily active users and each of them makes 100 read requests per day on average and new data is created 5 times per day.
RPS = 10M * 100 / 86400 = 12000 r/s
WPS = 10M * 5 / 86400 = 580 w/s
Peak QPS = 24000 r/s
Let's assume(clarify with the interviewer) that the average read message size is 50 bytes and the written message is 1kb.
Avg Read throughput 50 * 12*10^3 = 60kb/s
Avg Write throughput 1kb * 580 = 580kb/s
Here we can think about the type of data/metadata etc. Let's assume that you have clarified with your interviewer and the size of the new data is 1kb.
5 years storage - 10M * 1kb * 5 time per day * 365 days per year * 5 years = 91tb * 3 = 300tb with replicas.
Lets assume that you have only 10% of hot data and you agreed to use 20% as cache.
Cache storage - 10% * 90tb * 20% * 3 replicas = 5.5 tb
Top comments (2)
I have some confusion about calcucating cache size for 5 years. So I asked ChatGPT the following.
And here's the relevant part from its answer.
A More Accurate Estimation Approach:
Instead of considering 5 years’ worth of data, you should estimate the cache size based on the data’s access patterns and the working set size. Here’s how you might adjust the calculation:
1. Determine the Working Set Period: Decide on a time frame that represents the typical period during which data remains “hot.” For many applications, this could be the last few days or weeks.
2. Calculate Data Generated in the Working Set Period:
• For example, if the working set period is one month:
• Daily data generated: 10M users x 1KB x 5 times/day = 50 GB/day
• Monthly data generated: 50 GB/day x 30 days = 1.5 TB
3. Estimate the Hot Data Percentage:
• If 10% of this monthly data is hot:
• Hot data size: 1.5 TB x 10% = 150 GB
4. Calculate Cache Size with Replication and Overhead:
• If you use 20% of the hot data size for cache (which might represent a cache hit ratio you’re aiming for):
• Cache size: 150 GB x 20% = 30 GB
• With 3 replicas: Total cache storage: 30 GB x 3 = 90 GB
@vladisov Thanks for the simple and clear explanation.
Even some popular books in the area couldn't convey it this easy.