Fahim ul Haq

Posted on Sep 25

The power of estimation in building reliable systems

Reliable System Design comes down to two things: clear requirements and realistic estimates. Part one covered how to turn ideas into requirements. This part focuses on estimation and how to validate your design against real-world constraints.

From my time leading teams at Microsoft and Meta, I saw the same pattern repeat. Systems didn’t fail because of surprises. They failed because no one ran the numbers early enough. Skipping estimation is like designing blind; you only hit the real limits once the system is live, often during an outage.

Here, I’ll share why estimation is critical for scalability, the techniques I use, and how you can apply them to real problems.

Why estimations truly matter

In my experience, teams that skip upfront estimation pay for it later, often during a major outage when the system can’t handle the load. Estimation in System Design is the discipline of quantifying uncertainty. It turns abstract requirements into concrete numbers for user load, data storage, and transaction rates.

In 2012, Instagram learned a difficult lesson. After being acquired by Facebook, their user base grew so fast that their single database couldn’t keep up. The overwhelming traffic caused the service to slow down and created a risk of crashing. To handle the massive increase, they had to urgently restructure their system, breaking up their data and distributing it across many separate databases.

In my experience, estimation matters for three main reasons:

Capacity planning: Estimations forecast metrics like daily active users (DAUs) and interaction patterns, which help predict resource usage and guide decisions on server capacity, database size, and network infrastructure. The goal is to provide for current traffic while preparing for future growth.
Performance optimization: Estimations help identify potential bottlenecks. As usage increases, you can design responsive systems by calculating expected queries per second (QPS) or response times under peak load.
Cost management: Over-provisioning wastes capital, while under-provisioning harms user experience. Accurate estimations strike a balance, supporting cost-effective scaling. In large organizations like Meta and Microsoft, even small errors in these estimates can lead to millions in unnecessary infrastructure costs.

The diagram below illustrates how these three pillars, capacity, performance, and cost, are interdependent.

Ultimately, these three pillars are interconnected. A failure in capacity planning directly impacts performance, and both have significant cost implications. Solid estimations are the starting point for getting all three right. The next step is learning the practical techniques to generate these numbers reliably.

Essential estimation techniques

Having a toolkit of estimation techniques is essential. The art is knowing which method to apply based on the available information and the required precision. Here’s the visual of a few proven methods to ground your estimations in real numbers.

Let’s discuss them in detail:

Back-of-the-envelope calculations: The first tool to reach for is back-of-the-envelope math. It involves quick, approximate calculations to establish a baseline and check for feasibility. The goal is less about perfect numbers than testing whether an idea is viable. This includes using rules of thumb and breaking a large estimate into smaller, manageable parts.

Even at Google, Jeff Dean was known for using back-of-the-envelope math in design reviews to quickly determine whether proposals were feasible.

Order-of-magnitude thinking: Before diving into specifics, ask, “What is the scale of this problem?” Are you designing for thousands of users, millions, or billions? Estimating values to the nearest power of ten is a quick way to assess scale without getting lost in details. This framing shapes the entire architectural approach.
Breakdown and aggregation: When faced with a complex system, break it into its core components. Estimate the requirements for each part individually (e.g., authentication service, data ingestion pipeline) and then aggregate them. This modular approach is far more accurate than estimating the entire system simultaneously.
Historical data analysis and benchmarks: Data from past projects is often the most reliable starting point. Suppose a similar service handled A traffic with B resources, which provides a powerful baseline. Organizations like Netflix and Uber often publish engineering blogs with valuable performance benchmarks.

The table below is directional; it highlights each method’s trade-offs in speed, accuracy, and use cases.

The best practice is to start broad with order-of-magnitude estimates and refine them iteratively as more data becomes available.

Techniques provide structure, but real-world estimation faces challenges such as evolving requirements, optimism bias, and new technologies. Next, we’ll examine these challenges in detail.

Common estimation challenges

Estimation is part calculation and part judgment; even experienced teams struggle with it. The biggest obstacles are often human and organizational rather than purely technical.

The most common pitfalls include:

Uncertainty in requirements: Early-stage projects often have vague or evolving requirements. A product manager’s vision can change, market conditions shift, and user behavior can be unpredictable. This makes fixed, long-term estimations difficult.
Human factors: Optimism bias is a powerful force in engineering. We often underestimate complexity and the time required to complete tasks. This can be compounded by organizational pressure to meet ambitious deadlines, leading to unrealistic commitments.

Research on large IT initiatives shows a consistent pattern: projects run significantly over budget and schedule, while delivering only a fraction of the expected benefits. The gap comes less from technical problems and more from early mistakes and overconfidence.

Technological constraints: When working with new technologies, there is often a lack of historical data or established benchmarks. This lack of precedent makes it difficult to accurately predict performance and resource needs.

Handling these uncertainties requires iteration. Start broad, refine as requirements evolve, and adjust as real-world data comes in. Involve cross-functional teams to bring diverse perspectives and reduce bias. Use historical data and industry benchmarks to keep estimates grounded. Above all, build a culture of transparency where forecasts can evolve instead of being treated as fixed predictions.

With those challenges in mind, here’s how estimation looks in action.

Estimation in practice

Let’s look at a summarized back-of-the-envelope estimation for a URL shortener to see how these numbers guide our design. While this example is simple, the process is the same for any large-scale system. It all starts by establishing a few baseline assumptions to define the scale of the problem.

Baseline assumptions

The first step is to frame the problem with a few straightforward assumptions about how the system will be used and how much data it will handle.

200M new URL shortening requests per month
1:100 shorten-to-redirect ratio
Each shortened URL record requires ~500 bytes
Entries stored for up to 5 years

We can perform quick back-of-the-envelope math with these assumptions to calculate our core metrics.

Back-of-the-envelope calculations

Let’s translate these assumptions into concrete storage, traffic, bandwidth, and server needs estimates.

Storage: 200M/month × 500 bytes ≈ 6 TB over 5 years
Write traffic (average): ~200M new URLs 30 days ÷ 24 hours ÷ 3600 seconds ≈ 77 URLs/sec
Read traffic (average): 77 writes/sec × 100 reads/write ≈ 7.7K redirects/sec
Bandwidth: At peak, redirects could consume ~60 Mbps (based on an average redirect size of ~1KB and peak traffic of ~7.7K requests per second (RPS)
Servers: We’d need 5–6 servers for the application logic (assuming peak traffic is 3x the average, or ~23K RPS, and each server handles 5K RPS), plus additional servers for redundancy.

These estimates are not exact but provide a foundation for our architecture.

The projected redirection rate makes a cache layer essential. The need to store billions of link points toward a scalable key-value store, not a single relational database. Strict availability targets demand redundancy and automatic failover.

The diagram below connects these estimates to the final architecture:

This is the real value of estimation: it translates numbers into design choices long before implementation begins.

For a detailed, step-by-step walk-through, see the TinyURL System Design, which covers the calculation process.

The back-of-the-envelope math we used for TinyURL is perfect for high-level design. But we need to explore more specialized tools and models for more complex systems or when greater precision is required.

Advanced estimation model

Back-of-the-envelope calculations are great for scoping, but complex systems often need more refined models. As designs mature, moving from rough guesses to structured methods reduces risk.

For scenarios involving high degrees of parallelism or uncertainty, I’ve seen teams successfully employ specialized models:

BSP model: For large-scale parallel computations such as data pipelines or ML training, the Bulk Synchronous Parallel (BSP) model helps predict performance. It models computation, communication, and synchronization phases to estimate completion times and identify scaling bottlenecks. For example, BSP-style reasoning has been widely applied in optimizing MapReduce and Spark jobs to spot where network shuffles or synchronization barriers limit throughput.
Fuzzy logic models: System Design often involves incomplete or uncertain information. Fuzzy logic addresses this using degrees of truth instead of binary values, making it useful for modeling ambiguous requirements or sparse data. It can guide decisions in resource allocation and performance prediction under variable load. When estimating server capacity for a new product with no usage history, a team I worked with used fuzzy logic. Instead of needing rigid numbers, the model took qualitative inputs like “user interest” (low, medium, high) and provided a range of potential resource needs, guiding a more flexible and resilient architecture.

These advanced models complement fundamental estimation techniques rather than replace them. Consider classic methods as the first filter, giving you a baseline, and modern approaches as the fine-tuning step when problems carry more uncertainty. The key is to match the tool’s complexity to the level of uncertainty in the problem.

The following table shows the key differences between different models in terms of their main usage scenarios, reliability, advantages, and key drawbacks:

Strive for precision

Precise estimation in System Design separates resilient systems from those that fail under pressure. During my time at Microsoft and Meta, I saw how quantitative reasoning sets strong engineers apart at Microsoft and Meta.

Here’s a quick estimation validity check to apply in your own designs:
Define 3–5 baseline assumptions (users, traffic, data growth).
Run quick back-of-the-envelope math for storage, throughput, and cost.
Refine estimates as requirements evolve.
Validate numbers against benchmarks or past system data.

By applying these techniques, you shift from reactive fixes to proactive architecture. You anticipate growth, manage costs, and ensure performance. The strongest engineers I worked with were the ones who ran numbers early and refined them as designs evolved.

Ready to apply these principles? Master these concepts with these comprehensive courses: Grokking the Modern System Design Interview, Grokking the Frontend System Design, Advanced System Design, and Product and Architecture System Design. These courses and many others on the platform will help you translate theoretical knowledge into practical, real-world skills.