Ronit Dahiya

Posted on Apr 23

Stop drawing system design diagrams. Start simulating them.

#systemdesign #architecture #interview #career

Your diagram never fails.

That's the problem.

You draw a load balancer, three app servers, Redis cache, Postgres. Arrows everywhere. Looks great. Interviewer nods. You nod. Everyone nods. And then you get the follow-up question: "What happens at 50,000 requests per second?"

And you hand-wave. "Well, the cache handles most of it, so the database is mostly fine..."

Mostly fine. The two most dangerous words in system design.

The lie we tell in interviews

I've sat through a lot of system design prep. The advice is always the same: draw the components, explain the tradeoffs, mention CAP theorem, say "it depends" a lot.

Here's what nobody tells you: a diagram is a snapshot of your best intentions, not a model of your system's behavior.

Your diagram shows what components exist. It says nothing about what happens when:

Your cache cold-starts after a deploy
Two pods restart simultaneously and 40,000 requests hit your database directly
Your message queue backs up because a downstream service is slow

The diagram sits there looking perfect while your system falls apart in ways you didn't anticipate.

I got tired of hand-waving through these scenarios. So I built something that actually runs them.

What simulation adds that drawing doesn't

I built SysSimulator — a system design simulator that runs entirely in your browser. You drag components onto a canvas, configure them (cache hit rate, database connection pool size, replica count), set your RPS, and hit run. It uses a discrete-event simulation engine compiled from Rust to WebAssembly, so it's running real math, not vibes.

(The engineering behind the Rust/WASM part is in my previous article if that's your thing. This one is about what you actually learn from running it.)

Here's the thing about running a simulation: it tells you things your diagram can't.

The cache stampede I didn't expect

Let me give you a concrete example. Classic interview scenario: high-read system with a Redis cache in front of Postgres.

I drew this a hundred times. Cache handles 90% of reads, database handles 10%, everything is fine.

Then I simulated it.

I set up the blueprint: load balancer → app servers → Redis → Postgres. Cache hit rate: 90%. Starting load: 1,000 RPS. Then I hit the "cache expiry" chaos scenario — simulates what happens when your cache TTLs expire simultaneously, which happens after a cold start or a cache flush.

What my diagram predicted: cache handles 90% of reads, brief spike on the database, recovers quickly.

What actually happened:

p99 latency: 48ms → 2,400ms in 11 seconds
Database error rate: 34%
Queue depth: backed up to 8,000 pending requests
Recovery time after cache refills: 4 minutes

The diagram said "brief spike." The simulation said "your database is on fire and users are seeing 2.4-second loads for 4 minutes."

That's a cache stampede. It's a known failure mode — there's even a Wikipedia article about it — and my beautiful diagram completely missed it.

Why this changes how you talk in interviews

Here's the interview prep angle that I think gets overlooked.

When you simulate a scenario, you get specific numbers. And specific numbers change how you talk.

Before simulation:

"The cache handles most of the load, so the database should be fine."

After simulating the cache stampede:

"At 10K RPS with a 90% cache hit rate, the database normally handles around 1,000 QPS — well within its connection pool limit. But if the cache expires simultaneously, we lose that 90% hit rate for about 60-90 seconds. That sends the full 10K RPS to the database, which saturates the connection pool. p99 spikes from ~50ms to over 2 seconds, error rates hit 30%+. The fix is probabilistic cache expiry — instead of all keys expiring at the same TTL, you add jitter. Each key expires at TTL + random(0, TTL*0.1), so the stampede becomes a trickle."

Same underlying knowledge. Completely different delivery. The second version sounds like someone who has actually seen this happen.

You haven't seen it happen — but you've run the simulation, and that's close enough to talk about it with conviction.

The narration framework that actually works

For system design interviews, I use a three-part structure for any scenario:

1. The steady state — what happens at normal load

"At 10K RPS with warm cache, p99 is around 45ms. Database handles 800 QPS, well within its 2,000 connection limit."

2. The failure mode — what breaks it and why

"If we lose the cache — cold start, flush, or a stampede — the full load hits the database directly. Connection pool saturates in about 8 seconds."

3. The fix — specific, not vague

"Cache-aside with jitter on TTL. Or serve stale-while-revalidate with a background refresh. Either approach limits the blast radius of cache expiry."

The simulation gives you the numbers for part 1 and 2. Part 3 is where you show you understand the tradeoffs.

I wrote up a longer version of this framework with the exact narration for the cache stampede scenario — word-for-word, with the numbers. It's what I'd say if I was on the spot in an interview right now.

Running it yourself

If you want to try the cache stampede scenario specifically:

Go to SysSimulator
Load the "Three-Tier Web App" blueprint (it's in the blueprints panel)
Set RPS to 10,000
Run the simulation — watch p99 and error rate in the live metrics bar
Hit the "Cache Stampede" chaos scenario
Watch what happens to p99

It takes about 3 minutes. At the end you'll have specific numbers you can use in any interview that involves a cache.

The tool is free, no login, runs entirely in your browser. The simulation runs in WebAssembly so it's fast — 100K RPS simulations run ahead of wall-clock time.

What to do with the other blueprints

The stampede is one scenario. There are 57 blueprints total — CQRS, event sourcing, Saga pattern, rate limiting architectures, MCP agent systems — and 28 chaos scenarios you can inject on any of them.

The ones that surprise people most in practice:

Read replica lag — your replica is "caught up" until it isn't, and now your read-after-write consistency assumption is broken
Thundering herd on load balancer restart — all connections re-establishing simultaneously
Queue backpressure — your Kafka consumer is 800ms behind, your producer doesn't slow down, your broker fills up

None of these look dangerous in a diagram. All of them have produced real incidents. Running the simulation doesn't replace production experience — but it's a lot faster and cheaper than getting paged at 2am.

I spent a lot of time drawing boxes in Excalidraw and not enough time asking "what actually breaks here." This is me trying to fix that.

If you find the cache stampede numbers useful for prep, or have a specific scenario you want to model, drop it in the comments. Happy to work through it.