DEV Community: Ronit Dahiya

28 chaos engineering scenarios mapped to system design interview questions

Ronit Dahiya — Sat, 06 Jun 2026 16:13:44 +0000

"Your cache just died at 30,000 RPS. Walk me through what happens."

Most candidates pause, then describe it in the abstract. The cache goes down, requests hit the database, the database gets overwhelmed. Technically correct. Completely unconvincing.

The candidate who has actually watched a cache stampede answers differently:

"p99 goes from 48ms to around 2,400ms in under a second. Connection pool exhaustion on the database is what causes it, not the database being slow, but the number of concurrent connections queued waiting for a slot. At 30K RPS with a 0% cache hit rate and a database connection pool sized for 10% of that traffic, the pool saturates in roughly 400 milliseconds. Error rate climbs from 0.1% to 34% before any circuit breaker fires."

That answer comes from having seen it happen, not from having read about it. Chaos engineering is how you see it happen before an interview, or before production does it for you.

Why system design interviews now test failure reasoning

The shift is real and it has happened over the last two to three years. Drawing the architecture is the entry bar, not the differentiator. Interviewers at companies running large distributed systems are explicitly looking for failure reasoning: what your design does when a component behaves badly, how failure propagates, and what you designed to contain it.

The question "what happens when your cache fails?" is not a curveball. It is a standard follow-up to any design that includes a cache.

So is:

"What happens when your primary database goes down?"
"What does your system do when the payment processor is slow but not timing out?"
"What happens if all your clients reconnect at the same moment after a 30-second outage?"

Each of those maps to a named chaos engineering scenario. Knowing the scenario means knowing what metric changes first, what cascade follows, and what design decision prevents or contains it.

Four categories, eight scenarios, eight interview questions

Network chaos

Latency injection

What the interviewer is testing with: "Your third-party payment processor starts responding slowly. Your checkout service calls it synchronously. Walk me through the impact on your cart API."

What latency injection shows: In a synchronous call chain, injected latency multiplies. A 200ms injection on one downstream service adds 200ms to every request that touches it. If your checkout service makes five synchronous calls and one of them is slow, end-to-end p99 climbs by the full injection amount, not a fraction of it.

The design decision being tested is whether you have identified which dependencies should be async and which genuinely need to be synchronous.

Network partition

What the interviewer is testing with: "Your cache and your application servers can't communicate. What does your system do?"

What network partition shows: Whichever side of the partition keeps running will exhaust its connection pool trying to reach the other side. Without circuit breakers, callers queue requests that will never complete.

With partition, you see the CAP theorem stop being theoretical. You are making a real choice between consistency and availability, and your design either has an explicit answer or it does not.

Infrastructure chaos

Node failure

What the interviewer is testing with: "One of your three application servers goes down. What happens to traffic?"

What node failure shows: Single points of failure that were not obvious on the diagram become obvious when you kill one component and watch traffic stack on the survivors. If your load balancer is not health-checking aggressively enough, it keeps routing to the dead node. If your remaining nodes were sized assuming N capacity rather than N-1, they saturate immediately.

Cache stampede

What the interviewer is testing with: "You do a rolling deploy and your cache gets cleared. It's 2pm on a Tuesday. Walk me through the next 60 seconds."

What cache stampede shows: Forcing the cache hit rate to near zero at meaningful RPS reveals whether your origin can handle the full load. Usually it cannot, because the origin was sized assuming the cache absorbs 80 to 95% of reads.

The database connection pool saturates, queue depth climbs, and p99 spikes.

The design decisions it tests:

Probabilistic early expiration, so the cache never goes fully cold.
Request coalescing, so a thousand simultaneous cache misses for the same key only produce one origin request.
Write-through caching, so the cache is never empty after a deploy.

Data layer chaos

Replication lag

What the interviewer is testing with: "Your read replicas are falling 8 seconds behind the primary. Which users notice, and what do they experience?"

What replication lag shows: Write-then-read flows break first. A user updates their profile photo and immediately refreshes. The replica serves the old photo.

The design decision it tests is whether you have identified which flows require reading from the primary versus which can tolerate eventual consistency. Most designs treat all reads as replica-safe. Replication lag reveals which ones are not.

Connection pool exhaustion

What the interviewer is testing with: "Your database is running at 30% CPU but your application is throwing connection timeout errors. What's happening?"

What connection pool exhaustion shows: A database can be entirely healthy while the application layer is unable to use it, because the connection pool is full of connections that are slow to complete.

This scenario teaches that pool utilization is a more important leading indicator than database CPU. The design decision it tests is how you size pools and whether you have pool utilization alerts before exhaustion, not after.

Traffic chaos

Request spike

What the interviewer is testing with: "Your service goes viral. Traffic goes from 5,000 to 50,000 RPS in four minutes. What breaks first?"

What a request spike shows: The first component to saturate is always the one with the smallest capacity headroom, which is often not the component you expected. At 10x traffic, the cache usually holds because hit rate stays high. The database write path saturates first if writes are not queued.

Auto-scaling helps, but it has a lag. The design decision it tests is what happens between the spike starting and the new capacity coming online.

Thundering herd

What the interviewer is testing with: "Your service goes down for 30 seconds and comes back up. All your clients try to reconnect at the same moment. What happens?"

What thundering herd shows: Coordinated load is more damaging than the same volume of load spread over time. A million clients reconnecting in the same two-second window produce a load spike that no steady-state capacity planning accounts for.

The design decisions it tests are jittered reconnection backoff on the client side and request coalescing or rate limiting on the server side.

How to narrate a chaos scenario in an interview

The structure that works:

State the steady state first.
Describe the injection.
Walk through the cascade in order.
Give specific numbers.
State your design fix.

Steady state:

"At 30K RPS with a warm cache, p99 is 48ms and error rate is 0.1%."

Injection:

"The cache restarts, so hit rate drops to near zero."

Cascade:

"Origin requests spike by a factor of 10. The database connection pool is sized for 3,000 concurrent connections. At 30K RPS hitting origin, that pool exhausts in under 500 milliseconds."

Numbers:

"p99 climbs to 2,400ms. Error rate hits 34%."

Fix:

"Request coalescing at the cache layer means a thousand simultaneous misses for the same key produce one origin request. Probabilistic early expiration means the cache never goes fully cold during a deploy."

The numbers are what make the answer credible. Abstract descriptions of cascade failure sound like everyone else's answer.

Where to actually practice this

Reading about cache stampede and watching it happen on a live metrics graph are different experiences.

I built a free browser-based chaos engineering simulator with all 28 scenarios: load any blueprint, run traffic at your target RPS, inject a chaos scenario, and watch p99, error rate, and throughput change in real time. No infrastructure, no signup, runs entirely in the browser.

The cache stampede on the Twitter / X Clone blueprint is the most instructive one to start with.

The difference is familiarity

The candidates who answer chaos questions well are not smarter. They have seen the failure mode before.

They know that connection pool exhaustion, not database CPU, is what kills a cache stampede. They know that latency injection multiplies through synchronous call chains. They know that thundering herd hits hardest in the two seconds after recovery, not during the outage.

That familiarity is the difference between a hand-wave and a convincing answer, and it is entirely learnable before the interview.

Why your AWS architecture cost estimate is always wrong - and how to fix it

Ronit Dahiya — Sat, 06 Jun 2026 15:57:16 +0000

$12,000 per month or $1,800. Same architecture. Same traffic. Same instance types. One architecture decision different.

That decision is whether you put a CDN in front of your origin. Not a specific vendor button. Not a pricing-calculator checkbox. Just the architectural choice to add an edge cache between users and the services that would otherwise serve every read directly.

The reason most cost estimates miss this is that most estimates start from the wrong place: service configuration instead of architecture.

Here is what typically happens. An engineer needs to estimate AWS costs for a new service. They open the AWS Pricing Calculator, select EC2, guess an instance type, type in some hours. Then RDS, ElastiCache, maybe a load balancer. They add it up, present the number, and move on.

The problem is that by the time you are filling in that calculator, you have already made the three or four architectural decisions that determine whether the bill is reasonable or catastrophic. The calculator is just arithmetic on top of those decisions. If the decisions are wrong, precise arithmetic makes things worse, not better.

The line item everyone underestimates

Data transfer out is the most expensive surprise in AWS billing. It does not look expensive per unit. A few cents per GB sounds harmless until you do the multiplication.

A standard web application at 10,000 requests per second with an average response size of 5KB produces roughly 130 terabytes of outbound transfer per month.

That math is simple: 10,000 RPS x 5KB x 2,592,000 seconds per month. At roughly $0.09/GB, that is around $11,700 per month from data transfer alone, before you have paid for a single EC2 instance, database, cache, or queue.

This is not an edge case. Any application with meaningful read traffic and no CDN layer runs into this wall. The application tier, the database, and the cache are usually not the expensive part. The pipe is.

Now put a CDN component in front of the same origin. The CDN serves cached static assets, media, pages, or API responses from the edge when possible. The origin only handles cache misses and dynamic requests.

If the CDN absorbs 85 to 90 percent of read traffic, origin transfer drops by the same order of magnitude. The architecture did not need a new database. It did not need a bigger app server. It needed the read path to stop treating the origin as the first stop for every request.

That is the point most estimates miss. The CDN is not just a performance optimization with a cost. At scale, it is often a cost-control decision that also improves latency.

The three decisions that set your bill before you open any calculator

Compute model. EC2, ECS Fargate, and Lambda have fundamentally different cost structures at different request rates. Lambda charges per invocation and duration. That can be cheap at low, intermittent traffic and expensive at sustained high volume. The crossover point where EC2 or containers become cheaper than Lambda is usually workload-specific, but it often appears once traffic becomes steady rather than spiky.

Most teams pick one model and stick with it without checking where that crossover sits for their workload. Checking it before you build costs nothing. Discovering it after months of high bills is more expensive.

Data path. This is the CDN question, but it is broader than CDN alone. Are clients pulling directly from app servers, or can static assets and media come from a CDN? Are large files proxied through compute, or served from object storage? Are repeat reads hitting a cache, or falling through to the database?

These choices decide whether scale multiplies expensive origin work or gets absorbed closer to the user.

Database tier. RDS, Aurora Serverless, DynamoDB, and cache-backed database patterns all have different cost curves. A steady 24/7 API has a different answer than a batch-heavy internal tool or a bursty consumer product. The right database estimate depends on write rate, read rate, storage growth, replication, and how much traffic can be served from cache.

The wrong database tier can cost more than the wrong instance type. The wrong read path can cost more than both.

Estimate before the architecture is locked, not after

The most useful cost estimate is not "what will this cost at our current traffic." It is "what will this cost at 2x, 5x, and 10x traffic, and where does the curve bend?"

The point where cost growth accelerates tells you where the first architectural change will be required. Sometimes it is the database. Sometimes it is the cache. Sometimes it is data transfer because every request is still going back to origin.

Almost never is it only the compute instance type, which is the part engineers spend the most time optimizing.

Running this estimate before committing to an architecture means you design for the right bottleneck. Running it after means you redesign under pressure when the bill arrives.

I built a free browser-based cost estimator that starts from your architecture diagram rather than service configuration: you draw components like Load Balancer, App Server, Cache, Database, Object Storage, and CDN, set your target RPS, and see the cost breakdown update from the diagram in real time with no account required: https://syssimulator.com/aws-cost-estimator

The bill is set before the calculator opens

Cost estimation done after architectural decisions are made is just arithmetic on choices you have already locked in. The CDN question, the compute model question, the data path question, and the database tier question are what move the bill by factors of two to ten.

Instance type optimization still matters, but it usually moves the number by tens of percent. Architecture moves it by orders of magnitude.

Estimate while the architecture is still fluid. Run the numbers at 2x, 5x, and 10x. Find where the curve bends. That is where your first architectural constraint lives, and it is much cheaper to discover it on a diagram than in production.

Stop drawing system design diagrams. Start simulating them.

Ronit Dahiya — Thu, 23 Apr 2026 20:04:18 +0000

Your diagram never fails.

That's the problem.

You draw a load balancer, three app servers, Redis cache, Postgres. Arrows everywhere. Looks great. Interviewer nods. You nod. Everyone nods. And then you get the follow-up question: "What happens at 50,000 requests per second?"

And you hand-wave. "Well, the cache handles most of it, so the database is mostly fine..."

Mostly fine. The two most dangerous words in system design.

The lie we tell in interviews

I've sat through a lot of system design prep. The advice is always the same: draw the components, explain the tradeoffs, mention CAP theorem, say "it depends" a lot.

Here's what nobody tells you: a diagram is a snapshot of your best intentions, not a model of your system's behavior.

Your diagram shows what components exist. It says nothing about what happens when:

Your cache cold-starts after a deploy
Two pods restart simultaneously and 40,000 requests hit your database directly
Your message queue backs up because a downstream service is slow

The diagram sits there looking perfect while your system falls apart in ways you didn't anticipate.

I got tired of hand-waving through these scenarios. So I built something that actually runs them.

What simulation adds that drawing doesn't

I built SysSimulator — a system design simulator that runs entirely in your browser. You drag components onto a canvas, configure them (cache hit rate, database connection pool size, replica count), set your RPS, and hit run. It uses a discrete-event simulation engine compiled from Rust to WebAssembly, so it's running real math, not vibes.

(The engineering behind the Rust/WASM part is in my previous article if that's your thing. This one is about what you actually learn from running it.)

Here's the thing about running a simulation: it tells you things your diagram can't.

The cache stampede I didn't expect

Let me give you a concrete example. Classic interview scenario: high-read system with a Redis cache in front of Postgres.

I drew this a hundred times. Cache handles 90% of reads, database handles 10%, everything is fine.

Then I simulated it.

I set up the blueprint: load balancer → app servers → Redis → Postgres. Cache hit rate: 90%. Starting load: 1,000 RPS. Then I hit the "cache expiry" chaos scenario — simulates what happens when your cache TTLs expire simultaneously, which happens after a cold start or a cache flush.

What my diagram predicted: cache handles 90% of reads, brief spike on the database, recovers quickly.

What actually happened:

p99 latency: 48ms → 2,400ms in 11 seconds
Database error rate: 34%
Queue depth: backed up to 8,000 pending requests
Recovery time after cache refills: 4 minutes

The diagram said "brief spike." The simulation said "your database is on fire and users are seeing 2.4-second loads for 4 minutes."

That's a cache stampede. It's a known failure mode — there's even a Wikipedia article about it — and my beautiful diagram completely missed it.

Why this changes how you talk in interviews

Here's the interview prep angle that I think gets overlooked.

When you simulate a scenario, you get specific numbers. And specific numbers change how you talk.

Before simulation:

"The cache handles most of the load, so the database should be fine."

After simulating the cache stampede:

"At 10K RPS with a 90% cache hit rate, the database normally handles around 1,000 QPS — well within its connection pool limit. But if the cache expires simultaneously, we lose that 90% hit rate for about 60-90 seconds. That sends the full 10K RPS to the database, which saturates the connection pool. p99 spikes from ~50ms to over 2 seconds, error rates hit 30%+. The fix is probabilistic cache expiry — instead of all keys expiring at the same TTL, you add jitter. Each key expires at TTL + random(0, TTL*0.1), so the stampede becomes a trickle."

Same underlying knowledge. Completely different delivery. The second version sounds like someone who has actually seen this happen.

You haven't seen it happen — but you've run the simulation, and that's close enough to talk about it with conviction.

The narration framework that actually works

For system design interviews, I use a three-part structure for any scenario:

1. The steady state — what happens at normal load

"At 10K RPS with warm cache, p99 is around 45ms. Database handles 800 QPS, well within its 2,000 connection limit."

2. The failure mode — what breaks it and why

"If we lose the cache — cold start, flush, or a stampede — the full load hits the database directly. Connection pool saturates in about 8 seconds."

3. The fix — specific, not vague

"Cache-aside with jitter on TTL. Or serve stale-while-revalidate with a background refresh. Either approach limits the blast radius of cache expiry."

The simulation gives you the numbers for part 1 and 2. Part 3 is where you show you understand the tradeoffs.

I wrote up a longer version of this framework with the exact narration for the cache stampede scenario — word-for-word, with the numbers. It's what I'd say if I was on the spot in an interview right now.

Running it yourself

If you want to try the cache stampede scenario specifically:

Go to SysSimulator
Load the "Three-Tier Web App" blueprint (it's in the blueprints panel)
Set RPS to 10,000
Run the simulation — watch p99 and error rate in the live metrics bar
Hit the "Cache Stampede" chaos scenario
Watch what happens to p99

It takes about 3 minutes. At the end you'll have specific numbers you can use in any interview that involves a cache.

The tool is free, no login, runs entirely in your browser. The simulation runs in WebAssembly so it's fast — 100K RPS simulations run ahead of wall-clock time.

What to do with the other blueprints

The stampede is one scenario. There are 57 blueprints total — CQRS, event sourcing, Saga pattern, rate limiting architectures, MCP agent systems — and 28 chaos scenarios you can inject on any of them.

The ones that surprise people most in practice:

Read replica lag — your replica is "caught up" until it isn't, and now your read-after-write consistency assumption is broken
Thundering herd on load balancer restart — all connections re-establishing simultaneously
Queue backpressure — your Kafka consumer is 800ms behind, your producer doesn't slow down, your broker fills up

None of these look dangerous in a diagram. All of them have produced real incidents. Running the simulation doesn't replace production experience — but it's a lot faster and cheaper than getting paged at 2am.

I spent a lot of time drawing boxes in Excalidraw and not enough time asking "what actually breaks here." This is me trying to fix that.

If you find the cache stampede numbers useful for prep, or have a specific scenario you want to model, drop it in the comments. Happy to work through it.

I compiled Rust to WebAssembly to build a system design simulator that runs entirely in your browser!

Ronit Dahiya — Mon, 20 Apr 2026 10:23:38 +0000

Static diagrams don't fail. Systems do.

That was the problem I kept running into when practicing system design. I'd draw boxes and arrows, convince myself the architecture was solid, and then an interviewer would ask "what happens to your read traffic if the cache goes down?" - and I'd be narrating from memory, not from evidence.

I wanted to watch my architecture break. So I built SysSimulator - a free browser-based tool that lets you simulate real traffic, inject chaos scenarios, and watch cascade failures in real time. No install. No signup. No backend required.

The interesting engineering decision was the foundation: the simulation engine is written in Rust, compiled to WebAssembly, and runs entirely in your browser.

Here's why, and what I learned.

Why not just use JavaScript?

The obvious choice for a browser-based simulator is JavaScript. It's already in the browser. You don't need a compilation step. Every tutorial on "build a simulation in the browser" uses it.

But simulation engines have a specific performance profile that JavaScript handles badly.

A discrete-event simulation (DES) processes thousands of events per second - request arrivals, processing completions, timeout triggers, state transitions. Each event modifies shared state (component queues, error counts, latency distributions) and may produce new events. At 100,000 RPS with 10 components, you're processing hundreds of thousands of state mutations per second.

JavaScript's garbage collector will pause the world mid-simulation. At high event rates, those pauses become visible - the particle animation stutters, the metrics bar freezes, the simulation loses time fidelity. It's not fatal for a toy, but it breaks the sense of "real" that makes the tool actually useful for building intuition.

Rust gives you:

Deterministic memory management - no GC pauses, no stop-the-world
Predictable performance - the simulation advances at wall-clock speed without hitches
Zero-cost abstractions - rich type system and pattern matching with no runtime overhead
Direct WASM compilation via wasm-pack with minimal boilerplate The tradeoff is compile time and complexity. It's worth it.

How the DES engine works

A discrete-event simulation has three core concepts:

1. Events - things that happen at a specific simulated time. In SysSimulator, events are things like:

RequestArrived { component_id, timestamp, request_id }
ProcessingComplete { component_id, timestamp, latency_ms }
ChaosInjected { scenario, target_component, severity }

2. The event queue - a priority queue ordered by timestamp. The engine always processes the earliest event first. This is what makes DES "discrete" - time jumps forward in steps, not continuously.

3. State - the current condition of every component. Queue depth, active connections, error rates, latency distributions. Each event reads and writes state.

The core loop in Rust is approximately:

pub fn step(&mut self) -> SimulationResult {
    while let Some(event) = self.event_queue.pop() {
        if event.timestamp > self.clock + self.tick_duration {
            break;
        }
        self.clock = event.timestamp;

        let new_events = self.process_event(&event);
        for e in new_events {
            self.event_queue.push(e);
        }

        self.update_metrics(&event);
    }

    self.collect_metrics()
}

The real complexity is in process_event - each component type (load balancer, cache, database, message queue) has its own behaviour model. A cache hit generates a fast response event. A cache miss cascades to a database read. A database under memory pressure starts dropping connections. The interactions are what make simulation genuinely useful.

Compiling Rust to WASM with wasm-pack

The compilation pipeline is simpler than I expected.

Cargo.toml:

[lib]
crate-type = ["cdylib"]

[dependencies]
wasm-bindgen = "0.2"
js-sys = "0.3"
serde = { version = "1.0", features = ["derive"] }
serde-wasm-bindgen = "0.6"

[profile.release]
opt-level = "s"  # optimise for size, not speed

The opt-level = "s" is important - WASM bundles transferred over the network should be small. Size optimisation also tends to reduce instruction count, which helps in the WASM runtime.

Exposing functions to JavaScript:

use wasm_bindgen::prelude::*;

#[wasm_bindgen]
pub struct SimEngine {
    state: SimulationState,
    event_queue: BinaryHeap<SimEvent>,
    clock: f64,
}

#[wasm_bindgen]
impl SimEngine {
    #[wasm_bindgen(constructor)]
    pub fn new(topology: JsValue) -> SimEngine {
        let parsed: Topology = serde_wasm_bindgen::from_value(topology).unwrap();
        SimEngine::from_topology(parsed)
    }

    pub fn step(&mut self, tick_ms: f64) -> JsValue {
        let result = self.advance(tick_ms);
        serde_wasm_bindgen::to_value(&result).unwrap()
    }

    pub fn inject_chaos(&mut self, scenario: JsValue) {
        let chaos: ChaosScenario = serde_wasm_bindgen::from_value(scenario).unwrap();
        self.apply_chaos(chaos);
    }
}

Build command:

wasm-pack build --target bundler --release

This produces a pkg/ directory with:

syssimulator_bg.wasm - the compiled WASM binary
syssimulator.js - the JS glue code generated by wasm-bindgen
TypeScript type definitions The generated JS handles the memory bridge between JavaScript and WASM automatically. You call Rust functions like they're regular JS functions. The serde-wasm-bindgen crate handles serialisation of complex types (the topology JSON, metrics output) across the boundary.

The WASM loading strategy

First load time is the main UX risk with WASM. The binary needs to be fetched, compiled, and instantiated before the simulation can run. On a slow connection this can be several seconds.

My approach:

1. Async loading with a visible loading state. The UI renders immediately from static HTML. The simulation controls are shown but disabled. A loading indicator shows "Initialising simulation engine..." so users know what's happening.

2. Streaming compilation. Modern browsers can compile WASM while it's still downloading via WebAssembly.instantiateStreaming. This is enabled automatically when you serve WASM with the correct Content-Type: application/wasm header. On Vercel, this is handled automatically.

3. Persistent caching. The WASM binary is served with a content-hash filename and long Cache-Control headers. After the first visit, subsequent loads are instant - the binary comes from the browser cache.

// The generated wasm-bindgen glue handles this, but conceptually:
const { instance } = await WebAssembly.instantiateStreaming(
  fetch('/pkg/syssimulator_bg.wasm'),
  importObject
);

Modelling the 18 component types

Every component in the simulator has a behaviour model that determines:

Processing latency - a latency distribution (P50, P95, P99), derived from real-world measurements for that component type
Concurrency limits - max simultaneous requests before queuing begins
Failure modes - what happens under chaos injection (node crash, memory pressure, etc.) Here are a few interesting ones:

Cache (Redis model)

fn process_cache_request(&mut self, req: &Request) -> ProcessResult {
    let hit = self.rng.gen::<f64>() < self.hit_rate;

    if hit {
        // Cache hit: fast response, no downstream call
        ProcessResult::complete(req, self.hit_latency_dist.sample())
    } else {
        // Cache miss: forward to origin with cache miss overhead
        let miss_latency = self.miss_overhead_dist.sample();
        ProcessResult::forward_to_origin(req, miss_latency)
    }
}

Under a cache stampede chaos scenario, the hit rate drops to near zero and hundreds of requests simultaneously hit the origin - which is where the cascade failure becomes visible in the simulation.

Load balancer (round-robin model)

fn route_request(&mut self, req: Request) -> Option<ComponentId> {
    // Round-robin with health check
    let healthy_backends: Vec<_> = self.backends
        .iter()
        .filter(|b| b.is_healthy())
        .collect();

    if healthy_backends.is_empty() {
        return None; // All backends unhealthy - request fails
    }

    let idx = self.counter % healthy_backends.len();
    self.counter += 1;
    Some(healthy_backends[idx].id)
}

When you inject a node failure on one of the app servers, the load balancer's health check detects it and routes around it - but if enough backends fail, capacity drops and latency rises. This is the exact behaviour pattern that shows up in production incidents.

The chaos engine - 28 scenarios

The chaos system is separate from the simulation engine. Each scenario is a function that modifies component state:

pub fn apply_chaos(&mut self, scenario: ChaosScenario) {
    match scenario.kind {
        ChaosKind::NetworkPartition => {
            // Drop all requests between two components
            self.add_connection_filter(
                scenario.source,
                scenario.target,
                ConnectionFilter::DropAll
            );
        },
        ChaosKind::LatencyInjection { p50_ms, p99_ms } => {
            // Add artificial latency distribution to a component
            self.components[scenario.target]
                .add_latency_overhead(LatencyDist::new(p50_ms, p99_ms));
        },
        ChaosKind::CacheStampede => {
            // Force cache hit rate to near zero
            if let Component::Cache(ref mut cache) = self.components[scenario.target] {
                cache.override_hit_rate(0.02);
            }
        },
        ChaosKind::NodeFailure => {
            // Take component offline - load balancers detect and route around
            self.components[scenario.target].set_health(ComponentHealth::Down);
        },
        // ... 24 more scenarios
    }
}

The interesting design decision was making chaos composable. You can inject a network partition AND a memory pressure event simultaneously and watch the compounding failure. In production, incidents are rarely single-cause - this teaches engineers to think in terms of failure combinations.

AWS cost estimation

This was the feature I was most uncertain about including, and it turned out to be one of the most useful.

Every component maps to an AWS service and a pricing model:

pub fn estimate_monthly_cost(&self, topology: &Topology, rps: f64) -> CostBreakdown {
    let mut compute = 0.0;
    let mut storage = 0.0;
    let mut networking = 0.0;
    let mut requests = 0.0;

    for component in &topology.components {
        match component.kind {
            ComponentKind::WebServer => {
                // EC2 t3.medium equivalent based on configured throughput
                let instance_count = (rps / component.throughput_limit).ceil();
                compute += instance_count * EC2_T3_MEDIUM_HOURLY * 730.0;
            },
            ComponentKind::Serverless => {
                // Lambda pricing: per-request + duration
                let monthly_requests = rps * 86400.0 * 30.0;
                requests += monthly_requests * LAMBDA_PER_REQUEST;
                requests += monthly_requests * (component.avg_duration_ms / 1000.0) 
                    * LAMBDA_PER_GB_SECOND * component.memory_gb;
            },
            ComponentKind::Database => {
                // RDS db.t3.medium for the configured storage tier
                compute += RDS_T3_MEDIUM_HOURLY * 730.0;
                storage += component.storage_gb * RDS_STORAGE_PER_GB;
            },
            // ... other component types
        }
    }

    CostBreakdown { compute, storage, networking, requests }
}

The numbers are rough-order estimates, not exact billing. But "adding 3 more app servers costs approximately $280/month" is the right answer to "why not just scale horizontally indefinitely?" - which is exactly the kind of cost-awareness question that separates senior engineers from mid-level in system design interviews.

What surprised me about WASM in production

The good:

Performance exceeded expectations. At 100,000 simulated RPS with 10+ components, the engine advances simulation time faster than wall clock time - there's headroom to spare.
Debugging is better than expected. wasm-pack test --chrome runs your Rust unit tests in an actual browser. Source maps work reasonably well with the right setup.
The memory model forced better design. Rust's ownership rules pushed me toward an architecture where simulation state is clearly separated from UI state. The resulting code is more correct.

The hard parts:

Serialisation overhead is real. Every call across the JS/WASM boundary that involves complex types goes through serialisation. Calling step() 60 times per second is fine. Passing large topology objects on every frame would not be.
Error handling across the boundary is awkward. Rust's Result<T, E> doesn't cross the boundary cleanly. I ended up encoding errors as optional fields in the return value rather than using WASM exceptions.
Bundle size management is ongoing. The WASM binary is currently ~280KB gzipped. Acceptable, but I'm tracking it.

Results — what the simulator shows that a whiteboard can't

When you inject a cache stampede on a 10,000 RPS e-commerce architecture, you see:

Cache hit rate drops from 98% → 2%
Database connections saturate within 400ms
App server queue depth climbs until requests start timing out
Error rate spikes from 0.1% → 34%
P99 latency goes from 48ms → 2,400ms That sequence - and the ability to narrate exactly what happened and why - is what interviewers at FAANG are evaluating when they ask "what happens to your read traffic if the cache goes down?"

A static diagram cannot show you that. A simulator built on a proper DES engine can.

Try it

SysSimulator is free, runs in your browser, no account required.

57 architecture blueprints (e-commerce, chat, payment systems, Kafka pipelines, MCP AI agents), 28 chaos scenarios, real-time AWS cost estimation.

The source of the WASM simulation engine is something I'm considering open-sourcing - leave a comment if that's interesting to you.

What questions do you have about the Rust/WASM approach? Specifically curious if others have tackled the serialisation overhead problem differently - would love to compare notes.

Built by Ronit Dahiya. LinkedIn | GitHub