DEV Community: Saumya Karnwal

Line of Defense: Three Systems, Not One

Saumya Karnwal — Sat, 28 Feb 2026 04:50:03 +0000

Three Systems, Not One

"Rate limiting" gets used as a catch-all for anything that rejects or slows down requests. But there are actually three distinct mechanisms, each protecting against a different failure mode, each asking a different question:

Mechanism	Question it asks	What it protects
Load shedding	"Is this server healthy enough to handle ANY request?"	The server from itself
Rate limiting	"Is THIS CALLER sending too many requests?"	The system from abusive callers
Adaptive throttling	"Is the DOWNSTREAM struggling right now?"	Downstream services from this server

A rate limiter won't save you when your server is OOM-ing — every user is within their quota, the server is just dying. Load shedding won't stop one customer from consuming 80% of your capacity — total concurrency is fine, the distribution is unfair. And neither will prevent you from hammering a downstream service that's already struggling.

These are complementary systems. Treating them as one thing — or building only one of the three — leaves gaps that show up exactly when you need protection most.

The Three Layers

Each layer asks a different question. Each protects a different thing.

Layer 1 — Load Shedding protects this server from itself. Is memory pressure too high? Are there too many concurrent requests? Did a downstream just return RESOURCE_EXHAUSTED? If any of these are true, reject immediately — doesn't matter who the user is, doesn't matter what the request is. The building is at capacity.
Layer 2 — Rate Limiting protects the system from abusive users. Is this specific user, API key, or IP address sending more than their allowed share? This is the classic rate limiter — per-user counters, sliding windows, token buckets.
Layer 3 — Adaptive Throttling protects downstream services from this server. The server tracks its success rate when calling each downstream. If 20% of calls to the payment service are failing, it starts probabilistically dropping 20% of outbound calls — giving the payment service breathing room to recover.

Why the Order Matters

Layer 1 runs at the highest priority — before authentication, before request parsing, before anything. Here's why:

If rate limiting (Layer 2) runs first, the server spends CPU checking Redis counters, computing sliding window math, and doing per-user lookups. Then it reaches Layer 1, which says "actually, the server is dying, reject everything." All that rate-limit computation was wasted on a request you were going to reject anyway.

Load shedding is cheap — one atomic counter check or one GC flag read. It takes microseconds. Rate limiting might require a Redis round-trip. Run the cheap check first.

Think of it like a nightclub. The fire marshal at the door (load shedding) doesn't check your ID. "Building is at capacity. Nobody gets in." Only if the building isn't full does the bouncer (rate limiter) check your guest list.

What Each Layer Catches That The Others Miss

A bad deployment causes OOM. Your new ML model eats 3x the expected memory. Layer 1 sees GC pressure spike and starts shedding. Layer 2 is blind — every user is within their rate limit. Layer 3 is blind — the downstream is fine. Without load shedding, you're relying on Kubernetes to OOM-kill the pod and restart it, which takes 30-60 seconds of full outage.
One customer sends 10x their normal traffic. A migration script gone wrong. Layer 2 catches it immediately — their per-user counter crosses the threshold. Layer 1 might eventually catch it (if the extra traffic pushes overall concurrency past the limit), but it can't distinguish "one bad user" from "legitimate traffic spike." Layer 3 is blind — the downstream doesn't know which user caused the load.
A downstream payment service enters degraded state. It accepts 60% of requests, returns RESOURCE_EXHAUSTED on the rest. Layer 3 sees the failure rate climb and starts probabilistically dropping outbound calls — giving the payment service room to breathe. Layer 1 catches the RESOURCE_EXHAUSTED responses and triggers a reactive backoff. Layer 2 is completely blind — users are within their limits, the problem is downstream.
A DDoS hits your API. Thousands of IPs, each sending moderate traffic. Layer 1 catches it (total concurrency spikes). Layer 2 catches it (per-IP limits hit). Layer 3 is blind — this is an inbound problem, not outbound. Both layers contribute, but neither alone is sufficient — the DDoS might stay under per-IP limits while overwhelming total capacity, or it might come from one IP but stay under concurrency limits.
A slow dependency causes thread pool exhaustion. A database query that usually takes 5ms starts taking 2 seconds. Threads pile up waiting. Layer 1 sees concurrent request count spike toward the limit. Layer 3 would catch it if the dependency returned errors, but slow responses aren't errors — the threads just accumulate silently. Layer 2 is blind. This is the scenario where load shedding saves you — it's the only layer watching the server's actual resource consumption.

No single layer handles everything. That's the point. They're complementary, not redundant.

If Layer 2 has a bug or Redis goes down, Layer 1 still protects the server from overload. If Layer 1's threshold is set too high, Layer 2 still limits abusive users. If both fail, Layer 3 at least prevents a cascade into downstream services.

Defense in depth. Not defense in one.

Layer 2 Has Two Personalities: Reject or Delay

Rate limiting (Layer 2) isn't one tool — it's two, with opposite behaviors.

Rejection says "no." The request is over the limit. Return 429. The caller deals with it.

Delay says "wait." The request is over the limit, but instead of rejecting it, hold it in a queue and release it when the rate allows. The caller doesn't even know it was throttled — just that the response was a bit slow.

Same goal (enforce a rate), completely different experience for the caller.

The question is: when do you reject, and when do you delay?

Reject when someone external is holding a connection. A user called your API. Their HTTP connection is open. If you delay them, you're holding that connection — which means a thread, a socket, memory. Delay 500 users and you've exhausted your connection pool. Now legitimate users who are under the limit can't get a connection. Your rate limiter just caused an outage for good users by being too nice to bad ones. Reject fast. Free the connection. Let the client's retry logic handle it.
Delay when your own system needs the request to succeed. You're calling Stripe's payment API. You know their limit: 100 requests per second. The 101st request doesn't need to fail — it just needs to wait 10 milliseconds for the next second's budget. If you reject it instead, you need retry logic, backoff timers, dead letter queues, monitoring for the retries — an entire infrastructure to handle a problem that "just wait" solves.
Five scenarios to build the intuition:
Your public API gets a burst from a customer. Reject. Return 429 instantly. The customer's SDK has built-in retry with exponential backoff. Your server processed the rejection in microseconds and moved on. If you delayed instead, 500 connections held open, connection pool starved, outage for everyone.
You're sending 50,000 marketing emails through SendGrid. Delay. SendGrid allows 500/sec. Queue all 50,000, drip them at 500/sec. Takes 100 seconds, every email delivered. If you rejected instead, 49,500 emails bounced in the first second. Now you need a dead letter queue and retry scheduling for a problem that "wait your turn" solves completely.
Your gRPC server receives internal traffic from an upstream service. Reject. Return RESOURCE_EXHAUSTED. The upstream's adaptive throttler (Layer 3 on their side) sees the error and automatically backs off. The system self-heals. If you delayed instead, the upstream's gRPC deadline expires while its request sits in your queue. Timeout errors are worse than clean rejections — the upstream can't tell "server is slow" from "I'm being rate limited."
A batch job scrapes 10,000 records from a partner API nightly. Delay. Partner allows 50 req/sec. Pace it perfectly — 3.3 minutes, all requests succeed, partner never sees a spike. If you rejected instead, 9,950 requests fail immediately, retry logic fires, you hammer the partner for 20 minutes instead of a clean 3-minute crawl.
A user calls your payment endpoint during checkout. Reject. The user is staring at a button that says "Pay Now." A 200ms rejection with a "please try again" message is infinitely better than a 5-second delay where they think the page froze, hit refresh, and trigger a duplicate payment.

The rule is simple: reject when someone is waiting for the connection. Delay when you can afford to wait.

Distributed Rate Limiting — Five Problems That Break Your Counters

Saumya Karnwal — Fri, 27 Feb 2026 14:17:51 +0000

Why Local Rate Limiting Breaks

A rate limiter on a single server works exactly as advertised. But most production systems aren't a single server — they're 10, 50, or 200 instances behind a load balancer. And that changes the math.

If your limit is 100 requests per minute and you have 50 instances, the load balancer sprays traffic round-robin. Each instance sees ~2 requests per minute from any given user. Every instance says "well under the limit." Nobody rejects anything. The user sends 3,000 requests. All pass.

Your per-instance rate limiter silently became a limit × num_instances rate limiter. You didn't change the code. You changed the deployment.

The fix is shared state — usually Redis. All instances read and write to the same counter, so the global count is enforced globally. But the moment you introduce shared state over a network, five new problems appear.

Problem 1: The Race You Can't See

Two requests from the same user arrive at the same millisecond, hitting two different instances. Both call Redis.

Instance A                         Instance B
──────────                         ──────────
GET counter → 99                   GET counter → 99
99 < 100? YES                      99 < 100? YES
SET counter → 100                  SET counter → 100

Both allowed. Real count: 101. Limit breached.

This is TOCTOU — time-of-check-time-of-use. You read, decided, then wrote. But someone else read the same value in the gap between your read and your write.

The fix sounds simple: use INCR instead of GET + SET. Redis INCR atomically increments and returns the new value. No gap.

Instance A                         Instance B
──────────                         ──────────
INCR counter → 100                 INCR counter → 101
100 ≤ 100? ALLOW                   101 > 100? REJECT

But what about more complex algorithms — sliding window, token bucket — where you need to read a value, do math, then conditionally write? You can't do that with one INCR. You need Redis Lua scripts. A Lua script runs atomically on Redis's single thread — no other command can interleave.

One network round-trip. One atomic operation. No race.

The alternative is WATCH/MULTI/EXEC — Redis's optimistic locking. You WATCH a key, read it, start a MULTI transaction, write your changes, and EXEC. If anyone modified the watched key between your WATCH and EXEC, the transaction aborts and you retry. It's compare-and-swap over the network. More flexible than Lua, but slower under contention because of retries.

Problem 2: Redis Dies. Now What?

Redis is down. Or the network between your service and Redis is partitioned. Every rate limit check fails. You have three options, and none of them are good.

Fail open: Allow all requests. Your system stays up, but you have no rate limiting. If Redis went down because of load, you just removed the only thing protecting you from more load.

Fail closed: Reject all requests. Congratulations, your rate limiter just became a denial-of-service attack on your own users.

Fall back to local: Switch to per-instance in-memory counters with global_limit / num_instances as the local limit. Inaccurate, but bounded.

Option three is what most production systems do. But the transition is tricky. When Redis comes back, do you trust the Redis counter (which is stale) or the local counter (which is approximate)? Most teams reset the Redis counter on recovery and accept a brief window of inaccuracy.

The deeper issue: how fast do you detect the failure? If your Redis timeout is 500ms, every rate-limited request adds 500ms of latency while you wait to find out Redis is dead. You need a circuit breaker — after N consecutive timeouts, stop asking Redis for a cooldown period. Go straight to local. Check again in 10 seconds.

Problem 3: Your Servers Disagree About What Time It Is

Window-based algorithms need to answer "which window does this request belong to?" That requires knowing what time it is. Across 50 servers, even with NTP, clocks drift by 10-50ms.

Server A:  10:00:00.000  (on time)
Server B:  10:00:00.150  (150ms ahead)
Server C:  09:59:59.900  (100ms behind)

At 10:00:00 — the window boundary — Server C thinks it's still in the old window. Server B thinks the new window started 150ms ago. They compute different bucket IDs and increment different Redis keys.

Server C:  INCR rate_limit:user123:window_599   ← old window
Server A:  INCR rate_limit:user123:window_600   ← new window
Server B:  INCR rate_limit:user123:window_600   ← new window

Requests at the boundary split across two keys.
Neither hits the limit. Both pass.

For a 60-second window, 150ms of skew is 0.25% — noise. For a 1-second window, it's 15% — a real problem.

Fixes:

Use Redis's clock. Let the Lua script call redis.call('TIME') to determine the current window. One clock, one truth. Adds no extra round-trip since you're already in a Lua script.
Use large windows. If your window is 60 seconds, clock skew doesn't matter. If you need sub-second precision, you need to solve clock sync first.
Use token bucket. No windows, no boundaries, no clock skew problem. The refill math is based on elapsed time (now - last_refill), and small drift in "now" produces proportionally small drift in tokens. A 50ms clock difference on a 1-token-per-second refill rate means 0.05 tokens of error.

Problem 4: One User Melts Your Redis

Per-user rate limiting means one Redis key per user. Most users generate 10 requests per minute. Then one user — or one bot — sends 50,000. Every request hits the same Redis key.

Redis is single-threaded. A hot key means one user's traffic is serialized through one CPU core, and if that core is saturated, ALL other Redis operations on that shard slow down. One abusive user degrades rate limiting for everyone.

Normal:   user:12345  →  10 INCR/min      (invisible)
Abusive:  user:99999  →  50,000 INCR/min  (hot key, one CPU core)

The layered fix:

Reject early. If a user is 10x over the limit, you know the answer without asking Redis. Keep a local approximate counter. If local count >> limit, reject immediately. Only call Redis when the count is near the threshold — the boundary where you actually need distributed accuracy.
Batch concurrent checks. If 200 requests from the same user arrive in the same millisecond, don't make 200 Redis calls. Batch them: one Redis call for the batch, then distribute the result to all 200 waiters locally. This is what production throttlers do — one network round-trip per batch, not per request.
Shard the key. Split user:99999 into user:99999:0, user:99999:1, ..., user:99999:7. Each instance writes to a random shard. To check the total, sum all shards. You trade perfect accuracy for throughput — the sum might be slightly stale.

Problem 5: Three Regions, Three Redis Instances, One Limit

You deploy in US-East, US-West, and EU-West. Each region has its own Redis. A user with a global limit of 1,000/min sends 400 requests to each region.

US-East Redis:  user:123 → 400  (under 1000, allow)
US-West Redis:  user:123 → 400  (under 1000, allow)
EU-West Redis:  user:123 → 400  (under 1000, allow)

Total: 1,200 allowed. Limit is 1,000.

Nobody noticed because no single Redis saw more than 400.

Your options, ranked by pragmatism:

Split the quota. Give each region 1000 / 3 = 333. Simple, but a user who only uses US-East gets 333 instead of 1,000. You're penalizing them for your architecture.

Over-provision the limit. Set it to 800 and accept that the real effective limit is somewhere between 800 and 1,200 depending on distribution. For most use cases, "roughly 1,000" is good enough.

Single global Redis. All regions talk to one Redis in US-East. Accurate, but US-West adds 60-80ms and EU adds 100-150ms per request. For a rate limit check that should take 1ms, that's a 100x latency penalty.

Cross-region sync. Each region publishes its local count every second. Each region subscribes to the others. You get eventual consistency with a 1-2 second window of inaccuracy. Complex to build, complex to debug, and you still have the "what if the sync is down" problem.

Most teams pick option one or two. The engineering cost of options three and four is almost never justified by the accuracy gain. Rate limiting is about protection, not precision — being off by 20% is fine if it still prevents abuse.

The Meta-Problem

These five problems share a root cause: rate limiting is global state enforced locally. Every instance needs to know the global count, but global knowledge has a cost — latency (network hops), availability (what if the store is down), and consistency (what if two instances disagree).

You can't have all three. Pick two:

Accurate + Available: Central store with local fallback (most common)
Accurate + Fast: Single instance, no distribution (doesn't scale)
Fast + Available: Local-only with periodic sync (inaccurate)

Every production rate limiter is a choice on this triangle. Understanding which trade-off your system made — and which failure mode it accepted — is the difference between "we have rate limiting" and "our rate limiting actually works."

Five Ways to Say "Slow Down" — A Field Guide to Rate Limiting Algorithms

Saumya Karnwal — Thu, 26 Feb 2026 18:25:05 +0000

What Is Rate Limiting?

Rate limiting is a rule: no more than N requests in T time. 100 API calls per minute. 5 login attempts per 15 minutes. 1,000 database writes per second.

At its core, it's a counter with a clock. A request comes in, you check the count, and you either let it through or reject it. That's the whole idea.

Why You Need It

Every system has a capacity — the load it can handle before performance falls off a cliff. Not a gentle slope. A cliff.

A database tuned for 500 writes/sec doesn't get 1% slower at 501. It stays fine at 550, maybe a bit sluggish at 580, and then at 600 the query queue backs up, the connection pool exhausts, and latency goes from 50ms to 8 seconds in under a minute. Recovery takes even longer because the backed-up requests are still draining.

Without rate limiting, three things go wrong:

1. One bad actor takes down everyone. A single misconfigured client retrying in a tight loop can saturate your service. The other 10,000 well-behaved clients suffer equally.

2. Cascading failures amplify the damage. When Service A slows down, Service B (which calls A) starts timing out. B's callers retry. A 20% overload on one service becomes a 300% overload on three.

3. Recovery becomes harder than survival. Even after the spike passes, the queue of backed-up requests keeps the system pinned. Without a way to shed load, you can stay degraded for minutes after the cause is gone.

Rate limiting is the difference between "gracefully reject 5% of traffic during a spike" and "return errors to 100% of traffic for 10 minutes."

Five Algorithms, Five Trade-offs

But "rate limiting" isn't one algorithm. It's five, each with a different trade-off between accuracy, memory, burst tolerance, and implementation complexity. Each one is introduced below with how it works, what you give up, and when it's the right pick.

1. Fixed Window Counter

How it works: Divide time into fixed intervals (e.g., 1-minute windows). Keep a counter per window. Increment on each request. If the counter exceeds the limit, reject. When the window ends, reset to zero.

The trade-off: You get simplicity and near-zero memory (~16 bytes per key). You give up accuracy at window boundaries.

Minute 1:                        Minute 2:
..............90 reqs at :59  |  90 reqs at :00..............

Both windows say "under 100, allowed." But in a 2-second real window spanning the boundary? 180 requests — nearly 2x your limit. In the worst case, a client can get double the allowed rate by timing requests around the boundary.

Where to use it: When the limit is a rough safety net, not a precise guarantee. Login throttling ("5 attempts per 15 minutes") is the classic case — even if someone exploits the boundary to squeeze out 10 attempts, that's still worthless for brute-force. API key quotas where "close enough" is fine. Anywhere you need something working in an hour, not a week.

2. Sliding Window Log

How it works: Instead of counting per window, store the exact timestamp of every request. When a new request arrives, evict all timestamps older than the window duration, then count what's left.

Perfectly accurate. Zero boundary tricks. The window slides smoothly with every request.

The trade-off: You get perfect precision. You give up memory. You're storing up to limit timestamps per key — typically in a Redis sorted set. At 10,000 req/min across a million users, that's up to 10 billion timestamps. At 8 bytes each, that's ~80 GB of Redis just for rate limiting state.

If you have 2,000 users making 50 requests/day each? The sorted set holds 50 entries per key. Total memory: negligible. But scale that to millions of keys and the math stops working.

Where to use it: When precision matters more than scale. Financial transaction limits where regulatory compliance demands exact counting — the auditor doesn't care about "99.7% accurate." Database write protection where exceeding the threshold causes corruption, not just slowness. Low-volume, high-stakes APIs where "off by one" has real consequences. The key constraint: either the limit is small, or the user count is small. Ideally both.

3. Sliding Window Counter

How it works: A compromise between the first two. Keep counters for the current window and the previous window. Estimate the sliding total using weighted math based on how far into the current window you are.

Limit: 5/min    Current time: 1:18 (18 sec into window)

Previous window [0:00-1:00]:  5 requests
Current window  [1:00-2:00]:  3 requests

Weighted total = 5 × (42/60) + 3
               = 3.5 + 3
               = 6.5 ~ 7  →  DO NOT ALLOW

Two counters per key. ~32 bytes of memory. Not sorted sets of timestamps — two integers.

The trade-off: You get near-perfect accuracy at minimal memory cost. You give up a small margin of error. Cloudflare measured 99.7% accuracy against a perfect sliding window. The 0.3% error comes from assuming requests were uniformly distributed in the previous window. In practice, your measurement noise is already larger than that.

Where to use it: When you need accuracy at scale. Millions of keys, limited memory, and you can tolerate a rounding error smaller than your measurement noise. This is the default choice for most production rate limiters. If you're not sure which algorithm to pick, start here.

Cloudflare uses this. AWS API Gateway uses this. Most "rate limit by API key" implementations in production use some variant of this.

4. Token Bucket

How it works: A different mental model. Instead of counting requests in a time window, imagine a bucket that fills with tokens at a steady rate. Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, so tokens don't accumulate forever.

Two parameters: the refill rate (long-term average) and the bucket capacity (maximum burst size).

The trade-off: You get burst tolerance — short spikes are absorbed as long as the long-term average stays within limits. You give up strict per-window guarantees. A user can consume their entire bucket in a single burst, which means the instantaneous rate can be much higher than the average rate.

The key insight: it rewards idle time. A user who hasn't called your API in 5 seconds has accumulated tokens and can burst. This matches how real humans use APIs — idle, idle, idle, click click click click, idle. A window-based counter punishes that pattern. Token bucket embraces it.

Where to use it: When your users are bursty and that's expected. A developer building a dashboard fires 12 parallel API calls on page load, then nothing for 45 seconds. A mobile app syncs on wake. A CLI tool batches requests. In all these cases, the average rate is fine — it's the shape that doesn't fit a fixed window. Token bucket lets legitimate bursts through while still enforcing a long-term ceiling.

Stripe, GitHub, and AWS EC2 all use token bucket for their public APIs. Google's Guava RateLimiter library implements it.

5. Leaky Bucket

How it works: The mirror image of token bucket. Requests pour into a fixed-size queue. The queue drains at a constant rate. If the queue is full when a new request arrives, it's rejected.

If 50 requests arrive in one second, 1 goes through immediately, 10 wait in the queue, and 39 are rejected. The output is perfectly smooth — always exactly the drain rate, no matter how spiky the input.

The trade-off: You get perfectly shaped output — no burst ever reaches the downstream system. You give up two things: (1) burst tolerance — even legitimate spikes get queued or rejected, and (2) latency — requests sit in the queue waiting their turn instead of being processed immediately.

The critical difference from token bucket: token bucket allows bursts and limits the average. Leaky bucket eliminates bursts and smooths the output. They look similar on paper but behave very differently under spiky load.

Where to use it: When your downstream truly cannot handle bursts — not even brief ones. An SMS provider that charges 5x for burst traffic. A legacy database that crashes above 100 writes/sec rather than degrading. A hardware device with a fixed processing rate. Network traffic shaping where you need constant bandwidth. Anywhere the shape of the output matters as much as the volume.

NGINX's limit_req module uses leaky bucket by default. Twilio and SendGrid submission queues work this way. Network QoS traffic shapers are leaky buckets over bytes instead of requests.

Choosing the Right One

But there's a sixth option that doesn't fit neatly into the tree.

Bonus: Adaptive Throttling — The Self-Tuning Client

What if you don't want to pick a number at all?

Google's SRE Book describes a pattern where the client tracks its own success rate and starts probabilistically dropping requests when the server is struggling:

If 10% of your calls fail, you preemptively drop ~10% client-side. The server gets breathing room. As it recovers, your success rate climbs and you ramp back up automatically.

Use it for: Internal service-to-service calls where you control both sides. No manual tuning. No threshold guessing. The system finds its own equilibrium. Implement it as a client interceptor (gRPC, HTTP middleware) that backs off when the backend returns overload errors.

The Uncomfortable Truth

Rate limiting is admission control — it decides who gets rejected. And rejection has a cost. A rejected API call might mean a failed checkout or a user staring at a spinner.

The algorithms above are tools. The harder question is policy: Who gets limited? Per-IP is easy to circumvent. Per-API-key punishes shared keys. Per-user requires authentication before rate checking. And in a multi-tenant SaaS, your free-tier user and your enterprise customer probably shouldn't share a bucket.

The algorithm is the mechanism. The policy is the product decision. Get both right.

How to Build Workflows That Never Lose Progress

Saumya Karnwal — Tue, 24 Feb 2026 19:33:45 +0000

The Half-Deployed Model

Imagine you're running an ML platform. A weekly cron job fires at 3 AM to retrain a customer's model. The pipeline has five steps:

Generate training data from BigQuery
Train the model on a Kubernetes cluster
Push the model artifact to a registry
Create a scoring configuration in the scoring service database
Authorize the model for the customer's traffic

Steps 1 through 3 take about two hours and cost real money — compute time, BigQuery slots, container images. At 5:02 AM, step 3 completes. The model is trained and pushed.

Step 4 calls the scoring service to create the config. The scoring service is in the middle of a routine database migration. Connection refused.

Now you have a problem. The model is sitting in the artifact registry, trained and ready. But it can't serve traffic because there's no scoring config. The pipeline marks the whole run as "FAILED."

What happens next depends on how you built the system.

If you start over: The 6 AM retry re-runs from step 1. Two more hours of BigQuery and Kubernetes compute, re-training a model that's identical to the one you already have. You just burned money and time rebuilding something that already exists.

If you do nothing: The model sits orphaned in the registry. The customer's production model is stale. A data scientist notices three days later and manually creates the config.

If you built a saga: The system knows step 3 completed. It retries step 4. The scoring service comes back from its migration at 5:15 AM. Retry succeeds. Step 5 runs. By 5:20 AM the customer has a fresh model. Nobody was woken up. No work was wasted.

What's a Saga?

The original problem was database transactions that span multiple systems — you can't use a single BEGIN/COMMIT because the data lives in different databases.

The solution: break the big transaction into a sequence of smaller steps, each with:

A known state (pending, in-progress, complete, failed)
The ability to retry safely (idempotency)
An optional compensating action (undo what was done if we need to abort)

The state machine IS the recovery mechanism. You don't need a separate "recovery system" — you just need each step to be resumable.

Why Idempotency Is the Hard Part

The saga pattern sounds simple: track state, retry failed steps. But there's a catch. What if the step did succeed, but you didn't get the confirmation?

Picture this: your pipeline calls the scoring service to create a config. The service creates it, writes it to the database, and starts sending back a 200 response. At that exact moment, the network blips. Your pipeline gets a timeout. It thinks step 4 failed.

On retry, the pipeline calls the scoring service again: "Create this config." If the service isn't idempotent, it creates a second config. Now you have duplicate scoring entries, and the model might score users twice.

Idempotency means: running the same operation twice produces the same result as running it once. The service checks "does this config already exist for this model version?" and if so, returns the existing one instead of creating a duplicate.

This is the non-negotiable foundation. If a step can't be safely retried, the entire saga pattern breaks.

What This Looks Like in Practice

The State Machine

Every deployment in the system has a status that tracks exactly where it is:

PENDING
   │
   ▼
DATA_GEN_IN_PROGRESS ──(fail)──▶ DATA_GEN_FAILED
   │                                     │
   ▼                                  (retry)
DATA_GEN_COMPLETE                        │
   │                              ◀──────┘
   ▼
TRAINING_IN_PROGRESS ──(fail)──▶ TRAINING_FAILED
   │                                     │
   ▼                                  (retry)
TRAINING_COMPLETE                        │
   │                              ◀──────┘
   ▼
PUSHING_IN_PROGRESS ──(fail)──▶ PUSH_FAILED
   │                                     │
   ▼                                  (retry)
PUSHING_COMPLETE                         │
   │                              ◀──────┘
   ▼
CONFIGURING ──(fail)──▶ CONFIG_PENDING
   │                          │
   ▼                    (reconciliation
READY                    loop retries)

This looks like a lot of states. But each state is dirt-simple: "I know exactly what succeeded, and I know exactly what to do next."

Retry With Backoff

When a step fails, the system doesn't immediately retry in a tight loop. That would hammer a service that might already be struggling.
Exponential backoff gives the downstream service time to recover. If it's a 30-second blip, attempt 2 or 3 catches it. If it's a longer outage, the system backs off gracefully.

The Reconciliation Loop

What if 5 retries aren't enough? The deployment state says CONFIG_PENDING. The pipeline stops actively retrying. But it's not abandoned. A background process — the reconciliation loop — periodically scans for stuck deployments.
When the downstream service recovers (maybe after a database migration, maybe after an outage), the reconciliation loop picks up the stuck deployments and completes them. No human intervention. No lost work.

The user sees: "Deployment in progress — model trained, awaiting configuration." Not an error. Not a failure. Just... waiting, and it'll fix itself.

Making Each Step Idempotent

In practice, idempotency looks different for each type of operation:

Database writes: Use INSERT ... ON CONFLICT DO NOTHING or check-before-write. If the row exists with the same key, it's a no-op.
API calls: Include a unique request ID (sometimes called an idempotency key). The server caches results by this key — if it's seen the key before, it returns the cached result.
State changes: Read current state before deciding what to do. If the current state is already what you want, do nothing. This is how Kubernetes controllers work — they compare desired state to actual state on every loop.

The pattern is always the same: check if the work is already done before doing it again.

The Anatomy of a Good Saga

1. State Must Be Durable

The state machine lives in a database, not in memory. If the orchestrator crashes:

It restarts
Reads the state from the database
Picks up where it left off

If the state was in memory, a crash means starting over. If it's in a database, a crash means a brief pause.

2. Compensating Actions for the Unhappy Path

Sometimes you need to abort, not retry. If a model is deployed but turns out to be bad, you don't just retry — you rollback.
The compensating actions are the "undo" for each step. Not every step needs one (training data in BigQuery doesn't hurt anyone just sitting there), but state changes in production databases definitely do.

3. Visibility Into the State

A saga that works perfectly but is opaque to users is almost as bad as one that fails. The user should be able to see:

Which step the deployment is on
What failed and why
Whether the system is retrying or waiting for manual intervention
A "Retry" button for failed steps

When You Need a Saga

Two conditions:

The operation spans multiple services or systems. If it's a single database transaction, use a regular transaction. If it crosses service boundaries, you need a saga.
Partial completion is worse than complete failure. If step 3 of 5 fails and you're left in a half-done state, that's a problem. The saga ensures you either complete or cleanly recover.

A quick gut check: if you find yourself writing code like "first do X, then do Y, and if Y fails... um..." — you need a saga.

Where You've Seen This Pattern

Stripe's idempotency keys — Every Stripe API call accepts an Idempotency-Key header. If your server crashes after Stripe processes a charge but before you record the response, you retry with the same key. Stripe returns the original result. No double-charge. This is idempotency as a first-class API concept.
Kubernetes controllers — The entire K8s control plane is a saga engine. Controllers compare desired state to current state on a reconciliation loop. If a controller crashes mid-action, it restarts, re-evaluates, and acts on the delta. It doesn't need to remember what it did — it looks at what exists.
Airline booking systems — When you book a flight, the system reserves a seat, charges your card, issues a ticket, and sends confirmation. If the charge fails, a compensating action releases the seat hold. If ticketing fails, it retries without re-charging. Each step knows what happened before it.

“You Can’t Do That” Is Not a User Experience

Saumya Karnwal — Sun, 22 Feb 2026 11:32:48 +0000

The 28 Slack Messages

Imagine you've built an ML platform. Product managers can deploy pre-built model templates to customers — select a template, pick a customer, click deploy. Easy.

One Tuesday afternoon, a PM deploys a click-through rate model to a new customer. The customer onboarded three days ago. The model template needs at least 14 days of interaction data to train.

The PM doesn't know this. Why would they? They're a product manager, not a data scientist. They click "Deploy," get a spinner for 90 seconds, and then:

Error: Training pipeline failed. Exit code 1.

That's it. No explanation. No guidance. No next step.

So the PM does what anyone would do — they message the engineering channel on Slack: "Hey, I tried to deploy CTR model to Customer Y and it failed. Can someone look?"

An engineer investigates. Twenty minutes later: "Oh, they only have 3 days of data. The model needs 14 days minimum. You'll have to wait until around March 5."

Now multiply this by every PM, every customer, every model type with different requirements. You get 28 Slack messages a week, each one asking some variation of "why did this fail?" And each answer is something the system already knew — it just didn't bother to tell anyone.

The system had all the information. It chose to say nothing useful.

The Pattern: Guide, Don't Block

Most systems handle bad input the same way: reject it.

User tries something invalid → "Error: Invalid request" → User confused

This technically works. The system is "safe." But the user is stuck, frustrated, and about to file a support ticket — which means a human ends up solving what the system should have.

Guard rails take a different approach:

User tries something → Validation fails → "Here's WHY, here's WHAT to fix, here's WHEN to retry"

The system still prevents the bad thing from happening. But instead of slamming a door in your face, it puts up a guardrail on the highway — you can see the edge, you know you're drifting, and you correct before going off the cliff.

The Failure Mode This Prevents

Guard rails solve a specific problem: self-serve systems that generate support tickets instead of successful outcomes.

Imagine you're building a platform where non-technical users trigger complex operations. Maybe it's:

A product manager deploying an ML model to a customer
A marketing team scheduling a push notification campaign
An ops engineer provisioning infrastructure for a new region

These users don't have the mental model of the system. They don't know the prerequisites. They don't know what "Error: insufficient data for training pipeline" means.

Without guard rails, every failed operation becomes:

User → tries action → cryptic error → Slack message to engineering →
engineer investigates → "oh, you need 14 days of data first" →
user waits → tries again → maybe fails again

With guard rails:

User → tries action → "This customer has 3 days of data.
This model needs at least 14 days. Earliest deploy: March 5.
[Notify me when ready]"

No Slack message. No engineering ticket. No frustration. The system explained the constraint and offered a next step.

What Good Guard Rails Look Like

Back to the ML platform. Here's what that PM should have seen instead of "Exit code 1":

The Validation Checklist

Before the system accepts a deploy request, it runs a series of checks:

Notice what this does:

Shows what passed — the user knows they're not doing everything wrong
Explains what failed — specific, not generic
Gives a date — "when" is more useful than "no"
Offers a next action — "Notify me" means they don't have to keep checking manually

The Thresholds Are Part of the Template

Different models need different things. A simple bandit model might work with 7 days of data. A gradient-boosted model might need 30 days. A deep learning model might need 90 days and a GPU.

The guard rail thresholds aren't hardcoded — they're defined by the people who know: the data scientists who built the template. When DS publishes a template, they specify:

Check	Bandit	LightGBM	Deep Learning
Min data days	14	30	90
Min interaction rows	1,000	50,000	500,000
Feature table required	No	Yes	Yes
GPU required	No	No	Yes

The platform enforces these automatically. DS defines the rules once, every deployment is validated against them forever.

The Anatomy of a Good Guard Rail

1. Validate Early, Not Late

Don't let the user fill out a 10-step form, click "Submit," and THEN tell them step 2 was wrong. Validate as early as possible.

Even better — validate before they even start. If you know a customer doesn't have enough data, show a warning on the template selection page. Don't wait until they've configured everything.

2. Be Specific, Not Generic

❌ "Validation error"
❌ "Cannot deploy model"
❌ "Insufficient data"
✅ "Customer Y has 3 days of interaction data. CTR Bandit requires at least 14 days."

Every word in the error message should reduce the user's uncertainty. If they read it and still don't know what to do, the message failed.

3. Tell Them When, Not Just No

❌ "Not enough data. Try again later."
✅ "Not enough data. Earliest deploy date: March 5, 2026."

"Later" is useless. A date is actionable. They can set a calendar reminder and come back.

4. Offer a Next Step

The best guard rails don't just inform — they offer an action:

"Notify me when ready" (subscribe to a check)
"Deploy with reduced accuracy" (accept the risk explicitly)
"Contact DS team" (escalate with context pre-filled)

The user should never be left staring at a message with nothing to click.

Where You've Seen This Pattern

Once you recognize guard rails, you see them everywhere:

Stripe's onboarding doesn't say "Error 403: Account not verified." It shows a checklist of what's complete and what's missing, with time estimates and action buttons for each step.
GitHub branch protection doesn't just reject your push to main. It shows which checks failed (with links to logs), which reviews are missing (with names), and a "Re-run" button for flaky tests.
Terraform plan shows you exactly what will be created, changed, and destroyed before you apply — so you make the call with full information, not after the damage is done.
Google's "Did you mean?" — the original guard rail. You search for "pythn tutoral" and instead of "no results," you get a suggested correction and results anyway.

These are all systems that chose to guide instead of block. The user is still prevented from doing the wrong thing — but they're never left staring at a wall.

When You Need Guard Rails

Two conditions:

Non-expert users trigger complex operations. If only senior engineers use the system, a terse error might be fine. If product managers, marketers, or ops people use it — they need guidance.
Failures are recoverable but wasteful. If the system would eventually fail anyway (out of memory, bad training data, missing features), it's better to catch it before spending 2 hours and $50 in compute on a doomed training run.

The guard rails pattern works best at system boundaries — where user intent meets system constraints. That's where the mismatch between "what I want to do" and "what the system can handle" is highest.