If this has happened to you…
- two requests update the same thing at the same time (race conditions)
- retries create duplicate effects (double emails, double charges, double writes)
- rate limits / quotas are inconsistent under load
- ordering matters per customer/resource, but events arrive “whenever”
- “it worked locally” but breaks under real traffic
…you’re in distributed systems territory.
And you don’t need a “massive distributed system” for this to be true. Even with a single server, concurrent requests + retries + partial failures can create the same class of problems. If you later add replicas, the pain gets amplified fast.
So the topic isn’t “how many servers you have”. It’s coordination. Per-key coordination.
Per-key coordination means: for one specific thing (like an order or a user), there’s one place that decides what happens.
That sentence sounds obvious… until you meet the moment where it stops being optional.
The moment it becomes real
Imagine a button: Pay.
On the happy path, it’s boring: click -> charge -> 200 OK -> “Paid.”
The failure path is where your system reveals what it actually believes.
The user clicks Pay. The server charges the card. Then something goes wrong before the user sees success. It could be a timeout, a network hiccup, a crash, the client giving up early, a proxy retrying, a job runner retrying later—pick your favorite. The point is: the system did some work, but the outside world didn’t get a clean “done” signal.
So the intent arrives again.
Now you’re not debugging payments. You’re debugging this question:
Did we already do the thing? If yes, what should we do now?
That question doesn’t live only in checkout flows. It shows up when you send an email, create a subscription, increment a quota, finalize an order, apply a state transition, update a profile, accept an invite.
This is why production bugs can feel haunted. The code looks fine. Tests pass. Logs look normal. Yet outcomes are wrong—because the system answers “did it already happen?” inconsistently.
“Maybe” is the most expensive state
At some point your system can’t confidently say “yes” or “no.” It can only say “maybe.”
And “maybe” is expensive because it forces two bad choices. If you do it again, you create duplicates (double charge, double email, double write). If you refuse to do it again, you miss work (no email, stale state, inconsistent outcomes).
The frustrating part is that “maybe” isn’t rare. It shows up through normal reality: concurrent requests, retries, webhook redeliveries, at least once job processing, crashes between “side effect happened” and “response delivered.”
So the fix usually isn’t “add another condition.” The fix is to introduce a consistent place where decisions get made.
The missing building lego
When the weirdness clusters around one thing an order, a user, a tenant, a resource the shape of the fix is usually the same:
one coordinator per key.
One place that can say: “I’ve already seen this requestId, don’t apply it twice.” Or: “For this order, state transitions happen in order.” Or: “For this tenant, quotas are enforced consistently.” Or: “Only one worker can hold this lock right now.”
In this series, I’ll use Cloudflare Durable Objects as the coordination primitive so we can focus on the patterns, not the plumbing.
Why this series uses Durable Objects
You can solve coordination problems on any major cloud. The question isn’t “can I do it on AWS/GCP?” — you can.
The question is: how many moving parts do you need to make it correct, and how easy is it to reason about under retries and concurrency?
Coordination bugs rarely come from missing features. They come from the system being forced to answer “did it already happen?” and having no single place that can answer consistently for a key.
Durable Objects show up in this series because they give you a very direct building lego for that shape of problem:
one key -> one stateful place to decide, with storage attached.
What you usually end up building on AWS/GCP (and why it’s easy to get wrong)
If you try to recreate “one place per key decides” using common cloud primitives, it typically becomes a small distributed system of its own:
- stateless compute (Lambda/Cloud Run/etc.)
- a database with conditional writes / transactions (DynamoDB/Spanner/Postgres)
- often a cache/lock service (Redis) for counters/locks
- sometimes a queue/workflow layer (SQS/PubSub/Step Functions/Cloud Tasks)
Each piece is fine, the problem is the glue:
- you now have multiple services that must agree on ordering and idempotency
- you have to design retries across service boundaries
- you have to handle partial failures
- locks need TTLs, renewal, fencing, and careful failure handling
- debugging becomes “which service saw what, in what order?”
You can absolutely do it, but the coordination logic gets spread across infrastructure decisions, not just code.
What Durable Objects are (and what you get by default)
A Durable Object is a stateful instance addressed by an ID (or a name-derived key) that combines compute with persistent storage.
Three properties matter for coordination:
Requests for the same key go to the same object: That gives you a natural “home” for decisions about
order:123ortenant:acme.Single-threaded execution per object: You can write coordination logic without reinventing locks inside your own code.
Storage is attached to the object: The place that decides can also remember what it decided (dedupe keys, current state, counters, queue state).
That combination means you can implement patterns like:
- idempotency/dedupe (requestId sets)
- single-writer ordering per key
- per-key rate limits/quotas
- per-key queues
- stampede protection (one refresh, many wait)
…without first assembling a separate coordination stack.
The real benefit
The benefit isn’t “Cloudflare vs AWS”. It’s that Durable Objects reduce coordination from “a system design problem across multiple services” into “a local decision inside one keyed instance.”
That’s why they’re a great teaching tool for these patterns: we can spend the series on ordering, dedupe, rate limits, per-key queues, stampede protection, and sharding, instead of spending half the series wiring and operationalizing the coordination stack.
When you shouldn’t use Durable Objects
Durable Objects aren’t a universal default. If a clean database transaction already solves the correctness problem, or your workload is purely stateless, or the primary requirement is global querying/analytics across many keys, DO may be unnecessary.
The guiding principle for the series stays the same:
use the simplest tool that can make the decision consistently.
What I’m going to build over 30 days
The goal is to end with a toolbox: the same coordination idea, applied as repeatable patterns.
We’ll start by making the primitive “click” with tiny demos: mapping keys to coordinators, choosing keys safely, and understanding what is and isn’t durable. From there we’ll turn the “Pay button story” into real solutions: deduping retries (idempotency), enforcing ordering (single-writer per key), and keeping quotas consistent (rate limiting). Then we’ll move into the patterns that show up once you ship: per-key queues, locks, stampede protection, hot keys, sharding, and the trade-offs that decide when DO is the right tool versus when a database, Redis, or a queue is the better fit.
I’ll keep it flexible on purpose: if a topic turns out to be more useful than planned, I’ll spend more time there.
Posting schedule (holiday break)
This is a 30 post run, but there will be no posts on: Dec 24, 25, 29, 30, 31, and Jan 1, 2.
Planned map (subject to change)
I’ll keep this updated as days ship.
Day 01 - One Key, One Coordinator: The primitive: route by key -> one stateful place to decide.
Day 02 — Key Design = Partitioning: Good keys isolate; bad keys collide (and create hot spots).
Day 03 — What’s Actually Durable?: Memory vs storage vs “what survives” (and what doesn’t).
Day 04 — Single-Writer per Key: Enforce ordering and avoid races by serializing per key.
Day 05 — Idempotency: Dedupe Retries: Turn “maybe” into “already handled” with request IDs.
Day 06 — Rate Limiting per Key: Consistent quotas even when you scale and retries happen.
Day 07 — Weekly Recap #1 + Cheatsheet: The first “pattern index” you can bookmark.
Day 08 — A Per-Key Queue: Queue work per key to control order and throughput.
Day 09 — Locks per Resource (and when not to): When you truly need mutual exclusion, and the footguns.
Day 10 — Debounce/Throttle per Key: Collapse bursts into one decision.
Day 11 — Stampede Protection (“Single Flight”): One refresh runs; everyone else waits.
Day 12 — Consistent Counters per Key: Quotas, usage, and “exactly once ish” counting.
Day 13 — Leader per Key (Coordinator Role): When one instance must orchestrate steps for a key.
Day 14 — Weekly Recap #2 + Pattern Index: Consolidate: dedupe, ordering, queues, locks, stampedes.
Day 15 — Handling At-Least-Once Delivery: Designing for duplicates as a normal case.
Day 16 — Webhooks: Redelivery Without Panic: Make webhook handlers safe under retries.
Day 17 — Sagas per Key (Multi-Step Workflows): A simple saga state machine you can reason about.
Day 18 — Backpressure per Key: Protect correctness when load spikes.
Day 19 — Hot Keys: Symptoms and Triage: How to recognize and mitigate before it melts.
Day 20 — Sharding a Hot Key: Split one key into many without losing correctness.
Day 21 — Weekly Recap #3 + Failure Modes Checklist: The “what breaks in prod” list.
Day 22 — Observability That Helps Coordination Bugs: Logs/metrics/tracing for “did it already happen?”
Day 23 — Testing Retries, Races, and Ordering: A harness to reproduce the “haunted” bugs.
Day 24 — Anti-Patterns (What Not To Do): The mistakes that create invisible correctness debt.
Day 25 — DO vs DB vs Redis vs Queues (Honest Trade-offs): How to choose the simplest correct tool.
Day 26 — Multi-Tenant Boundaries: Isolation, fairness, and per-tenant abuse prevention.
Day 27 — Coordinating Fan-Out (Realtimes / Rooms): When many clients depend on one key’s truth.
Day 28 — Composition: Building a “Coordination Kit”: Combine patterns instead of rewriting them.
Day 29 — A Small Capstone Demo: A realistic flow that uses multiple patterns together.
Day 30 — Final Index + Learning Paths: Where to start depending on your problem (retries, ordering, quotas, hot keys).
How to follow along
The code lives in a single GitHub repo: Link
Stay curious and ship.
Top comments (0)