A few years ago I watched a webhook handler charge a customer's card twice in the same minute. The success path committed. The retry path committed. They'd been written six months apart by different engineers, neither aware the other existed. The fix took an afternoon. The conversation about why the system allowed it took two weeks.
The fix was an idempotency key. The two weeks were about everything else.
I've been writing this kind of code for 20 years, and that incident wasn't the first or the worst. Across consulting gigs, Fortune 500 integrations, and products I've shipped, the same shape keeps showing up. A webhook comes in. A payload gets parsed. We enrich it with a customer record, call an API, post a notification, write a row, enqueue a follow-up job, and handle the error path if something goes wrong.
At first, it's just application code. A function here, a queue consumer there, a few retries, a few logs, maybe a database status column. It doesn't feel like a workflow system. It feels like the practical thing you write because the product needs to move.
That's the trap.
Useful workflows don't stay small. They grow branches. They get approval steps. They need retries. They need cancellation. They need to avoid charging a card twice or posting the same Slack message three times. They need to resume after a deploy, a crash, a timeout, or a handler bug. They need enough history that someone can answer a basic operator question: what happened to this run?
At that point, the workflow is no longer just glue code. It's a state machine. The only question is whether that state machine is explicit, or scattered through conditionals, database fields, queue messages, and side effects. In my experience it's almost always the second one.
The shape of the bug
The bugs that bother me most in workflow code aren't clever algorithm bugs.
They're state and boundary bugs:
- A handler assumes a step already happened, but the persisted state doesn't prove it.
- A retry repeats a side effect that should have been committed once.
- A failure is stored as a string, so retry policy depends on text that was never meant to be a contract.
- A function looks like a pure transform but quietly calls a database, an API, or a notification service.
- A workflow can enter a state no caller expected because transitions are just convention.
- The UI, worker, API, and deploy target each have a slightly different idea of what the workflow is.
I've been in the meeting where I asked "what state can this be in?" and got a 20-minute conversation involving three engineers and a database query.
None of those problems are exotic. That's what makes them annoying. They're the normal failure mode of workflow automation when the workflow model is implicit. The code can look clean locally. Each function reads fine in review. The system-level contract is still informal — and the informal part is exactly where the production incident is going to come from.
The hidden state machine
In the Marines, an operations order is a contract. It names the phases of the operation, the conditions to move between them, the abort criteria, what each element is doing in each phase, and what counts as mission complete. It isn't a vibe. It's a document everyone references when things start going sideways. Because things will start going sideways.
A workflow is the same kind of contract. It just doesn't usually look like one.
The double-charge incident I opened with had an implicit state machine. There were states — "received," "processing," "charged," "failed," "retried" — but none of them were named anywhere. They lived in a status column, a queue message, and a handful of conditionals. When the retry path was added, nobody wrote down the rule that "charged" was terminal. Nobody had to. The code worked.
Until it didn't.
If those states had been explicit, the questions would have answered themselves before shipping: What transitions are legal? Which state is terminal? Which step commits an externally visible action? Which failure states can retry?
When states are implicit, you still answer those questions. You just answer them at 2 AM, after the fact, distributed across code paths, database columns, log messages, and the heads of whoever wrote the code three sprints ago.
That's where glue code becomes expensive. Not because functions are bad. Not because JSON is bad. Because the workflow contract is hiding in the implementation instead of being something the implementation follows.
The side-effect problem
Workflows are mostly about side effects.
Parsing JSON isn't the hard part. The hard part is deciding what the runtime is allowed to repeat.
There's a real difference between these operations:
- validate a payload
- compute a retry decision
- format a message
- read a cached record
- charge a card
- send an email
- post to Slack
- write to an external system
In ordinary glue code, the line between which of those can replay safely and which can't is expressed by naming, comments, or reviewer discipline. Discipline doesn't save you when things go sideways. The runtime needs to know which calls are replay-safe and which calls are externally visible commitments — and it needs to know that without inferring it from a function name.
A workflow runtime needs a side-effect boundary it can reason about. If the contract doesn't say where committed work happens, the runtime is guessing from implementation details, and the customer is the one who finds out.
The bet
There are good workflow engines already. I'm not skeptical of them because durable-execution systems are useless. I'm skeptical because most of them put the runtime at the center of gravity. You describe a graph or write code against a framework, and the platform interprets or orchestrates the execution. That's a fine answer for a lot of teams. It isn't the answer I want as an engineer.
I want the workflow contract to be the source of truth, not the runtime.
A workflow should be readable before it's runnable. You should be able to open one file and see the states the workflow can be in, the transitions it allows, the data carried by each state, the side effects it can perform, the committed actions it might take, and the failure shape it exposes to the runtime. Then the generated code and operational tooling should preserve that contract instead of re-inventing it somewhere else.
That's why I stopped trusting workflow glue code as the primary abstraction. Not because glue code is always bad. Because workflows eventually need stronger boundaries than glue code naturally provides — and by the time you notice, you're already paying for it.
The next post is about what I'm building to fix it.
Top comments (0)