We've written before about how StateKeep migrates running actors when a workflow definition changes — routing by event history fingerprint instead of current state, so two actors sitting in the same state can receive different migration decisions based on how they got there.
What we haven't covered is the formal side of it: why this isn't a heuristic that happens to work most of the time, but a model with axioms and proofs behind it.
The full formal writeup — the axioms, the routing law derived from them, and the proofs of the properties below — is part of an ongoing research effort and isn't public yet. What follows is the shape of the model: precise enough that you can evaluate whether it holds up, without it being a how-to.
The Core Idea
The question at the center of this is simple to state and hard to answer: when a machine definition changes, how do you migrate in-flight actors to the new version without replaying their entire event history?
The reason this is hard is that current state is not enough information.
Two actors can both be sitting in awaiting_documents. One got there after paying a verification fee. The other got there after that fee was waived. They are in the same state. A version number cannot distinguish them. Their current state cannot distinguish them. Only their event history can.
The model we built treats this seriously. Every actor's event history places it in a prefix class — actors that have processed exactly the same sequence of event types belong to the same class, and actors in the same class must receive identical routing decisions. This is a hard requirement, not a guideline: it's what makes the rest of the model work.
A change-point is a registered point in the system's history: a deployment time, a target prefix class, and a refinement index (for hotfixes to the same point). Each change-point has an associated target version.
The routing rule — the thing every other guarantee is derived from — is this:
For an actor currently on version
v, evaluated at timeT: find the earliest registered change-point at or afterTwhose prefix class matchesv's continuation. If multiple refinements exist at that point, take the latest one. Migrate to its target. If no such change-point exists, the actor stays onv, unchanged.
That's it. That's the entire routing decision. No replay, no re-firing of side effects, no developer judgment call per actor.
What This Gives You: Determinism
The first property that falls out of this rule is determinism: given the same actor and the same registry, the routing decision is always the same.
This sounds obvious until you ask why it's guaranteed. It's not because we tested a lot of cases. It's because of what the registry structurally is.
"Earliest applicable change-point" is a minimum over a set with a total order — and a non-empty set under a total order has a unique minimum. "Latest refinement at that point" is a maximum over a finite set of integers — also unique. And each change-point maps to exactly one target version, by definition.
Three steps, each one structurally forced to have a single answer. There's no point in the chain where two different valid answers could exist. Determinism isn't a property we verified after the fact — it's a property the model can't not have, given how a change-point registry is defined.
The More Surprising One: Irreversibility
Here's a question that sounds like it should require careful engineering to answer: can a sequence of deployments ever create a routing cycle — where an actor that moved from version A to version B could, after some later deployment, get routed back to a branch anchored at A?
The model's answer is: no, and not because anything checks for it at deploy time.
Once an actor's prefix class diverges from a branch's anchor point, every change-point it can ever become eligible for inherits that diverged prefix — not the original one. There is no change-point, registered now or in the future, that is anchored to the old prefix and also applicable to this actor, because the actor's prefix simply isn't that anymore.
The practical consequence: the migration graph is a DAG by construction. Not "we recommend not creating cycles." Not "our deploy tooling validates against cycles." It is structurally impossible to construct one, in the same way it's impossible to construct a triangle whose three angles sum to more than 180° in Euclidean geometry — the geometry itself forecloses it.
This matters in practice because it means you can deploy v1 → v2 → v3 → a hotfix to v2's logic → v4, in any combination, and there is no sequence of deployments that accidentally routes an actor backward into a branch it already left. The "no rollback footguns" property isn't a feature we added. It's a consequence of the prefix-class structure.
The Hard Part: Parallel State
The base model — the routing rule above — was originally developed for flat, linear event histories. Real statecharts aren't flat. XState actors routinely have parallel regions: orthogonal pieces of state that evolve independently, where a single event might advance one region and leave another untouched.
This breaks the simple version of the model in a specific way: two different parallel regions can end up with the identical event history (e.g., both a payment region and a shipping region might independently process [CONFIRM]), which would make them look like the same prefix class to a naive routing engine — causing a change-point meant for one region to incorrectly match the other.
This was a real problem the extended formal model had to address, not a footnote. The extension — covering hierarchical and parallel statechart routing — proves a parallel determinism result with the same shape as the flat case: for a well-formed registry, regional routing is uniquely defined, with tie-breaking across deployment time, selector specificity, and refinement, in that order. We're not going to walk through how region identities are kept distinct here, but it's solved, and it's part of what the formal model covers.
What Happens When a Deployment Is Wrong
One more property worth stating, because it's the one that determines whether any of the above is usable in production: what happens when you deploy a buggy version and actors migrate into it before you catch it?
Under the routing rule above, those actors are now anchored to the buggy version's prefix class — and by irreversibility, they can't be routed back to the original branch. They're stuck on the bug. Is that a dead end?
No. The model includes a rescue mechanism: a new change-point, anchored to the marooned actors' current (post-bug) prefix class, with a corrected version as its target. This is just another instance of the same routing rule — a forward-only branch from where the actors actually are now, not a reversal of where they were. Irreversibility isn't violated, because nothing routes backward; a new path is created forward from the actor's current position.
In practice: when StateKeep can't find a safe migration target for an actor, it doesn't guess. The actor is marked needs_rescue, stays queryable and operational, and a corrected deployment targeting its current history gives it a forward path. The worst case in this model is "we need more information before we can move this actor" — never "we moved it somewhere wrong."
Why This Matters for XState
If you've worked with long-running XState actors and changed a machine definition while actors were still in flight — written a getVersion()-style guard you weren't fully confident about, or kept two versions of a workflow's logic running side by side because you couldn't prove a migration was safe — this is the formal answer to a question that's been open for years: a deterministic, paradox-free migration model for long-running actors is possible, and it's possible without replaying event logs or re-firing side effects.
StateKeep is the implementation of this model. The routing rule above — earliest applicable change-point, prefix-anchored, latest refinement — is the actual decision StateKeep's engine makes for every actor, on every deployment.
Early Access
StateKeep is in private early access. We're giving developers access to a live demo instance where you can deploy breaking changes to XState-style machines, watch the routing decisions happen, and inspect the needs_rescue path directly.
If you want access, email statekeep.support@gmail.com with a line about what kind of workflows you're building. We're especially interested in teams that have already hit the in-flight migration problem in production.
Top comments (1)
This is a remarkable formalization of deterministic actor migration for XState. I really appreciate how the model handles prefix-class routing, change-points, irreversibility, and rescue paths to ensure migrations are paradox-free and deterministic, even for complex hierarchical and parallel statecharts. The distinction between structural guarantees and post-hoc verification is very instructive for teams maintaining long-running workflows.
I’d love to collaborate and explore extending this model—experimenting with multi-region actors, automated change-point generation, and integration with CI/CD pipelines for safe workflow evolution. Sharing strategies for formal migration proofs, real-time monitoring, and rescue mechanisms could benefit teams dealing with in-flight actor migrations in production.
Would you be open to discussing collaboration to prototype additional scenarios, stress-test the routing rules, or extend StateKeep to larger, multi-agent XState environments?