State machines are easy to deploy the first time.
The hard problem starts later, when the workflow definition changes but thousands of actors are already running inside the old version.
- A loan application is halfway through review.
- An order is waiting for fulfillment.
- A pull request is already approved.
- A customer onboarding flow is stuck in verification.
- A subscription lifecycle is moving through parallel billing and access states.
Then the business changes the process.
- A compliance gate gets added.
- A review state is split into nested substates.
- A flat workflow becomes parallel.
- A state is renamed or removed.
- A shortcut path becomes invalid .
Now the question is not:
How do we define the new workflow?
The question is:
How do we safely move live actors from the old workflow version into the new one?
That is the in-flight workflow versioning problem.
Most teams work around it by freezing deployments, keeping old versions alive forever, branching workflow code by version, or writing one-off migration scripts against production state.
StateKeep was built to solve this directly.
The core is APV: Anchor Point Versioning.
APV solves the migration decision problem: deciding whether and how each live actor should move when a workflow definition changes.
Current state is not enough. Two actors can both be in document_review, but one may have reached it through a clean verification path while another reached it through retries, manual overrides, or an old branch.
Same state label. Different history. Different migration risk.
Anchor Point Versioning uses stable points in each actor's execution history to route actors across workflow versions. If the route is safe, the actor migrates. If no safe route exists, StateKeep marks it as needs_rescue instead of silently corrupting it.
We tested this against hierarchical and parallel XState-style workflows with intentional breaking changes: state renames, nested state changes, flat-to-parallel restructuring, and missing mappings.
Here is what happened.
Why This Problem Keeps Coming Back
Most workflow systems handle the first deployment well.
You define states, transitions, guards, actions, and persistence. New actors enter the workflow and move forward.
But real systems do not stay still.
- A loan platform changes its underwriting process.
- A healthcare intake flow adds a required verification step.
- An order system changes fulfillment logic.
- A SaaS onboarding flow splits one review stage into multiple paths.
The new definition may be correct for new actors, but what about the actors already in flight?
If you only store the current state, you usually end up with a database migration script:
UPDATE applications
SET status = 'compliance_review'
WHERE status = 'manual_review'
AND amount >= 50000;
That might work for a simple case.
But it breaks down when the correct migration depends on how the actor reached that state.
For example, two applications may both be in manual_review:
- one passed credit check, paid the fee, and entered manual review normally
- another skipped a step through an older branch and was manually pushed forward
A current-state migration treats them the same.
A safe migration should not.
This is the reason teams end up freezing deployments, keeping old versions alive forever, or writing increasingly fragile migration scripts.
The missing layer is a workflow-versioning engine that can answer:
Given this actor's current state and history, where does it belong in the new definition?
That is what APV is for.
What Anchor Point Versioning Does
Anchor Point Versioning, or APV, is the migration model behind StateKeep.
An anchor point is a stable, meaningful point in an actor's execution history. It might represent that the actor passed a specific gate, entered a specific branch, completed a required obligation, or reached a known point in a previous workflow version.
APV uses those anchor points to make migration decisions across versions.
It does not ask only:
What state is this actor in right now?
It asks:
How did this actor get here, and what does that mean under the new definition?
That difference matters most when workflows evolve beyond simple flat states.
A flat state rename can sometimes be handled with a simple mapping.
But real statecharts are often hierarchical or parallel:
- a top-level state may contain nested substates
- one flat state may become a compound state
- a workflow may split into parallel regions
- a state may be renamed and moved under a parent state
- two actors in the same state may need different destinations based on their history
StateKeep makes migration routing explicit, previewable, and auditable. Safe actors migrate. Unsafe actors are isolated into needs_rescue instead of being silently moved into the wrong state.
The Test Scenario: A Breaking Workflow Change
We ran a real test suite against StateKeep using three workflow types:
- a CI/CD pipeline
- a pull request review workflow
- a SaaS subscription lifecycle
The tests included hierarchical and parallel XState-style machines, not just flat enum states.
One simple example was a pull request review workflow.
The old workflow looked like this:
open → awaiting_review → under_review → approved → merged
Then security introduced a new policy:
open → awaiting_review → under_review → security_review → approved → merged
The business requirement was clear:
PRs that were already approved should not merge until they pass the new security review.
In a traditional system, this often becomes a migration script:
UPDATE pull_requests
SET status = 'security_review'
WHERE status = 'approved';
That is fine only if status = approved contains enough information.
In larger workflows, it often does not.
In StateKeep, the migration is attached to the new definition as routing metadata:
stateMapping: {
approved: 'security_review'
}
The migration is not a separate production script. It is part of the workflow deployment artifact.
The result:
sim-pr-v2 migration complete — migrated=30 failed=0
10 approved PRs routed into security_review
0 actors stranded
That is the good path: explicit mapping, clean migration, no manual database rewrite.
Then We Broke It on Purpose
The real test is not the clean case.
The real test is what happens when a workflow changes in a way that cannot be safely inferred.
We intentionally deployed breaking changes without the required mappings.
One workflow changed from flat states into a more complex parallel/hierarchical structure.
The old issue workflow was simple:
const ISSUE_V1 = {
initial: 'open',
states: {
open: { on: { ASSIGN: 'in_progress' } },
in_progress: { on: { SUBMIT_PR: 'in_review' } },
in_review: { on: { APPROVE: 'approved' } },
approved: { on: { MERGE: 'merged' } },
merged: { type: 'final' }
}
};
The new version restructured the workflow:
-
in_progressbecame a parallel state with coding and checklist regions -
in_reviewwas renamed tocode_review -
code_reviewbecame a compound state with nested substates
A simplified version looked like this:
const ISSUE_V2 = {
initial: 'open',
states: {
open: { on: { ASSIGN: 'in_progress' } },
in_progress: {
type: 'parallel',
states: {
coding: {
initial: 'working',
states: {
working: { on: { SELF_REVIEW: 'reviewed' } },
reviewed: {}
}
},
checklist: {
initial: 'pending',
states: {
pending: { on: { RUN_CHECKS: 'passed' } },
passed: {}
}
}
},
on: { SUBMIT_PR: 'code_review' }
},
code_review: {
initial: 'awaiting_reviewer',
states: {
awaiting_reviewer: { on: { REVIEWER_ASSIGNED: 'under_review' } },
under_review: {}
},
on: { APPROVE: 'approved' }
},
approved: { on: { MERGE: 'merged' } },
merged: { type: 'final' }
}
};
This is the kind of change that causes real incidents:
- some states still exist
- some states were renamed
- some states became nested
- one flat state became parallel
- some actors no longer have an obvious target
When we deployed the broken version without enough routing metadata, StateKeep did not guess.
It did not silently push actors into the closest-looking state.
It isolated unsafe actors into needs_rescue.
v2 migration:
build-v2: migrated=25 failed=5
issue-v2: migrated=20 failed=10
session-v2: migrated=20 failed=10
GET /v1/actors?status=needs_rescue
affected actors grouped by broken stateValue:
"expiring" → 10 actors
"in_review" → 10 actors
"failed" → 5 actors
"active" → 3 actors
This is the safety property that matters.
A bad migration should not corrupt live actors.
The worst acceptable outcome is:
This actor cannot be safely migrated without more information.
That is what needs_rescue represents.
The Fix Was a Deployment Mapping, Not a Database Script
After the broken deploy, the fix was explicit state mapping attached to the next definition:
BUILD fix: { testing: "testing", deploying: "deploying", failed: "error" }
ISSUE fix: { in_progress: "in_progress", in_review: "code_review" }
SESSION fix: { active: "active", expiring: "active" }
Then v3 migrated successfully:
build-v3: migrated=30 failed=0
issue-v3: migrated=30 failed=0
session-v3: migrated=30 failed=0
Active actors continued processing. Unsafe actors were isolated until the corrected mapping was deployed.
The important part is not that a mapping exists.
The important part is where it lives.
In StateKeep, migration routing is part of the workflow deployment artifact. It is reviewable. It is testable. It can be previewed before deployment. It is not an ad hoc production database script.
Preview Before Deploying
StateKeep includes a migration preview endpoint:
POST /v1/definitions/preview
Before deploying a new workflow definition, you can ask:
- how many actors would migrate cleanly?
- how many actors would strand?
- which state values or paths need attention?
The response gives you counts such as:
wouldMigrate: 240
wouldStrand: 12
That wouldStrand > 0 result is your signal to stop and add routing metadata before deploying.
In a traditional setup, you may only discover this after a script runs or after users report broken flows.
StateKeep moves that failure earlier, into preview.
How needs_rescue Works
needs_rescue is StateKeep's safety mechanism for unsafe migrations.
An actor enters needs_rescue when the engine cannot safely route it into the new definition.
That can happen when:
- the actor's current state no longer exists
- the state was moved into a nested structure
- a flat state became parallel
- the old path does not satisfy the new workflow's obligations
- no explicit mapping covers the actor
When that happens:
- the actor is not silently migrated
- its state and history remain queryable
- event processing can be blocked for that actor
- operators can inspect the affected actors
- a corrected mapping or manual decision can resolve it
This is important because silent corruption is worse than visible failure.
A needs_rescue actor is operational work.
A silently corrupted actor is a business incident.
Performance
The migration/routing engine is implemented as a compiled C library and loaded by the Node.js service.
In our engine-level tests:
p50: 1.26µs per actor
p95: 1.48µs per actor
p99: 1.79µs per actor
throughput: ~97,000 actors/sec
For large migration batches, the bottleneck is usually database writes and coordination, not the routing decision itself.
End-to-end API lifecycle tests across 90 concurrent actors produced:
CI/CD workflow: 30/30 completed
PR review workflow: 30/30 completed
Subscription workflow: 30/30 completed
Total wall-clock: ~32 seconds
Those were full lifecycle runs through the API, not just engine microbenchmarks.
Why Self-Hosted Matters
Workflow state often contains sensitive business data.
Loan applications, patient intake records, identity checks, claims, refunds, approvals, and onboarding flows can include information that many teams do not want to send to a third-party workflow cloud.
StateKeep is self-hosted by design.
Your workflow state, actor context, event history, and routing decisions stay in your infrastructure.
That matters for teams with data residency, compliance, enterprise procurement, or customer privacy requirements.
Why HTTP and JSON
StateKeep definitions are JSON, and actors are advanced over HTTP.
That means the backend language does not matter.
You can send events from:
- Node.js
- Python
- Go
- Java
- Ruby
- any service that can make HTTP requests
The TypeScript SDK exists, but it is not required.
A Python service can spawn an actor and send events like this:
import httpx
client = httpx.Client(
base_url="https://statekeep.yourcompany.com",
headers={"x-api-key": "sk_live_..."}
)
resp = client.post("/v1/actors", json={
"definitionId": "loan-application-v1",
"initialContext": {
"applicantId": "user_123",
"amount": 50000
}
})
loan_actor_id = resp.json()["id"]
client.post(f"/v1/actors/{loan_actor_id}/event", json={
"type": "SUBMIT"
})
StateKeep is not trying to force your entire backend into one language or framework.
It is a workflow runtime exposed over HTTP.
Who This Is For
StateKeep is for teams that have long-running workflows and cannot afford to treat workflow versioning as an afterthought.
It is especially relevant if you are building:
- loan processing
- KYC or onboarding
- claims processing
- document approvals
- order and refund flows
- internal review queues
- CI/CD or release pipelines
- subscription lifecycle systems
The common pattern is the same:
Many live actors are inside a workflow, and the workflow needs to change.
If your current answer is a status column, a migration script, and a hope that nobody missed an edge case, StateKeep is built for the problem you are eventually going to hit.
What StateKeep Changes
StateKeep changes workflow versioning from an operational workaround into a first-class deployment concern.
Instead of:
- freezing deployments
- keeping old versions alive forever
- branching code by schema version
- writing one-off migration scripts
- discovering breakage from user reports
You get:
- persistent actors
- versioned workflow definitions
- previewable migration impact
- anchor-aware routing with APV
-
needs_rescuefor unsafe actors - auditable routing decisions
- self-hosted deployment
- HTTP access from any backend
That is the difference.
The goal is not to pretend every workflow migration can be fully automatic.
The goal is to make the migration decision explicit, safe, previewable, and recoverable.
Early Access
StateKeep is in private early access.
We are giving developers access to a live demo instance where they can run the chaos simulation, try the preview endpoint, break workflow definitions, and inspect the recovery path.
If you want access, email:
statekeep.support@gmail.com
Send a line about what kind of workflows you are building.
We are especially interested in teams that have already hit the in-flight workflow versioning problem in production.
Those edge cases are exactly what StateKeep was built for.
Top comments (0)