StateKeep

Posted on May 29

We Solved the Hard Part of Workflow Versioning: Changing State Machines While Actors Are Still Running

#webdev #stately #tooling #xstate

State machines are easy to deploy the first time.

The hard problem starts later, when the workflow definition changes but thousands of actors are already running inside the old version.

A loan application is halfway through review.
An order is waiting for fulfillment.
A pull request is already approved.
A customer onboarding flow is stuck in verification.
A subscription lifecycle is moving through parallel billing and access states.

Then the business changes the process.

A compliance gate gets added.
A review state is split into nested substates.
A flat workflow becomes parallel.
A state is renamed or removed.
A shortcut path becomes invalid .

Now the question is not:

How do we define the new workflow?

The question is:

How do we safely move live actors from the old workflow version into the new one?

That is the in-flight workflow versioning problem.

Most teams work around it by freezing deployments, keeping old versions alive forever, branching workflow code by version, or writing one-off migration scripts against production state.

StateKeep was built to solve this directly.

The core is APV: Anchor Point Versioning.

APV solves the migration decision problem: deciding whether and how each live actor should move when a workflow definition changes.

Current state is not enough. Two actors can both be in document_review, but one may have reached it through a clean verification path while another reached it through retries, manual overrides, or an old branch.

Same state label. Different history. Different migration risk.

Anchor Point Versioning uses stable points in each actor's execution history to route actors across workflow versions. If the route is safe, the actor migrates. If no safe route exists, StateKeep marks it as needs_rescue instead of silently corrupting it.

We tested this against hierarchical and parallel XState-style workflows with intentional breaking changes: state renames, nested state changes, flat-to-parallel restructuring, and missing mappings.

Here is what happened.

Why This Problem Keeps Coming Back

Most workflow systems handle the first deployment well.

You define states, transitions, guards, actions, and persistence. New actors enter the workflow and move forward.

But real systems do not stay still.

A loan platform changes its underwriting process.
A healthcare intake flow adds a required verification step.
An order system changes fulfillment logic.
A SaaS onboarding flow splits one review stage into multiple paths.

The new definition may be correct for new actors, but what about the actors already in flight?

If you only store the current state, you usually end up with a database migration script:

UPDATE applications
SET status = 'compliance_review'
WHERE status = 'manual_review'
  AND amount >= 50000;

That might work for a simple case.

But it breaks down when the correct migration depends on how the actor reached that state.

For example, two applications may both be in manual_review:

one passed credit check, paid the fee, and entered manual review normally
another skipped a step through an older branch and was manually pushed forward

A current-state migration treats them the same.

A safe migration should not.

This is the reason teams end up freezing deployments, keeping old versions alive forever, or writing increasingly fragile migration scripts.

The missing layer is a workflow-versioning engine that can answer:

Given this actor's current state and history, where does it belong in the new definition?

That is what APV is for.

What Anchor Point Versioning Does

Anchor Point Versioning, or APV, is the migration model behind StateKeep.

An anchor point is a stable, meaningful point in an actor's execution history. It might represent that the actor passed a specific gate, entered a specific branch, completed a required obligation, or reached a known point in a previous workflow version.

APV uses those anchor points to make migration decisions across versions.

It does not ask only:

What state is this actor in right now?

It asks:

How did this actor get here, and what does that mean under the new definition?

That difference matters most when workflows evolve beyond simple flat states.

A flat state rename can sometimes be handled with a simple mapping.

But real statecharts are often hierarchical or parallel:

a top-level state may contain nested substates
one flat state may become a compound state
a workflow may split into parallel regions
a state may be renamed and moved under a parent state
two actors in the same state may need different destinations based on their history

StateKeep makes migration routing explicit, previewable, and auditable. Safe actors migrate. Unsafe actors are isolated into needs_rescue instead of being silently moved into the wrong state.

The Test Scenario: A Breaking Workflow Change

We ran a real test suite against StateKeep using three workflow types:

a CI/CD pipeline
a pull request review workflow
a SaaS subscription lifecycle

The tests included hierarchical and parallel XState-style machines, not just flat enum states.

One simple example was a pull request review workflow.

The old workflow looked like this:

open → awaiting_review → under_review → approved → merged

Then security introduced a new policy:

open → awaiting_review → under_review → security_review → approved → merged

The business requirement was clear:

PRs that were already approved should not merge until they pass the new security review.

In a traditional system, this often becomes a migration script:

UPDATE pull_requests
SET status = 'security_review'
WHERE status = 'approved';

That is fine only if status = approved contains enough information.

In larger workflows, it often does not.

In StateKeep, the migration is attached to the new definition as routing metadata:

stateMapping: {
  approved: 'security_review'
}

The migration is not a separate production script. It is part of the workflow deployment artifact.

The result:

sim-pr-v2 migration complete — migrated=30 failed=0
10 approved PRs routed into security_review
0 actors stranded

That is the good path: explicit mapping, clean migration, no manual database rewrite.

Then We Broke It on Purpose

The real test is not the clean case.

The real test is what happens when a workflow changes in a way that cannot be safely inferred.

We intentionally deployed breaking changes without the required mappings.

One workflow changed from flat states into a more complex parallel/hierarchical structure.

The old issue workflow was simple:

const ISSUE_V1 = {
  initial: 'open',
  states: {
    open:        { on: { ASSIGN: 'in_progress' } },
    in_progress: { on: { SUBMIT_PR: 'in_review' } },
    in_review:   { on: { APPROVE: 'approved' } },
    approved:    { on: { MERGE: 'merged' } },
    merged:      { type: 'final' }
  }
};

The new version restructured the workflow:

in_progress became a parallel state with coding and checklist regions
in_review was renamed to code_review
code_review became a compound state with nested substates

A simplified version looked like this:

const ISSUE_V2 = {
  initial: 'open',
  states: {
    open: { on: { ASSIGN: 'in_progress' } },
    in_progress: {
      type: 'parallel',
      states: {
        coding: {
          initial: 'working',
          states: {
            working: { on: { SELF_REVIEW: 'reviewed' } },
            reviewed: {}
          }
        },
        checklist: {
          initial: 'pending',
          states: {
            pending: { on: { RUN_CHECKS: 'passed' } },
            passed: {}
          }
        }
      },
      on: { SUBMIT_PR: 'code_review' }
    },
    code_review: {
      initial: 'awaiting_reviewer',
      states: {
        awaiting_reviewer: { on: { REVIEWER_ASSIGNED: 'under_review' } },
        under_review: {}
      },
      on: { APPROVE: 'approved' }
    },
    approved: { on: { MERGE: 'merged' } },
    merged: { type: 'final' }
  }
};

This is the kind of change that causes real incidents:

some states still exist
some states were renamed
some states became nested
one flat state became parallel
some actors no longer have an obvious target

When we deployed the broken version without enough routing metadata, StateKeep did not guess.

It did not silently push actors into the closest-looking state.

It isolated unsafe actors into needs_rescue.

v2 migration:
  build-v2:   migrated=25 failed=5
  issue-v2:   migrated=20 failed=10
  session-v2: migrated=20 failed=10

GET /v1/actors?status=needs_rescue
  affected actors grouped by broken stateValue:
    "expiring"  → 10 actors
    "in_review" → 10 actors
    "failed"    → 5 actors
    "active"    → 3 actors

This is the safety property that matters.

A bad migration should not corrupt live actors.

The worst acceptable outcome is:

This actor cannot be safely migrated without more information.

That is what needs_rescue represents.

The Fix Was a Deployment Mapping, Not a Database Script

After the broken deploy, the fix was explicit state mapping attached to the next definition:

BUILD   fix: { testing: "testing", deploying: "deploying", failed: "error" }
ISSUE   fix: { in_progress: "in_progress", in_review: "code_review" }
SESSION fix: { active: "active", expiring: "active" }

Then v3 migrated successfully:

build-v3:   migrated=30 failed=0
issue-v3:   migrated=30 failed=0
session-v3: migrated=30 failed=0

Active actors continued processing. Unsafe actors were isolated until the corrected mapping was deployed.

The important part is not that a mapping exists.

The important part is where it lives.

In StateKeep, migration routing is part of the workflow deployment artifact. It is reviewable. It is testable. It can be previewed before deployment. It is not an ad hoc production database script.

Preview Before Deploying

StateKeep includes a migration preview endpoint:

POST /v1/definitions/preview

Before deploying a new workflow definition, you can ask:

how many actors would migrate cleanly?
how many actors would strand?
which state values or paths need attention?

The response gives you counts such as:

wouldMigrate: 240
wouldStrand: 12

That wouldStrand > 0 result is your signal to stop and add routing metadata before deploying.

In a traditional setup, you may only discover this after a script runs or after users report broken flows.

StateKeep moves that failure earlier, into preview.

How `needs_rescue` Works

needs_rescue is StateKeep's safety mechanism for unsafe migrations.

An actor enters needs_rescue when the engine cannot safely route it into the new definition.

That can happen when:

the actor's current state no longer exists
the state was moved into a nested structure
a flat state became parallel
the old path does not satisfy the new workflow's obligations
no explicit mapping covers the actor

When that happens:

the actor is not silently migrated
its state and history remain queryable
event processing can be blocked for that actor
operators can inspect the affected actors
a corrected mapping or manual decision can resolve it

This is important because silent corruption is worse than visible failure.

A needs_rescue actor is operational work.

A silently corrupted actor is a business incident.

Performance

The migration/routing engine is implemented as a compiled C library and loaded by the Node.js service.

In our engine-level tests:

p50: 1.26µs per actor
p95: 1.48µs per actor
p99: 1.79µs per actor
throughput: ~97,000 actors/sec

For large migration batches, the bottleneck is usually database writes and coordination, not the routing decision itself.

End-to-end API lifecycle tests across 90 concurrent actors produced:

CI/CD workflow:        30/30 completed
PR review workflow:    30/30 completed
Subscription workflow: 30/30 completed
Total wall-clock:      ~32 seconds

Those were full lifecycle runs through the API, not just engine microbenchmarks.

Why Self-Hosted Matters

Workflow state often contains sensitive business data.

Loan applications, patient intake records, identity checks, claims, refunds, approvals, and onboarding flows can include information that many teams do not want to send to a third-party workflow cloud.

StateKeep is self-hosted by design.

Your workflow state, actor context, event history, and routing decisions stay in your infrastructure.

That matters for teams with data residency, compliance, enterprise procurement, or customer privacy requirements.

Why HTTP and JSON

StateKeep definitions are JSON, and actors are advanced over HTTP.

That means the backend language does not matter.

You can send events from:

Node.js
Python
Go
Java
Ruby
any service that can make HTTP requests

The TypeScript SDK exists, but it is not required.

A Python service can spawn an actor and send events like this:

import httpx

client = httpx.Client(
    base_url="https://statekeep.yourcompany.com",
    headers={"x-api-key": "sk_live_..."}
)

resp = client.post("/v1/actors", json={
    "definitionId": "loan-application-v1",
    "initialContext": {
        "applicantId": "user_123",
        "amount": 50000
    }
})

loan_actor_id = resp.json()["id"]

client.post(f"/v1/actors/{loan_actor_id}/event", json={
    "type": "SUBMIT"
})

StateKeep is not trying to force your entire backend into one language or framework.

It is a workflow runtime exposed over HTTP.

Who This Is For

StateKeep is for teams that have long-running workflows and cannot afford to treat workflow versioning as an afterthought.

It is especially relevant if you are building:

loan processing
KYC or onboarding
claims processing
document approvals
order and refund flows
internal review queues
CI/CD or release pipelines
subscription lifecycle systems

The common pattern is the same:

Many live actors are inside a workflow, and the workflow needs to change.

If your current answer is a status column, a migration script, and a hope that nobody missed an edge case, StateKeep is built for the problem you are eventually going to hit.

What StateKeep Changes

StateKeep changes workflow versioning from an operational workaround into a first-class deployment concern.

Instead of:

freezing deployments
keeping old versions alive forever
branching code by schema version
writing one-off migration scripts
discovering breakage from user reports

You get:

persistent actors
versioned workflow definitions
previewable migration impact
anchor-aware routing with APV
needs_rescue for unsafe actors
auditable routing decisions
self-hosted deployment
HTTP access from any backend

That is the difference.

The goal is not to pretend every workflow migration can be fully automatic.

The goal is to make the migration decision explicit, safe, previewable, and recoverable.

Early Access

StateKeep is in private early access.

We are giving developers access to a live demo instance where they can run the chaos simulation, try the preview endpoint, break workflow definitions, and inspect the recovery path.

If you want access, email:

statekeep.support@gmail.com

Send a line about what kind of workflows you are building.

We are especially interested in teams that have already hit the in-flight workflow versioning problem in production.

Those edge cases are exactly what StateKeep was built for.

DEV Community

We Solved the Hard Part of Workflow Versioning: Changing State Machines While Actors Are Still Running

Why This Problem Keeps Coming Back

What Anchor Point Versioning Does

The Test Scenario: A Breaking Workflow Change

Then We Broke It on Purpose

The Fix Was a Deployment Mapping, Not a Database Script

Preview Before Deploying

How `needs_rescue` Works

Performance

Why Self-Hosted Matters

Why HTTP and JSON

Who This Is For

What StateKeep Changes

Early Access

Top comments (0)

Why This Problem Keeps Coming Back

What Anchor Point Versioning Does

The Test Scenario: A Breaking Workflow Change

Then We Broke It on Purpose

The Fix Was a Deployment Mapping, Not a Database Script

Preview Before Deploying

How needs_rescue Works

Performance

Why Self-Hosted Matters

Why HTTP and JSON

Who This Is For

What StateKeep Changes

Early Access

How `needs_rescue` Works