DEV Community

Cover image for The XState persistence problem is five years old. Here is what we built to finally solve it.
StateKeep
StateKeep

Posted on

The XState persistence problem is five years old. Here is what we built to finally solve it.

In 2019 someone opened a GitHub issue in the XState repository. The title was "How do I persist XState actor state between server restarts?" It became one of the most-upvoted open issues in the repo. The answer from the core team was honest: XState doesn't handle persistence. Serialize the state object and store it yourself.
Five years later, that answer hasn't changed. Dozens of blog posts exist showing developers how to roll their own persistence layer. Every single one of them is a developer reinventing the same infrastructure that has nothing to do with their actual product.
We got tired of reinventing it. So we built StateKeep.

The problem with rolling your own
Persisting XState state sounds straightforward. Call actor.getSnapshot(), serialize it, store it in Redis or Postgres. On startup, rehydrate. Done.
It works. Until it doesn't.
XState v5 shipped in 2023 and changed the snapshot format. Teams with actors persisted in the old format had a bad week. Some serialized state just crashed on deserialization. Others silently corrupted context. The format was never treated as a public API — because XState is a library, and libraries are not responsible for what you do with their output.
Even before the v5 breakage, there was the migration problem. Your order management workflow lives in a table. You have 40,000 active orders. Your product team needs to add a quality review step. You write a migration script. You test it in staging. You run it in production at midnight. You discover that 847 orders were in an edge state you didn't account for. You spend the next three hours fixing them manually.
This is not a XState problem. It is not a developer competence problem. It is an infrastructure gap. There is no layer between "XState the library" and "your application database" that takes responsibility for keeping actors alive through code changes.

What Stately built — and where it ends
Stately saw this problem. They built Stately Cloud: a hosted service that persists your XState actors, keeps them alive between restarts, and gives you an API to send events and read state.
It is a real solution for the right use case. If you are building a side project, you are a JavaScript shop, and your data can live on their servers — Stately Cloud is worth evaluating.
Three things make it a hard no for a large chunk of teams:

  1. Data residency. Your actor context contains your customer's data. For any team in fintech, healthcare, insurance, or enterprise SaaS with compliance requirements — sending that data to a third-party hosted service is often not an option. Stately Cloud has no self-hosted deployment.
  2. Language lock-in. Stately Cloud requires XState. Not "XState-compatible JSON" — actual XState TypeScript. If your backend is Python, Go, Java, or anything other than JavaScript, you are locked out.
  3. No path-based migration. When you update a workflow definition, Stately Cloud does not help you decide which of your 40,000 in-flight actors should move to the new version and where they should land. You write that logic yourself. That third one is the one we spent the most time on.

The migration problem is harder than it looks
Here is a scenario that sounds simple but breaks every migration tool I have seen.
You have a loan application workflow. 50,000 active applications. You need to add a compliance check step — but only for applications that went through the paid verification path, because that is the regulatory requirement for that specific path.
Your database has 50,000 actors. Some paid the verification fee. Some waived it. Both groups are currently in awaiting_documents. They are in the same state. They look identical to any query that reads current state.
Any system that routes migrations by current state will treat them identically. That is the wrong answer.
The correct answer requires looking at each actor's history. An actor that processed PAY_FEE belongs in the new compliance check flow. An actor that processed WAIVE_FEE does not. The only way to know which group an actor belongs to is to look at what events it has already processed.
This is the problem we built StateKeep to solve.

How StateKeep handles it
StateKeep is a self-hosted statechart hosting platform. You deploy XState-compatible JSON definitions via HTTP, spawn actors, and send events. Any backend language works — Python, Go, Java, Node, anything with an HTTP client.
When you deploy a new version, you declare a historyPath:
json

{
  "id": "loan-v2",
  "parentId": "loan-v1",
  "historyPath": ["SUBMIT_INFO", "PAY_FEE"],
  "definition": { ... }
}
Enter fullscreen mode Exit fullscreen mode

StateKeep evaluates every actor. Each one carries a fingerprint of its event history — a rolling FNV-1a hash of every event type it has processed in order. Actors whose history matches the declared path migrate. Actors whose history does not match stay on the current version.
Alice paid the fee. Her fingerprint matches. She migrates to loan-v2, where the new definition routes her through the income verification step before approval.
Bob waived the fee. His fingerprint does not match. He stays on loan-v1, continuing normally.
Both actors keep running. Neither restarts. Neither loses context. No migration script. No midnight deployment anxiety.
The routing is not based on engineering confidence. It is backed by a formal mathematical proof that guarantees every actor ends up on exactly the correct version, with no actor evaluated twice, regardless of timing or evaluation order. The algorithm runs in native C at p50 1.26µs per actor — 50,000 actors in under a second on modest hardware.

What it looks like in practice
typescript

import { createClient } from '@statekeep/sdk';

const sk = createClient({
  baseUrl: 'https://your-instance.com',
  apiKey: 'sk_...'
});

// Deploy a machine definition
await sk.deploy('loan-v1', {
  id: 'loan', initial: 'submitted',
  states: {
    submitted:       { on: { SUBMIT_INFO: 'under_review' } },
    under_review:    { on: { PAY_FEE: 'awaiting_docs',
                             WAIVE_FEE: 'awaiting_docs' } },
    awaiting_docs:   { on: { APPROVE: 'approved', REJECT: 'rejected' } },
    approved:        { type: 'final' },
    rejected:        { type: 'final' },
  }
});

// Spawn one actor per loan application
const actor = await sk.spawn('loan-v1', {
  applicantId: 'usr-001',
  loanAmount: 25000
});

// Send events as things happen in your system
await sk.send(actor.actorId, 'SUBMIT_INFO');
await sk.send(actor.actorId, 'PAY_FEE');

// Later — deploy a new version targeting only paid-path actors
await sk.deploy('loan-v2', newDefinition, {
  parentId: 'loan-v1',
  historyPath: ['SUBMIT_INFO', 'PAY_FEE'],
});
// Only actors who processed PAY_FEE migrate. Everyone else stays.
Enter fullscreen mode Exit fullscreen mode

You can preview the migration before committing:

typescript

const preview = await sk.preview('loan-v2', newDef, { parentId: 'loan-v1', historyPath: ['SUBMIT_INFO', 'PAY_FEE'] });
console.log(preview.migration.wouldMigrate.length); // 1,203
console.log(preview.migration.wouldStay.length);    // 847
Enter fullscreen mode Exit fullscreen mode

The preview uses the exact same evaluation function as the live deployment. What you see is what will happen.

What it does not do
StateKeep tracks state. It does not execute your code.
Guards (guard: 'isEligible') are ignored entirely — every transition fires unconditionally when the matching event arrives. Do not rely on guards to protect invalid transitions. Check eligibility in your backend before calling send. Actions (actions: 'sendEmail') are no-ops — state changes but nothing executes. Your backend reads the new stateValue from the response and handles side effects itself.
This is a deliberate design choice. The engine is pure data: JSON definitions in, state transitions out. No secrets, no database connections, no application context. Your business logic stays in your application where it belongs.
The upside: migrations never accidentally re-fire side effects. 50,000 actors migrating to a new version do not trigger 50,000 emails.

The current state
StateKeep is at early access. The platform is running in production, 400+ tests passing, the APV engine active, self-hosted on a VPS with AES-256-GCM encryption at rest, continuous backup via Litestream, full dashboard, CLI tooling, and a TypeScript SDK.
We are looking for developers who have hit the XState persistence problem or the workflow migration problem in production — people who have written that midnight migration script, who have lost actor state on a server restart, who have kept two versions of workflow code running forever because there was no clean upgrade path.
Free access for anyone willing to give honest feedback. Reach out at statekeep.support@gmail.com or comment below.

Top comments (0)