WonderLab

Posted on Jul 1

Workflow Series (03): State Management — Persistence, Idempotency, and Version Binding

#ai #workflow #engineering #productivity

Why State Management Is the Core Problem

An Agent Workflow crashes at Phase 5. After restart, does it begin from the top or continue from Phase 6?

Without state persistence, it starts over: every LLM call, tool execution, and human approval is discarded.

State management means the workflow resumes from the last checkpoint regardless of when interruption occurs. It's also the mechanism behind human approval gates — a triggered gate pauses the workflow at a specific state, and the human response resumes from exactly that state.

Durable Execution Pattern

Serialize execution as recoverable checkpoints. Any interruption resumes from the most recent one with results identical to uninterrupted execution. Temporal.io implements this at the code layer, but the same semantics work with a JSON file.

State File Structure

{
  "workflow_id": "wf-bug-e2e-AE-33995-20260601",
  "workflow_version": "1.3.0",
  "jira_key": "AE-33995",
  "started_at": "2026-06-01T10:00:00+08:00",
  "phase": "phase_4",
  "phases": {
    "phase_1": {
      "status": "done",
      "completed_at": "2026-06-01T10:02:30+08:00",
      "output_file": "bug_info.json"
    },
    "phase_4": {
      "status": "in_progress",
      "step": "step_4_1",
      "steps": {
        "step_4_1": {"status": "done", "output_file": "candidate_a.json"},
        "step_4_2": {"status": "in_progress"},
        "step_4_3": {"status": "pending"}
      }
    }
  }
}

Resume Protocol

def resume_workflow(state_file: Path) -> None:
    state = json.loads(state_file.read_text())

    for phase_id, phase_data in state["phases"].items():
        if phase_data["status"] == "done":
            continue  # skip completed phases

        if phase_data["status"] in ("in_progress", "pending"):
            # in_progress treated same as pending — re-execute
            # (idempotency guarantees this is safe)
            execute_phase(phase_id, state)
            return

The key principle: trust only the state file, not memory. The main Agent doesn't remember what it did — it reads the status field. Phases marked in_progress get re-executed, which requires every phase operation to be idempotent.

Double-Ended Writes

Write to the state file both before a phase starts and after it completes — not only on completion:

def execute_phase(phase_id: str, state: dict) -> None:
    # Before start: mark in_progress
    # (if crash occurs, resume finds this phase and re-executes it)
    state["phases"][phase_id]["status"] = "in_progress"
    write_state(state)

    try:
        result = run_phase_logic(phase_id, state)

        # After completion: mark done, record output file path
        state["phases"][phase_id]["status"] = "done"
        state["phases"][phase_id]["output_file"] = result.output_file
        write_state(state)

    except Exception as e:
        state["phases"][phase_id]["status"] = "failed"
        state["phases"][phase_id]["error"] = str(e)
        write_state(state)
        raise

Idempotency Design

The resume protocol re-executes in_progress phases, meaning a phase can run twice. Operations that aren't idempotent produce duplicate side effects: two Jira comments, two git commits, two notification emails.

Idempotency Analysis by Operation Type

File writes (naturally idempotent)

# Overwrite is idempotent — running twice produces the same result
output_file.write_text(json.dumps(result))  # ✅

Jira comments (not idempotent — requires detection)

# ❌ Wrong: direct write produces a duplicate comment on re-run
jira.add_comment(issue_key, comment_text)

# ✅ Correct: check for existing comment with this run's ID first
def add_comment_idempotent(issue_key: str, comment_text: str, run_id: str) -> None:
    existing = jira.get_comments(issue_key)
    marker = f"[run_id:{run_id}]"  # unique marker per workflow run

    if any(marker in c.body for c in existing):
        return  # already written — skip

    jira.add_comment(issue_key, f"{marker}\n{comment_text}")

Git commits (not idempotent — requires detection)

# ❌ Wrong: direct commit creates a second commit on re-run
git.commit(message)

# ✅ Correct: check if commit result file exists and passed=true
def commit_idempotent(message: str, output_file: Path) -> dict:
    if output_file.exists():
        result = json.loads(output_file.read_text())
        if result.get("passed"):
            return result  # already committed successfully

    commit_sha = git.commit(message)
    result = {"passed": True, "sha": commit_sha}
    output_file.write_text(json.dumps(result))
    return result

External API triggers (conditionally idempotent)

# Adding a Gerrit reviewer: duplicate adds don't error — naturally idempotent ✅
gerrit.add_reviewer(change_id, reviewer)

# Creating a cron job: duplicate creates produce two jobs ❌
# Fix: list first, create only if not already present
def create_cron_idempotent(job_config: dict) -> None:
    existing_jobs = cron.list_jobs()
    if any(j["name"] == job_config["name"] for j in existing_jobs):
        return  # already exists — skip
    cron.create_job(job_config)

Idempotency Self-Check

For every new Step, answer these three questions before implementing:

□ If this step runs twice, does it produce side effects?
□ If yes, how do you detect "already executed" and skip?
□ Is the detection logic itself idempotent?

The third question is easy to miss. If detection depends on in-memory state or has side effects of its own, it fails in the resume scenario just like the original operation.

State File Version Binding

Modify a workflow definition mid-run — add a new Step, for example — and the old state file has no record of it. When the workflow resumes, the main Agent has no basis for handling the missing step.

The fix: bind the workflow version in the state file and verify it on resume.

def start_or_resume(state_file: Path, current_version: str) -> dict:
    if state_file.exists():
        state = json.loads(state_file.read_text())
        saved_version = state.get("workflow_version")

        if saved_version != current_version:
            raise WorkflowVersionMismatch(
                f"State file version: {saved_version}\n"
                f"Current workflow version: {current_version}\n"
                f"Options:\n"
                f"  1. Resume with saved state using old workflow ({saved_version})\n"
                f"  2. Start fresh with new workflow ({current_version})\n"
                f"  3. Manually migrate the state file"
            )

        return state  # versions match — resume normally

    # New run: create state file
    state = {
        "workflow_version": current_version,
        "started_at": datetime.now(timezone.utc).isoformat(),
        "phases": {}
    }
    write_state(state, state_file)
    return state

Version Number Rules (MAJOR.MINOR.PATCH)

MAJOR: Phase structure changes (add/remove Phase, major routing changes)
        → Breaks in-progress runs; requires explicit handling
        → Cannot resume directly; user must decide

MINOR: Add Step, template improvements, new gate options
        → Backward compatible; in-progress runs complete with old version
        → New runs use new version

PATCH: Wording tweaks, config adjustments, behavior unchanged
        → Safe to upgrade; old state files resume without issue

Design Checklist

State persistence

[ ] Every Phase/Step writes in_progress before starting and done after completing
[ ] Resume protocol reads only the state file, not conversation history
[ ] State file includes workflow_version

Idempotency

[ ] All external writes (Jira comments, git commits, API calls) have idempotency checks
[ ] Detection uses a unique identifier (run_id or output file existence)
[ ] The detection logic itself produces no side effects

Version binding

[ ] Version is verified on resume against the current workflow version
[ ] MAJOR version changes have an explicit handling strategy
[ ] Version mismatches surface user-actionable options, not just an error exit

Summary

Durable Execution requires double-ended writes: write in_progress before the phase starts and done after — a crash at any point allows precise resumption
Resume requires idempotency: in_progress phases get re-executed, so every external write must be safe to run twice; file writes are naturally idempotent, Jira comments and git commits need explicit detection
Version binding prevents silent errors: when a workflow is modified, a mismatch between the old state file and the new workflow version should surface actionable options — not silently apply new logic to old state

Check out PrimeSkills — a curated marketplace of AI agents and skills that have been validated in real-world, enterprise-grade workflows. No fluff, just what actually works.

Find more useful knowledge and interesting products on my Homepage

DEV Community