DEV Community

~K¹yle Million
~K¹yle Million

Posted on

Coordinator Resume Integrity: What Happens When a Claude Code Agent Loses Its Mind Mid-Handoff

Your coordinator agent dispatched three sub-agents. Sub-agent 1 finished. Sub-agent 2 is halfway through. Sub-agent 3 hasn't started yet.

Then your coordinator's session ends. Context limit hit. Cron killed the process. Doesn't matter why — the coordinator is gone.

Next cron tick, a new coordinator starts. It doesn't know Sub-agent 1 is done. It doesn't know Sub-agent 2 is mid-task. It restarts all three.

Sub-agent 1 runs again, producing duplicate output. Sub-agent 2 conflicts with itself. Sub-agent 3 finally starts — after two unnecessary reruns. Your pipeline produced wrong results with no error, because the coordinator had no way to resume from where it left off.

This is coordinator resume integrity failure. It's the most common reason multi-agent pipelines produce inconsistent results under real operating conditions.


Why Coordinators Fail to Resume

The coordinator's state — which tasks it dispatched, which completed, what still needs to run — lives entirely in context. That context is not written anywhere. When the session ends, it's gone.

Most agents are written assuming they'll run to completion in a single session. That assumption holds in development where you're watching, but breaks in production where:

  • Sessions end unpredictably (context limits, cron timeouts, system interrupts)
  • The same agent runs on a schedule, not once
  • Downstream work takes longer than the coordinator's execution window

Three specific failure modes:

1. Duplicate execution

Coordinator resumes with no state. Re-dispatches all sub-agents. Sub-agents that already completed run again. If sub-agents write to fixed paths, the second run overwrites the first. If they write to unique paths, you accumulate duplicates with no way to know which is canonical.

2. Partial completion invisible to the next coordinator

Sub-agent 2 is 40% through its task. New coordinator restarts it from zero. Sub-agent 2's partial output — which may have taken significant time and API usage — is abandoned.

3. Ordering violations

Coordinator was enforcing an execution order: A before B before C. New coordinator starts all three simultaneously. B runs before A has committed its output. B reads stale data.


What Doesn't Work

Checking output files

Coordinators often check for output file existence to infer completion: "if outputs/task_A.md exists, A is done." This breaks when:

  • A partial write left the file in an invalid state
  • A previous interrupted run left a file from a different context
  • The same task needs to run multiple times across different runs

Reading sub-agent logs

Sub-agent logs tell you what happened inside that sub-agent's run. They don't tell the coordinator what the coordinator already dispatched, or whether that dispatch was intended for this run.

Trusting context to persist

Context doesn't persist across sessions. Period. Anything the coordinator knows that isn't written to disk is lost on session end.


The Pattern: Explicit Dispatch Ledger

Every coordinator maintains a dispatch ledger — a structured file that records what was dispatched, when, and what state it's in. The ledger is written before dispatch, updated on completion, and read first on every coordinator startup.

LEDGER="$INTUITEK/coordination/dispatch_ledger_${PIPELINE_ID}.json"
Enter fullscreen mode Exit fullscreen mode

Ledger schema:

{
  "pipeline_id": "pipeline_orders_20260422_070001",
  "coordinator_started": "2026-04-22T07:00:01Z",
  "last_coordinator_heartbeat": "2026-04-22T07:04:17Z",
  "tasks": [
    {
      "task_id": "agent_order_1",
      "status": "COMPLETE",
      "dispatched_at": "2026-04-22T07:00:05Z",
      "completed_at": "2026-04-22T07:02:31Z",
      "output_path": "outputs/order_1_result_20260422.md"
    },
    {
      "task_id": "agent_order_2",
      "status": "IN_PROGRESS",
      "dispatched_at": "2026-04-22T07:00:06Z",
      "completed_at": null,
      "output_path": null
    },
    {
      "task_id": "agent_order_3",
      "status": "PENDING",
      "dispatched_at": null,
      "completed_at": null,
      "output_path": null
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Coordinator startup sequence:

startup_coordinator() {
    if [[ -f "$LEDGER" ]]; then
        # Resume from existing ledger
        echo "Resuming pipeline: $(jq -r '.pipeline_id' $LEDGER)"
        RESUME=true
    else
        # Initialize new ledger
        python3 -c "
import json, datetime
ledger = {
    'pipeline_id': 'pipeline_${PIPELINE_TYPE}_$(date +%Y%m%d_%H%M%S)',
    'coordinator_started': datetime.datetime.utcnow().isoformat() + 'Z',
    'last_coordinator_heartbeat': datetime.datetime.utcnow().isoformat() + 'Z',
    'tasks': []
}
print(json.dumps(ledger, indent=2))
" > "$LEDGER"
        RESUME=false
    fi
}
Enter fullscreen mode Exit fullscreen mode

Before dispatching any sub-agent, write its entry to the ledger:

dispatch_task() {
    local TASK_ID="$1"
    local TASK_PROMPT="$2"

    # Write PENDING entry to ledger before dispatch
    python3 -c "
import json, datetime
with open('$LEDGER') as f:
    ledger = json.load(f)
ledger['tasks'].append({
    'task_id': '$TASK_ID',
    'status': 'IN_PROGRESS',
    'dispatched_at': datetime.datetime.utcnow().isoformat() + 'Z',
    'completed_at': None,
    'output_path': None
})
with open('$LEDGER', 'w') as f:
    json.dump(ledger, f, indent=2)
"
    # Dispatch the sub-agent
    bash ~/intuitek/run_task.sh "$TASK_PROMPT" &
}
Enter fullscreen mode Exit fullscreen mode

On coordinator restart, read the ledger and skip completed tasks:

get_pending_tasks() {
    python3 -c "
import json
with open('$LEDGER') as f:
    ledger = json.load(f)
pending = [t for t in ledger['tasks'] if t['status'] in ('PENDING', 'IN_PROGRESS')]
for t in pending:
    print(t['task_id'])
"
}

# Only dispatch tasks that aren't COMPLETE
for TASK_ID in $(get_pending_tasks); do
    dispatch_task "$TASK_ID" "$(get_task_prompt $TASK_ID)"
done
Enter fullscreen mode Exit fullscreen mode

Heartbeat for Long-Running Pipelines

For pipelines that run longer than one coordinator session, add a heartbeat to the ledger. This lets a new coordinator detect whether the previous coordinator is still running or abandoned:

update_heartbeat() {
    python3 -c "
import json, datetime
with open('$LEDGER') as f:
    ledger = json.load(f)
ledger['last_coordinator_heartbeat'] = datetime.datetime.utcnow().isoformat() + 'Z'
with open('$LEDGER', 'w') as f:
    json.dump(ledger, f, indent=2)
"
}

# Call every 60 seconds in coordinator's main loop
while true; do
    update_heartbeat
    sleep 60
done &
Enter fullscreen mode Exit fullscreen mode

On startup, check if the previous coordinator abandoned the pipeline:

check_abandoned() {
    python3 -c "
import json, datetime, sys
with open('$LEDGER') as f:
    ledger = json.load(f)
last_hb = ledger.get('last_coordinator_heartbeat')
if last_hb:
    age_seconds = (datetime.datetime.utcnow() - datetime.datetime.fromisoformat(last_hb.rstrip('Z'))).total_seconds()
    if age_seconds > 300:
        print('ABANDONED')
    else:
        print('ACTIVE')
else:
    print('UNKNOWN')
"
}

STATUS=$(check_abandoned)
if [[ "$STATUS" == "ACTIVE" ]]; then
    echo "Previous coordinator still active — exiting to avoid conflict"
    exit 0
fi
Enter fullscreen mode Exit fullscreen mode

Cleanup and Pipeline Completion

When all tasks reach COMPLETE status, mark the pipeline done and optionally archive the ledger:

mark_pipeline_complete() {
    python3 -c "
import json, datetime
with open('$LEDGER') as f:
    ledger = json.load(f)
ledger['pipeline_completed'] = datetime.datetime.utcnow().isoformat() + 'Z'
with open('$LEDGER', 'w') as f:
    json.dump(ledger, f, indent=2)
"
    # Move ledger to completed/
    mv "$LEDGER" "$INTUITEK/coordination/completed/$(basename $LEDGER)"
}
Enter fullscreen mode Exit fullscreen mode

The Production Implementation

The patterns above are the core logic. The production implementation includes:

  • Ledger factory with schema validation
  • Dispatch wrapper with atomic ledger write + sub-agent launch
  • Resumable coordinator startup with ledger read and skip-completed logic
  • Heartbeat manager (60s background update loop)
  • Abandoned pipeline detector with configurable staleness threshold
  • Pipeline completion detector and ledger archival
  • Multi-coordinator conflict guard (prevents two coordinators running the same pipeline)
  • CLAUDE.md template for embedding resume logic in coordinator agent prompts

Coordinator Resume Integrity — Production Agent Handoff Logic:
https://www.shopclawmart.com/listings/coordinator-resume-integrity-production-agent-handoff-logic-d158e10b

$19. Instant download. One-time purchase.


Built by Aegis, IntuiTek¹ | ~K¹ (W. Kyle Million)

Top comments (0)