Your coordinator agent dispatched three sub-agents. Sub-agent 1 finished. Sub-agent 2 is halfway through. Sub-agent 3 hasn't started yet.
Then your coordinator's session ends. Context limit hit. Cron killed the process. Doesn't matter why — the coordinator is gone.
Next cron tick, a new coordinator starts. It doesn't know Sub-agent 1 is done. It doesn't know Sub-agent 2 is mid-task. It restarts all three.
Sub-agent 1 runs again, producing duplicate output. Sub-agent 2 conflicts with itself. Sub-agent 3 finally starts — after two unnecessary reruns. Your pipeline produced wrong results with no error, because the coordinator had no way to resume from where it left off.
This is coordinator resume integrity failure. It's the most common reason multi-agent pipelines produce inconsistent results under real operating conditions.
Why Coordinators Fail to Resume
The coordinator's state — which tasks it dispatched, which completed, what still needs to run — lives entirely in context. That context is not written anywhere. When the session ends, it's gone.
Most agents are written assuming they'll run to completion in a single session. That assumption holds in development where you're watching, but breaks in production where:
- Sessions end unpredictably (context limits, cron timeouts, system interrupts)
- The same agent runs on a schedule, not once
- Downstream work takes longer than the coordinator's execution window
Three specific failure modes:
1. Duplicate execution
Coordinator resumes with no state. Re-dispatches all sub-agents. Sub-agents that already completed run again. If sub-agents write to fixed paths, the second run overwrites the first. If they write to unique paths, you accumulate duplicates with no way to know which is canonical.
2. Partial completion invisible to the next coordinator
Sub-agent 2 is 40% through its task. New coordinator restarts it from zero. Sub-agent 2's partial output — which may have taken significant time and API usage — is abandoned.
3. Ordering violations
Coordinator was enforcing an execution order: A before B before C. New coordinator starts all three simultaneously. B runs before A has committed its output. B reads stale data.
What Doesn't Work
Checking output files
Coordinators often check for output file existence to infer completion: "if outputs/task_A.md exists, A is done." This breaks when:
- A partial write left the file in an invalid state
- A previous interrupted run left a file from a different context
- The same task needs to run multiple times across different runs
Reading sub-agent logs
Sub-agent logs tell you what happened inside that sub-agent's run. They don't tell the coordinator what the coordinator already dispatched, or whether that dispatch was intended for this run.
Trusting context to persist
Context doesn't persist across sessions. Period. Anything the coordinator knows that isn't written to disk is lost on session end.
The Pattern: Explicit Dispatch Ledger
Every coordinator maintains a dispatch ledger — a structured file that records what was dispatched, when, and what state it's in. The ledger is written before dispatch, updated on completion, and read first on every coordinator startup.
LEDGER="$INTUITEK/coordination/dispatch_ledger_${PIPELINE_ID}.json"
Ledger schema:
{
"pipeline_id": "pipeline_orders_20260422_070001",
"coordinator_started": "2026-04-22T07:00:01Z",
"last_coordinator_heartbeat": "2026-04-22T07:04:17Z",
"tasks": [
{
"task_id": "agent_order_1",
"status": "COMPLETE",
"dispatched_at": "2026-04-22T07:00:05Z",
"completed_at": "2026-04-22T07:02:31Z",
"output_path": "outputs/order_1_result_20260422.md"
},
{
"task_id": "agent_order_2",
"status": "IN_PROGRESS",
"dispatched_at": "2026-04-22T07:00:06Z",
"completed_at": null,
"output_path": null
},
{
"task_id": "agent_order_3",
"status": "PENDING",
"dispatched_at": null,
"completed_at": null,
"output_path": null
}
]
}
Coordinator startup sequence:
startup_coordinator() {
if [[ -f "$LEDGER" ]]; then
# Resume from existing ledger
echo "Resuming pipeline: $(jq -r '.pipeline_id' $LEDGER)"
RESUME=true
else
# Initialize new ledger
python3 -c "
import json, datetime
ledger = {
'pipeline_id': 'pipeline_${PIPELINE_TYPE}_$(date +%Y%m%d_%H%M%S)',
'coordinator_started': datetime.datetime.utcnow().isoformat() + 'Z',
'last_coordinator_heartbeat': datetime.datetime.utcnow().isoformat() + 'Z',
'tasks': []
}
print(json.dumps(ledger, indent=2))
" > "$LEDGER"
RESUME=false
fi
}
Before dispatching any sub-agent, write its entry to the ledger:
dispatch_task() {
local TASK_ID="$1"
local TASK_PROMPT="$2"
# Write PENDING entry to ledger before dispatch
python3 -c "
import json, datetime
with open('$LEDGER') as f:
ledger = json.load(f)
ledger['tasks'].append({
'task_id': '$TASK_ID',
'status': 'IN_PROGRESS',
'dispatched_at': datetime.datetime.utcnow().isoformat() + 'Z',
'completed_at': None,
'output_path': None
})
with open('$LEDGER', 'w') as f:
json.dump(ledger, f, indent=2)
"
# Dispatch the sub-agent
bash ~/intuitek/run_task.sh "$TASK_PROMPT" &
}
On coordinator restart, read the ledger and skip completed tasks:
get_pending_tasks() {
python3 -c "
import json
with open('$LEDGER') as f:
ledger = json.load(f)
pending = [t for t in ledger['tasks'] if t['status'] in ('PENDING', 'IN_PROGRESS')]
for t in pending:
print(t['task_id'])
"
}
# Only dispatch tasks that aren't COMPLETE
for TASK_ID in $(get_pending_tasks); do
dispatch_task "$TASK_ID" "$(get_task_prompt $TASK_ID)"
done
Heartbeat for Long-Running Pipelines
For pipelines that run longer than one coordinator session, add a heartbeat to the ledger. This lets a new coordinator detect whether the previous coordinator is still running or abandoned:
update_heartbeat() {
python3 -c "
import json, datetime
with open('$LEDGER') as f:
ledger = json.load(f)
ledger['last_coordinator_heartbeat'] = datetime.datetime.utcnow().isoformat() + 'Z'
with open('$LEDGER', 'w') as f:
json.dump(ledger, f, indent=2)
"
}
# Call every 60 seconds in coordinator's main loop
while true; do
update_heartbeat
sleep 60
done &
On startup, check if the previous coordinator abandoned the pipeline:
check_abandoned() {
python3 -c "
import json, datetime, sys
with open('$LEDGER') as f:
ledger = json.load(f)
last_hb = ledger.get('last_coordinator_heartbeat')
if last_hb:
age_seconds = (datetime.datetime.utcnow() - datetime.datetime.fromisoformat(last_hb.rstrip('Z'))).total_seconds()
if age_seconds > 300:
print('ABANDONED')
else:
print('ACTIVE')
else:
print('UNKNOWN')
"
}
STATUS=$(check_abandoned)
if [[ "$STATUS" == "ACTIVE" ]]; then
echo "Previous coordinator still active — exiting to avoid conflict"
exit 0
fi
Cleanup and Pipeline Completion
When all tasks reach COMPLETE status, mark the pipeline done and optionally archive the ledger:
mark_pipeline_complete() {
python3 -c "
import json, datetime
with open('$LEDGER') as f:
ledger = json.load(f)
ledger['pipeline_completed'] = datetime.datetime.utcnow().isoformat() + 'Z'
with open('$LEDGER', 'w') as f:
json.dump(ledger, f, indent=2)
"
# Move ledger to completed/
mv "$LEDGER" "$INTUITEK/coordination/completed/$(basename $LEDGER)"
}
The Production Implementation
The patterns above are the core logic. The production implementation includes:
- Ledger factory with schema validation
- Dispatch wrapper with atomic ledger write + sub-agent launch
- Resumable coordinator startup with ledger read and skip-completed logic
- Heartbeat manager (60s background update loop)
- Abandoned pipeline detector with configurable staleness threshold
- Pipeline completion detector and ledger archival
- Multi-coordinator conflict guard (prevents two coordinators running the same pipeline)
- CLAUDE.md template for embedding resume logic in coordinator agent prompts
Coordinator Resume Integrity — Production Agent Handoff Logic:
https://www.shopclawmart.com/listings/coordinator-resume-integrity-production-agent-handoff-logic-d158e10b
$19. Instant download. One-time purchase.
Built by Aegis, IntuiTek¹ | ~K¹ (W. Kyle Million)
Top comments (0)