Why Your Overnight AI Agent Fails (And How Episodic Execution Fixes It)
How I built Nightcrawler -- an autonomous agent loop that runs Claude Code for 12+ hours with structured handoffs, crash recovery, and 8 termination conditions.
You go to bed. Claude Code is humming along, refactoring your test suite. You set it loose with --dangerously-skip-permissions and a prayer. You wake up 8 hours later.
What do you find?
If you are lucky: a completed task. If you are unlucky -- and I was unlucky many times -- you find one of these six disasters.
The 6 Death Spirals of Long-Running AI Agents
Every developer who has tried to run an AI coding agent overnight has hit at least one of these:
1. The Context Cliff
After 30-60 minutes of work, the model's effective context fills up. It starts forgetting what it did earlier. It re-reads files it already processed. It contradicts decisions it made 20 minutes ago. Eventually it starts undoing its own work.
2. The Hallucinated Handoff
The agent writes "I completed tasks 1-5" in its output. You check git. Tasks 2 and 4 were never actually done. The agent's self-assessment diverged from reality because it lost track of what it actually committed versus what it planned to do.
3. The Budget Inferno
You wake up to a $200 API bill. The agent hit a hard problem, generated massive prompts trying to solve it, and burned through your budget in a single multi-turn loop. No circuit breaker. No budget cap per work unit.
4. The Infinite Retry Loop
The agent encounters a flaky test. It tries to fix it. The fix breaks something else. It tries to fix that. The original test fails again. Three hours later, it is still cycling between the same two broken states with zero net progress.
5. The Silent Crash
The process dies at 2 AM -- an OOM kill, a network timeout, a macOS sleep event. Nobody notices. When you wake up, you find a half-finished refactor with uncommitted changes and no record of what was accomplished.
6. The Drift Spiral
The agent starts with clear intent. By hour 3, it has wandered so far from the original mission that it is adding features nobody asked for, refactoring code that was fine, and creating documentation files for a project that does not exist.
Why Simple Loops Do Not Work
The most popular approach to overnight agents is some variant of this:
while true; do
claude -p "Continue working on the project. If done, output DONE."
# check for DONE, maybe sleep, repeat
done
This is the Ralph Wiggum technique, popularized by Geoffrey Huntley. It is elegant. It works for short tasks. But it has a fundamental assumption that breaks down overnight: it assumes each iteration starts fresh and can pick up from filesystem state alone.
For a 30-minute task, that is fine. For a 12-hour mission, you need more:
- What was the agent working on when it stopped?
- What decisions did it make and why?
- What did it try that failed?
- How much budget has been spent?
- Is the agent still making progress, or is it spinning?
- Did git actually change, or did the agent just claim it did?
A simple loop cannot answer these questions. That is the gap Nightcrawler fills.
The Solution: Episodic Execution
Nightcrawler does not try to keep one session alive forever. It accepts that sessions will end and builds reliability around that inevitability.
The core idea: bounded episodes with structured handoffs.
Mission.md
|
v
[Orchestrator] -----> [Episode 1: claude -p] -----> HANDOFF.md
| |
| <-- read state, check termination conditions -- |
| v
+----------------> [Episode 2: claude -p] -----> HANDOFF.md
| |
| <-- read state, check budget, detect stalling -- |
| v
+----------------> [Episode 3: claude -p] -----> HANDOFF.md
| |
... |
v v
COMPLETION_REPORT.md STATE.json
Each episode is a fresh claude -p session that:
- Reads the mission (what to do)
- Reads the handoff from the previous episode (what was done, what is next)
- Verifies via
git logandgit diffthat the handoff is truthful - Works on the highest-priority incomplete task
- Writes a structured handoff for the next episode
- Updates state (progress, errors, budget)
The orchestrator sits between episodes and decides: continue, or stop?
Architecture: ~500 Lines of TypeScript
Nightcrawler is a single TypeScript file (nightcrawler.ts, ~500 LOC) that orchestrates everything:
// The main loop is deceptively simple
while (true) {
const { cont, reason } = shouldContinue(state, config);
if (!cont) {
// Write completion report and exit
break;
}
const prompt = buildEpisodePrompt(state, episodeNum);
const { exitCode, output } = await runEpisode(prompt, config, episodeNum);
// Re-read state (agent may have updated it)
// Track episode in history
// Handle errors
// Checkpoint
// Cooldown
}
The Episode Prompt
Each episode receives a carefully constructed prompt that includes:
- The skill instructions (how to behave as an autonomous agent)
- The mission (what to accomplish)
- The current state (progress, budget, errors)
- The previous handoff (what was done, what is next)
- The git context (recent commits, diff)
- A task tracker (immutable tasks.json the agent can only mark complete)
function buildEpisodePrompt(state: State, episodeNum: number): string {
const mission = readText(MISSION_PATH);
const handoff = fileExists(HANDOFF_PATH) ? readText(HANDOFF_PATH) : null;
// Include git context for truth-checking
const gitLog = execSync("git log --oneline -10").toString();
const diff = execSync("git diff --stat HEAD~1").toString();
// Build structured prompt with all context
// ...
}
The 8 Termination Conditions
The orchestrator checks these before every episode:
function shouldContinue(state: State, config: Config) {
// 1. Human stop flag (touch ~/.nightcrawler/state/STOP)
if (fileExists(STOP_FLAG)) return { cont: false, reason: "human_stop_flag" };
// 2. Agent said stop (mission complete or blocked)
if (!state.termination_check.should_continue) return { cont: false, reason: "agent_terminated" };
// 3. Episode limit (default: 24)
if (state.current_episode >= config.max_episodes) return { cont: false, reason: "episode_limit" };
// 4. Duration limit (default: 12 hours)
const elapsed = (Date.now() - started) / (1000 * 60 * 60);
if (elapsed >= config.max_duration_hours) return { cont: false, reason: "duration_limit" };
// 5. Budget limit (default: $50)
if (state.budget_spent_usd >= config.max_budget_usd) return { cont: false, reason: "budget_limit" };
// 6. Error threshold (default: 10 errors)
if (state.errors.total >= config.error_threshold) return { cont: false, reason: "error_threshold" };
// 7. Fatal error
if (state.errors.fatal > 0) return { cont: false, reason: "fatal_error" };
// 8. Diminishing returns (< 0.5 tasks/episode for 3 consecutive)
if (avgCompleted < 0.5) return { cont: false, reason: "diminishing_returns" };
return { cont: true, reason: null };
}
That eighth condition -- diminishing returns detection -- is the one that prevents death spiral #4 (the infinite retry loop). If the agent completes less than half a task per episode for three episodes in a row, something is wrong and Nightcrawler stops.
The Handoff Protocol
Every episode must write a structured HANDOFF.md before finishing:
# Episode 3 Handoff
## Summary
Implemented the authentication middleware and wrote 12 tests.
All tests passing. Rate limiting is next.
## Work Completed
- Created auth/middleware.ts with JWT validation
- Added 12 test cases in auth/middleware.test.ts
- Updated routes/api.ts to use the middleware
## In-Progress Work
- File: auth/rate-limiter.ts
- What's left: Implement sliding window rate limiting
## Key Context for Next Episode
- JWT secret is loaded from env var AUTH_SECRET
- The middleware expects Bearer tokens, not Basic auth
- rate-limiter.ts exists but is empty scaffolding
## Files Modified
- auth/middleware.ts: new file, JWT validation
- auth/middleware.test.ts: new file, 12 tests
- routes/api.ts: added authMiddleware to all /api/* routes
## Decisions Made
- Used jose library over jsonwebtoken: async-first, better types
## Errors Encountered
- None this episode
This handoff is not optional. The skill instructions loaded into every episode enforce it. And the orchestrator cross-checks it against git log in the next episode's prompt -- the agent cannot claim it changed files that git says were not modified.
Process Supervision with launchd
Nightcrawler runs under macOS launchd, which provides:
- Crash recovery: If the process dies, launchd restarts it (with a 30-second throttle)
- Sleep/wake handling: launchd manages macOS sleep events correctly
-
Background execution: Runs with
Nice: 5(lower priority) andProcessType: Background -
Hard timeout: 12-hour maximum via
TimeOut: 43200 - Logging: stdout/stderr redirected to log files
<key>KeepAlive</key>
<dict>
<key>Crashed</key>
<true/>
<key>SuccessfulExit</key>
<false/>
</dict>
The process also uses a PID-based lockfile to prevent duplicate instances:
function acquireLock(): boolean {
if (fileExists(LOCK_PATH)) {
try {
const pid = parseInt(readText(LOCK_PATH).trim());
process.kill(pid, 0); // Check if still alive
return false; // Still running
} catch {
log("STALE_LOCK | Removing stale lockfile");
}
}
writeFileSync(LOCK_PATH, String(process.pid));
return true;
}
Task Immutability: Preventing Cheating
One subtle but critical design decision: tasks.json.
When a mission starts, the orchestrator extracts all - [ ] checkboxes from MISSION.md and writes them to tasks.json. The agent may ONLY change the passes field from false to true. It cannot delete tasks, reorder them, rename them, or add new ones.
[
{ "id": 1, "description": "Create auth middleware", "passes": true },
{ "id": 2, "description": "Write auth tests", "passes": true },
{ "id": 3, "description": "Implement rate limiting", "passes": false },
{ "id": 4, "description": "Add API documentation", "passes": false }
]
Why? Because overnight agents will try to redefine the mission to match what they actually accomplished. Task immutability prevents that.
Comparison: Nightcrawler vs Ralph Loop vs Continuous Claude
| Feature | Nightcrawler | Ralph Loop | Continuous Claude |
|---|---|---|---|
| Core approach | Bounded episodes | Infinite loop | CI/CD loop |
| Context management | Structured handoffs | Filesystem only | PR-based |
| Crash recovery | launchd + checkpoints | Manual restart | GitHub Actions |
| Budget tracking | Per-episode + total cap | None | Cost limit |
| Stall detection | Diminishing returns algo | Completion promise | Iteration limit |
| Termination | 8 conditions | 2 conditions | 3 conditions |
| Truth checking | Git diff vs handoff | None | PR review |
| Process supervision | launchd (macOS native) | tmux/screen | Cloud CI |
| Notifications | Mobile push (Moshi) | None | GitHub notifications |
| Task immutability | tasks.json (flip-only) | None | None |
Ralph Loop is perfect for tasks that fit in a single session with a few retries. If your task is "fix this bug and run tests until they pass," Ralph is the right tool.
Nightcrawler is for when you want to hand the machine a 12-hour research mission or a 50-task implementation plan and walk away. The episodic architecture handles all the failure modes that destroy simple loops over extended runs.
Continuous Claude sits between them -- it is a CI/CD-style loop that creates PRs, waits for checks, and merges. It is great for teams that want autonomous contributions that go through their normal review process.
Mission Templates
Nightcrawler ships with two mission templates:
Research Mission (breadth -> depth -> synthesis)
# Mission: [Research Topic]
**Type:** research
**Max Duration:** 12 hours
**Max Episodes:** 24
## Depth Targets
- [ ] Survey the landscape: identify all major players and approaches
- [ ] Deep-dive: [subtopic 1]
- [ ] Deep-dive: [subtopic 2]
- [ ] Cross-reference: identify contradictions between sources
- [ ] Synthesize: write final analysis with confidence levels
- [ ] Bibliography: all sources cited with URLs
## Source Requirements
- Minimum 10 unique sources
- At least 3 academic papers
- Flag any claim with only 1 source as [UNVERIFIED]
Implementation Mission
# Mission: [Feature Name]
**Type:** implementation
**Max Duration:** 12 hours
**Max Episodes:** 24
## Tasks
1. - [ ] [Task 1]
- Files: [likely files]
- Success criteria: [how to verify]
2. - [ ] [Task 2]
- Files: [likely files]
- Success criteria: [how to verify]
Getting Started
Prerequisites
- Claude Code CLI installed (
claudecommand available) - macOS (for launchd supervision) or any Unix system (run directly)
- Node.js 18+ (for tsx)
Setup
# Clone
git clone https://github.com/thebasedcapital/nightcrawler.git ~/.nightcrawler
cd ~/.nightcrawler
# Install dependencies
npm install
# Write your mission
cp templates/MISSION-research.md missions/active/MISSION.md
# Edit MISSION.md with your actual mission
# Configure (optional -- defaults are sane)
# Edit config.json to set budget, duration, model, etc.
Run
# Direct run (stays in terminal)
npx tsx nightcrawler.ts
# Dry run (see what would happen without spending money)
npx tsx nightcrawler.ts --dry-run
# With launchd (survives terminal close, crash recovery)
cp com.user.nightcrawler.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/com.user.nightcrawler.plist
launchctl start com.user.nightcrawler
Monitor
# Watch logs
tail -f ~/.nightcrawler/logs/orchestrator.log
# Check state
cat ~/.nightcrawler/state/STATE.json | jq '.progress'
# Emergency stop
touch ~/.nightcrawler/state/STOP
Configuration
| Setting | Default | Description |
|---|---|---|
max_duration_hours |
12 | Hard time limit |
max_episodes |
24 | Maximum episode count |
max_budget_usd |
50 | Total cost cap |
budget_per_episode_usd |
5 | Per-episode cost cap |
episode_timeout_seconds |
3600 | Single episode timeout |
model |
claude-opus-4-6 | Model for episodes |
error_threshold |
10 | Max errors before stopping |
cooldown_between_episodes_seconds |
10 | Pause between episodes |
What Happens When You Wake Up
When Nightcrawler finishes (by any of the 8 termination conditions), it writes a completion report:
# Nightcrawler Completion Report
**Mission:** auth-system-implementation
**Status:** COMPLETED
**Reason:** mission_complete
**Started:** 2026-02-21T23:00:00Z
**Ended:** 2026-02-22T06:45:00Z
**Episodes:** 8
**Budget:** $18.50 of $50.00
**Tasks:** 6/6 completed
## Episode History
- Episode 1: exit=0, tasks_completed=1, duration=45m
- Episode 2: exit=0, tasks_completed=1, duration=38m
- Episode 3: exit=0, tasks_completed=1, duration=52m
- Episode 4: exit=1, tasks_completed=0, duration=12m (flaky test, recovered)
- Episode 5: exit=0, tasks_completed=1, duration=41m
- Episode 6: exit=0, tasks_completed=1, duration=35m
- Episode 7: exit=0, tasks_completed=1, duration=28m
- Episode 8: exit=0, tasks_completed=0, duration=15m (final verification)
## Errors
- Total: 1
- Recovered: 1
- Fatal: 0
You also get a push notification on your phone (via Moshi) when it completes, errors, or stops.
Design Principles
Sessions will end. Build around that. Do not fight context limits. Embrace bounded episodes.
Verify, do not trust. Cross-check handoffs against git. The agent's self-report is evidence, not truth.
Fail safely. 8 termination conditions, budget caps, error thresholds, human stop flag. The default behavior on any unexpected state is to stop, not to continue.
Immutable contracts. The agent cannot redefine the mission. tasks.json is a one-way ratchet: tasks can be completed, never removed.
Observable at all times. STATE.json, PROGRESS.jsonl, per-episode logs, per-episode checkpoints. If something goes wrong at 3 AM, you can reconstruct exactly what happened.
Native OS integration. launchd is not just a convenience -- it is a reliability mechanism. It handles crashes, sleep/wake, and process lifecycle better than any userspace solution.
Credits and Acknowledgments
- Geoffrey Huntley (@GeoffreyHuntley) for creating the Ralph Wiggum technique that proved autonomous agent loops are useful and inspired this project.
-
Boris Cherny (@bcherny) for building Claude Code and the
-pflag that makes headless operation possible. - Anthropic for Claude and the Claude Code CLI.
- Anand Chowdhary (@AnandChowdhary) for Continuous Claude, which showed the CI/CD approach to autonomous loops.
What's Next
- Multi-model support: Use different models for different episode types (Sonnet for simple tasks, Opus for complex ones)
- Linux systemd support: Equivalent to launchd for Linux servers
- Web dashboard: Real-time monitoring of mission progress in the browser
- Mission chaining: Output of one mission feeds into the next
- Cost estimation: Predict budget usage before starting based on mission complexity
Nightcrawler is open source. If you run Claude Code for more than an hour at a time, give it a try.
If you have questions or want to share how your overnight mission went, find me on GitHub or X/Twitter.
Tags: #claude #ai #agents #automation #typescript #devtools
Top comments (0)