DEV Community: thebasedcapital

Why Your Overnight AI Agent Fails (And How Episodic Execution Fixes It)

thebasedcapital — Mon, 23 Feb 2026 04:12:14 +0000

Why Your Overnight AI Agent Fails (And How Episodic Execution Fixes It)

How I built Nightcrawler -- an autonomous agent loop that runs Claude Code for 12+ hours with structured handoffs, crash recovery, and 8 termination conditions.

You go to bed. Claude Code is humming along, refactoring your test suite. You set it loose with --dangerously-skip-permissions and a prayer. You wake up 8 hours later.

What do you find?

If you are lucky: a completed task. If you are unlucky -- and I was unlucky many times -- you find one of these six disasters.

The 6 Death Spirals of Long-Running AI Agents

Every developer who has tried to run an AI coding agent overnight has hit at least one of these:

1. The Context Cliff

After 30-60 minutes of work, the model's effective context fills up. It starts forgetting what it did earlier. It re-reads files it already processed. It contradicts decisions it made 20 minutes ago. Eventually it starts undoing its own work.

2. The Hallucinated Handoff

The agent writes "I completed tasks 1-5" in its output. You check git. Tasks 2 and 4 were never actually done. The agent's self-assessment diverged from reality because it lost track of what it actually committed versus what it planned to do.

3. The Budget Inferno

You wake up to a $200 API bill. The agent hit a hard problem, generated massive prompts trying to solve it, and burned through your budget in a single multi-turn loop. No circuit breaker. No budget cap per work unit.

4. The Infinite Retry Loop

The agent encounters a flaky test. It tries to fix it. The fix breaks something else. It tries to fix that. The original test fails again. Three hours later, it is still cycling between the same two broken states with zero net progress.

5. The Silent Crash

The process dies at 2 AM -- an OOM kill, a network timeout, a macOS sleep event. Nobody notices. When you wake up, you find a half-finished refactor with uncommitted changes and no record of what was accomplished.

6. The Drift Spiral

The agent starts with clear intent. By hour 3, it has wandered so far from the original mission that it is adding features nobody asked for, refactoring code that was fine, and creating documentation files for a project that does not exist.

Why Simple Loops Do Not Work

The most popular approach to overnight agents is some variant of this:

while true; do
  claude -p "Continue working on the project. If done, output DONE."
  # check for DONE, maybe sleep, repeat
done

This is the Ralph Wiggum technique, popularized by Geoffrey Huntley. It is elegant. It works for short tasks. But it has a fundamental assumption that breaks down overnight: it assumes each iteration starts fresh and can pick up from filesystem state alone.

For a 30-minute task, that is fine. For a 12-hour mission, you need more:

What was the agent working on when it stopped?
What decisions did it make and why?
What did it try that failed?
How much budget has been spent?
Is the agent still making progress, or is it spinning?
Did git actually change, or did the agent just claim it did?

A simple loop cannot answer these questions. That is the gap Nightcrawler fills.

The Solution: Episodic Execution

Nightcrawler does not try to keep one session alive forever. It accepts that sessions will end and builds reliability around that inevitability.

The core idea: bounded episodes with structured handoffs.

Mission.md
    |
    v
[Orchestrator] -----> [Episode 1: claude -p] -----> HANDOFF.md
    |                                                    |
    |  <-- read state, check termination conditions --   |
    |                                                    v
    +----------------> [Episode 2: claude -p] -----> HANDOFF.md
    |                                                    |
    |  <-- read state, check budget, detect stalling --  |
    |                                                    v
    +----------------> [Episode 3: claude -p] -----> HANDOFF.md
    |                                                    |
    ...                                                  |
    v                                                    v
COMPLETION_REPORT.md                              STATE.json

Each episode is a fresh claude -p session that:

Reads the mission (what to do)
Reads the handoff from the previous episode (what was done, what is next)
Verifies via git log and git diff that the handoff is truthful
Works on the highest-priority incomplete task
Writes a structured handoff for the next episode
Updates state (progress, errors, budget)

The orchestrator sits between episodes and decides: continue, or stop?

Architecture: ~500 Lines of TypeScript

Nightcrawler is a single TypeScript file (nightcrawler.ts, ~500 LOC) that orchestrates everything:

// The main loop is deceptively simple
while (true) {
  const { cont, reason } = shouldContinue(state, config);
  if (!cont) {
    // Write completion report and exit
    break;
  }

  const prompt = buildEpisodePrompt(state, episodeNum);
  const { exitCode, output } = await runEpisode(prompt, config, episodeNum);

  // Re-read state (agent may have updated it)
  // Track episode in history
  // Handle errors
  // Checkpoint
  // Cooldown
}

The Episode Prompt

Each episode receives a carefully constructed prompt that includes:

The skill instructions (how to behave as an autonomous agent)
The mission (what to accomplish)
The current state (progress, budget, errors)
The previous handoff (what was done, what is next)
The git context (recent commits, diff)
A task tracker (immutable tasks.json the agent can only mark complete)

function buildEpisodePrompt(state: State, episodeNum: number): string {
  const mission = readText(MISSION_PATH);
  const handoff = fileExists(HANDOFF_PATH) ? readText(HANDOFF_PATH) : null;

  // Include git context for truth-checking
  const gitLog = execSync("git log --oneline -10").toString();
  const diff = execSync("git diff --stat HEAD~1").toString();

  // Build structured prompt with all context
  // ...
}

The 8 Termination Conditions

The orchestrator checks these before every episode:

function shouldContinue(state: State, config: Config) {
  // 1. Human stop flag (touch ~/.nightcrawler/state/STOP)
  if (fileExists(STOP_FLAG)) return { cont: false, reason: "human_stop_flag" };

  // 2. Agent said stop (mission complete or blocked)
  if (!state.termination_check.should_continue) return { cont: false, reason: "agent_terminated" };

  // 3. Episode limit (default: 24)
  if (state.current_episode >= config.max_episodes) return { cont: false, reason: "episode_limit" };

  // 4. Duration limit (default: 12 hours)
  const elapsed = (Date.now() - started) / (1000 * 60 * 60);
  if (elapsed >= config.max_duration_hours) return { cont: false, reason: "duration_limit" };

  // 5. Budget limit (default: $50)
  if (state.budget_spent_usd >= config.max_budget_usd) return { cont: false, reason: "budget_limit" };

  // 6. Error threshold (default: 10 errors)
  if (state.errors.total >= config.error_threshold) return { cont: false, reason: "error_threshold" };

  // 7. Fatal error
  if (state.errors.fatal > 0) return { cont: false, reason: "fatal_error" };

  // 8. Diminishing returns (< 0.5 tasks/episode for 3 consecutive)
  if (avgCompleted < 0.5) return { cont: false, reason: "diminishing_returns" };

  return { cont: true, reason: null };
}

That eighth condition -- diminishing returns detection -- is the one that prevents death spiral #4 (the infinite retry loop). If the agent completes less than half a task per episode for three episodes in a row, something is wrong and Nightcrawler stops.

The Handoff Protocol

Every episode must write a structured HANDOFF.md before finishing:

# Episode 3 Handoff

## Summary
Implemented the authentication middleware and wrote 12 tests.
All tests passing. Rate limiting is next.

## Work Completed
- Created auth/middleware.ts with JWT validation
- Added 12 test cases in auth/middleware.test.ts
- Updated routes/api.ts to use the middleware

## In-Progress Work
- File: auth/rate-limiter.ts
- What's left: Implement sliding window rate limiting

## Key Context for Next Episode
- JWT secret is loaded from env var AUTH_SECRET
- The middleware expects Bearer tokens, not Basic auth
- rate-limiter.ts exists but is empty scaffolding

## Files Modified
- auth/middleware.ts: new file, JWT validation
- auth/middleware.test.ts: new file, 12 tests
- routes/api.ts: added authMiddleware to all /api/* routes

## Decisions Made
- Used jose library over jsonwebtoken: async-first, better types

## Errors Encountered
- None this episode

This handoff is not optional. The skill instructions loaded into every episode enforce it. And the orchestrator cross-checks it against git log in the next episode's prompt -- the agent cannot claim it changed files that git says were not modified.

Process Supervision with launchd

Nightcrawler runs under macOS launchd, which provides:

Crash recovery: If the process dies, launchd restarts it (with a 30-second throttle)
Sleep/wake handling: launchd manages macOS sleep events correctly
Background execution: Runs with Nice: 5 (lower priority) and ProcessType: Background
Hard timeout: 12-hour maximum via TimeOut: 43200
Logging: stdout/stderr redirected to log files

<key>KeepAlive</key>
<dict>
    <key>Crashed</key>
    <true/>
    <key>SuccessfulExit</key>
    <false/>
</dict>

The process also uses a PID-based lockfile to prevent duplicate instances:

function acquireLock(): boolean {
  if (fileExists(LOCK_PATH)) {
    try {
      const pid = parseInt(readText(LOCK_PATH).trim());
      process.kill(pid, 0); // Check if still alive
      return false; // Still running
    } catch {
      log("STALE_LOCK | Removing stale lockfile");
    }
  }
  writeFileSync(LOCK_PATH, String(process.pid));
  return true;
}

Task Immutability: Preventing Cheating

One subtle but critical design decision: tasks.json.

When a mission starts, the orchestrator extracts all - [ ] checkboxes from MISSION.md and writes them to tasks.json. The agent may ONLY change the passes field from false to true. It cannot delete tasks, reorder them, rename them, or add new ones.

[
  { "id": 1, "description": "Create auth middleware", "passes": true },
  { "id": 2, "description": "Write auth tests", "passes": true },
  { "id": 3, "description": "Implement rate limiting", "passes": false },
  { "id": 4, "description": "Add API documentation", "passes": false }
]

Why? Because overnight agents will try to redefine the mission to match what they actually accomplished. Task immutability prevents that.

Comparison: Nightcrawler vs Ralph Loop vs Continuous Claude

Feature	Nightcrawler	Ralph Loop	Continuous Claude
Core approach	Bounded episodes	Infinite loop	CI/CD loop
Context management	Structured handoffs	Filesystem only	PR-based
Crash recovery	launchd + checkpoints	Manual restart	GitHub Actions
Budget tracking	Per-episode + total cap	None	Cost limit
Stall detection	Diminishing returns algo	Completion promise	Iteration limit
Termination	8 conditions	2 conditions	3 conditions
Truth checking	Git diff vs handoff	None	PR review
Process supervision	launchd (macOS native)	tmux/screen	Cloud CI
Notifications	Mobile push (Moshi)	None	GitHub notifications
Task immutability	tasks.json (flip-only)	None	None

Ralph Loop is perfect for tasks that fit in a single session with a few retries. If your task is "fix this bug and run tests until they pass," Ralph is the right tool.

Nightcrawler is for when you want to hand the machine a 12-hour research mission or a 50-task implementation plan and walk away. The episodic architecture handles all the failure modes that destroy simple loops over extended runs.

Continuous Claude sits between them -- it is a CI/CD-style loop that creates PRs, waits for checks, and merges. It is great for teams that want autonomous contributions that go through their normal review process.

Mission Templates

Nightcrawler ships with two mission templates:

Research Mission (breadth -> depth -> synthesis)

# Mission: [Research Topic]

**Type:** research
**Max Duration:** 12 hours
**Max Episodes:** 24

## Depth Targets
- [ ] Survey the landscape: identify all major players and approaches
- [ ] Deep-dive: [subtopic 1]
- [ ] Deep-dive: [subtopic 2]
- [ ] Cross-reference: identify contradictions between sources
- [ ] Synthesize: write final analysis with confidence levels
- [ ] Bibliography: all sources cited with URLs

## Source Requirements
- Minimum 10 unique sources
- At least 3 academic papers
- Flag any claim with only 1 source as [UNVERIFIED]

Implementation Mission

# Mission: [Feature Name]

**Type:** implementation
**Max Duration:** 12 hours
**Max Episodes:** 24

## Tasks
1. - [ ] [Task 1]
   - Files: [likely files]
   - Success criteria: [how to verify]

2. - [ ] [Task 2]
   - Files: [likely files]
   - Success criteria: [how to verify]

Getting Started

Prerequisites

Claude Code CLI installed (claude command available)
macOS (for launchd supervision) or any Unix system (run directly)
Node.js 18+ (for tsx)

Setup

# Clone
git clone https://github.com/thebasedcapital/nightcrawler.git ~/.nightcrawler
cd ~/.nightcrawler

# Install dependencies
npm install

# Write your mission
cp templates/MISSION-research.md missions/active/MISSION.md
# Edit MISSION.md with your actual mission

# Configure (optional -- defaults are sane)
# Edit config.json to set budget, duration, model, etc.

Run

# Direct run (stays in terminal)
npx tsx nightcrawler.ts

# Dry run (see what would happen without spending money)
npx tsx nightcrawler.ts --dry-run

# With launchd (survives terminal close, crash recovery)
cp com.user.nightcrawler.plist ~/Library/LaunchAgents/
launchctl load ~/Library/LaunchAgents/com.user.nightcrawler.plist
launchctl start com.user.nightcrawler

Monitor

# Watch logs
tail -f ~/.nightcrawler/logs/orchestrator.log

# Check state
cat ~/.nightcrawler/state/STATE.json | jq '.progress'

# Emergency stop
touch ~/.nightcrawler/state/STOP

Configuration

Setting	Default	Description
`max_duration_hours`	12	Hard time limit
`max_episodes`	24	Maximum episode count
`max_budget_usd`	50	Total cost cap
`budget_per_episode_usd`	5	Per-episode cost cap
`episode_timeout_seconds`	3600	Single episode timeout
`model`	claude-opus-4-6	Model for episodes
`error_threshold`	10	Max errors before stopping
`cooldown_between_episodes_seconds`	10	Pause between episodes

What Happens When You Wake Up

When Nightcrawler finishes (by any of the 8 termination conditions), it writes a completion report:

# Nightcrawler Completion Report

**Mission:** auth-system-implementation
**Status:** COMPLETED
**Reason:** mission_complete
**Started:** 2026-02-21T23:00:00Z
**Ended:** 2026-02-22T06:45:00Z
**Episodes:** 8
**Budget:** $18.50 of $50.00
**Tasks:** 6/6 completed

## Episode History
- Episode 1: exit=0, tasks_completed=1, duration=45m
- Episode 2: exit=0, tasks_completed=1, duration=38m
- Episode 3: exit=0, tasks_completed=1, duration=52m
- Episode 4: exit=1, tasks_completed=0, duration=12m (flaky test, recovered)
- Episode 5: exit=0, tasks_completed=1, duration=41m
- Episode 6: exit=0, tasks_completed=1, duration=35m
- Episode 7: exit=0, tasks_completed=1, duration=28m
- Episode 8: exit=0, tasks_completed=0, duration=15m (final verification)

## Errors
- Total: 1
- Recovered: 1
- Fatal: 0

You also get a push notification on your phone (via Moshi) when it completes, errors, or stops.

Design Principles

Sessions will end. Build around that. Do not fight context limits. Embrace bounded episodes.
Verify, do not trust. Cross-check handoffs against git. The agent's self-report is evidence, not truth.
Fail safely. 8 termination conditions, budget caps, error thresholds, human stop flag. The default behavior on any unexpected state is to stop, not to continue.
Immutable contracts. The agent cannot redefine the mission. tasks.json is a one-way ratchet: tasks can be completed, never removed.
Observable at all times. STATE.json, PROGRESS.jsonl, per-episode logs, per-episode checkpoints. If something goes wrong at 3 AM, you can reconstruct exactly what happened.
Native OS integration. launchd is not just a convenience -- it is a reliability mechanism. It handles crashes, sleep/wake, and process lifecycle better than any userspace solution.

Credits and Acknowledgments

Geoffrey Huntley (@GeoffreyHuntley) for creating the Ralph Wiggum technique that proved autonomous agent loops are useful and inspired this project.
Boris Cherny (@bcherny) for building Claude Code and the -p flag that makes headless operation possible.
Anthropic for Claude and the Claude Code CLI.
Anand Chowdhary (@AnandChowdhary) for Continuous Claude, which showed the CI/CD approach to autonomous loops.

What's Next

Multi-model support: Use different models for different episode types (Sonnet for simple tasks, Opus for complex ones)
Linux systemd support: Equivalent to launchd for Linux servers
Web dashboard: Real-time monitoring of mission progress in the browser
Mission chaining: Output of one mission feeds into the next
Cost estimation: Predict budget usage before starting based on mission complexity

Nightcrawler is open source. If you run Claude Code for more than an hour at a time, give it a try.

If you have questions or want to share how your overnight mission went, find me on GitHub or X/Twitter.

Tags: #claude #ai #agents #automation #typescript #devtools

Why I Ditched RAG for Hebbian Synapses (and My Agent Actually Got Faster)

thebasedcapital — Sun, 22 Feb 2026 11:32:55 +0000

Every agent memory system I tried follows the same playbook: embed text into vectors, store them in a database, cosine search on recall. It works for storing facts. But my Claude Code agent kept doing the same dumb thing — grepping for auth.ts every single session, even though I open it 10 times a day.

The problem: RAG gives agents declarative memory (facts). It doesn't give them procedural memory (learned behavior). No amount of vector embeddings will teach an agent that auth.ts and session.ts always go together.

The neuroscience answer

Real brains don't embed memories into vectors. They form synaptic connections through repeated co-activation. Donald Hebb described this in 1949: "neurons that fire together wire together."

I built BrainBox to apply this to coding agents.

How it works

BrainBox has three primitives:

Neurons — represent files, tools, and errors. Created automatically when your agent interacts with them.

Synapses — connections between neurons. When you access auth.ts then session.ts within the same session, a synapse forms. Access them together 10 more times and the synapse strengthens.

Myelination — real neurons wrap frequently-used axons in myelin sheaths for faster signal propagation. BrainBox does the same: once a synapse crosses a threshold, it becomes a "superhighway" — instant recall, maximum confidence.

The math (simplified)

Synapse strengthening uses SNAP sigmoid plasticity:

delta = learning_rate * sigmoid_gain(current_weight)

Where sigmoid_gain makes strong synapses resist further strengthening (prevents any single connection from dominating). This mirrors real synaptic saturation.

Spreading activation (Collins & Loftus, 1975):

When you recall auth.ts, BrainBox activates it, then propagates activation to connected neurons through 2-hop BFS. Each hop decays by 1/sqrt(degree) (Anderson's fan effect). Files 2 hops away get weaker activation than direct neighbors.

Decay follows Ebbinghaus forgetting curves — unused connections weaken naturally:

Activation: -15% per cycle
Synapses: -2% per cycle
Myelination: -0.5% per cycle

Results from production

After 5 hours of real Claude Code usage:

79 neurons, 3,554 synapses formed
67% top-1 recall accuracy (the first file recalled was the right one)
8.9% gross token savings (fewer grep/search operations)
<5ms recall latency
3 superhighways formed naturally

Install

npm install brainbox-hebbian

Auto-configures Claude Code hooks. Works with any MCP agent. Full whitepaper in the repo.

GitHub: github.com/thebasedcapital/brainbox

Why not both?

BrainBox is complementary to RAG, not a replacement. RAG handles L2 (declarative facts). BrainBox handles L3 (behavioral patterns). The 4-layer memory model:

Layer	What	Examples
L1	Buffer	Context window, chat history
L2	Declarative	Mem0, Zep, SuperMemory (facts/conversations)
L3	Procedural	BrainBox (behavior patterns, muscle memory)
L4	Identity	Personal style, values, habits

L3 was empty until now. That's the gap BrainBox fills.

Full algorithm details in WHITEPAPER.md — covers SNAP plasticity, BCM theory, spreading activation, error-fix learning, and production benchmarks.