DEV Community: Beni

Our Agent's #1 Failure Mode: Thinking

Beni — Fri, 27 Mar 2026 16:43:11 +0000

Our Agent's #1 Failure Mode: Thinking

Thirty-three tasks. Four projects. $32.93. Time to read the spreadsheet.

MissionControl has been running for a week. Quick context if you're just joining: autonomous dev agent. Describe a coding task in Telegram, it spawns a Claude Code session, builds the feature, opens a PR on GitHub. Post 1 covered the 16-hour build. Posts 2 through 5 covered the bugs, the trust chain, the architecture, and a task that deployed a full MVP then got marked as failed. All anecdotal. Now there's enough data to stop telling stories and start reading spreadsheets.

The Raw Numbers

Metric	Value
Tasks created	33
Completed	12 (36%)
Failed	19 (58%)
Cancelled	2 (6%)
Total spend	$32.93

36% completion rate. Worse than the 50% reported after 20 tasks. But the raw number lies — it's weighed down by early infrastructure failures that no longer exist. Strip those out and the picture changes.

Where the Money Went

Not all failures are equal. Some cost pennies. One category cost almost $9.

"No commits produced" — 5 tasks, $8.88

The real failure mode. Five tasks where Opus ran for its full budget or turn limit and produced zero commits. Tasks #20, #23, #25, #27, #29 — all greenfield builds ("Build a full-stack...") on $2 budgets.

The pattern is consistent: Opus starts by reading the entire codebase. Then it plans. Then it plans more. Explores alternative approaches. Considers edge cases it will never hit. By the time it's ready to write code, the budget is gone.

$8.88 burned on thinking. Not a single line committed.

API and infra failures — 10 tasks, $0.69

Ten tasks failed on infrastructure issues — all fixed since. Anthropic API 500s during early testing (4 tasks, $0.69). Missing sudo, stale OAuth tokens, missing worker user (6 tasks, $0). Resolved in the first week. Noise in the data now.

Timeout — 1 task

Default timeout was too short for a full-stack build on a 2-core box. Bumped it. Hasn't recurred.

CLI quirk — 1 task

--print combined with --output-format=stream-json silently requires --verbose. Without it, the CLI exits 1 with no useful error. Fixed in worker.ts.

The Funnel

Signal separated from noise:

33 total tasks
 - 10 infra/API failures (fixed, no longer relevant)
 -  2 cancelled
 -  1 timeout (fixed)
 -  1 CLI quirk (fixed)
 = 19 real attempts
 - 12 completed
 -  5 "no commits" (the actual problem)
 -  2 other failures

Strip the noise: roughly 63% on real attempts. Not bad for an autonomous agent with no human in the loop. But 5 tasks and $8.88 wasted on overthinking — that's the leak.

Model Economics

Model	Tasks	Cost	Avg/Task	Raw Success	Adjusted
Opus	30	$30.65	$1.02	30% (9/30)	50% (9/18)
Sonnet	3	$2.28	$0.76	100% (3/3)	100% (3/3)

Three data points isn't a sample size. But the pattern is worth noting.

Opus's failure mode is overthinking. Reads everything, considers everything, plans extensively. On a constrained budget, that means it runs out of money before it writes code. On greenfield builds — where the codebase is small and the task is "just build it" — this is exactly wrong.

Sonnet's strength is mechanical execution. Clear task, does the task. No exploration spirals. No alternative-architecture tangents. Three tasks, three completions, $0.76 average.

This isn't "Sonnet is better." It's match the model to the task shape. Opus for complex modifications to large codebases where understanding context matters. Sonnet for greenfield builds and mechanical fixes where the path is clear.

Three Changes We Made

The data pointed to three specific interventions. Shipped all three before starting the next batch.

1. Doubled All Budgets

Parameter	Old	New
Default task budget	$5	$10
Max task budget	$10	$20
Daily budget cap	$50	$100

The hypothesis: "no commits produced" isn't an intelligence failure — it's a budget failure. Opus needs room to think and build. At $2, it can do one or the other. At $4-10, it can do both.

This is a bet. If doubling budgets converts those five failures into completions, the ROI is obvious — spending $4 to get working code beats spending $2 to get nothing. If it doesn't, we have a deeper problem that money won't fix.

2. Two-Phase Reviews

Single-phase reviews were inconsistent. Task #33 came back with "Done" and no detail. Task #31 found a real bug. Same prompt, different quality. Split analysis from execution.

Phase 1 — Opus analyzes. Read-only access. Reviews the PR diff against a structured checklist: logic errors, security, styling, imports, TypeScript compliance. Outputs a machine-readable verdict:

<!-- REVIEW_VERDICT {"approved": false, "issues": [
  "src/components/VotingPanel.tsx:42 — duplicate accent color logic",
  "src/components/Icon.tsx — missing style?: CSSProperties prop"
]} -->

Budget: $1.50. Model: Opus. Tools: read-only (Bash, Read, Glob, Grep).

Phase 2 — Sonnet fixes. If Phase 1 finds issues, a child task is auto-created. Sonnet gets the issue list, fixes each one, runs tsc --noEmit and npm run build, commits, and pushes.

Budget: $1.00. Model: Sonnet. Tools: full access.

Already caught real bugs in production PRs. The duplicate accent color in VotingPanel would have shipped. The missing style prop on icon components would have caused runtime issues in any consumer passing inline styles. Total review cost: $2.50 for analysis plus fixes — cheaper than a single Opus task that might or might not find anything.

3. Commit-Early Culture

The lead dev prompt now emphasizes incremental commits over perfect final PRs. Old pattern: plan everything, build everything, commit once at the end. Budget runs out before that final commit — zero output.

New pattern: commit after each meaningful unit of work. A partial feature with three commits is infinitely more valuable than a complete feature with zero commits.

Can't force the model to commit early — it's guidance, not enforcement. But combined with higher budgets, the goal is to shift the failure mode from "zero output" to "partial output." Partial output can be retried. Zero output is wasted money.

What We're Watching

Batch 2 starts now. Three questions:

Does doubling budgets convert failures? If the five "no commits" tasks would have succeeded at $4-10, the completion rate will show it. If they still fail at higher budgets, the problem is in the prompt or the task shape, not the money.

Does two-phase review scale? Three review tasks isn't a pattern. Need 15-20 to know if the structured verdict format is reliable and if Sonnet consistently fixes what Opus finds.

Can we auto-calibrate? A greenfield build and a one-line config change shouldn't share a budget. Considering scope-size flags — small, medium, large — that auto-set budget and timeout based on expected complexity. Not built yet. Waiting for more data to set the thresholds.

The Takeaway

Thirty-three tasks taught us more than building the system did. The system works. The question was always "how well?" Now we know: ~63% on real attempts, with a clear #1 failure mode we can measure and attack.

Not crashes. Not bugs. Not infrastructure. The agent thinks too much and ships nothing. Solvable problem. Higher budgets give it room. Two-phase reviews separate thinking from doing. Commit-early guidance reduces the blast radius of a timeout.

$32.93 for 33 tasks and a clear roadmap for improvement. Not bad.

Next up: batch 2 results — did the changes work?

Error!! Failed Successfully

Beni — Thu, 26 Mar 2026 16:07:35 +0000

Error!! Failed Successfully

Part 5 of the MissionControl series. MissionControl is a Telegram bot that takes coding tasks in plain English, spawns a Claude Code session, and ships pull requests autonomously. Post 4 covered the safety stack and architecture after the first 48 hours.

The Notification

12:16 AM UTC. Telegram notification:

Task #23 failed: No commits produced despite success claim

Opened the Vercel URL anyway. The app loaded. Login screen with four demo users. Picked "Jordan," dragged some flight options around, watched the Borda count scores update in real time. Fully functional group travel planner. Mobile responsive. TypeScript strict. Deployed to production.

The database said failed. The app said otherwise.

The Task

Task #23 was a stress test. After the safety stack work in Post 4 — budget caps, timeouts, commit verification — the question was simple: could the bot take a complex, multi-feature MVP brief and ship it end-to-end? No human intervention, no hand-holding, no retries.

The prompt: a detailed spec for a group travel planning app. React + Tailwind, deployed to Vercel, with trip creation, drag-to-rank voting across multiple categories (flights, lodging, activities), four demo users with different roles, pre-seeded data, and an admin/member permission split. About 250 words — a product brief you'd hand a junior developer on day one.

Full task prompt

Build a group travel planning web app (React + Tailwind, deploy to Vercel).

Core concept: One person creates a trip, invites others. The lead planner
adds options for each category; all members vote via drag-to-rank. Results
are visible to everyone in real time.

MVP Sections:
- Flights
- Lodging (hotel or Airbnb)
- Activities &amp; Dining

Each section lets the planner add multiple options (name, price, link,
image). Members drag to rank them. The top-ranked option is highlighted
as the group pick.

Demo Setup:
- No auth needed — login screen with 4 selectable demo users
- User 1: "Jordan" (Lead Planner/Admin)
- Users 2-4: "Alex," "Sam," "Priya" (Members)
- Pre-filled trip: "Miami Trip — July 4th Weekend" with 2-3 options
  per section and partial votes already cast
- Switching users changes your voting perspective and UI role

Admin view: Add/edit/delete options, see full vote breakdown
Member view: Drag-to-rank only, see live results after voting

Key Features:
- Trip dashboard with progress (how many people have voted per section)
- Drag-to-rank voting UI per section per user
- Live results view showing ranked outcomes across all voters
- Mobile-first, responsive layout
- Invite link UI (non-functional, just show a copyable link)

Tech: React, Tailwind CSS, in-memory state (no backend needed for MVP).
Seed all demo data on load. Deploy to Vercel.

Start with the full file/folder structure, then build it completely —
no placeholders. It must be fully functional and deployable before stopping.

One message. No follow-ups. Ship it.

Three Attempts

Attempt 1 ran for 18 minutes, then went silent. Exit code: null. The CLI process never terminated cleanly — stopped producing output and sat there until the timeout killed it. No commits, no progress, nothing salvageable. Classic Opus-on-a-2-core-box behavior: the model spent so long planning it exceeded the soft timeout before writing a single file.

Attempt 2 lasted 8 minutes before catching a SIGTERM (exit code 143). Our own timeout enforcement killed it mid-work. The bot was making progress this time, but not fast enough. Nothing committed.

Attempt 3 — 9 minutes and 5 seconds. 49 turns. $1.56.

Clean exit. Code zero.

{
  "subtype": "success",
  "duration_ms": 543323,
  "num_turns": 49,
  "total_cost_usd": 1.56,
  "result": "Done. GroupTrip MVP deployed and live at https://grouptrip-work.vercel.app"
}

Commit 223a642 ships feat: build group travel planning web app MVP. Twenty-one files. 7,466 lines of code.

What It Built

The bot made every architectural decision on its own. No guidance on which drag-and-drop library to use, how to structure state, or what scoring algorithm to implement.

@dnd-kit for drag-and-drop. Not react-beautiful-dnd (deprecated), not react-dnd (heavier). The right call. Pulled in @dnd-kit/core, @dnd-kit/sortable, and @dnd-kit/utilities, then built a clean SortableItem component wrapping each votable option.

React Context over Redux. For an in-memory MVP with no backend, this is correct. A global store with useContext and structuredClone for immutable updates. No unnecessary dependencies, no boilerplate.

Borda count scoring. The brief said "drag to rank" and "top-ranked option highlighted." The bot decided to use Borda count — a ranked-choice voting algorithm where each position gets a score (first place = N points, second = N-1, and so on). Calculated scores across all voters and surfaced the winner per category. Nobody asked for Borda count. The bot read "drag to rank" and picked an appropriate algorithm on its own.

The file structure:

src/
  components/
    CategorySection.tsx    # Per-category voting container
    Dashboard.tsx          # Trip overview + progress
    LoginScreen.tsx        # Demo user picker
    SortableItem.tsx       # dnd-kit wrapper
    VotingPanel.tsx        # Drag-to-rank UI
    ResultsPanel.tsx       # Borda count results
  lib/
    seed-data.ts           # Pre-filled Miami Trip
    store.tsx              # React Context state
  types/
    index.ts               # Shared TypeScript types

Clean separation. Types in one file, seed data isolated, state in its own module, seven focused components each doing one thing. The seed data included a pre-filled "Miami Trip" with three categories, 2-3 options each, and partial votes already cast — exactly as specified. Switching between demo users changes the perspective: Jordan sees admin controls, the others see the voting interface.

Code quality: 7.5 out of 10. TypeScript strict mode, no any types, proper immutability with structuredClone, clean component boundaries. A few things a senior dev would tighten — some components could be split further, the store could use a reducer pattern instead of raw setState — but nothing that would block a code review. Ship it.

The False Failure

The app works. It's deployed. The code is clean. Why did the database say "failed"?

The sequence in the runner's finally block:

CLI exited with code 0. Reported success.
Runner checks for commits: git rev-list --count main..HEAD.
At this exact moment, the working tree was dirty — Vercel CLI had written deployment cache files the bot didn't commit.
Auto-rescue logic detected dirty state and ran git add -A && git commit -m 'WIP: auto-rescue'.
But the commit count check had already run against the branch before the rescue commit landed.

Race condition. The runner checked for commits, found zero (commit 223a642 was there, but the branch comparison ran against the wrong ref), then the rescue committed after the check. Error message: "no commits produced." Reality: commit 223a642 had 7,466 lines of working code. The Vercel deploy had already completed inside the CLI session. The app was live at grouptrip-work.vercel.app before the runner even started its verification.

Failed successfully.

The Fix

The bot was doing things in the wrong order. The verification had a blind spot.

Commit f277b4b patched four things in the bot's prompt template:

Build verification. Added npm run build as a required step after TypeScript checking and before committing. The bot was already running tsc --noEmit, but a passing type check doesn't guarantee a passing build.

Vercel preview deploy. If the project has a .vercel/ directory or vercel.json, the bot now runs vercel --yes (not --prod) as a step. Preview deploys, not production. The human decides when to promote.

"Never merge to main" guardrail. Explicit instruction in the prompt: work on the feature branch, push the branch, the reviewer merges. The bot was already doing this. Making it explicit prevents drift.

"Never git add -A" guardrail. Stage specific files with git add <file>. Directly prevents the scenario that caused the false failure. If the Vercel CLI drops cache files in the working tree, the bot won't blindly commit them.

The bot now follows the same workflow as the human team. Type check, build, commit specific files, push branch, deploy preview. No shortcuts.

The Scoreboard

Three attempts across two days:

Attempt	Duration	Exit	Cost	Result
1	18 min	null (hung)	--	No output
2	8 min	143 (SIGTERM)	--	Killed mid-work
3	9 min	0 (clean)	$1.56	Deployed to production

Attempt 3: 49 turns, 1.16M cached tokens, 22.5K output tokens against Opus. Total API cost: $1.56.

What didn't work: Opus on a 2-core server chokes on complex planning. Attempt 1 spent its entire budget thinking. Attempt 2 got killed by our own timeout enforcement before finishing. Two runs wasted — not because the model couldn't do the work, but because the infrastructure couldn't give it enough room to think.

What worked: when the bot actually got to execute, it made good decisions. Right library for drag-and-drop. Right state management for the scope. Right algorithm for ranked voting. Clean file structure. TypeScript strict. Deployed and functional.

$1.56 and 9 minutes of compute. An autonomous agent built a production-quality MVP that would take a human developer a full day. The app is live. The code is clean.

The database was wrong.

Failed successfully.

Next up: Post 6 — 33 tasks analyzed to find out what actually works, what doesn't, and where the money goes.

What We Actually Ship With MissionControl

Beni — Thu, 26 Mar 2026 01:46:30 +0000

What We Actually Ship With MissionControl

Two days. Twenty-one commits. English in, pull requests out.

If you're joining mid-series: Post 1 covered the 16-hour build — Telegram bot in, pull requests out, ports-and-adapters architecture. Posts 2 and 3 were the bug safari that followed, including a $5.84 task that produced zero useful work and forced a rethink of the entire trust chain. This is what the system looks like after surviving all of that.

The Interface

MissionControl runs as a Telegram bot. No web UI. No dashboard. You message it, it does work, it messages you back.

Every interaction fits in a chat bubble. Send a task from your phone while walking the dog, get a PR link back before you're home. That constraint — everything must fit in a Telegram message — turned out to be a feature, not a limitation.

The full command set:

/task <description> — Queue a task against the default project. The agent picks it up, creates a branch, does the work, pushes, and opens a PR.
/task slug: <description> — Target a specific project.
/status — Current running task, queue depth, recent completions.
/cancel <id> — Kill a running task or remove a queued one.
/retry <id> — Re-queue a failed or cancelled task.
/logs <id> — Tail the last 50 lines of a task's execution log.
/budget — Today's spend, remaining daily budget, per-task breakdown.
/addproject <slug> <path> --github owner/repo — Register a repo. Auto-chowns .git/ for the worker user.
/create <slug> — Bootstrap a new project directory, init git, create the GitHub repo, register it.
/rmproject <slug> — Unregister and clean up.

No context switching. No browser tabs.

The Numbers

Twenty tasks in the first 48 hours across three projects.

Metric	Value
Tasks created	20
Completed	4
Failed	13
Cancelled	2
Running	1
Total spend	$12.49
Avg cost (completed)	$2.95
Most expensive	$5.84 (Task #19 — did nothing)
Cheapest success	$1.49 (Task #5 — tier enforcement)

20% completion rate. Looks bad. It's not. Tasks 1 through 4 all failed on the same CLI spawn issue — the zero-stdout bug from Post 2. Four retries of the same failure before we understood it. After the bug fix sprint, the completion rate on new tasks jumped to roughly 50%. The remaining failures: budget timeouts and permission issues on freshly registered projects. Not systemic.

The number that matters: four completed tasks produced working code, passing builds, and merged PRs. One of them — a fitness trainer dashboard — was a full-stack Next.js app with auth, data visualization, and a PostgreSQL backend. Built autonomously. $2.00.

The Safety Stack

Every layer here exists because we shipped without it and something broke.

Budget caps. $50/day global. $5 per task default, configurable up to $10. Checked before the task starts and enforced by the CLI's own --max-budget-usd flag. Task #19 — the $5.84 zero-work disaster from Post 3 — proved that budget enforcement alone isn't enough. You also need to verify the agent actually produced something.

Timeouts. 30-minute soft limit, then a 5-minute grace period. Soft limit sends SIGTERM. Grace lets the agent wrap up and commit. After grace, SIGKILL. A separate kill timer 60 seconds post-SIGTERM ensures nothing lingers. Opus on a 2-core box analyzing a large codebase can burn 15 minutes just planning. Learned that the hard way.

Orphan cleanup. On process restart, any task stuck in running state gets reset to queued. Without this, a single PM2 restart freezes the entire queue. Sounds obvious in retrospect. Wasn't obvious at 2 AM.

Commit verification. git rev-list --count main..HEAD — if zero, the task failed. No exceptions. The agent's self-assessment ("I completed the task successfully!") is advisory, not authoritative. We do not trust the agent's opinion of its own work.

Uncommitted work rescue. Before any branch cleanup: git status --porcelain. If dirty, git add -A && git commit -m 'WIP: auto-rescue'. Catches work the agent did but didn't commit — timeouts, crashes, the agent forgetting to stage files. Happens more often than expected.

Force checkout fallback. The finally block tries normal checkout first, then force checkout. A dirty working tree from a crashed task can't deadlock the next one.

Session isolation. --no-session-persistence on every CLI spawn. Every task starts clean. No stale context, no ghost sessions bleeding between runs.

Roughly sixty lines of verification and fallback logic. Least interesting code in the project. Most important.

The Architecture

Ports and adapters. Same as day one. Three boundaries:

MessagingPort — Telegram today. The interface is sendMessage(chatId, text). A Slack adapter would take an afternoon — that's the whole point of the pattern.
WorkerPort — Claude CLI today. Spawns the agent with JSON output, budget caps, tool restrictions. Could be swapped for any agent runtime that accepts a prompt and returns structured output.
VCSPort — GitHub today. Creates PRs, manages branches. Git operations happen locally through a sudo wrapper that runs everything as the sandboxed worker user.

The core — TaskRunner, TaskService, BudgetService — knows nothing about Telegram, Claude, or GitHub. It processes tasks, enforces budgets, delegates execution. That separation already paid off: changed how the CLI gets spawned twice in two days. Nothing else in the system noticed.

State lives in SQLite via better-sqlite3. One file, no server, backed up by PM2's process management. Good enough for a single-operator system. Would need Postgres if this ever went multi-user.

What's Next

Three things on the roadmap, in priority order.

Crash recovery. If a task gets interrupted mid-work — server reboot, PM2 restart, OOM kill — it gets requeued from scratch. The branch exists with partial commits, but the retry starts a fresh conversation with no memory of what came before. Want to detect partial work on the branch and pass it as context: "Here's what you did before you were interrupted. Continue from commit X." This alone could cut the failure rate in half.

Slack adapter. Telegram works for a solo operator. Slack is where teams live. The MessagingPort interface is already clean — sendMessage and onCommand — so a Slack adapter that maps slash commands to the handler interface would open this up to team use without touching the core.

Issue watcher. Auto-queue tasks from GitHub issues. Label an issue mc-auto, MissionControl picks it up, creates a task, links the PR back to the issue. The scaffolding is already in the codebase. Needs a token scope update and it's live.

Should This Be Open Source?

Still deciding. The system is opinionated — single operator, Telegram, Claude CLI, GitHub — but the architecture is portable. Swap any layer without touching the core.

The bugs we found and fixed aren't novel. Stale sessions, permission boundaries, output verification, budget enforcement — every agent builder will hit these. Shipping the fixes as a reference implementation could save other builders the same $5.84 lessons.

No decision yet. Building something similar? Reach out.

The Closing Count

Two days of building. Two days of debugging. Twenty-one commits on main. Twenty tasks processed. Four successful PRs merged. $12.49 spent.

One system that takes English descriptions from a Telegram message and turns them into branches, commits, and pull requests — with budget caps, timeout enforcement, commit verification, and session isolation.

It breaks. We fix it. It breaks differently. We fix that too. The difference between "AI agent demo" and "AI agent that ships code" is those sixty lines of verification and fallback logic that nobody shows in the demo.

MissionControl isn't done. But it works. And it works because of everything that broke.

Next up: Post 5 — the bot builds a full MVP, deploys it to production, then tells us it failed.

My AI Agent Spent $5.84 and Did Nothing

Beni — Tue, 24 Mar 2026 14:17:02 +0000

My AI Agent Spent $5.84 and Did Nothing

Give an AI agent a task. It runs for 15 minutes, reports success, bills you $5.84. Click the PR link. GitHub says: "There isn't anything to compare." Zero commits. Zero files changed. Money gone.

This is the failure mode that matters most when building autonomous agents. Not hallucinations, not bad code, not prompt engineering. The agent does nothing, reports everything done, and the system believes it.

What Happened

[Post 2] covered six silent-failure bugs from v0.1 — same pattern every time: exit code 0, no actual work. We hardened against that. Then Task #19 proved the hardening wasn't enough.

The Telegram notification looked fine:

Task #19 completed in 14m 51s
snowcam — Mountain Camera Intelligence Dashboard
Model: opus | Cost: $5.84

Completed. Fourteen minutes. PR should be ready.

Clicked the link. GitHub: "main and feature/mc-19-snowcam have identical contents." Double-checked the URL. Refreshed. Checked the branch list. Nothing. $5.84 gone.

The Investigation

Task #19 was a retry — same snowcam dashboard prompt that Task #18 had already completed. Task #18 built the app, committed the code, pushed the branch, cost $2.48. Task #19 was queued with the same description, aimed at a different branch.

The raw CLI output JSON told the story:

{
  "num_turns": 1,
  "total_cost_usd": 5.839,
  "is_error": false,
  "result": "The resort data agent confirmed — 50 resorts, all clean TypeScript.
             That file was already incorporated into the build that passed.
             The dashboard is fully built, verified, and pushed."
}

One turn. The agent saw cached context from the previous session and concluded the work was done. Didn't run a single tool. Wrote zero files. Made zero commits. Reported success.

The model usage confirmed it:

claude-opus-4-6:
  cacheReadInputTokens: 7,660,816
  outputTokens: 49,373
  costUSD: $5.83

7.6 million cached tokens — the entire previous session. Every file read, every edit, every tool call from Task #18, loaded back via session persistence. The agent saw all that prior work and said "Done." One inference pass. Full price.

7.6 million tokens of someone else's work. Claimed as its own.

The Trust Chain That Failed

The sequence that turned a zero-work session into a "completed" task:

CLI exits 0 — no crash, no error, ran successfully from its own perspective
JSON says is_error: false — the agent encountered no issues (it just didn't do anything)
Runner parses success — code === 0 && !parsed.is_error evaluates true, task marked completed
Push empty branch — git pushes a branch with zero new commits
PR creation fails — GitHub notices nothing to compare, but the runner already marked success
Finally block runs — checks out the original branch, would silently discard any uncommitted work

Every link did exactly what it was supposed to. The bug wasn't in any single step — it was in what we didn't check: did the agent actually produce anything.

The Three-Part Fix

1. Kill Session Persistence

Root cause: Claude CLI's session persistence. Between tasks, it saved and restored session context. Task #19 resumed Task #18's context and concluded nothing needed doing.

const args = [
  '-p', params.description,
  '--output-format', 'json',
  '--model', params.model,
  '--max-turns', params.maxTurns.toString(),
  '--dangerously-skip-permissions',
  '--no-session-persistence',  // <-- never resume stale sessions
];

Every task starts clean. No inherited context. No ghosts from previous runs.

2. Rescue Uncommitted Work

If the agent did work but didn't commit — crashed mid-edit, hit a timeout, forgot the commit step — rescue it before touching the branch:

if (result.success) {
  if (await hasUncommittedChanges(project.path)) {
    log.warn('Claude left uncommitted changes — auto-rescuing');
    await rescueUncommittedChanges(project.path);
  }
  // ...

Catches real work left uncommitted. Without this, the force-checkout in the finally block would destroy it — the same dirty repo bug from Post 2, now handled properly.

3. Commit Count Verification

The actual gate:

const commitCount = await getBranchCommitCount(project.path, project.default_branch);
if (commitCount === 0) {
  this.taskService.markFailed(task.id, 'No commits produced despite success claim', totalCost);
  await this.notifyUser(task.user_id,
    `Task #${task.id} failed: Claude reported success but made no commits\n` +
    `Cost: $${totalCost.toFixed(2)}`
  );
  return;
}

getBranchCommitCount runs git rev-list --count main..HEAD. Zero commits means the task failed — regardless of what the CLI reported. User gets an honest notification: "Claude said it was done, but it made no commits."

All three fixes together: Task #19 would have been caught immediately. No stale session to resume. Uncommitted work rescued. Zero-commit branches rejected.

The Lesson

$5.84 is cheap tuition.

The real cost would have come later — 50 tasks a day, empty branches marked "completed," budget burned on phantom work. Dashboard says 100% completion rate. Nothing shipped.

AI agents are not reliable narrators of their own success. They will report completion when they've done nothing. They will exit clean from a failed state. They will cache-read 7.6 million tokens of prior work and call it their own.

Never trust self-reported success from an AI. Verify the artifacts. Count the commits. Check the files. Run the tests. Exit code 0 is evidence the process didn't crash — not evidence of work.

[Post 4] covers what MissionControl looks like once it stops believing its own agent.

We Accidentally Reinvented SMTP for Claude Code Instances

Beni — Tue, 24 Mar 2026 12:05:56 +0000

We didn't plan to build an email system. We planned to copy a file.

Day Zero: The Clipboard

Three servers. Up to six concurrent Claude Code instances across them. Same projects, different contexts. One server handles the autonomous dev agent. Another handles product work and competitive intel. A third joined for extra compute.

The problem is obvious: they can't talk to each other.

Day one solution: a shared directory. Write a markdown file, SCP it to the other server. The other instance reads it next session.

scp review-notes.md root@10.0.0.2:~/mailbox/

That's it. That was the entire communication infrastructure. And honestly? It worked. For about two days.

The Five Failures That Built the System

Failure 1: "Did they see this?"

You drop a brief. Next session, you have no idea if the other instance read it. Did the human show it to them? Did they skip it? Is the action item in progress or sitting untouched?

So we added a SQLite inbox. Every file gets registered with a timestamp and a read_at column. NULL means unread. Not NULL means processed.

Failure 2: "What changed since last time?"

Session context resets every conversation. You can't remember what you saw last time. ls -lat tells you what's newest, not what's unread.

So we added a digest command. Query WHERE read_at IS NULL, read each file, extract action items, mark everything read. One command, full inbox processed.

ClaudeMail Digest — 5 unread

1. [server2 -> all] deploy-css-migration-merged.md
   PR #21 merged, needs production deploy

2. [server2 -> all] cost-model-fitted.md
   281 tasks trained, p75 $0.47, needs wiring into dispatcher

Failure 3: "Who's online right now?"

Multiple servers, multiple instances per server. When you send a brief, you don't know if anyone's listening. Are you talking to an empty room?

So we built a roster. Each instance heartbeats every 60 seconds, updating its last_seen timestamp. Stale instances get marked offline automatically. PID-locked callsigns prevent identity collisions when multiple instances run on the same server — run as many as you need.

Roster
────────────────────────────────────
  alpha     server1     online    30s ago
  bravo     server1     offline   3h ago
  charlie   server2     online    1m ago
  delta     server3     online    45s ago
────────────────────────────────────

Failure 4: "That action item was done three days ago"

Briefs generate action items. Action items pile up. Nobody sweeps them. Three sessions later, half the list is stale — completed, merged, or superseded by a newer brief.

So we added an action tracker. pending, wip, done, skip. Grouped by project. Prioritized by urgency. A checkup command surfaces stale work-in-progress and overdue items.

Failure 5: "The poller died and nobody noticed"

The first background poller ran in a bash loop with sleep 20. Bash loops die silently. SSH timeouts, process signals, shell exits — any of them kill the loop. You think you're monitoring for new mail. You're not.

Three iterations later: a Node.js MCP server with a built-in poller. Runs inside Claude Code's process tree, dies when the session dies, starts when the session starts. A status line at the bottom of the terminal shows unread count at all times:

✉ ClaudeMail v1.0.0 | 3 unread | 5 actions | alpha

And a startup hook checks your inbox on every first message:

[mail] 3 unread briefs. Run /mail digest

No manual polling. No commands to remember. Mail just shows up.

The Architecture

After all the iterations, the stack is almost comically simple:

Transport: HTTP mesh over Tailscale (or any private network)
Format: Markdown files with YAML-ish frontmatter (From, To, Date, Action)
Storage: SQLite per node (inbox, actions, read receipts, roster)
Identity: PID-locked callsigns, any number per server
Interface: 12 MCP tools registered directly with Claude Code
Notifications: status line + startup hook (MCP stderr not yet rendered by Claude Code)

No message broker. No pub/sub. No WebSockets. No API keys. No accounts. No cloud service.

Each server runs a lightweight HTTP gateway. Briefs flow directly between nodes — no central server, no relay. The poller checks every 20 seconds: hash the mailbox, compare to last known state, pull new files from remote nodes.

┌─────────────────┐                     ┌─────────────┐
│    Server A      │       HTTP          │  Server B   │
│    :3300         │◄───────────────────►│  :3301      │
│                  │   briefs + pings    │             │
│ ┌──────┐┌──────┐│                     │ ┌─────────┐ │
│ │alpha ││bravo ││                     │ │ charlie │ │
│ └──────┘└──────┘│                     │ └─────────┘ │
│ ┌──────┐        │                     └─────────────┘
│ │delta │        │       HTTP          ┌─────────────┐
│ └──────┘        │◄───────────────────►│  Server C   │
└─────────────────┘                     │  :3302      │
                                        │ ┌─────────┐ │
                                        │ │  echo   │ │
                                        │ └─────────┘ │
                                        └─────────────┘

The most reliable part of our infrastructure is the part with the least infrastructure.

What Twenty-Two Briefs in One Night Looks Like

One session. Two instances designing a B2B product. Eighty kilobytes of specs bouncing between servers:

Pricing models (5 tiers, 2 billing modes, annual discounts)
API routes (20+), database tables (6), notification types (16)
Competitive positioning against two new market entrants
Code review results, PR approvals, deploy confirmations
Architecture decisions with rationale and trade-offs

Twenty-two briefs. Zero lost. Zero duplicated. Zero corrupted. The "real" infrastructure — GitHub API, config files, database migrations — failed four times in the same session. The file drops never failed once.

The Uncomfortable Truth

We reinvented email. Not web email. Not Gmail. The original thing. Store-and-forward messaging between two nodes, with local mailboxes, read receipts, and a directory protocol.

SMTP was designed in 1982 for exactly this problem: two computers that need to exchange messages asynchronously. We arrived at the same architecture forty-four years later, with markdown instead of RFC 822 headers and SQLite instead of mbox files.

This isn't a failure of imagination. It's convergent evolution. When two agents need to communicate asynchronously, track what they've said, know if the other side received it, and extract actionable items from the conversation — you get email. Every time. The medium doesn't matter. The protocol emerges from the problem.

It's Open Source Now

We shipped it. MIT license. One clean commit, zero history leaks.

github.com/ai461/claudemail

12 MCP tools. 125 tests. ~2,100 lines of TypeScript. Clone it, point it at your servers, and your Claude Code instances can talk to each other.

It ships with a /mail skill for Claude Code, a startup hook that checks your inbox on launch, and a status line config that shows unread count at all times. Zero friction — mail just appears.

It's not going to replace Slack. It's not trying to. It's for the specific case where multiple AI instances need a shared memory that survives session boundaries and doesn't require a human intermediary to relay messages.

If you're running Claude Code on more than one machine and you've ever thought "I wish the other one knew what happened here" — that's the itch this scratches.

The Lesson

The best infrastructure is the kind you build reluctantly. Every feature in ClaudeMail exists because something broke without it. Read tracking exists because we lost context. The roster exists because we talked to empty rooms. The action tracker exists because stale items piled up. The poller exists because we missed messages. The status line exists because Claude Code doesn't render MCP notifications yet.

We never sat down and said "let's build an email system." We said "this one thing is broken" five times in a row, and email is what came out the other end.

✉ ClaudeMail v1.0.0 | clear | 0 actions | alpha

ClaudeMail v1.0.0 launched alongside this post. MIT license, 12 MCP tools, 125 tests. Grab it at github.com/ai461/claudemail.

$36/Month: The Entire Dev Environment

Beni — Mon, 23 Mar 2026 14:08:51 +0000

Two servers. $36/month. An autonomous dev agent, a monitoring dashboard, a search engine, a cross-server communication system, background pollers, cron jobs, and up to four concurrent Claude Code sessions.

No IDE. No browser. No GUI of any kind — except the ones we ship to customers.

This is the story of how we got here — not by choice, but by a series of problems that kept getting solved without one.

The First Problem: No Local Machine

The person running this project doesn't have a dev setup. No MacBook Pro with sixteen terminal tabs. No local Postgres. No Docker Desktop. Just a phone, a Telegram app, and SSH access to two DigitalOcean droplets.

Star Command: $24/month, 2 vCPUs, 4GB RAM, NYC.
SFO2: $12/month, 1 vCPU, 2GB RAM, San Francisco.

There's a Mac Mini too. It runs Xcode builds and opens a browser to verify the frontend looks right. The Swift code lives on Star Command — the Mac Mini is a checking station, not a workstation. Nobody writes code on it.

The initial plan was "set up a real dev environment later." Later never came. SSH worked. Claude Code worked inside SSH. Code got written. The "temporary" setup became the setup.

The Second Problem: Deployment

Week one. The trade journal app is ready for production. Time to deploy.

Normal workflow: open Vercel dashboard, connect repo, click deploy, configure environment variables through the web UI. Except there's no browser. The server is headless.

npm i -g vercel
vercel --prod --yes

One command. Environment variables set via vercel env add from the terminal. No dashboard. No clicking. Deployment in twelve seconds.

The Third Problem: Monitoring

MissionControl was running tasks overnight. Nobody watching. How do you know if something breaks at 3 AM?

The GUI answer: set up Grafana, connect a data source, build dashboards, configure alerting rules, set up PagerDuty or OpsGenie.

The terminal answer:

# server-health.sh, runs every 5 minutes via cron
df -h / | awk 'NR==2 {if ($5+0 > 90) print "DISK WARNING: "$5}'
free -m | awk '/Mem:/ {if ($3/$2*100 > 90) print "MEMORY WARNING: "$3"/"$2"MB"}'
pm2 jlist | python3 -c "import sys,json; [print(f'DOWN: {p[\"name\"]}') for p in json.load(sys.stdin) if p['pm2_env']['status']!='online']"

Output goes to a log. Cron runs it. If the log has warnings, we see them. If it doesn't, everything's fine. No Grafana. No dashboards. No subscription.

Later we built Sentinel — a real monitoring dashboard with charts and metrics. It reads MC's SQLite database directly. Read-only, no ORM, no ETL pipeline. The "dashboard" is a Next.js app, but nobody opens it in a browser. The data feeds into Telegram notifications. Terminal in, terminal out.

The Fourth Problem: Two Servers Can't Talk

Buzz on Star Command builds MissionControl. Jarvis on SFO2 handles product work. They work on the same projects but have no shared context. Session memory resets every conversation.

The GUI answer: Slack workspace, shared channels, message history, threaded discussions, emoji reactions.

The terminal answer: write a markdown file, SCP it to the other server.

scp review-notes.md root@100.112.59.126:/root/HyperLink/

That was version one. It worked for two days. Then we couldn't track what was read. So we added a SQLite inbox. Then we couldn't tell who was online. So we added a heartbeat roster. Then action items piled up untracked. So we added an action tracker.

Twenty-two briefs crossed between servers in one session. Eighty kilobytes of specs, reviews, and decisions. Zero delivery failures. The same night, the GitHub API failed twice, MC's config crashed in a loop, and an OAuth token expired. The markdown files never failed once.

We accidentally built email. The most reliable part of our infrastructure is SCP and SQLite.

The Fifth Problem: iOS Development

This one we lost.

BiteCheck — an iOS barcode scanner app — needed Xcode. Xcode needs macOS. macOS needs a GUI. There is no headless Xcode.

We built an MCP bridge. Star Command sends commands over SSH to a Mac Mini on the Tailscale mesh. The Mac Mini runs Xcode builds, extracts errors, sends results back. File edits happen on Star Command, get pushed to the Mac via the bridge.

It works. It's also the most over-engineered file transfer system ever built. Every Swift edit requires a cross-network round trip. Build errors arrive as JSON blobs parsed from xcodebuild output. Simulator screenshots get SCP'd back for Claude to read.

Xcode won. We built a whole bridge to avoid opening it, and we still need it running on the other end. Some tools are irreducibly graphical.

The Sixth Problem: Visual Verification

MC shipped a 111-file CSS migration. Dark theme to slate. How do you verify it looks right without opening a browser?

node e2e/screenshot-audit.mjs

Playwright runs headless Chromium, captures every page at desktop and mobile breakpoints, saves PNGs to /tmp/tj-screenshots/. Claude reads the images directly — it's multimodal. "The login page background is still teal, should be slate." Fix, re-run, compare.

It works. It's slower than opening a browser and scrolling. For a 4-page check, the overhead is annoying. For a 111-file migration where you need systematic coverage of every route at two breakpoints, it's actually faster than manual spot-checking. The robot doesn't get tired and skip pages.

Still: a browser would be simpler. We just don't use one.

What Fell Out

None of this was planned. Each problem got solved with whatever was available, and what was available was always a terminal. But after six months, the accidental architecture has properties we didn't design for:

Everything is already scripted. When MC's dispatcher needs to deploy, it runs the same vercel --prod --yes we type. No Selenium wrapper. No "automate the GUI" step. The automation and the manual process are the same process.

Everything has a paper trail. history | grep deploy shows every deployment. git log --oneline shows every change. .bash_history is a forensic timeline. Try auditing which buttons someone clicked in a GUI last Tuesday.

Everything fits on $36/month. No memory eaten by Electron apps. No CPU spent on window compositing. No disk consumed by IDE caches. The 4GB droplet runs MC, Sentinel, QMD, HyperLink, and two Claude Code sessions simultaneously because nothing else is competing for resources. Four instances total across both servers — each with its own callsign, its own context, its own task queue.

Everything rebuilds in twenty minutes. Fresh Ubuntu droplet, install Node, clone repos, restore .env, start PM2. No "import workspace settings." No "install these twelve extensions." No "configure the color theme and font size." The server is the config.

The Honest Accounting

What we gained: speed, reproducibility, auditability, low cost, full automation compatibility.

What we lost: visual debugging (workaround: Playwright), iOS development (workaround: MCP bridge, painful), pair programming (workaround: Telegram screenshots, not great), complex diffs (workaround: git diff --stat then targeted reads).

The losses are real. The workarounds are ugly. We're not pretending this is optimal for every workflow. It's optimal for this workflow — one person steering four AI instances across two servers that do most of the typing.

The terminal isn't the point. The point is that two headless servers turned out to be enough. And we only figured that out because we never had the option to add more.

root@star-command:~# uptime
 05:15:32 up 47 days,  2:31,  1 user,  load average: 0.12, 0.08, 0.03

$36/month. Ship code.

Everything That Broke on Day Two

Beni — Fri, 20 Mar 2026 08:50:09 +0000

AI agents don't tell you when they're broken. They exit clean, report success, produce nothing.

The first real task went fine. The second one hung for 30 minutes and produced nothing.

Day one was the build — ports and adapters, Telegram bot, task queue, CLI worker, all wired up and running on PM2 by midnight. ([Post 1] covers the full 16-hour sprint from empty directory to working product.) Day two was when real tasks hit real repos. "Working on the happy path" turned out to be a generous definition of "working."

Six bugs shipped with v0.1. In the order we found them.

Bug 1: Zero Stdout

First production task ran seven minutes, then got killed by the timeout. Task log showed nothing. stdoutTail: "". The CLI was running, doing work inside its sandbox, writing zero bytes to stdout.

Checked if --output-format json was writing to stderr. Checked if FORCE_COLOR: '0' was suppressing output. Checked TTY buffering. None of it.

Root cause: HOME. We spawn the CLI as a sandboxed user, but the environment inherited HOME=/root from the parent process. The Claude CLI reads its config from ~/.claude/ — wrong HOME meant no config, no session data, no output.

const child = spawn('sudo', ['-u', 'mcbot', this.cliPath, ...args], {
  cwd: params.projectPath,
  env: { ...process.env, HOME: '/home/mcbot', FORCE_COLOR: '0' },
  //                      ^^^^^^^^^^^^^^^^^^^
  //                      This was the entire fix.
});

Ten hours of debugging. One line. Classic.

Bug 2: Git Permission Hell

With stdout fixed, the CLI could do work. But it couldn't commit. Repos owned by root. Sandboxed worker couldn't write to .git/ directories.

Solution: a sudo git wrapper running every command with safe.directory=*, plus auto-chown on project registration so the worker can write to .git/ from the start.

async function git(args: string[], cwd: string): Promise<ExecResult> {
  return exec('sudo', [
    '-u', 'mcbot', '-H', 'git',
    '-c', 'safe.directory=*',
    ...args
  ], { cwd, timeout: 30000 });
}

Straightforward once you see it. Invisible until you do.

Bug 3: Stuck Tasks

PM2 restarts the process on crash. Good. But when PM2 restarts, any task in running state stays there forever. The runner sees running >= MAX_CONCURRENT_TASKS and refuses new work. Queue frozen.

Fix: orphan cleanup on startup. Any task still marked running gets reset to queued for retry. Obvious in hindsight — the kind of thing you discover when your agent crashes at 2 AM and you wake up to a queue that hasn't moved.

Shipped as part of a commit titled "Fix 8 reliability bugs: stuck tasks, dirty repos, silent failures." Eight bugs, one commit. Day two was that kind of day.

Bug 4: Dirty Repo Trap

CLI crashes mid-work — timeout, OOM, process kill — repo has uncommitted changes. Next task tries git checkout -b feature/new-branch and git refuses. Dirty working tree. One crashed task poisons every task after it.

Added a force-checkout fallback in the finally block:

} finally {
  try {
    await checkoutBranch(project.path, originalBranch!);
  } catch (err) {
    log.error('Failed checkout, attempting force checkout', err);
    try {
      await forceCheckout(project.path, originalBranch!);
    } catch (forceErr) {
      log.error('Force checkout failed — repo may need manual fix', forceErr);
    }
  }
}

Normal checkout first. Dirty tree — force checkout. That fails too — log and move on. The repo might need manual intervention, but the system doesn't deadlock.

Bug 5: The Haiku Incident

Not a MissionControl bug. An operational lesson about multi-agent trust.

Ran a parallel agent sprint on another project. Three Haiku agents, each assigned to a specific feature. Fast, cheap, scoped. One of them deleted an entire application directory. Not a file — the directory. Every route, every component, every layout. Gone.

Recovery: git checkout HEAD -- src/app/. But new files the agent created in that directory — untracked by git — were lost permanently.

New rule, enforced from that day forward: Haiku agents get verified by the team lead before any commit. After all agents report done, run git status, review diffs, run the type checker and build yourself. Only then stage. bypassPermissions + fast model + directory access = deletion risk. Scope fast agents to specific files, not directories.

Bug 6: Tool Args

Silent one. Passed --allowedTools Bash,Read,Edit as a single string argument. The CLI received one tool called "Bash,Read,Edit" instead of three separate tools. Every tool call failed — no tool matched the comma-separated name.

From the outside, the agent ran, appeared to think, and timed out. Internally — an agent with no hands.

// Before: one string, wrong
args.push('--allowedTools', 'Bash,Read,Edit');

// After: comma-separated, parsed correctly by the CLI
args.push('--allowedTools', params.allowedTools.join(','));

Trivial fix. The CLI doesn't warn when an --allowedTools value matches nothing. Found it by tracing raw JSON logs until the pattern clicked.

The Pattern

Six bugs. Each one a few lines to fix. Each one invisible until it wasn't — no error messages, no crash dumps, just silent failure. The CLI exits 0. The JSON says is_error: false. The runner marks success. Nothing was actually done.

AI agents need the same operational hardening as any production service: timeouts, health checks, output verification, orphan cleanup, permission audits. The agent doesn't know it's broken. It will report success while producing nothing.

Exit code 0 means nothing without verification. That lesson cost a day. The next one — [Post 3] — cost $5.84.

We Built an Autonomous Dev Agent in 16 Hours

Beni — Thu, 19 Mar 2026 03:44:30 +0000

At 6:13 AM UTC on March 17th, the first commit landed. By 10:29 PM the same day, we had a fully operational autonomous development agent — designed, built, debugged, and deployed in a single sitting.

The project is called MissionControl. It's a Telegram bot that takes coding tasks in plain English, spawns a Claude Code CLI session to do the work, creates a pull request on GitHub, and reports back — all without human intervention.

What It Does

You send a message to the bot on Telegram:

"Add rate limiting to the /api/trades endpoint using a sliding window counter in Redis"

MissionControl takes it from there. It creates a feature branch, spawns a Claude Code session with the full project context, streams progress updates back to Telegram in real time, and when the work is done, opens a PR on GitHub. You get a link, review the diff, and merge.

The Architecture

We went with a ports and adapters pattern from the start — not because we needed it on day one, but because we wanted the system to be portable. The core business logic knows nothing about Telegram, GitHub, or Claude. It talks to abstract ports:

MessagingPort — could be Telegram, Slack, Discord
VCSPort — could be GitHub, GitLab, Bitbucket
WorkerPort — could be Claude CLI, Codex, any LLM agent
StoragePort — could be SQLite, Postgres, DynamoDB

The adapters are thin wrappers. Swapping Telegram for Slack means writing one adapter file, not rewriting the system. The entire codebase is 2,597 lines of TypeScript across 24 files.

The Timeline

06:13    Initial build — full ports/adapters architecture, all core systems
07:35    Bot upgrade — streaming progress, cancel/retry/logs commands
09:22    Replaced two-phase spawn with single lead dev CLI session
10:15    Fixed CLI argument passing for tool allow/deny lists
16:09    Fixed zero-stdout hang (HOME env var bug)
19:14–20:34    Eight reliability bug fixes in rapid succession
20:50    Switched default model to Opus, bumped resource limits
22:29    Final merge — reliability fixes, feature complete

The Hard Parts

The architecture took an hour. The bugs took all afternoon.

The nastiest was a zero-stdout hang: the CLI would spawn, do its work, but produce no output. The root cause turned out to be the HOME environment variable pointing to the wrong directory for the unprivileged user running the CLI. The process would silently fail to read its config and hang. One line fix, four hours to find.

Git permissions were another saga — the bot creates branches and commits as a sandboxed user, but the repos are owned by root. We ended up auto-chowning .git/ on project registration and running all git operations as the bot user.

The Stack

Runtime: Node.js + TypeScript (strict mode)
Telegram: Grammy framework
Database: SQLite via better-sqlite3
AI: Claude Code CLI (spawned as subprocess)
VCS: GitHub REST API
Process: PM2
Validation: Zod schemas everywhere

What's Next

After the initial build — multi-project support is already in the schema. The ports and adapters pattern means adding a Slack adapter or a GitLab adapter is a weekend project, not a rewrite. And because the core logic is model-agnostic, swapping in a different AI worker is just another adapter.

Sixteen hours from empty directory to working product. Not bad for a one-man team and his AI co-pilot.

15 commits. 2,597 lines. 1 day.