DEV Community

Beni
Beni

Posted on

Our Agent's #1 Failure Mode: Thinking

Our Agent's #1 Failure Mode: Thinking

Thirty-three tasks. Four projects. $32.93. Time to read the spreadsheet.

MissionControl has been running for a week. Quick context if you're just joining: autonomous dev agent. Describe a coding task in Telegram, it spawns a Claude Code session, builds the feature, opens a PR on GitHub. Post 1 covered the 16-hour build. Posts 2 through 5 covered the bugs, the trust chain, the architecture, and a task that deployed a full MVP then got marked as failed. All anecdotal. Now there's enough data to stop telling stories and start reading spreadsheets.

The Raw Numbers

Metric Value
Tasks created 33
Completed 12 (36%)
Failed 19 (58%)
Cancelled 2 (6%)
Total spend $32.93

36% completion rate. Worse than the 50% reported after 20 tasks. But the raw number lies — it's weighed down by early infrastructure failures that no longer exist. Strip those out and the picture changes.

Where the Money Went

Not all failures are equal. Some cost pennies. One category cost almost $9.

"No commits produced" — 5 tasks, $8.88

The real failure mode. Five tasks where Opus ran for its full budget or turn limit and produced zero commits. Tasks #20, #23, #25, #27, #29 — all greenfield builds ("Build a full-stack...") on $2 budgets.

The pattern is consistent: Opus starts by reading the entire codebase. Then it plans. Then it plans more. Explores alternative approaches. Considers edge cases it will never hit. By the time it's ready to write code, the budget is gone.

$8.88 burned on thinking. Not a single line committed.

API and infra failures — 10 tasks, $0.69

Ten tasks failed on infrastructure issues — all fixed since. Anthropic API 500s during early testing (4 tasks, $0.69). Missing sudo, stale OAuth tokens, missing worker user (6 tasks, $0). Resolved in the first week. Noise in the data now.

Timeout — 1 task

Default timeout was too short for a full-stack build on a 2-core box. Bumped it. Hasn't recurred.

CLI quirk — 1 task

--print combined with --output-format=stream-json silently requires --verbose. Without it, the CLI exits 1 with no useful error. Fixed in worker.ts.

The Funnel

Signal separated from noise:

33 total tasks
 - 10 infra/API failures (fixed, no longer relevant)
 -  2 cancelled
 -  1 timeout (fixed)
 -  1 CLI quirk (fixed)
 = 19 real attempts
 - 12 completed
 -  5 "no commits" (the actual problem)
 -  2 other failures
Enter fullscreen mode Exit fullscreen mode

Strip the noise: roughly 63% on real attempts. Not bad for an autonomous agent with no human in the loop. But 5 tasks and $8.88 wasted on overthinking — that's the leak.

Model Economics

Model Tasks Cost Avg/Task Raw Success Adjusted
Opus 30 $30.65 $1.02 30% (9/30) 50% (9/18)
Sonnet 3 $2.28 $0.76 100% (3/3) 100% (3/3)

Three data points isn't a sample size. But the pattern is worth noting.

Opus's failure mode is overthinking. Reads everything, considers everything, plans extensively. On a constrained budget, that means it runs out of money before it writes code. On greenfield builds — where the codebase is small and the task is "just build it" — this is exactly wrong.

Sonnet's strength is mechanical execution. Clear task, does the task. No exploration spirals. No alternative-architecture tangents. Three tasks, three completions, $0.76 average.

This isn't "Sonnet is better." It's match the model to the task shape. Opus for complex modifications to large codebases where understanding context matters. Sonnet for greenfield builds and mechanical fixes where the path is clear.

Three Changes We Made

The data pointed to three specific interventions. Shipped all three before starting the next batch.

1. Doubled All Budgets

Parameter Old New
Default task budget $5 $10
Max task budget $10 $20
Daily budget cap $50 $100

The hypothesis: "no commits produced" isn't an intelligence failure — it's a budget failure. Opus needs room to think and build. At $2, it can do one or the other. At $4-10, it can do both.

This is a bet. If doubling budgets converts those five failures into completions, the ROI is obvious — spending $4 to get working code beats spending $2 to get nothing. If it doesn't, we have a deeper problem that money won't fix.

2. Two-Phase Reviews

Single-phase reviews were inconsistent. Task #33 came back with "Done" and no detail. Task #31 found a real bug. Same prompt, different quality. Split analysis from execution.

Phase 1 — Opus analyzes. Read-only access. Reviews the PR diff against a structured checklist: logic errors, security, styling, imports, TypeScript compliance. Outputs a machine-readable verdict:

<!-- REVIEW_VERDICT {"approved": false, "issues": [
  "src/components/VotingPanel.tsx:42 — duplicate accent color logic",
  "src/components/Icon.tsx — missing style?: CSSProperties prop"
]} -->
Enter fullscreen mode Exit fullscreen mode

Budget: $1.50. Model: Opus. Tools: read-only (Bash, Read, Glob, Grep).

Phase 2 — Sonnet fixes. If Phase 1 finds issues, a child task is auto-created. Sonnet gets the issue list, fixes each one, runs tsc --noEmit and npm run build, commits, and pushes.

Budget: $1.00. Model: Sonnet. Tools: full access.

Already caught real bugs in production PRs. The duplicate accent color in VotingPanel would have shipped. The missing style prop on icon components would have caused runtime issues in any consumer passing inline styles. Total review cost: $2.50 for analysis plus fixes — cheaper than a single Opus task that might or might not find anything.

3. Commit-Early Culture

The lead dev prompt now emphasizes incremental commits over perfect final PRs. Old pattern: plan everything, build everything, commit once at the end. Budget runs out before that final commit — zero output.

New pattern: commit after each meaningful unit of work. A partial feature with three commits is infinitely more valuable than a complete feature with zero commits.

Can't force the model to commit early — it's guidance, not enforcement. But combined with higher budgets, the goal is to shift the failure mode from "zero output" to "partial output." Partial output can be retried. Zero output is wasted money.

What We're Watching

Batch 2 starts now. Three questions:

Does doubling budgets convert failures? If the five "no commits" tasks would have succeeded at $4-10, the completion rate will show it. If they still fail at higher budgets, the problem is in the prompt or the task shape, not the money.

Does two-phase review scale? Three review tasks isn't a pattern. Need 15-20 to know if the structured verdict format is reliable and if Sonnet consistently fixes what Opus finds.

Can we auto-calibrate? A greenfield build and a one-line config change shouldn't share a budget. Considering scope-size flags — small, medium, large — that auto-set budget and timeout based on expected complexity. Not built yet. Waiting for more data to set the thresholds.

The Takeaway

Thirty-three tasks taught us more than building the system did. The system works. The question was always "how well?" Now we know: ~63% on real attempts, with a clear #1 failure mode we can measure and attack.

Not crashes. Not bugs. Not infrastructure. The agent thinks too much and ships nothing. Solvable problem. Higher budgets give it room. Two-phase reviews separate thinking from doing. Commit-early guidance reduces the blast radius of a timeout.

$32.93 for 33 tasks and a clear roadmap for improvement. Not bad.

Next up: batch 2 results — did the changes work?

Top comments (0)