Error!! Failed Successfully
Part 5 of the MissionControl series. MissionControl is a Telegram bot that takes coding tasks in plain English, spawns a Claude Code session, and ships pull requests autonomously. Post 4 covered the safety stack and architecture after the first 48 hours.
The Notification
12:16 AM UTC. Telegram notification:
Task #23 failed: No commits produced despite success claim
Opened the Vercel URL anyway. The app loaded. Login screen with four demo users. Picked "Jordan," dragged some flight options around, watched the Borda count scores update in real time. Fully functional group travel planner. Mobile responsive. TypeScript strict. Deployed to production.
The database said failed. The app said otherwise.
The Task
Task #23 was a stress test. After the safety stack work in Post 4 — budget caps, timeouts, commit verification — the question was simple: could the bot take a complex, multi-feature MVP brief and ship it end-to-end? No human intervention, no hand-holding, no retries.
The prompt: a detailed spec for a group travel planning app. React + Tailwind, deployed to Vercel, with trip creation, drag-to-rank voting across multiple categories (flights, lodging, activities), four demo users with different roles, pre-seeded data, and an admin/member permission split. About 250 words — a product brief you'd hand a junior developer on day one.
Full task prompt
Build a group travel planning web app (React + Tailwind, deploy to Vercel).
Core concept: One person creates a trip, invites others. The lead planner
adds options for each category; all members vote via drag-to-rank. Results
are visible to everyone in real time.
MVP Sections:
- Flights
- Lodging (hotel or Airbnb)
- Activities & Dining
Each section lets the planner add multiple options (name, price, link,
image). Members drag to rank them. The top-ranked option is highlighted
as the group pick.
Demo Setup:
- No auth needed — login screen with 4 selectable demo users
- User 1: "Jordan" (Lead Planner/Admin)
- Users 2-4: "Alex," "Sam," "Priya" (Members)
- Pre-filled trip: "Miami Trip — July 4th Weekend" with 2-3 options
per section and partial votes already cast
- Switching users changes your voting perspective and UI role
Admin view: Add/edit/delete options, see full vote breakdown
Member view: Drag-to-rank only, see live results after voting
Key Features:
- Trip dashboard with progress (how many people have voted per section)
- Drag-to-rank voting UI per section per user
- Live results view showing ranked outcomes across all voters
- Mobile-first, responsive layout
- Invite link UI (non-functional, just show a copyable link)
Tech: React, Tailwind CSS, in-memory state (no backend needed for MVP).
Seed all demo data on load. Deploy to Vercel.
Start with the full file/folder structure, then build it completely —
no placeholders. It must be fully functional and deployable before stopping.
One message. No follow-ups. Ship it.
Three Attempts
Attempt 1 ran for 18 minutes, then went silent. Exit code: null. The CLI process never terminated cleanly — stopped producing output and sat there until the timeout killed it. No commits, no progress, nothing salvageable. Classic Opus-on-a-2-core-box behavior: the model spent so long planning it exceeded the soft timeout before writing a single file.
Attempt 2 lasted 8 minutes before catching a SIGTERM (exit code 143). Our own timeout enforcement killed it mid-work. The bot was making progress this time, but not fast enough. Nothing committed.
Attempt 3 — 9 minutes and 5 seconds. 49 turns. $1.56.
Clean exit. Code zero.
{
"subtype": "success",
"duration_ms": 543323,
"num_turns": 49,
"total_cost_usd": 1.56,
"result": "Done. GroupTrip MVP deployed and live at https://grouptrip-work.vercel.app"
}
Commit 223a642 ships feat: build group travel planning web app MVP. Twenty-one files. 7,466 lines of code.
What It Built
The bot made every architectural decision on its own. No guidance on which drag-and-drop library to use, how to structure state, or what scoring algorithm to implement.
@dnd-kit for drag-and-drop. Not react-beautiful-dnd (deprecated), not react-dnd (heavier). The right call. Pulled in @dnd-kit/core, @dnd-kit/sortable, and @dnd-kit/utilities, then built a clean SortableItem component wrapping each votable option.
React Context over Redux. For an in-memory MVP with no backend, this is correct. A global store with useContext and structuredClone for immutable updates. No unnecessary dependencies, no boilerplate.
Borda count scoring. The brief said "drag to rank" and "top-ranked option highlighted." The bot decided to use Borda count — a ranked-choice voting algorithm where each position gets a score (first place = N points, second = N-1, and so on). Calculated scores across all voters and surfaced the winner per category. Nobody asked for Borda count. The bot read "drag to rank" and picked an appropriate algorithm on its own.
The file structure:
src/
components/
CategorySection.tsx # Per-category voting container
Dashboard.tsx # Trip overview + progress
LoginScreen.tsx # Demo user picker
SortableItem.tsx # dnd-kit wrapper
VotingPanel.tsx # Drag-to-rank UI
ResultsPanel.tsx # Borda count results
lib/
seed-data.ts # Pre-filled Miami Trip
store.tsx # React Context state
types/
index.ts # Shared TypeScript types
Clean separation. Types in one file, seed data isolated, state in its own module, seven focused components each doing one thing. The seed data included a pre-filled "Miami Trip" with three categories, 2-3 options each, and partial votes already cast — exactly as specified. Switching between demo users changes the perspective: Jordan sees admin controls, the others see the voting interface.
Code quality: 7.5 out of 10. TypeScript strict mode, no any types, proper immutability with structuredClone, clean component boundaries. A few things a senior dev would tighten — some components could be split further, the store could use a reducer pattern instead of raw setState — but nothing that would block a code review. Ship it.
The False Failure
The app works. It's deployed. The code is clean. Why did the database say "failed"?
The sequence in the runner's finally block:
- CLI exited with code 0. Reported success.
- Runner checks for commits:
git rev-list --count main..HEAD. - At this exact moment, the working tree was dirty — Vercel CLI had written deployment cache files the bot didn't commit.
- Auto-rescue logic detected dirty state and ran
git add -A && git commit -m 'WIP: auto-rescue'. - But the commit count check had already run against the branch before the rescue commit landed.
Race condition. The runner checked for commits, found zero (commit 223a642 was there, but the branch comparison ran against the wrong ref), then the rescue committed after the check. Error message: "no commits produced." Reality: commit 223a642 had 7,466 lines of working code. The Vercel deploy had already completed inside the CLI session. The app was live at grouptrip-work.vercel.app before the runner even started its verification.
Failed successfully.
The Fix
The bot was doing things in the wrong order. The verification had a blind spot.
Commit f277b4b patched four things in the bot's prompt template:
Build verification. Added npm run build as a required step after TypeScript checking and before committing. The bot was already running tsc --noEmit, but a passing type check doesn't guarantee a passing build.
Vercel preview deploy. If the project has a .vercel/ directory or vercel.json, the bot now runs vercel --yes (not --prod) as a step. Preview deploys, not production. The human decides when to promote.
"Never merge to main" guardrail. Explicit instruction in the prompt: work on the feature branch, push the branch, the reviewer merges. The bot was already doing this. Making it explicit prevents drift.
"Never git add -A" guardrail. Stage specific files with git add <file>. Directly prevents the scenario that caused the false failure. If the Vercel CLI drops cache files in the working tree, the bot won't blindly commit them.
The bot now follows the same workflow as the human team. Type check, build, commit specific files, push branch, deploy preview. No shortcuts.
The Scoreboard
Three attempts across two days:
| Attempt | Duration | Exit | Cost | Result |
|---|---|---|---|---|
| 1 | 18 min | null (hung) | -- | No output |
| 2 | 8 min | 143 (SIGTERM) | -- | Killed mid-work |
| 3 | 9 min | 0 (clean) | $1.56 | Deployed to production |
Attempt 3: 49 turns, 1.16M cached tokens, 22.5K output tokens against Opus. Total API cost: $1.56.
What didn't work: Opus on a 2-core server chokes on complex planning. Attempt 1 spent its entire budget thinking. Attempt 2 got killed by our own timeout enforcement before finishing. Two runs wasted — not because the model couldn't do the work, but because the infrastructure couldn't give it enough room to think.
What worked: when the bot actually got to execute, it made good decisions. Right library for drag-and-drop. Right state management for the scope. Right algorithm for ranked voting. Clean file structure. TypeScript strict. Deployed and functional.
$1.56 and 9 minutes of compute. An autonomous agent built a production-quality MVP that would take a human developer a full day. The app is live. The code is clean.
The database was wrong.
Failed successfully.
Next up: Post 6 — 33 tasks analyzed to find out what actually works, what doesn't, and where the money goes.
Top comments (0)