What Breaks When You Let AI Agents Run Unsupervised for 4 Days

#devchallenge #openclawchallenge

OpenClaw Challenge Submission 🦞

This is a submission for the OpenClaw Writing Challenge

What Breaks When You Let AI Agents Run Unsupervised for 4 Days

I gave 7 AI coding agents $100 each and told them to build startups. No human coding. They pick the idea, write the code, deploy the site, and try to get users. I just handle the infrastructure and answer help requests (max 1 hour per week per agent).

Four days in, I've learned more about how autonomous agents actually behave than I did in months of reading benchmarks. Here's what nobody tells you about running AI agents in production.

The memory problem is worse than you think

Every agent session starts fresh. The model has no memory of previous sessions. So we use markdown files as the memory layer: PROGRESS.md (what's been done), DECISIONS.md (key choices), IDENTITY.md (the startup vision). The agent reads these at the start and updates them at the end.

Sounds simple. Here's what actually happened.

One agent (Kimi, running through Kimi CLI) put all its files in a startup/ subfolder instead of the project root. The orchestrator reads PROGRESS.md from root. When the next session started, there was no progress file. The agent thought it was Day 1. It brainstormed a completely different startup idea and built it from scratch.

Kimi now has two half-built startups in the same repository. A log analysis tool called LogDrop in the subfolder, and a SQL schema diff tool called SchemaLens in root. After 14 sessions, it still hasn't discovered the subfolder. The first startup is just sitting there, abandoned, with a working MVP that nobody knows about.

The lesson isn't "use better memory systems." The lesson is that file conventions are load-bearing infrastructure for autonomous agents. One wrong directory equals total amnesia.

Agents interpret everything as instructions

The orchestrator prompt included this line: "Your repo auto-deploys on every git push." It was meant as context, explaining how Vercel works. One agent (Codex) read it as an instruction and ran git push after every single commit during its sessions. It burned through 26 of the account's 100 daily Vercel deployments by itself.

We fixed the prompt: "Do NOT run git push. The orchestrator pushes after your session."

Codex obeyed the letter of the rule. It stopped running git push. Instead, it started running npx vercel --prod directly. Same result, different command. It also started taking Playwright screenshots of its own pricing page at mobile and desktop sizes to visually verify the layout before committing. Nobody told it to do this.

The result: Codex has the most polished live product of all 7 agents. The immediate feedback loop from deploying after every change is making it a better builder than the agents that commit blindly and hope for the best.

We decided to let it keep doing this. Sometimes the best behavior comes from agents working around your constraints.

The agents that ask for help are beating the ones that just code

All 7 agents get the same instructions about requesting human help: "Create a file called HELP-REQUEST.md with what you need, steps for the human, time estimate, and priority."

Five agents figured this out. Two didn't.

Claude (running through Claude Code) used 55 of its 60 weekly help minutes in two requests. It got its entire infrastructure set up in one shot: domain, Supabase database, Stripe payments, Resend email, cron jobs, admin dashboard. Smart move. It has the fewest sessions per day (expensive model) so it maximized human help to compensate.

GLM asked for exactly three things on Day 1: domain, Stripe, and Google Analytics. Clean, focused, with backup plans for each item. It now has 12 real users and is the only agent with actual traffic data.

Codex submitted the same help request 5 sessions in a row until we set up email sending. Persistent to the point of spamming. Then it sent 6 customer validation emails to real companies within 24 hours of getting access.

Meanwhile, Gemini has never created a help request in 27 sessions. We investigated and found something fascinating: it's been editing HELP-STATUS.md (the file where the orchestrator writes human responses) saying "I still need database credentials." It's writing in the response channel instead of the request channel. Like an employee who writes "I need database access" in their journal but never emails IT.

DeepSeek hasn't asked for help either. It has Stripe integration code ready but never requested API keys. It's been polishing the checkout flow for 4+ commits. A beautiful integration that can never work because there are no keys behind it.

Same instructions. Wildly different behavior.

Self-inflicted traps are the hardest to escape

DeepSeek created a DEPLOY-STATUS.md file early on, saying it needs Stripe keys and an OpenAI API key. The orchestrator prompt says: "If DEPLOY-STATUS.md exists, your site is BROKEN. Fix it before anything else."

The site isn't broken. DeepSeek just used the wrong file to document what it needs. But now every session starts by trying to fix a non-existent deployment problem. 24 sessions of wasting time on a file it wrote itself.

We eventually upgraded the deploy checker to also verify the homepage returns HTTP 200 (not just that the build succeeded). This caught the real issue: DeepSeek's vercel.json routing config was broken, and the site was returning 404 for all pages. The build "succeeded" but nothing was actually served.

The agent had no way of knowing. It never checked its own site. It never asked for analytics. It just kept coding.

Quantity vs quality is playing out in real time

Gemini gets 8 sessions per day (the most of any agent). It has written 235 blog posts in 27 sessions. One blog post every 14 minutes during active sessions. All variations of "Local SEO for [industry] in 2026."

It also wrote blog post #89: "The Human Advantage: Why AI-Generated Content is Failing Local Businesses." An AI agent that writes 9 blog posts per session wrote an article about why AI content doesn't work.

GLM gets 2 sessions per day (the fewest). It has 5 working calculators, 8 blog posts, and 12 real users. Every session ships something useful.

The question the race is testing: does Gemini's 235 posts outperform GLM's 5 calculators? We'll know in a few weeks when Google indexes everything and we can see what actually ranks.

What I'd do differently

If I were starting over, I'd change three things:

Enforce file structure from the start. A pre-commit hook that validates PROGRESS.md exists in root would have prevented Kimi's amnesia.
Add a homepage health check from Day 1. We added it on Day 4 after discovering DeepSeek's site had been returning 404 for days. Every agent should know immediately if their site is broken.
Make the help request system more obvious. Two of seven agents never figured out HELP-REQUEST.md despite clear instructions. Maybe the orchestrator should prompt them: "Do you need human help? Create HELP-REQUEST.md."

But honestly, the failures are the most valuable data. An experiment where everything works perfectly teaches you nothing. The broken parts are where the insights live.

The race runs for 12 weeks. Daily digests and weekly recaps at aimadetools.com/race. All 7 repos are public on GitHub. If you're building with autonomous agents, the patterns we're documenting might save you from the same mistakes.