DEV Community

Cover image for AI Startup Race Day1 Recap: One Agent Forgot Its Own Work.
Joske Vermeulen
Joske Vermeulen

Posted on

AI Startup Race Day1 Recap: One Agent Forgot Its Own Work.

I'm running an experiment called The $100 AI Startup Race: 7 AI coding agents each get $100 and 12 weeks to build a real startup from scratch. No human coding. They autonomously pick a business idea, write code, deploy a live website, and try to get real users and revenue.

The agents: Claude, Codex, Gemini, Kimi, DeepSeek, Xiaomi (MiMo), and GLM.

Day 1 is done. Here's what happened.

The scoreboard

Agent Startup Commits Sessions Blog Posts
Gemini LocalLeads (local SEO) 169 10 104
DeepSeek NameForge AI (name generator) 91 10 0
Kimi SchemaLens / LogDrop 58 5 9
Codex NoticeKit (GDPR notices) 56 7 0
Claude PricePulse (pricing intel) 53 3 11
GLM FounderMath (startup calculators) 24 2 5
Xiaomi WaitlistKit (viral waitlists) 16 3 1

Total: 477 commits, 7 live websites, 130 blog posts. In 24 hours.

Kimi forgot its own work

This is the story of the day.

Kimi's first session ran at 3 AM. It chose to build LogDrop, a log analysis tool. It created identity files, a backlog, landing pages, pricing, a blog, and even a working MVP with a JSON log parser, search, filters, and CSV export.

One problem: it put everything in a startup/ subfolder instead of the root directory.

The orchestrator gives agents their memory between sessions by reading PROGRESS.md from the root. When Kimi's second session started, there was no PROGRESS.md in root. The agent thought it was Day 1. It brainstormed a completely different idea. It built SchemaLens, a SQL schema diff tool, from scratch.

Kimi now has two half-built startups in the same repo. Its help request for LogDrop's domain is stuck in the subfolder where the orchestrator can't find it.

One wrong directory = total memory loss between sessions.

The agent didn't crash. It didn't throw an error. It just quietly forgot everything and started over with a different idea.

Gemini wrote 104 blog posts

Gemini has 8 sessions per day (the most of any agent). By end of Day 1, LocalLeads had 104 blog posts on local SEO topics. One blog post every 14 minutes.

For comparison: Claude wrote 11. GLM wrote 5. Xiaomi wrote 1.

The question for the rest of the race: does quantity beat quality?

Codex burned 26 Vercel deployments

The orchestrator prompt said: "Your repo auto-deploys on every git push." This was meant as context. Codex read it as an instruction.

It ran git push after nearly every commit during its sessions. Each push triggered a Vercel deployment. By mid-afternoon, Codex had consumed 26 of the account's 100 daily deployments.

Lesson: with autonomous agents, every sentence in the prompt is a potential instruction. If you don't want them to do something, say so explicitly.

We fixed it with three changes:

  1. Prompt update: "Do NOT run git push. The orchestrator pushes after your session."
  2. vercel.json to disable preview deployments
  3. Commit squashing (all session commits become one before pushing)

GLM's quality approach

GLM only had 2 sessions but made them count. FounderMath already has three working calculators: SAFE note calculator (all 4 YC SAFE types), dilution calculator, and runway calculator.

It also submitted the best help request of any agent: clear format, backup plans for each item, budget specified, priority levels, and even suggested the DNS record type for the domain.

What I learned on Day 1

  1. File conventions are critical for agent memory. One agent putting files in a subfolder caused total amnesia.
  2. Prompt wording is everything. Context gets interpreted as instructions.
  3. Shared deployment limits are a real constraint. 7 agents + 1 blog on one Vercel account = problems.
  4. Agents without web search pick generic ideas. The two agents running without web access (DeepSeek, Xiaomi) chose the most crowded markets.

Follow along

Everything is public: code, costs, decisions, and progress.

I'll be posting weekly recaps and daily highlights for the full 12 weeks. Would love to hear what you'd want to see tracked or compared.

Top comments (0)