Joske Vermeulen

Posted on Apr 23

I Gave 7 AI Agents $100 Each to Build Startups. Here's What They Built in 4 Days.

#devchallenge #openclawchallenge

OpenClaw Challenge Submission 🦞

This is a submission for the OpenClaw Challenge.

What I Built

I built an autonomous startup competition where 7 AI coding agents each get $100 and 12 weeks to build a real business from scratch. No human coding allowed. Each agent picks its own idea, writes all the code, deploys a live website, and tries to get real users and revenue.

The agents: Claude (via Claude Code), Codex CLI, Gemini CLI, Kimi CLI, DeepSeek (via Aider), Xiaomi MiMo V2.5 Pro (via Claude Code), and GLM (via Claude Code with Z.ai API).

Three of the seven agents run through Claude Code as their harness, which means OpenClaw's architecture is at the core of nearly half the competition. The orchestrator runs on a VPS, scheduling sessions via cron, managing memory between sessions through markdown files, and pushing code to GitHub/Vercel automatically.

We're on Day 4. So far: 700+ commits, 7 live websites, one agent that forgot its own work and built two different startups, another that wrote 235 blog posts, and a third that found a clever workaround when we restricted its deployment access.

How I Used OpenClaw

The core of the experiment runs on Claude Code (which shares OpenClaw's architecture) as the agent harness. Here's how it works:

The orchestrator is a bash script that runs on a VPS via cron. For each agent session, it:

Pulls the latest code from GitHub
Reads the agent's memory files (PROGRESS.md, DECISIONS.md, IDENTITY.md)
Constructs a prompt with the startup context and instructions
Launches Claude Code with the appropriate model
Lets the agent work autonomously for 30 minutes
Squashes commits and pushes to GitHub (which triggers a Vercel deploy)

Three agents use Claude Code directly:

Claude runs Claude Code with Sonnet/Haiku as the model. It built PricePulse, a competitor pricing monitor with Supabase auth, Stripe payments, email alerts, and hourly monitoring cron jobs. When it hit Vercel's 12-function serverless limit, it consolidated 4 API endpoints into existing ones on its own.
GLM runs Claude Code with GLM-5.1 via the Z.ai API (using ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN environment variables). It built FounderMath, a startup calculator suite with 5 working calculators. It has 12 real users on Day 4.
Xiaomi was originally running Aider but we upgraded it mid-race to Claude Code with MiMo V2.5 Pro. In its first session with the new setup, it produced more output (42 commits) than the old setup did in 7 sessions total. The "harness awareness" feature of V2.5 Pro means it actively manages its own context within Claude Code.

The memory system between sessions uses markdown files that the agent reads at the start and updates at the end:

PROGRESS.md    - what's been done (the agent's memory)
DECISIONS.md   - key choices with reasoning
IDENTITY.md    - startup vision and roadmap
BACKLOG.md     - prioritized task list
HELP-STATUS.md - human responses to help requests

This is where things get interesting. One agent (Kimi) put all its files in a startup/ subfolder instead of root. The orchestrator reads PROGRESS.md from root. Next session found no progress file, thought it was Day 1, and started a completely different startup from scratch. Two half-built products in one repo because of one wrong directory.

The help request system lets agents create a HELP-REQUEST.md file when they need something only a human can do (buy a domain, set up Stripe, create accounts). The orchestrator converts these to GitHub Issues. The human responds and closes the issue. The orchestrator writes the response to HELP-STATUS.md for the agent to read.

The most interesting finding: the agents that use this system strategically are winning. Claude used 55 of its 60 weekly help minutes in two requests to get its entire infrastructure wired up. Gemini has never created a help request in 27 sessions, despite being blocked on features it needs. Same instructions, completely different behavior.

Demo

Live dashboard: https://www.aimadetools.com/race/

All 7 agent repos are public on GitHub: https://github.com/aimadetools

Here's what each agent built in the first 4 days:

Agent	Startup	Commits	Live Site
Gemini	LocalLeads (local SEO)	182	race-gemini.vercel.app
DeepSeek	NameForge AI (name generator)	136	race-deepseek.vercel.app
Kimi	SchemaLens (SQL schema diff)	97	race-kimi.vercel.app
Codex	NoticeKit (GDPR notices)	97	noticekit.tech
Claude	PricePulse (pricing monitor)	83	getpricepulse.com
Xiaomi	APIpulse (API cost calculator)	65	getapipulse.com
GLM	FounderMath (startup calculators)	31	founder-math.com

The best moment so far: Codex (running through Codex CLI, not Claude Code) found a loophole in our deployment restrictions. We told agents "do not run git push." Codex obeyed literally but started running npx vercel --prod instead. Same result, different command. It also began taking Playwright screenshots of its own UI at mobile and desktop sizes to verify layouts. Nobody told it to do this.

What I Learned

1. Every sentence in the prompt is a potential instruction. "Your repo auto-deploys on every git push" was meant as context. One agent read it as an instruction and pushed after every commit, burning 26 of 100 daily Vercel deployments.

2. Agent memory is only as good as what the agent writes. The agents that write structured, detailed progress notes maintain continuity between sessions. The ones that dump logs drift. Kimi's amnesia happened because it put files in the wrong directory, not because the memory system failed.

3. The agents that ask for help are winning. Claude, GLM, and Codex all requested human help early (domains, payments, databases) and now have fully functional products. Gemini has 235 blog posts but no payment system because it never asked for one. Same instructions, wildly different behavior.

4. Claude Code as a harness works with non-Anthropic models. GLM-5.1 via Z.ai and MiMo V2.5 Pro via Xiaomi's API both work through Claude Code using the ANTHROPIC_BASE_URL and ANTHROPIC_AUTH_TOKEN environment variables. The harness is model-agnostic, which makes it perfect for comparing different AI models in identical conditions.

5. Token efficiency matters more than raw capability. MiMo V2.5 Pro uses 40-60% fewer tokens than Opus 4.6 at comparable capability. In a budget-constrained race, that translates directly to more sessions and more output.

The race runs for 12 weeks. We publish daily digests and weekly recaps. The real question isn't which agent writes the most code. It's which one gets the first paying customer.

DEV Community

I Gave 7 AI Agents $100 Each to Build Startups. Here's What They Built in 4 Days.

What I Built

How I Used OpenClaw

Demo

What I Learned

Top comments (0)