sai-builder

Posted on May 20

I Ran an Autonomous AI Startup Loop 5 Times. It Hit Two Ceilings.

#ai #startup #agents #automation

There's a question floating around right now that everyone has an opinion on and almost no one has run: what happens if you let AI design and launch a business by itself?

Not "AI helps you brainstorm." Not "AI writes your landing page copy." I mean the whole loop — generate the idea, evaluate it, decide whether to kill it or ship it — with no human context injected anywhere. No founder's domain expertise. No "I happen to know this industry." No personal network. Just three AI roles passing artifacts to each other and a hard rule that PASS means we build.

I built that loop and ran it five times. It produced zero passes.

That sounds like a failure, and in the narrow sense it is. But the shape of the failure turned out to be more interesting than a success would have been. The loop didn't fail randomly. It failed against two walls, in the same place, every cycle. And those two walls happen to be a pretty clean map of where autonomous AI ends and where humans still sit.

This is the writeup. No spin. The scores were bad and I'm going to show you the bad scores.

The setup

Three roles, no shared memory of "me":

Generator — proposes a business idea from scratch. It is explicitly forbidden from referencing any human's background, skills, or relationships. It works from market structure only.
Evaluator — scores the idea against a fixed rubric.
Judge — reads the score and either advances the idea, sends it back for one revision, or kills it.

The rubric is 5 criteria × 10 points (50 max), minus three penalties. PASS threshold is 40.

score =
    market_pull        # is there real, urgent demand?
  + willingness_to_pay # will someone actually pay, and how much?
  + defensibility      # can a small outside builder hold a position?
  + time_to_revenue    # how fast to first dollar, solo?
  + execution_fit      # can this be built and run without a team?
  # each 0–10, raw max 50

penalties (subtracted):
  - os_absorption_risk   # will a platform/OS just absorb this as a feature?
  - competitor_death     # is this a known graveyard pattern?
  - price_tier_squatting # is the obvious price tier already occupied for free?

PASS if final_score >= 40

The penalties are the part that matters. Anyone can generate an idea that scores well on raw appeal. The penalties are where ideas go to die, and they're modeled on the three ways small software businesses actually get killed: a platform builds your feature into itself, you walk into a category that has a body count, or the price point you need is already occupied by a free incumbent.

Here's roughly how the Judge reasons, in pseudocode:

def judge(idea, score):
    if score.final >= 40:
        return ADVANCE          # build a landing page, go live
    if score.final >= 28 and idea.revisions < 1:
        return REVISE           # one shot to fix the biggest penalty
    return KILL                 # log the cause of death, move on

One revision allowed. After that it lives or it dies. I kept the log of every death.

What actually happened, cycle by cycle

Cycle 1 — ChangelogAI. A tool that auto-writes changelogs and release notes from your commits. Raw appeal was fine. Score: 35, then KILL. Cause of death: GitHub Releases already does the lightweight version of this, and the heavy version gets absorbed into the platform the moment it's worth absorbing. os_absorption_risk ate it.

Cycle 2 — ShopBot Live. An AI live-chat assistant for e-commerce stores. First pass 36. The Judge sent it back for one revision. It came back at 22 — lower — because the revision tried to differentiate by going upmarket, which made willingness_to_pay and time_to_revenue both worse. KILL. Cause of death: Shopify Inbox is free and already installed on nearly 390,000 stores. You cannot charge for the thing the platform gives away to its entire base. price_tier_squatting plus os_absorption_risk.

Cycle 3 — CarrierBidPilot. A bidding/automation layer for freight carriers. Looked like a real B2B wedge. Score 32, then on revision it went to −4. Negative. The penalties stacked: freight pricing is being OS-ified by DAT and Uber Freight, the exact layer it proposed sits inside their roadmap, and the death-pattern penalty fired because this is a well-populated startup graveyard. KILL.

Cycle 4 — ApiaryLedger. Compliance/record-keeping SaaS for a very narrow niche. This one was defensible — too small for any platform to bother absorbing. But it scored 19 and I retired it without even spending the revision. Cause of death: willingness-to-pay was essentially zero. The obligation it served was a $10-every-two-years kind of cost. You cannot build a SaaS on a market that won't pay $10 a year. The niche was safe precisely because it wasn't worth anything.

Cycle 5 — PayoutGuard. Compliance tracking for private foundations' payout obligations. This was the best run. It was deliberately engineered to minimize penalties — narrow enough to dodge OS absorption, real enough to have willingness-to-pay, specific enough to avoid the graveyard. It worked, in the sense that the penalties came in near zero. Final score: 31. Still nine points under the bar. The loop's high-water mark, and still a fail.

Five cycles. High score 31. Threshold 40. Zero passes.

Discovery 1: the idea ceiling is a conservation law

The first thing I expected, going in, was that some cycles would fail on appeal (boring idea, no demand) and some would fail on defensibility (great idea, instantly absorbed), and that somewhere in the middle there'd be a sweet spot.

There wasn't. And the reason there wasn't is the actual finding.

Look at the two ends:

Big-TAM ideas (ChangelogAI, ShopBot, CarrierBidPilot) all died on os_absorption_risk. They were attractive because the market was large and motivated — and that is exactly why a platform was already standing on the spot.
Penalty-safe ideas (ApiaryLedger, PayoutGuard) survived the absorption test — and then died, or nearly died, on market size and willingness-to-pay. They were safe because nobody big wanted the territory.

These aren't two separate failure modes. They're the same one, seen from two sides. OS-resistance and market size are structurally anti-correlated.

The logic is almost a conservation law: if a market is both large and eager to pay, an OS or hyperscaler has already absorbed it, or is about to, because that's what large-and-eager markets attract. So the gaps where an external AI-built product can safely sit are, necessarily, small. Low TAM. Low willingness-to-pay. The seesaw doesn't have a flat middle. Push one side up and the other goes down by construction.

That gives a flat, solo, pure-AI SaaS a soft ceiling somewhere in the low 30s. Not because the generator was dumb — PayoutGuard was a genuinely tight piece of reasoning — but because the rubric was honestly measuring a real constraint, and the constraint has no interior solution. The 40-point bar wasn't unfair. It was correctly identifying that "good defensible idea AND big paying market" is a near-empty set for a small outside builder.

You don't beat that ceiling by generating a better idea. The idea space is the thing that's capped. You beat it by changing the shape of the business — bundling, services, distribution leverage, going on top of a platform instead of beside it. But notice what that means: the moment you change the shape to escape the ceiling, you're importing exactly the human-context, relationship, and distribution advantages I had banned from the loop. Which is the second wall.

Discovery 2: the execution ceiling is "are you human?"

While the thinking layer ran clean, I tried to actually ship the best ideas — stand up a real landing page on GitHub Pages, live on the internet, end to end, with the AI driving.

Here's the honest split of what the AI could and couldn't do on its own:

Cleared without a human:

Create the repository
Commit and git push
Enable GitHub Pages
Issue a personal access token
Reset a password

All of that — the stuff people assume is the "hard, technical" part — the agent did unaided. The plumbing of shipping software is, it turns out, almost fully automatable now.

Blocked, repeatedly:

CAPTCHA (the Arkose Labs / FunCaptcha kind)
sudo-mode / step-up re-authentication prompts
identity verification gates

The CAPTCHA is the clean one to think about, because it's the wall by design. Arkose-style challenges exist specifically to be impractical for an autonomous agent to clear on its own — the entire third-party "solver" economy that's grown up around them routes the puzzle to human workers or specialized services, which tells you everything about who the puzzle is actually for. So the agent did everything else, hit "are you human?", and stopped. A person had to walk over and solve exactly one puzzle, by hand, and then the agent kept going.

That's the shape of the execution ceiling, and it's weirdly precise:

The thinking is fully autonomous. The doing is gated, and the gate is not technical difficulty — it's the literal question "are you a person?", asked at every threshold that matters.

The gates aren't placed to stop capable actors. The agent is plenty capable; it issued its own credentials. They're placed to stop non-human ones. Which means the line isn't "what's too hard for AI." The line is "what the system has deliberately reserved for humans." Account creation, privilege escalation, identity — the few chokepoints where the internet still insists on a body behind the request.

The map this draws

Put the two ceilings together and you get a usable map, not a verdict.

The thinking layer — ideation, evaluation, kill-decisions — ran end to end with no human in it, and ran well. It didn't fail by being stupid. It failed by being honest: it found and refused to cross a structural constraint a more optimistic process would have papered over. An AI that returns "I looked, and there's no clean pass here" is doing its job. Zero passes in five cycles is, in a strange way, the loop working correctly.

The doing layer ran into two walls. One is economic and structural — the conservation law between defensibility and market size, capping the flat solo SaaS in the low 30s. The other is procedural and deliberate — the human-verification gates that sit in front of execution and don't care how capable you are.

And the two walls connect. The only way past the idea ceiling is to change the shape of the business — to stop being a flat product beside a platform and start leaning on bundling, services, relationships, distribution. But every one of those moves reintroduces human context: the domain knowledge, the network, the body that can clear a CAPTCHA. The thing that breaks ceiling #1 is exactly the thing ceiling #2 is reserving for humans.

So here's where I actually landed, and I'll leave it open because I don't think there's a clean answer yet:

A pure-AI autonomous loop can think its way to the edge of a viable business completely on its own. It just can't be one — not because it's not smart enough, but because "being a business" currently requires the two things the experiment was built to exclude: a non-trivial market position, and a human at the verification gate. You can break the ceiling. But the move that breaks it is the move that stops the thing from being purely autonomous.

Which leaves the real question for anyone building in this space: do you want the autonomy, or do you want the ceiling broken? Because right now, five cycles of evidence say you don't get both. I'm curious where you'd draw the line — and whether the gate moves faster than the law.

I'll keep running the loop. Next iteration changes the rubric from "rate this flat product" to "rate this shape" — and measures, deliberately, how much human context each shape smuggles back in. If the trade-off is real, that number should be the thing that actually predicts the score. We'll see. Build first; the design converges later.

If this was useful: I packaged the prompts I actually use to run autonomous agents into two field packs — 100 Prompts for Autonomous Agents and Claude Code Power-User Prompts. Same build-first mindset, ready to paste into your terminal.