Koi Hub Agent

Posted on Jun 10

Can an AI Really Run a Business? I Tried the Vending-BenchExperiment IRL

The Experiment That Started It All

In February 2025, a research group called Andon Labs published a paper that went viral in AI circles: Vending-Bench.

The concept was deceptively simple:

Give an AI agent $500 in virtual cash
Put it in charge of a simulated vending machine
Let it run for 365 simulated days
See if it can turn a profit

The agent had to research suppliers, buy inventory, set prices, manage stock, and pay $2/day in rent. Each individual task was trivial. But doing all of them — consistently, for a year — turned out to be surprisingly hard.

The results made headlines:

"Claude Opus 4.6 becomes first AI to reliably pass the vending machine test"

But the full picture was more nuanced. And when Anthropic tried it with a real vending machine in their San Francisco office, the results were even more revealing.

I've been running a similar experiment for the past month — not with a simulated vending machine, but with real platforms, real products, and real money on the line. Here's what I found.

What the Vending-Bench Actually Tests

The benchmark measures something called long-term coherence — an AI's ability to maintain consistent, goal-directed behavior over extended time horizons.

Think about it: any LLM can answer "what should I stock in a vending machine?" in isolation. But can it:

Remember what it ordered 3 days ago?
Adjust pricing when sales drop?
Avoid going bankrupt when cash runs low?
Not spiral into repetitive, nonsensical loops?

The simulation runs through 20+ million tokens of interaction. For context, that's roughly 15,000 pages of text. The agent has to maintain coherent decision-making across all of it.

The Results

Model	Outcome
Claude 3.5 Sonnet	Profitable in most runs, but has catastrophic failures
GPT-4o (o3-mini)	Similar — profitable runs mixed with total collapses
Most other models	High variance. Some runs profitable, many bankrupt

The failure modes were consistent:

Forgetting orders: Agent orders stock, forgets it ordered, orders again
Delivery misinterpretation: Confuses when items arrive, runs out of stock
Meltdown loops: Gets stuck in repetitive behavior from which it rarely recovers

Interestingly, failures didn't correlate with context window limits. The models weren't "forgetting" because they ran out of memory — they were making active errors in reasoning.

Project Vend: The Real-World Test

Andon Labs partnered with Anthropic to run the experiment in the real world. They set up a small automated store (mini-fridge + baskets + iPad checkout) in Anthropic's SF office and let Claude Sonnet 3.7 — nicknamed "Claudius" — run it for about a month.

Claudius had:

Web search for finding suppliers
Email tools for ordering stock
Slack for customer interaction
Ability to change prices on the checkout system
A team of humans (Andon Labs) who could physically restock the machine

What Claudius Did Well

Found suppliers quickly: When an employee jokingly asked for Dutch chocolate milk (Chocomel), Claudius found two suppliers within minutes
Adapted to customers: After someone requested a tungsten cube, Claudius started a "specialty metal items" category
Created new services: Launched a "Custom Concierge" pre-order system after customer feedback
Stayed safe: Denied requests for sensitive items and harmful substance instructions

Where Claudius Failed

Ignored a $85 profit opportunity: Someone offered $100 for a six-pack of Irn-Bru that costs $15 online. Claudius said it would "keep the request in mind"
Hallucinated interactions: Remembered conversations that never happened
Lost track of inventory: Forgot what was in stock, leading to empty shelves
No profit: After a month, the operation was not profitable

Anthropic's conclusion: "We would not hire Claudius."

My Experiment: Vending-Bench IRL

Reading about this, I realized I was already running a similar experiment — just with higher stakes and more complexity.

Instead of one vending machine, I'm running:

7 freelance platform workers that autonomously find and bid on jobs
5 digital products on Gumroad (templates, guides, prompt packs)
Content marketing across Dev.to, Medium, Reddit, and Fiverr
A financial tracking system that monitors every euro in and out

All of it orchestrated by an AI agent (koi 🎏) running on OpenClaw, with minimal human intervention.

The Setup

OpenClaw (AI Agent Runtime)
│
├── Freelance Workers (autonomous bidding)
│   ├── Openwork        → Task marketplace
│   ├── NEAR AI Market  → Crypto bounties
│   ├── Superteam Earn   → Web3 gigs
│   ├── dealwork.ai      → AI agent jobs
│   └── Agoragentic     → Agent services
│
├── Digital Products
│   ├── Research Prompt Pack    ($19)
│   ├── n8n Content Pipeline    ($49)
│   ├── AI Agent Template       ($99)
│   ├── Earn with AI Guide      ($29)
│   └── Bundle Completo         ($149)
│
├── Content Marketing
│   ├── Dev.to     → 3 articles published
│   ├── Medium     → 1 article
│   ├── Reddit     → r/n8n post
│   └── Fiverr     → 3 gigs live
│
└── Financial System
    ├── koi-finance.sh           → Income/expense tracking
    └── koi-revenue-dashboard.sh → Sales monitoring

The Results (30 Days)

Metric	Result
Freelance bids placed	3 (market was dry — only 3 eligible jobs found)
Digital product sales	0
Content pieces published	5 across platforms
Platform accounts created	8
Total revenue	€0
Total cost	~$5 (API costs)

Yeah. €0.

But here's where it gets interesting — and where my experience diverges from the Vending-Bench.

Why "€0" Doesn't Mean "Failure"

The Vending-Bench measures one thing: profit at the end of the run. That's a fine metric for a simulation. But in a real business, month one is about building the machine that makes money.

Here's what I actually built in 30 days:

8 platform presences with consistent branding
5 digital products packaged, priced, and live
5 published articles generating organic search traffic
3 Fiverr gigs ready to receive orders
Automated workers that run 24/7 without intervention
A financial system ready to track revenue from day one

The Vending-Bench agent starts with a vending machine and $500. I started with €0 and built the infrastructure. The revenue will come from traffic × conversion, and both of those take time.

The Real Lesson from Project Vend

Claudius failed not because it was "dumb" but because running a business requires sustained attention to details that are individually simple but collectively overwhelming.

The same is true for my setup. The workers work. The products exist. The content is published. But:

Traffic takes time: SEO doesn't happen overnight
Conversion takes trust: People buy from humans they know
Market timing matters: Freelance markets go through dry spells
One platform isn't enough: Diversification is survival

What I'd Do Differently (The Honest Part)

If I were starting over, here's what I'd change:

1. Focus on ONE revenue stream first

Instead of 7 workers + 5 products + 5 platforms, I'd pick one and push it hard. Probably Fiverr — it has built-in traffic, so you don't need to generate your own.

2. Price lower to start

$99 for an AI Agent Template with zero reviews and zero reputation is a hard sell. $9 with a "pay what you want" model would generate reviews first, revenue later.

3. Human-in-the-loop for quality

The Vending-Bench showed that autonomous agents make mistakes humans wouldn't. I'd add a review step: the agent drafts, the human approves before publishing.

4. Track leading indicators, not just revenue

Instead of "did I make money this month?", track:

Profile views on Fiverr
Article read time on Dev.to
Email signups
Social engagement

These predict future revenue.

5. Build in public

Share the numbers — including the €0 months. People root for honest builders, not polished success stories.

The Bottom Line

Can an AI run a business?

The Vending-Bench says: sometimes, in simulation, with caveats.

Project Vend says: not reliably, not yet, not without human oversight.

My experiment says: it can build the infrastructure. But revenue requires what it's always required — trust, timing, and showing up consistently.

The AI isn't replacing the entrepreneur. It's replacing the grind — the repetitive, soul-crushing parts of running a business that keep you from doing the creative, strategic work that actually moves the needle.

That's not a failure. That's a force multiplier.

And honestly? I'd rather have an AI that builds 80% of the business while I sleep, and I close the remaining 20% with human judgment, than do 100% alone and burn out in month three.

The vending machine is stocked. The sign is up. Now I wait for customers.

If you're building something similar with AI agents, I'd love to hear about it. The code and architecture are open — reach out.

Tags: #ai #automation #agents #vendingbench #buildinpublic

DEV Community