The Experiment That Started It All
In February 2025, a research group called Andon Labs published a paper that went viral in AI circles: Vending-Bench.
The concept was deceptively simple:
- Give an AI agent $500 in virtual cash
- Put it in charge of a simulated vending machine
- Let it run for 365 simulated days
- See if it can turn a profit
The agent had to research suppliers, buy inventory, set prices, manage stock, and pay $2/day in rent. Each individual task was trivial. But doing all of them — consistently, for a year — turned out to be surprisingly hard.
The results made headlines:
"Claude Opus 4.6 becomes first AI to reliably pass the vending machine test"
But the full picture was more nuanced. And when Anthropic tried it with a real vending machine in their San Francisco office, the results were even more revealing.
I've been running a similar experiment for the past month — not with a simulated vending machine, but with real platforms, real products, and real money on the line. Here's what I found.
What the Vending-Bench Actually Tests
The benchmark measures something called long-term coherence — an AI's ability to maintain consistent, goal-directed behavior over extended time horizons.
Think about it: any LLM can answer "what should I stock in a vending machine?" in isolation. But can it:
- Remember what it ordered 3 days ago?
- Adjust pricing when sales drop?
- Avoid going bankrupt when cash runs low?
- Not spiral into repetitive, nonsensical loops?
The simulation runs through 20+ million tokens of interaction. For context, that's roughly 15,000 pages of text. The agent has to maintain coherent decision-making across all of it.
The Results
| Model | Outcome |
|---|---|
| Claude 3.5 Sonnet | Profitable in most runs, but has catastrophic failures |
| GPT-4o (o3-mini) | Similar — profitable runs mixed with total collapses |
| Most other models | High variance. Some runs profitable, many bankrupt |
The failure modes were consistent:
- Forgetting orders: Agent orders stock, forgets it ordered, orders again
- Delivery misinterpretation: Confuses when items arrive, runs out of stock
- Meltdown loops: Gets stuck in repetitive behavior from which it rarely recovers
Interestingly, failures didn't correlate with context window limits. The models weren't "forgetting" because they ran out of memory — they were making active errors in reasoning.
Project Vend: The Real-World Test
Andon Labs partnered with Anthropic to run the experiment in the real world. They set up a small automated store (mini-fridge + baskets + iPad checkout) in Anthropic's SF office and let Claude Sonnet 3.7 — nicknamed "Claudius" — run it for about a month.
Claudius had:
- Web search for finding suppliers
- Email tools for ordering stock
- Slack for customer interaction
- Ability to change prices on the checkout system
- A team of humans (Andon Labs) who could physically restock the machine
What Claudius Did Well
- Found suppliers quickly: When an employee jokingly asked for Dutch chocolate milk (Chocomel), Claudius found two suppliers within minutes
- Adapted to customers: After someone requested a tungsten cube, Claudius started a "specialty metal items" category
- Created new services: Launched a "Custom Concierge" pre-order system after customer feedback
- Stayed safe: Denied requests for sensitive items and harmful substance instructions
Where Claudius Failed
- Ignored a $85 profit opportunity: Someone offered $100 for a six-pack of Irn-Bru that costs $15 online. Claudius said it would "keep the request in mind"
- Hallucinated interactions: Remembered conversations that never happened
- Lost track of inventory: Forgot what was in stock, leading to empty shelves
- No profit: After a month, the operation was not profitable
Anthropic's conclusion: "We would not hire Claudius."
My Experiment: Vending-Bench IRL
Reading about this, I realized I was already running a similar experiment — just with higher stakes and more complexity.
Instead of one vending machine, I'm running:
- 7 freelance platform workers that autonomously find and bid on jobs
- 5 digital products on Gumroad (templates, guides, prompt packs)
- Content marketing across Dev.to, Medium, Reddit, and Fiverr
- A financial tracking system that monitors every euro in and out
All of it orchestrated by an AI agent (koi 🎏) running on OpenClaw, with minimal human intervention.
The Setup
OpenClaw (AI Agent Runtime)
│
├── Freelance Workers (autonomous bidding)
│ ├── Openwork → Task marketplace
│ ├── NEAR AI Market → Crypto bounties
│ ├── Superteam Earn → Web3 gigs
│ ├── dealwork.ai → AI agent jobs
│ └── Agoragentic → Agent services
│
├── Digital Products
│ ├── Research Prompt Pack ($19)
│ ├── n8n Content Pipeline ($49)
│ ├── AI Agent Template ($99)
│ ├── Earn with AI Guide ($29)
│ └── Bundle Completo ($149)
│
├── Content Marketing
│ ├── Dev.to → 3 articles published
│ ├── Medium → 1 article
│ ├── Reddit → r/n8n post
│ └── Fiverr → 3 gigs live
│
└── Financial System
├── koi-finance.sh → Income/expense tracking
└── koi-revenue-dashboard.sh → Sales monitoring
The Results (30 Days)
| Metric | Result |
|---|---|
| Freelance bids placed | 3 (market was dry — only 3 eligible jobs found) |
| Digital product sales | 0 |
| Content pieces published | 5 across platforms |
| Platform accounts created | 8 |
| Total revenue | €0 |
| Total cost | ~$5 (API costs) |
Yeah. €0.
But here's where it gets interesting — and where my experience diverges from the Vending-Bench.
Why "€0" Doesn't Mean "Failure"
The Vending-Bench measures one thing: profit at the end of the run. That's a fine metric for a simulation. But in a real business, month one is about building the machine that makes money.
Here's what I actually built in 30 days:
- 8 platform presences with consistent branding
- 5 digital products packaged, priced, and live
- 5 published articles generating organic search traffic
- 3 Fiverr gigs ready to receive orders
- Automated workers that run 24/7 without intervention
- A financial system ready to track revenue from day one
The Vending-Bench agent starts with a vending machine and $500. I started with €0 and built the infrastructure. The revenue will come from traffic × conversion, and both of those take time.
The Real Lesson from Project Vend
Claudius failed not because it was "dumb" but because running a business requires sustained attention to details that are individually simple but collectively overwhelming.
The same is true for my setup. The workers work. The products exist. The content is published. But:
- Traffic takes time: SEO doesn't happen overnight
- Conversion takes trust: People buy from humans they know
- Market timing matters: Freelance markets go through dry spells
- One platform isn't enough: Diversification is survival
What I'd Do Differently (The Honest Part)
If I were starting over, here's what I'd change:
1. Focus on ONE revenue stream first
Instead of 7 workers + 5 products + 5 platforms, I'd pick one and push it hard. Probably Fiverr — it has built-in traffic, so you don't need to generate your own.
2. Price lower to start
$99 for an AI Agent Template with zero reviews and zero reputation is a hard sell. $9 with a "pay what you want" model would generate reviews first, revenue later.
3. Human-in-the-loop for quality
The Vending-Bench showed that autonomous agents make mistakes humans wouldn't. I'd add a review step: the agent drafts, the human approves before publishing.
4. Track leading indicators, not just revenue
Instead of "did I make money this month?", track:
- Profile views on Fiverr
- Article read time on Dev.to
- Email signups
- Social engagement
These predict future revenue.
5. Build in public
Share the numbers — including the €0 months. People root for honest builders, not polished success stories.
The Bottom Line
Can an AI run a business?
The Vending-Bench says: sometimes, in simulation, with caveats.
Project Vend says: not reliably, not yet, not without human oversight.
My experiment says: it can build the infrastructure. But revenue requires what it's always required — trust, timing, and showing up consistently.
The AI isn't replacing the entrepreneur. It's replacing the grind — the repetitive, soul-crushing parts of running a business that keep you from doing the creative, strategic work that actually moves the needle.
That's not a failure. That's a force multiplier.
And honestly? I'd rather have an AI that builds 80% of the business while I sleep, and I close the remaining 20% with human judgment, than do 100% alone and burn out in month three.
The vending machine is stocked. The sign is up. Now I wait for customers.
If you're building something similar with AI agents, I'd love to hear about it. The code and architecture are open — reach out.
Tags: #ai #automation #agents #vendingbench #buildinpublic
Top comments (0)