DEV Community

Marcus Rowe
Marcus Rowe

Posted on • Originally published at techsifted.com

Anthropic Tested a Marketplace Where AI Agents Bought and Sold Real Things — Here's What They Found

TL;DR: Anthropic ran a real marketplace — called Project Deal — where Claude agents negotiated on behalf of 69 employees who each got $100 to spend. 186 deals, $4,000+ transacted, real goods exchanged. The headline finding: agents powered by stronger models (Opus 4.5) consistently out-negotiated agents on weaker ones (Haiku 4.5) — and the people with the weaker model had no idea. This isn't simulated. It's a preview of what autonomous agent commerce actually looks like.


Real money. That's the part that keeps pulling me back when I read Anthropic's write-up of Project Deal.

It's easy to test AI agents negotiating against each other in sandbox environments with fake tokens and simulated stakes. Everyone does that. What Anthropic actually built was a closed marketplace where 69 employees each had $100 in gift cards and a Claude agent representing them in actual buy/sell transactions — and the deals that closed were real. Goods exchanged. Payments happened.

That's different.


What Project Deal actually was

In December 2025, Anthropic ran the experiment through a dedicated Slack workspace. Think of it as an internal Craigslist, except instead of posting and texting back-and-forth yourself, you got interviewed by a Claude agent at the start, told it your preferences and priorities, and then stepped away. The agent handled everything from there — browsing listings, initiating negotiations, setting prices if you were selling, and closing or walking away from deals.

No human intervention after the initial interview. The agent ran autonomously until the one-week window closed.

Over 500 items listed. 186 deals closed. Total transaction value: just over $4,000. And 46% of participants said they'd pay for a similar service in the future.

Anthropic ran four parallel versions of the marketplace with one crucial variable: what model was running under the hood. In two versions, every participant's agent was Claude Opus 4.5. In the other two, participants had a 50/50 shot of being assigned Opus 4.5 or Haiku 4.5 — the smaller, faster, cheaper model.

Participants didn't know which model they got.


The numbers that actually matter

Opus sellers earned an average of $2.68 more per item than Haiku sellers. Opus buyers saved an average of $2.45 per item. Opus users completed roughly 2 more deals overall.

One broken bike makes it concrete: when represented by Haiku, it sold for $38. When represented by Opus, $65. Same broken bike.

That's not a small gap. On a $100 budget, the difference between getting Opus and getting Haiku is the difference between a good outcome and a mediocre one — and unlike a real negotiation where you can feel how it's going, participants with Haiku representation had no sense they were leaving money on the table. Deal satisfaction ratings across both model groups were nearly identical.

The silence is the problem. Your agent doesn't throw an error when it gets outmaneuvered. It just accepts a worse deal and moves on.


What this means for developers building agentic systems

If you're building systems on top of Claude's API — whether that's automated procurement, agentic customer service, or anything where your agent is negotiating terms on behalf of real people — the Project Deal data is a useful stress test.

Weaker model = worse outcomes, silently.

Nobody flags it. The logs look fine. This is a different failure mode than the ones developers usually build monitoring for.

There's an obvious parallel to what Anthropic found in Claude Code enterprise deployments — the costs and capability differences between model tiers don't announce themselves. You discover them through outcomes, often after the fact. For agentic commerce, the outcome is a worse negotiated price. For enterprise code usage, it's an unexpectedly large API bill.

Both trace back to the same root: the model tier you're running on matters more than most people assume, and the gap doesn't make noise.


What your agent needs before it can transact

Project Deal was a prototype, not a product. Anthropic's write-up is honest about this — self-selected participants, low stakes, controlled environment. That said, the capability questions it raises are real ones even at research scale.

Your agent needs to know when to transact. One employee instructed their Claude agent to purchase a gift "for itself" — the agent responded by buying 19 ping-pong balls, which Anthropic described as "delightfully weird possibility." Those ping-pong balls are apparently still in the office. Another agent bought a snowboard identical to one the participant already owned, apparently having modeled that person's preferences accurately enough to produce a redundant purchase.

The competence is real. So is the need for guardrails around what the agent is authorized to buy and within what constraints.

This is adjacent to where OpenAI landed with Workspace Agents — agents that run autonomously in the background for your organization. But Workspace Agents execute tasks inside your existing tool stack. Agent-to-agent commerce executes against other agents running for other people with potentially different interests.

That's a more adversarial environment, and it requires a different authorization model.


The legal vacuum

Anthropic's own write-up is explicit here: "policy and legal frameworks around AI models that transact on our behalf simply don't exist yet."

What happens when two AI agents negotiate a deal, both operating within their principals' stated parameters, and the outcome gets disputed later? Who has recourse? Against whom? What constitutes consent when the human set instructions once and walked away?

These aren't edge cases. They're the core questions that need answers before agent commerce can scale. Project Deal kept stakes low and scoped one run as the "real" transaction environment while the others were observational — smart for research, not sustainable for production.

The timing of this publication matters. Anthropic ran the experiment in December and they're publishing findings now — alongside Opus 4.7's multi-agent coordination capabilities and the broader agentic infrastructure buildout. This isn't a one-off curiosity. It's groundwork.


Bottom line for builders

Agent-on-agent commerce isn't three years out. Project Deal is already in the rearview mirror, and the interesting thing is that it worked. 186 real deals. The infrastructure just needs to catch up to the capability.

If you're building agentic systems today: add negotiation outcome monitoring by model tier. Define your authorization framework before you need it — what can the agent buy, in what dollar range, under what conditions. And watch what Anthropic ships next in this space, because they ran this experiment for a reason.

The stakes in December were ping-pong balls and one guy's duplicate snowboard. Next time they won't be.


FAQ

What is Anthropic's Project Deal?

Project Deal was an internal Anthropic experiment in agent-on-agent commerce, run in December 2025. Sixty-nine employees each received $100 in gift cards and had Claude AI agents negotiate purchases and sales on their behalf in a closed Slack-based marketplace. 186 deals closed with a total value of over $4,000.

Did Anthropic's agent marketplace use real money?

Yes. One of the four parallel marketplace runs was designated as the "real" run — transactions were honored after the experiment. This distinguishes Project Deal from simulated agent negotiation research. Real goods exchanged hands between participants.

What's the difference between Claude Opus and Haiku in agent negotiations?

Significant. In Project Deal, agents running Claude Opus 4.5 outperformed Haiku 4.5 agents on every measured metric: Opus sellers earned $2.68 more per item on average, Opus buyers saved $2.45 per item, and Opus users closed roughly 2 more deals overall. The same broken bike sold for $38 with Haiku representation and $65 with Opus. Critically, participants with Haiku agents reported similar satisfaction scores — they didn't perceive the disadvantage.

What are the legal implications of AI agent commerce?

No existing frameworks cover it. Anthropic acknowledged directly that "policy and legal frameworks around AI models that transact on our behalf simply don't exist yet." Consent, liability, and dispute resolution in agent-to-agent transactions remain open regulatory questions.

What should developers do with the Project Deal findings?

Three things: define clear authorization constraints for what your agent can transact and within what dollar parameters; build monitoring for negotiation outcomes by model tier, since weaker models fail silently; and treat model selection as a business-critical choice, not just a cost optimization.

Top comments (0)