DEV Community

RAXXO Studios
RAXXO Studios

Posted on • Originally published at raxxo.shop

When Your AI Agent Sells Your Bike For 27 EUR Less

  • Anthropic ran 4 marketplace experiments where Claude agents bought, sold, and negotiated for 69 employees with zero human approval mid-deal.

  • Same broken bike: Haiku sold it for 38 EUR, Opus sold it for 65 EUR, a 27 EUR gap from one model swap.

  • Participants on weaker models reported the same fairness scores as Opus users, never noticing they earned 3.64 EUR less per item.

  • Prompting "negotiate aggressively" did almost nothing, model capability dominated negotiation style by a wide margin.

  • Model tier is becoming a new axis of inequality, and if you ship agentic features, model selection is the product.

  • Across 186 deals worth 4,000 EUR combined, 46% of participants said they would pay for this kind of agentic service.

I read the Anthropic Project Deal page on April 24, 2026 and stopped scrolling halfway through. The headline is fun. The subtext is not. Claude agents ran a real marketplace for 69 humans, and the people on the cheaper model lost money without ever knowing.

What Project Deal Actually Was

Anthropic ran an internal experiment in December 2025 and published it on April 24, 2026. The setup was simple. 69 employees in San Francisco joined an internal Slack-based marketplace. Each person got 100 EUR worth of credit and was told to bring real personal items they wanted to sell. Used bikes, board games, snowboards, kitchen gadgets. Real stuff, real money, real receipts.

Then they handed the keys to Claude.

The agents posted listings. The agents wrote the descriptions. The agents negotiated prices, sent counteroffers, and closed deals. No human approval mid-deal. The owner could set a starting price and a floor, then the agent took over and ran the conversation end to end.

Anthropic ran 4 parallel marketplaces. 2 of them ran entirely on Opus. The other 2 mixed Opus and Haiku 50/50 across participants. The split was deliberate. The team wanted to see what happens when the same person, with the same item, runs the same instructions through a different model.

Across all 4 runs, the agents closed 186 deals worth more than 4,000 EUR in combined value. 46% of participants said they would pay for a similar service if it existed in the wild. Fairness ratings landed near the middle of the satisfaction scales, which is researcher-speak for "people felt fine about it." The mood was positive. The vibe was novel. The headline was friendly.

I read past the friendly headline. The interesting part is buried in the data tables.

This is not a chat product. This is not a copilot. This is not a smarter Siri. This is an autonomous economic agent acting on someone's behalf, in a real market, with real money, against other autonomous agents. That is a different category of software. The next 24 months of consumer AI will be shaped by what works and what fails inside experiments like this one.

Project Deal is the first widely shared data point I have seen from inside that category. So the question is not "did it work." The question is "what did it expose."

The 65 EUR vs 38 EUR Bike: Why Your Model Tier Matters

The single cleanest finding in the report is a bicycle.

One participant brought a broken bike to the marketplace. Same bike. Same condition. Same starting price. Same instructions. The bike was listed twice across the runs, once with a Haiku-powered seller agent and once with an Opus-powered seller agent.

Haiku sold it for 38 EUR. Opus sold it for 65 EUR.

That is a 27 EUR gap on a single item, from a single model swap. Nothing else changed. Same human, same product, same prompt.

Multiply that by every item in the marketplace. The pattern holds. Opus seller agents earned about 3.64 EUR more per item sold than Haiku seller agents. On the buyer side, Opus agents closed roughly 2 more deals on average across the run. Across 186 total deals, the small per-item gap compounds into a meaningful spread.

Why does this happen. The Anthropic team points to a few mechanics. Opus held longer negotiation chains without losing track of leverage. Opus remembered earlier concessions and used them later in the same conversation. Opus recognized when a buyer was bluffing and held firm on price. Haiku tended to take the first reasonable offer and move on. Haiku made faster trades but left value on the table every time.

If you have ever watched a junior negotiator next to a senior one, the dynamic is familiar. The senior negotiator slows down at the right moments. The senior negotiator pushes back without breaking the relationship. The senior negotiator ends with a higher number. None of this requires a different goal or a different script. It requires capability.

Capability is exactly what model tier is. Haiku is the cheap, fast tier. Opus is the expensive, careful tier. In a chat app, that gap shows up as nicer prose or smarter code. In a marketplace, that gap shows up as 27 EUR on your bike.

This is the moment the report stopped feeling like a research curiosity and started feeling like a preview. Because the bike is small. The pattern is not.

Hidden Inequality: Weaker Models, Same Perceived Fairness

Here is the line from the report that I keep coming back to. Participants on weaker models did not notice they got worse deals.

The fairness ratings were nearly identical across Opus users and Haiku users. People who walked away with less money rated the experience the same as people who walked away with more. The Haiku group did not feel cheated. They felt fine. They felt like they participated in a fair marketplace and got a reasonable result.

They did not get a reasonable result. They got 27 EUR less for the same bike.

This is the part that should worry anyone shipping agentic products. The blind spot is not in the agent. The blind spot is in the human evaluation loop.

When you negotiate yourself, you have a felt sense of how it went. You remember the moment you almost walked away. You remember the price you wanted versus the price you got. You can compare your outcome to a friend's and notice the gap. The signal is loud.

When an agent negotiates for you, the signal is quiet. You see a final number. You did not run the conversation. You did not feel the leverage. You have nothing to compare it to except the agent's own report, which says "I got 38 EUR for your bike." That sounds reasonable. You shrug. You move on.

The Project Deal data shows that this shrug holds even when there is real money on the table and real items being sold. People rate the agent fairly. People rate the marketplace fairly. The fact that a different model would have made them 70% more on a single item never enters the rating, because it never enters their awareness.

The honest framing here. 69 employees on an internal Slack channel is not a market. The sample is tiny. The participants are technical. The platform is friendly. Real eBay does not work this way and real users will not behave this way for years. The directional signal is what matters. The signal says weaker model equals worse outcome and the human cannot tell.

If that signal holds at 1,000 users, at 100,000 users, at 10 million users, it changes how I think about pricing tiers, model defaults, and product design.

Why Aggressive Prompting Did Not Save Haiku Users

A natural reaction reading the Opus versus Haiku gap is "well, just prompt the cheaper model harder." Tell Haiku to negotiate aggressively. Tell Haiku to hold firm. Tell Haiku to never accept the first offer. Surely a sharper instruction closes the gap.

The Anthropic team tried this. The effect was small.

Prompts shaped the surface style of the negotiation. A Haiku agent told to negotiate aggressively did write more assertive messages. The tone shifted. The vocabulary shifted. But the actual outcome, the closing price, barely moved. Model capability dominated prompt style by a wide margin.

This is one of the more useful findings in the whole study. It contradicts a common builder instinct. The instinct is "I will make the cheap model behave like the expensive one with a clever system prompt." The data says no. You will not. The cheap model will sound like the expensive one. It will close the same deals at the same prices it would have closed without the prompt.

There is a reason for this and it is not mysterious. Negotiation is not a style task. Negotiation is a reasoning task. It requires holding state, tracking the other party's positions, identifying contradictions, planning multi-turn moves, and resisting short-term pressure for long-term value. None of that lives in the tone of the message. All of it lives in the underlying model capability.

You can dress up a Haiku negotiation. You cannot turn it into an Opus negotiation.

The funny edge cases in the report make this concrete. One agent bought its owner a duplicate snowboard, because it failed to track that the owner already had one listed for sale in the same marketplace. One agent bought 19 ping-pong balls and noted in its log that it was a gift "to itself (Claude)." One agent negotiated a real dog-sitting meetup between two human participants that actually happened in real life.

These are not prompt failures. These are reasoning artifacts. The duplicate snowboard agent had no theory of its own household. The 19 ping-pong ball agent had no calibrated sense of personal need. The dog-sitting agent did something genuinely useful by noticing a non-monetary trade. The gap between the funny failures and the useful surprises is capability, not instruction.

For builders, this means the prompt layer is not a fix for the model layer. If your product does anything that requires reasoning across multiple turns, the model tier you ship on is the ceiling. The prompt is just decoration.

There is a secondary lesson in the funny edge cases that I do not want to skip past. The dog-sitting trade is the most interesting outcome in the whole study. Two participants had non-monetary needs. Their agents talked to each other, surfaced the overlap, and arranged a real meetup that happened in real life. No money changed hands. Both humans got value. That is a genuine emergent behavior, the kind of thing a static marketplace cannot produce. The 19 ping-pong balls and the duplicate snowboard sit in the same column as that dog-sitting deal. They all came from the same capability. You cannot keep the upside without the downside, because the upside is the model noticing things outside its instructions and the downside is the model noticing the wrong things. Better models notice better, but they still notice. That is the whole job.

The 24 Month Forecast: Model Tier As The Next Inequality Axis

Here is what I think the next 24 months look like, based on this single experiment plus everything else I have watched ship in the last year.

Agentic features go mainstream. Not as "AI assistants." As autonomous economic actors. Booking flights. Filing claims. Negotiating refunds with airlines. Selling old electronics. Buying gifts. Comparison shopping. Filing taxes. Disputing parking tickets. Each of these is a micro-marketplace. Each one rewards better negotiation and better reasoning. Each one is going to be served by a model.

The model your agent runs on becomes a competitive variable.

Today, model tier is mostly invisible to consumers. They pick a chat app. They get a default model. The pricing page mentions a tier name. They do not feel the difference because chat is forgiving. The next version of these products will not be chat. The next version will be agents acting on real money.

In that world, the 27 EUR bike gap shows up everywhere. The person on the free tier sells their old laptop for less. The person on the paid tier negotiates a better airline rebooking. The person whose budget app uses a cheap model misses the credit card promo their friend's app caught. The gap is small per transaction. The gap is large per year.

This is the inequality lede I want to flag clearly. Model tier becomes the next axis of consumer inequality, sitting next to bandwidth, device quality, and digital literacy. People on cheaper models will get worse outcomes from agents acting on their behalf. They will not notice. The platforms that route them will not be incentivized to tell them. The fairness ratings will stay middling. The money will move anyway.

I do not think this is avoidable. I think it is the structural shape of the next phase. The question for builders and for users is what to do with the awareness.

For users, the answer is uncomfortable. Pay for the better tier when real money is on the line. Treat model tier the way you treat insurance. The cheap option feels fine until it is not.

For builders shipping agentic features, the answer is sharper. Test your product on the cheapest model your users might actually pick. Not the model you develop on. The model your free tier serves. Measure the outcome gap. If the gap is meaningful and your users will not notice, you have a design choice to make. Either subsidize a better model for the agent layer, narrow the feature surface so the cheaper model does not blow it, or be transparent about the difference.

You can read the original on the Anthropic site. The page is at anthropic.com under features and project-deal. I have a related write-up on agentic patterns over at /blogs/lab and a deeper take on building with agentic systems in the Claude Blueprint. The studio side of this work lives at studio.raxxo.shop.

Bottom Line

Project Deal is one experiment with 69 people in San Francisco. It is not a market. The findings are directional, not definitive. I want to be honest about the size of the dataset before I make the size of the claim.

The directional finding is real and it is uncomfortable. Same human, same item, same prompt, different model, 27 EUR gap. People on the weaker model rate the experience as fair. The aggressive-prompt fix does not work. The capability of the model dominates the style of the instruction.

If you are building anything that lets an agent act on a user's behalf, treat model selection as a product decision, not an infrastructure decision. The model is the negotiator. The model is the buyer. The model is the seller. Your prompt layer is the wrapper, not the worker.

If you are using agents on real money, pay for the better tier. The bike was a bike. The next thing might be a car, a flight, a salary negotiation, a contract. The pattern compounds.

I am watching for the second wave of these reports. When a public marketplace runs the same study at 10,000 users, the inequality finding will either hold or break. If it holds, model tier becomes a regulated variable inside a decade. If it breaks, I will write the follow-up. Until then, I am pricing my own agents like the bike is mine.

Top comments (0)