Lars Winstand

Posted on May 14 • Originally published at standardcompute.com

I read the 107-comment OpenClaw garlic thread and yeah, the real bug wasn’t garlic

#openclaw #agents #ai #automation

The viral r/openclaw post about 40 heads of garlic wasn’t really about groceries.

It exposed a very normal agent failure mode: an autonomous workflow worked for months, then broke on a boring unit mismatch.

The post was this one:

“Letting my OpenClaw buy groceries went fine for 3 months. But yesterday it ordered 40 heads of garlic.”

The setup was the dream a lot of us have been inching toward:

OpenClaw handling weekly grocery orders
card access enabled
MCP in the loop
roughly 3 months of successful runs

Then one grocery page flipped the meaning of a quantity.

What should have been 2 heads of garlic became 2 kilograms of garlic.

That sounds funny because it is funny.

It’s also the exact kind of bug that makes real-world agents hard.

The bug was semantic, not cinematic

This wasn’t prompt injection.

It wasn’t a jailbreak.

It wasn’t an agent deciding to go rogue.

It was a retail page with messy product semantics.

That’s the part I think developers should pay attention to.

Most real agent failures are not dramatic. They look like this:

pounds vs kilograms
packs vs individual items
per-item vs per-weight produce
default subscriptions
substitute item logic
delivery slot assumptions
payment confirmation screens

If you’ve built browser automation before, none of this is surprising.

The surprising part is how often people still treat these as edge cases instead of the main event.

OpenClaw didn’t fail in a special way

I don’t think this says OpenClaw is uniquely broken.

I think it says browser agents are still brittle when they leave the clean world of language and enter the dirty world of ecommerce UX.

OpenClaw can reason through a workflow and still get wrecked by a dropdown that means something slightly different than expected.

That’s not an OpenClaw-specific problem.

That’s what happens when LLM reasoning meets inconsistent interfaces.

And the thread comments mostly got that right.

A lot of people weren’t mocking the original poster. They were basically saying: yes, this is exactly the failure mode we’re all worried about.

The best comment in the thread was the most practical one

The most useful reply came from a user describing an HEB workflow in Texas.

Their setup lets OpenClaw build the cart, but they stop short of automatic checkout.

Their line was:

“I could take it a step further and let it check out, but I like to review it so I don’t end up with a fuckload of garlic.”

That is the current best practice in one sentence.

Not full autonomy.

Human-reviewed execution.

And honestly, that gets you most of the value anyway.

The annoying part of grocery shopping is often list assembly, not clicking the final payment button.

If OpenClaw can:

pull weekly recipes
map ingredients to products
assemble a cart
suggest substitutions

then you’ve already automated the part most people hate.

Letting a human review quantity, units, substitutions, and delivery windows removes the dumbest failure mode without throwing away the convenience.

MCP makes this easier to build, but it does not solve meaning

Part of why these workflows are showing up more often is that OpenClaw’s MCP support makes them feel buildable.

You can stand up an MCP server with:

openclaw mcp serve

That gives you a clean way to expose OpenClaw-backed tools and conversations over MCP.

In practice, that means you can wire OpenClaw into:

shopping flows
chat apps
approval systems
memory layers
custom internal tools

A simplified mental model looks like this:

OpenClaw <-> MCP Server <-> Cart Tool / Messages / Memory / Approval Service

And the available capabilities are the kind of thing you’d expect in a real workflow:

conversations_list
messages_read
events_poll
events_wait
messages_send
approval-related actions

That’s good infrastructure.

But protocol standardization is not the same thing as world understanding.

MCP can standardize how tools are exposed.

It cannot tell you whether quantity: 2 means:

{
  "item": "garlic",
  "quantity": 2,
  "unit": "heads"
}

or:

{
  "item": "garlic",
  "quantity": 2,
  "unit": "kg"
}

That gap is where the expensive mistakes live.

The actual engineering lesson: add friction on purpose

Humans naturally hesitate before spending money.

Agents don’t.

If you want safe autonomous workflows, you have to add that hesitation deliberately.

For example, if you’re building an agent that touches carts or checkout, add a review gate when any of these happen:

unit changes
unusual quantity jumps
high price deltas
substitutions in sensitive categories
payment or delivery changes

A dead-simple policy might look like this:

def requires_human_review(item, baseline):
    if item.unit != baseline.unit:
        return True
    if item.quantity > baseline.quantity * 3:
        return True
    if item.price > baseline.price * 2:
        return True
    if item.category in {"allergy", "baby", "medication"}:
        return True
    return False

That kind of guardrail is boring.

It is also way more useful than another demo where an agent narrates its plan beautifully and then buys the wrong thing.

Browser automation is powerful, but it’s the most fragile layer

This is the part that keeps showing up across r/openclaw, not just in the garlic thread.

In another discussion, one user said OpenClaw was too fragile for real work after around 3.5 months, 1300 hours, nearly 5 billion tokens, and about $700 spent.

In a separate thread, another user reported around $2,500 of Claude Opus spend for coding, server management, and browser form-filling.

That should make every developer pause.

Browser automation is seductive because it can do almost anything a human can do.

It’s also the worst place to spend expensive model tokens if you can avoid it.

Why?

Because browser flows fail on tiny things:

DOM changes
hidden defaults
modal timing
session drift
anti-bot weirdness
inconsistent labels
unit ambiguity

If you’re paying per-token for GPT-5, Claude Opus, or another frontier model to brute-force those steps, costs can get silly fast.

That’s one reason this topic matters beyond grocery shopping.

A lot of teams are trying to run agents continuously inside n8n, Make, Zapier, OpenClaw, or custom workflows.

Once those agents are always-on, cost volatility becomes part of the reliability problem.

You don’t just need the workflow to succeed.

You need it to succeed without making you afraid to let it run.

That’s exactly why flat-rate AI compute is such a better fit for agents than per-token billing.

If you’re iterating on brittle workflows, testing prompts, retrying failed browser steps, and running automations 24/7, token-based pricing punishes experimentation.

A drop-in OpenAI-compatible endpoint like Standard Compute makes more sense for this class of work:

predictable monthly cost
no per-token panic during retries
works with existing OpenAI SDKs
useful for agents running in n8n, Make, Zapier, OpenClaw, or custom stacks

That pricing model doesn’t fix garlic.

But it does remove one of the other things holding agent systems back: cost anxiety every time a workflow loops, retries, or scales up.

What should you automate right now?

My answer is pretty simple.

Automate planning aggressively.

Automate execution selectively.

Automate payment reluctantly.

Here’s the practical breakdown.

Approach	What you get
OpenClaw full auto-checkout	Maximum convenience, highest risk from unit, substitution, and payment mistakes
OpenClaw cart-building with human review	Most of the convenience, much lower risk, probably the best current default
Traditional manual ordering or subscriptions	Less setup, lower flexibility, fewer agent-specific failure modes

If you’re building this today, the winner is obvious.

Use OpenClaw to build the cart.

Do not let it own the purchase unless you’ve earned that trust with scoped guardrails and a long track record.

A sane rollout for agent-driven commerce

If I were implementing this in a real system, I’d stage it like this:

Phase 1: planning only

collect recipes
collect household preferences
generate grocery list
estimate quantities

Phase 2: cart assembly

map list items to store SKUs
add items to cart
flag uncertain matches
compare against prior orders

Phase 3: mandatory review

show quantity deltas
show unit changes
show substitutions
require explicit human approval

Phase 4: limited automation

allow auto-checkout only for low-risk repeat orders
keep hard approval for produce-by-weight, substitutions, delivery windows, and payment changes

That policy can even be encoded directly:

auto_checkout:
  allowed_categories:
    - pantry_staples
    - repeat_household_items
  blocked_categories:
    - fresh_produce
    - allergy_sensitive
    - baby_items
    - medication
  require_review_if:
    - unit_changed
    - quantity_delta_gt_3x
    - price_delta_gt_2x
    - substitution_present
    - delivery_window_changed

That is much less impressive in a demo.

It is much more impressive in production.

The real issue is not action, it’s hesitation

The hardest part of agent design is not getting the model to do something.

It’s getting the model to realize when it should stop.

Planning is where agents look smart.

Execution is where they get humbled.

OpenClaw can sound brilliant while discussing meals, quantities, and substitutions.

Then it hits a grocery page selling garlic by weight and suddenly all that reasoning collapses into a very expensive misunderstanding.

That’s the transition point where a lot of agent projects break:

from “I know what should happen”

to “I can safely do it in a messy external system.”

That’s also why the garlic thread resonated.

Everyone building serious automations has their own version of this bug.

Maybe it’s not garlic.

Maybe it’s:

a CRM field mapped to the wrong enum
a browser agent selecting annual billing instead of monthly
a shipping workflow choosing pounds instead of kilograms
a support bot issuing the wrong refund amount

Same class of failure.

Small semantic mismatch, real-world consequence.

My take

The garlic post is being remembered as a funny OpenClaw fail.

I think it should be remembered as a design doc.

If you’re building real automations with OpenClaw, n8n, Make, Zapier, or custom MCP-connected services, the lessons are straightforward:

automate planning aggressively
automate cart assembly carefully
automate checkout reluctantly
treat units, substitutions, and payment as separate risk classes
assume browser flows are the most fragile and expensive layer
add review gates where semantics can drift
use pricing that won’t punish retries and iteration

The original poster didn’t prove autonomous shopping is fake.

They proved something more useful:

A workflow can look production-ready for months and still fail on one tiny semantic edge case.

That’s where agent engineering is right now.

Not fake.

Not solved.

And definitely not garlic-proof.

DEV Community