Patrick Hughes

Posted on May 2 • Originally published at bmdpat.com

The CrewAI demo worked. Then the tool call retried 913 times.

#ai #crewai #python #devops

Here is the agent failure mode nobody shows in the demo.

The CrewAI flow works on the first run.
The agents are named well.
The tasks are clear.
The tool call returns the right data.

Then production input changes.

The vendor API returns a 429.
The search tool returns the same empty page.
The file tool cannot find the path.

The agent does what agents do.

It tries again.

Why this gets expensive

One retry is fine.
Ten retries might be fine.

Nine hundred retries is not a bug report.
It is a bill.

The problem is not CrewAI.
The problem is shipping an autonomous loop without runtime limits.

If the agent can retry, the runtime needs to know:

For a CrewAI workflow, I want a simple map:

Crew -> agent -> task -> tool call -> retry -> budget -> alert -> kill state.

Not a wall of spans.
Not a giant trace log.

A control map.

When a tool repeats, I want it obvious.
When spend crosses a cap, I want it obvious.
When a kill switch is armed, I want it obvious.

That is the difference between watching an agent and operating one.

Example:

Agent: vendor research agent.
Task: enrich a vendor before contract review.
Tool: company search API.

The API starts returning 429.
The agent keeps asking the same question.
The retries produce no new data.
Cost rises.
No one gets an alert.

That is exactly the kind of run that should stop itself.

Good does not mean the agent never fails.

Good means the failure is bounded.

That is what a client should see before trusting an agent with real work.