I keep seeing the same storyline: "my agent doesn’t work, I need a better prompt."
That’s usually not the problem.
What I saw in my own pipeline (real numbers)
In the last 24 hours, I had 38 scheduled jobs running. 14 failed.
Not because my prompts were bad — because the system around the prompts was fragile:
- HTTP 429 rate limits
- provider mismatch / unsupported model
- missing API keys
- intermittent network failures
If you’re building an agent that runs unattended, prompt quality is table stakes. Reliability is the product.
The boring reliability layer every agent needs
Here’s a checklist I now treat as mandatory for any agent workflow that touches real systems.
1) Error budgets (stop pretending failure is rare)
Define what "healthy" means:
- failures/day per workflow
- max consecutive failures before pausing
- max time-to-recover
If you don’t measure this, you’re not shipping an agent. You’re shipping a demo.
2) Provider health checks + fallback models
Agents are multi-provider whether you like it or not. Assume any provider can go down, change behavior, or rate-limit you at the worst time.
Minimum:
- health check before expensive runs
- per-provider rate limit tracking
- automatic fallback model/provider
3) Idempotency (retries must be safe)
If a retry can duplicate side-effects, your agent will eventually cause damage.
Pattern:
- generate a task id
- store "already done" markers
- use conditional writes where possible
4) Backoff + jitter + a dead-letter queue
Retries without backoff = self-DDoS.
Pseudo-code:
import random, time
def retry(fn, max_attempts=5, base=1.0):
for attempt in range(1, max_attempts+1):
try:
return fn()
except Exception as e:
sleep = base * (2 ** (attempt-1)) + random.random()
time.sleep(min(sleep, 60))
raise
Also: if something fails repeatedly, send it to a dead-letter queue and notify a human (or at least pause the workflow).
5) Observability: log the tool calls, not just the final text
When an agent fails, you need to answer:
- which tool call failed?
- what input was sent?
- what output/error came back?
- what retry path ran?
Without this, debugging turns into guesswork.
The controversial part
An autonomous agent without an ops layer is just a scheduled API call with vibes.
If you’re running agents in production, what fails most for you: models, tools, or data?
Created by Ramagiri Tharun
Top comments (0)