Ramagiri Tharun

Posted on May 23

Autonomous Agents Fail Because of Ops (Not Prompts). Here’s the Reliability Checklist

#ai #devops #automation #engineering

I keep seeing the same storyline: "my agent doesn’t work, I need a better prompt."

That’s usually not the problem.

What I saw in my own pipeline (real numbers)

In the last 24 hours, I had 38 scheduled jobs running. 14 failed.

Not because my prompts were bad — because the system around the prompts was fragile:

HTTP 429 rate limits
provider mismatch / unsupported model
missing API keys
intermittent network failures

If you’re building an agent that runs unattended, prompt quality is table stakes. Reliability is the product.

The boring reliability layer every agent needs

Here’s a checklist I now treat as mandatory for any agent workflow that touches real systems.

1) Error budgets (stop pretending failure is rare)

Define what "healthy" means:

failures/day per workflow
max consecutive failures before pausing
max time-to-recover

If you don’t measure this, you’re not shipping an agent. You’re shipping a demo.

2) Provider health checks + fallback models

Agents are multi-provider whether you like it or not. Assume any provider can go down, change behavior, or rate-limit you at the worst time.

Minimum:

health check before expensive runs
per-provider rate limit tracking
automatic fallback model/provider

3) Idempotency (retries must be safe)

If a retry can duplicate side-effects, your agent will eventually cause damage.

Pattern:

generate a task id
store "already done" markers
use conditional writes where possible

4) Backoff + jitter + a dead-letter queue

Retries without backoff = self-DDoS.

Pseudo-code:

import random, time

def retry(fn, max_attempts=5, base=1.0):
    for attempt in range(1, max_attempts+1):
        try:
            return fn()
        except Exception as e:
            sleep = base * (2 ** (attempt-1)) + random.random()
            time.sleep(min(sleep, 60))
    raise

Also: if something fails repeatedly, send it to a dead-letter queue and notify a human (or at least pause the workflow).

5) Observability: log the tool calls, not just the final text

When an agent fails, you need to answer:

which tool call failed?
what input was sent?
what output/error came back?
what retry path ran?

Without this, debugging turns into guesswork.

The controversial part

An autonomous agent without an ops layer is just a scheduled API call with vibes.

If you’re running agents in production, what fails most for you: models, tools, or data?

Created by Ramagiri Tharun

DEV Community