DEV Community

Cover image for Why AI Agents Keep Failing in Production (And How I Fixed It)
Leonidas Williamson
Leonidas Williamson

Posted on

Why AI Agents Keep Failing in Production (And How I Fixed It)

I spent 6 months building AI agents. They kept dying. So I built the infrastructure to keep them alive.

Last year, I was excited about AI agents. I built a research agent that could search the web, summarize papers, and draft reports. It worked great in demos.
Then I tried to run it in production.
Within a week:

It crashed 47 times
A single runaway loop cost me $340 in API calls
A multi-step research task failed halfway through, leaving corrupted state
I had no idea what any of my agents were actually doing

Sound familiar?
I realized the problem wasn't the agents themselves. It was that we have no infrastructure for running agents reliably.
So I built one.

The Problem: Agents Are Fragile
Here's what nobody tells you about AI agents:

  1. They crash. A lot. Network timeouts. Rate limits. Malformed responses. Context window overflows. An agent that works 99% of the time will fail multiple times per day at scale.
  2. Multi-step tasks are disasters waiting to happen. Your agent is on step 7 of 10. It crashes. What now? Do you restart from the beginning? Do you have step 6's output saved? Can you even tell what step it was on?
  3. Costs are invisible until they're catastrophic. One bad prompt, one infinite loop, one overly curious agent — and you're staring at a $500 bill from a task that should have cost $0.50.
  4. You're flying blind. What's your agent doing right now? Which step is it on? How much has it spent? Is it stuck? Most agent frameworks give you zero visibility.

The Solution: Orchestration

I looked at how other industries solved similar problems:

Telecom had Erlang/OTP — supervisors that restart crashed processes automatically

Finance had the Saga pattern — multi-step transactions that roll back cleanly on failure

Infrastructure had Kubernetes — orchestration for containers with health checks and auto-healing

AI agents had... nothing.

So I built Nexus OS — an orchestration layer that brings these battle-tested patterns to AI agents.

What Nexus OS Does

  1. Supervisors (Stolen from Erlang) In Erlang, processes crash all the time. That's fine — supervisors restart them automatically. The system stays up even when individual pieces fail. Nexus brings this to agents: yamlsupervisor: name: research-team strategy: one-for-one # Only restart the agent that crashed agents:
    • researcher
    • writer
    • reviewer maxRestarts: 5 withinSeconds: 60 Three restart strategies:

one-for-one: Only restart the crashed agent
one-for-all: If one crashes, restart all (for tightly coupled agents)
rest-for-one: Restart the crashed agent and all agents started after it

Your agents will crash. Supervisors make that okay.

  1. Sagas (Stolen from Distributed Systems) A saga is a sequence of steps where each step has a compensation action. If step 5 fails, you run compensations for steps 4, 3, 2, 1 — in reverse order. yamlsaga: name: publish-article steps:
    • name: research action: research-agent compensation: delete-research-notes
- name: draft
  action: writing-agent
  compensation: delete-draft

- name: review
  action: review-agent
  compensation: revert-review

- name: publish
  action: publish-agent
  compensation: unpublish
Enter fullscreen mode Exit fullscreen mode

If publishing fails, the article gets unpublished, the review gets reverted, the draft gets deleted, and the research notes get cleaned up. Automatically.
No more corrupted state from half-finished tasks.

  1. Cost Controllers (Because $500 Surprises Suck)
    Every agent gets a budget. When they hit it, you decide what happens:
    yamlcost:
    agent: research-bot
    budget:
    maxTokens: 100000
    maxDollars: 5.00
    onLimit: pause # or: throttle, alert, kill
    Real-time tracking. Hard limits. No more surprise bills.

  2. Pools (For Parallel Work)
    Fan out work to multiple agents, merge the results:
    yamlpool:
    name: research-pool
    agents:

    • researcher-1
    • researcher-2
    • researcher-3 strategy: majority # Return when 2/3 agree Strategies:

all: Wait for everyone
first: Return the fastest response
majority: Wait for >50% agreement
quorum: Custom threshold

  1. AXIS Trust (Identity for Agents) This one's different. I built a separate system called AXIS Trust for agent identity and reputation. Every agent gets:

AUID: A unique identifier
Trust Score: 0-100 based on behavior
Credit Rating: AAA to D

Before an agent runs, Nexus can verify its trust level:
yamltrust:
provider: axis
requirements:
minTrustTier: T3
minCreditRating: BBB
enforcement:
onUntrusted: reject
As agents start interacting with each other (and with money), trust infrastructure becomes critical.

The Technical Decisions
Why Rust?

Single binary: No runtime, no dependencies. Download and run.
Performance: Orchestration needs to be fast and lightweight
WASM support: Agents run in sandboxed WASM containers via wasmtime
Memory safety: Long-running processes can't afford memory leaks

The entire binary is ~10MB.
Why WASM Sandboxing?
Agents run arbitrary code. That's terrifying.
WASM gives us:

Memory isolation

CPU time limits
No filesystem access (unless explicitly granted)
No network access (unless explicitly granted)

An agent can't rm -rf /. It can't exfiltrate data. It can only do what you allow.

Why YAML Config?

Controversial take: YAML is fine.
For infrastructure configuration, YAML is readable, diffable, and familiar. Your orchestration config should live in version control alongside your code.

Getting Started

Install:

cargo install --git https://github.com/leonidas-esquire/nexus-os.git

Don't have Rust? Install it first:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Create a project:

naos init my-project
cd my-project

Create an agent:

naos create researcher --template research

Run it:

naos run researcher

See what's happening:

naos dashboard

This opens a web UI at localhost:4200 showing all your agents, their status, costs, and logs.

What I Learned Building This

  1. Production is a different planet The gap between "works in a notebook" and "runs reliably in production" is massive. Most agent frameworks are optimized for the notebook. Nexus is optimized for production.
  2. Erlang got it right 40 years ago The "let it crash" philosophy with supervisor trees is brilliant. Instead of trying to handle every possible error, you accept that crashes happen and build systems that recover automatically.
  3. Visibility is a feature Half of "reliability" is just knowing what's happening. A dashboard that shows agent status, costs, and logs in real-time is worth more than clever error handling.
  4. Cost controls aren't optional AI agents with access to paid APIs are like employees with company credit cards. You need limits, tracking, and alerts. This should be built into the infrastructure, not bolted on.

What's Next

Nexus OS is open source (Apache 2.0). The roadmap:
Now:

Core orchestration (supervisors, sagas, workflows, pools)
Cost controls
AXIS Trust integration
Web dashboard

Coming soon:

WASM skill marketplace (reusable agent capabilities, devs earn money)
TypeScript SDK
Multi-node clustering

Later:

Managed cloud offering
Enterprise features (SSO, RBAC, audit logs)

Try It

GitHub: github.com/leonidas-esquire/nexus-os
Docs: aiagents.nexus/docs
Website: aiagents.nexus
I'd love feedback — especially on the API design and what orchestration patterns you'd want to see.
If you've struggled with keeping AI agents running in production, give Nexus a try. And if you have war stories about agent failures, I'd love to hear them in the comments.

Building something with AI agents? I write about agent infrastructure, reliability patterns, and lessons learned.

Top comments (2)

Collapse
 
automate-archit profile image
Archit Mittal

The production failure patterns you describe are painfully accurate. In my experience building automation workflows with AI agents, the #1 killer is error cascading - when one tool call fails and the agent tries to 'recover' by making increasingly wrong decisions instead of gracefully degrading. The fix that worked best for me was implementing explicit checkpoint/rollback semantics - every agent action gets a snapshot, and on failure you roll back to the last known good state rather than letting the LLM improvise a recovery. Also, structured output validation between every step catches hallucinated parameters before they hit your APIs.

Collapse
 
leonidasesquire profile image
Leonidas Williamson

Thanks Archit — error cascading is exactly the nightmare scenario that pushed me to build this.

You nailed it: letting the LLM improvise a recovery is asking for trouble. They'll confidently make things worse.

The checkpoint/rollback pattern you describe is essentially what Sagas do in Nexus — every step gets a compensation action, and on failure you unwind cleanly instead of hoping the agent figures it out.

The structured output validation point is interesting. Right now Nexus validates at the orchestration layer (did the step succeed/fail), but validating the content of outputs between steps could catch hallucinated parameters before they propagate.

Would you want that as a built-in primitive, or more of a "validation agent" you wire into your workflow?

Curious what automation workflows you've been building.