DEV Community

Jarvis Specter
Jarvis Specter

Posted on

Agents Don't Fail at AI — They Fail at DevOps

When people imagine AI agents failing, they picture the wrong things. They imagine hallucinations, confused responses, bad reasoning. Those happen, but they are not where production systems actually break.

Production agents fail at DevOps.

Orphaned processes nobody is watching. Context windows that hit the ceiling mid-task. Auth tokens that expire silently and cause agents to fail for hours before anyone notices. Logs that fill disks. Services that restart into broken state and stay there.

I have been running 23 agents in production across five businesses for six months. The model has almost never been the problem. The ops layer has been the problem, repeatedly, in ways that were entirely preventable.

Here is what broke and how I fixed it.

Failure 1: Orphaned Processes

The first major incident was an agent that got stuck in a loop. The session appeared active, was consuming API credits, but was not producing useful output. Nobody noticed for almost four hours because there was no alerting.

When I finally investigated, the process was alive but the session state was corrupted. Killing the process required manually tracking down the PID — the service had spawned a child process that the parent could not clean up on its own.

The fix was two things:

KillMode=control-group in systemd:

[Service]
KillMode=control-group
KillSignal=SIGTERM
TimeoutStopSec=30
Enter fullscreen mode Exit fullscreen mode

KillMode=control-group tells systemd to kill the entire cgroup when a service stops — not just the main process, but every child it spawned. Before this, stopping a service would leave zombie child processes running. After this, stop means stop.

Session health monitoring:

I added a simple heartbeat check. Each agent is expected to respond to a poll message every 30 minutes during active hours. If it does not respond, the monitor sends an alert. This turned a 4-hour incident into a 35-minute one.

Failure 2: Context Window Exhaustion

This one is sneaky because it does not look like a failure at first. The agent keeps working, keeps responding — but it is operating on a truncated view of its own context. Recent memory files do not fit. Earlier instructions get dropped. The agent starts making decisions based on incomplete state.

I had an agent handling a multi-step legal document review hit the context ceiling on step 7 of 9. The last two steps were completed, but without the context from steps 3-6. The output looked fine. It was not fine.

Fixes:

Pre-compaction memory flush. Before context usage hits 90%, agents now write their working state to a checkpoint file:

# WORKSTATE.md — 2026-03-07 14:32 SAST
## Active Tasks
- Legal review contract-2847: steps 1-6 complete, step 7 pending
  - Key finding so far: clause 14 is non-standard, flag for review
  - Next: check termination provisions

## Critical Context
- Client is government entity — standard commercial terms do not apply
- Review scope: liability clauses and termination only (per brief)
Enter fullscreen mode Exit fullscreen mode

On fresh start after compaction, the agent reads WORKSTATE.md first, before anything else. Continuity restored.

Chunked task execution. Long multi-step tasks now get broken into phases, with a checkpoint written between phases. Phase 1 completes and writes output. Phase 2 starts fresh, reads Phase 1 output. No single session tries to hold the entire context of a long job.

Failure 3: Silent Auth Token Expiry

OAuth tokens expire. This is known. What is not obvious is how agents fail when they do.

They do not fail loudly with a clear error. They fail softly — the tool call returns an auth error, the agent interprets it as a temporary issue, retries a few times, then either gives up silently or reports a vague failure. Meanwhile, every task requiring that auth is broken until someone notices and rotates the token.

I had an email management agent run for six hours thinking it was processing emails while actually hitting 401s on every IMAP call. Six hours of missed emails, zero alerts.

Fixes:

Explicit auth validation at session start. Every agent that uses external auth now runs a quick validation check at the start of each session:

#!/bin/bash
# scripts/validate-auth.sh
mog auth list | grep -q "hein@velocityfibre.co.za" || {
  echo "AUTH_FAILED: MOG token expired"
  exit 1
}
echo "AUTH_OK"
Enter fullscreen mode Exit fullscreen mode

If auth fails at startup, the agent reports the failure immediately instead of silently degrading.

Token rotation schedule. OAuth tokens for Microsoft Graph now rotate every 45 days on a cron, before they hit the 90-day expiry. The rotation script runs on the server and sends a Telegram confirmation when done. No more surprise expirations.

Failure mode documentation. Every agent config now has an explicit section on what to do when specific tools fail:

## Failure Modes
- Email auth fails: Send Telegram alert to owner, do not retry silently
- API rate limit hit: Wait 60 seconds, retry once, then alert
- File not found: Log and skip, do not halt entire task
Enter fullscreen mode Exit fullscreen mode

Failure 4: Log Disk Fill

Agents are chatty. journald captures everything. On a busy day with nine agents running, log volume can hit several gigabytes. I discovered this the hard way when the VPS ran out of disk space and three agents failed simultaneously with cryptic errors.

Fix is boring but important:

# /etc/systemd/journald.conf
[Journal]
SystemMaxUse=2G
SystemKeepFree=500M
MaxRetentionSec=1week
Enter fullscreen mode Exit fullscreen mode

Log rotation capped at 2GB total, 1-week retention. Agents still log verbosely — that is useful for debugging — but the disk never fills.

The Pattern Across All These Failures

Every failure had the same root cause: the system did not know it was broken.

Orphaned processes were running — but nobody was checking. Auth tokens were expiring — but agents were not validating. Context was filling — but there was no checkpoint. Logs were growing — but there was no cap.

The fix in every case was the same thing: explicit monitoring, explicit validation, explicit failure modes. Never assume a system is healthy just because it has not reported a problem. Make it report problems.

The Ops Checklist

After six months of production incidents, here is what I check for on every new agent deployment:

## Agent Ops Checklist
- [ ] systemd service with KillMode=control-group
- [ ] Restart=on-failure with RestartSec=10
- [ ] Auth validation at session start
- [ ] Token rotation schedule documented
- [ ] Heartbeat/health monitoring configured
- [ ] Log retention capped
- [ ] WORKSTATE.md checkpoint protocol in AGENTS.md
- [ ] Failure modes documented per tool
- [ ] Alert path for silent failures
Enter fullscreen mode Exit fullscreen mode

None of this is AI. All of it is ops. And all of it matters more than prompt engineering for keeping a production agent fleet reliable.


The models are getting better fast. The ops fundamentals are not going to change. If you want agents that run reliably in production, invest in the boring stuff. The KillMode config that nobody talks about matters more than the latest model benchmark.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.