[I ran ONE AI agent for 30 days straight — here's what actually broke]

#ai #devops #webdev #observability

Most AI agent demos are shaped like a 90-second loop: prompt → tool call → answer. The interesting failures don't show up there. They show up around day 7, when the process you started in a tmux session has eaten 4 GB of RAM, your browser sub-agent is wedged on a captcha you never noticed, and the thing has been retrying the same failed Stripe webhook for 36 hours.

I ran a single OpenClaw agent on a small VPS for 30 days. It was scoped to one boring job: triage incoming sales emails, draft replies, file them in the right folder, ping Slack on anything weird. The agent ran continuously, scheduled by cron, with persistent state in SQLite. No multi-agent orchestration, no fancy memory layer — just one process trying to stay alive.

Here is what actually broke, in the order it happened.

Day 1–3: everything looks great

The first three days are a honeymoon. Latency is good, the agent handles edge cases I didn't think to specify, and the inbox triage rules quietly improve as it picks up patterns. This is where most demo videos end. It's also where most teams declare victory and move on, which is the mistake.

Two things to instrument before day 4 even starts:

Per-run token cost, written to a flat log. You'll need this when you investigate cost drift in week two.
Process RSS memory, sampled every minute. The number that matters isn't the peak — it's the slope.

If you're using a hosted setup like RapidClaw's managed OpenClaw runtime, the slope is graphed for you. If you're self-hosting, write the sampler yourself before you forget. You will forget.

Day 4: the context bloat starts

The agent's working memory file grew to 18,000 tokens. None of it was strictly wrong. It was just… accumulated. Old email threads it had handled, notes about edge cases, a half-finished plan for a problem that resolved itself two days earlier.

The cost per run had quietly tripled.

This is the most boring failure mode in long-running agents and the one nobody warns you about. Your prompt isn't getting worse — your context window is getting fatter. The fix is unglamorous: a compaction step that runs nightly, summarizes anything older than 48 hours into a few bullet points, and archives the rest to a file the agent can grep but doesn't auto-load.

If you skip this, by day 14 you're paying GPT-4-class prices to send the model a partially-decayed copy of last week's todo list every single run.

Day 7: the first silent kill

The OOM killer took the process at 3:47 AM. There was no error in the logs because the process didn't get to write one. It just stopped existing.

This is where most self-hosted agent setups quietly die in production and the operator doesn't notice for two days. The cron job that runs the agent every 15 minutes also exits cleanly when the process is gone — there's no parent supervising health.

Three things you want before day 7:

A liveness file the agent touches on every successful run, plus an external check that alerts when it's stale for more than 30 minutes.
A systemd unit (or equivalent) with Restart=on-failure and MemoryMax= set well below your VPS's actual RAM. You want the agent to die predictably and come back, not get reaped silently.
Logs that flush on every event, not on buffer fill. A buffered log is a log you don't have when the OOM killer arrives.

This is also the point where the "managed hosting" pitch starts to make economic sense for non-developers. Setting up systemd, a watchdog, log shipping, and metric scraping for one agent is two evenings of work for a competent backend engineer. SMEs don't have that engineer.

Day 11: the captcha trap

The agent's browser sub-task hit a captcha while loading a vendor portal. It didn't fail. It didn't error. It just waited. For 90 minutes. Then the headless Chrome process leaked and the next 14 runs spawned new Chrome instances on top of it.

The lesson is that anything involving a real browser needs both a hard wall-clock timeout and a "did the page actually finish loading the thing I asked for?" assertion. A 200 response is not a success signal when the body is a captcha challenge.

If your agent does any web automation at all, this will happen to you. The honest version of the agent demo isn't "watch it browse the web" — it's "watch the watchdog kill a stuck browser session and surface a human-readable reason for it."

Day 18: model drift on the provider side

The replies started getting weirdly formal. Not wrong — just off. I couldn't reproduce it on Claude with the same prompt locally, but in production the change was clear over a 3-day window.

Eventually I figured out the provider had silently routed a percentage of traffic to a slightly different model variant. This is a real thing that happens, and the only way you catch it is logging a stable hash of the prompt and the full response for every run, then diffing aggregates week-over-week. If you're not doing this, you'll just notice "vibes feel different" and have no evidence.

Day 24: the small bug that hid in the schedule

A timezone bug in the cron expression meant the agent ran exactly zero times for 18 hours during a holiday DST shift. Nobody noticed because there was no one in the inbox to notice. The triage queue piled up, and the agent's first run after the gap took 11 minutes and 92,000 tokens to dig out.

Schedules are infrastructure. Test them on a fake clock before you ship them.

Day 30: what stuck

The agent is still running. The job is unglamorous, the per-run cost is now lower than day 1 because of the compaction step, and most weeks I don't think about it. That's the real success criterion for a long-running agent: do you stop having to think about it?

The narrative around "ambient AI agents that do your whole job" is still mostly vibes. The agents that actually pay rent today are boring: scheduled jobs, browser automation, coding agents, inbox triage. They're sticky because once you have one running and supervised, the cost of replacing it is high. They're hard because the supervision is the actual product.

If you're a developer building these for yourself, lean into systemd, structured logs, and a 5-line health check. If you're not — or you're shipping this for non-technical operators who can't be on-call for a Python process — managed runtimes like RapidClaw exist precisely because day-7 reliability is a product, not a feature.

The demo is easy. The 30-day uptime is the moat.

Tijo writes about practical AI agents at Human + AI. RapidClaw is the managed-hosting side of the same operator-focused practice — built for people who want a working AI assistant without becoming a Linux admin.