Alex Wu

Posted on Mar 20

The Agent Crashed at 3AM. Here's What We Learned.

#ai #indiehackers #startup #automation

At Anythoughts.ai, we run AI agents continuously — writing content, sending outreach, enriching leads. Most of the time, it works. Then one Tuesday night, the whole pipeline silently stopped for six hours. Nobody noticed until a client asked why their weekly report hadn't arrived.

Here's what broke and what we changed.

The Setup

Our outreach agent runs on a cron schedule: pull leads from Apollo, enrich with Hunter.io, draft personalized emails, send via Resend. Simple pipeline, maybe 40 lines of orchestration code.

The failure? A rate limit response from Apollo that our agent treated as an empty result instead of an error. The agent looped happily, found "no leads," and exited cleanly. Zero alerts.

Lesson 1: Silent success is worse than a loud failure

Our agent returned exit code 0. Logged "No new leads found." Everything looked fine in the dashboard. The bug wasn't a crash — it was a wrong assumption dressed as valid output.

Fix: we added output validation. If the agent returns zero results on a run that historically returns 10-50, that's flagged as anomalous and triggers a human review ping.

// Before
if (leads.length === 0) return { status: 'done', count: 0 };

// After
if (leads.length === 0) {
  if (runHistory.avgResults > 5) {
    throw new Error('Unexpectedly empty results — possible upstream failure');
  }
}

Small change. Big difference.

Lesson 2: Agents need circuit breakers, not just retries

We had retry logic — 3 attempts with exponential backoff. But we didn't have a circuit breaker. When Apollo rate-limited us, the agent retried three times, failed gracefully, and then the next scheduled run tried again 30 minutes later. And the one after that.

By morning we'd burned through most of our monthly quota on failed retries.

Fix: a simple state file. If the last N runs failed with rate-limit errors, skip the next scheduled run and emit a warning instead.

// ~/.state/agent-circuit.json
{
  "apollo": {
    "consecutiveFailures": 3,
    "lastFailureType": "rate_limit",
    "circuitOpen": true,
    "openUntil": "2026-03-20T06:00:00Z"
  }
}

Not glamorous. Completely effective.

Lesson 3: Log what the agent decided, not just what it did

Our logs said: Fetched 0 leads. Exiting. That told us nothing. What we needed was: Apollo returned HTTP 429. Interpreted as empty result. Exiting.

Agents make micro-decisions constantly. When something goes wrong at 3AM, you want a decision trail — not just an action log.

We now enforce a simple rule: every conditional branch in an agent gets a log line explaining the choice.

if (response.status === 429) {
  log.warn('Apollo rate limit hit — treating as temporary failure, not empty results');
  throw new RateLimitError(response);
}

Ten extra log lines turned a six-hour mystery into a five-minute root cause analysis.

The Bigger Pattern

AI agents fail in boring ways. Not dramatic hallucinations or runaway loops — just wrong assumptions, swallowed errors, and missing observability.

The fixes aren't AI-specific. They're the same patterns that make any distributed system reliable: circuit breakers, anomaly detection, decision logging. We just had to learn them the hard way.

Three things we now build into every agent from day one:

Output sanity checks — does the result make sense given historical context?
Circuit breakers — stop hammering a failing dependency
Decision logging — log the why, not just the what

If you're building agents that run unattended, steal these patterns. Your future self at 3AM will thank you.

Anythoughts.ai builds AI agents that handle real business workflows — outreach, reporting, content. We share what we learn in public.

DEV Community