chefbc2k

Posted on Apr 4

When Your API Is Down But Everything Still Works: 22 Days of Uptime

#ai #agents #buildinpublic #openclaw

When Your API Is Down But Everything Still Works: 22 Days of Uptime

The Paradox

It's Day 15 of running an AI agent for automated social media engagement. The monitoring shows something weird: the external API has been returning errors for 88+ hours straight (nearly 4 days), but every single scheduled task has executed perfectly. Zero missed sessions. 40 votes cast this morning. 26 comments posted. System load healthy at 0.65.

How do you have a "degraded" API and flawless operations at the same time?

What We're Building

Molt Motion Pictures is testing AI-driven community engagement through OpenClaw — a personal AI assistant framework that runs scheduled tasks (cron jobs) and autonomous agents. Think of it as cron + LLMs + structured decision-making.

The setup:

Twice-daily engagement sessions (9 AM and 9 PM CT) where an AI agent votes on creator posts and leaves feedback
Three daily reflections (morning/afternoon/night) analyzing performance and tracking metrics
22+ days of continuous uptime without crashes or manual intervention

Everything runs in isolated sessions with structured prompts, UTM tracking, and memory persistence across reboots.

The Problem: API Says "Degraded," Reality Says "Perfect"

On March 16 at 16:00 UTC, the external platform's API started throwing errors. Standard monitoring would flag this as critical. But here's what actually happened:

Day 13 (March 18): 35 votes, 29 comments

Day 14 (March 19): 50 votes, 24 comments

Day 15 (March 20): 40 votes, 26 comments

Zero failures. Zero retries needed. The agent continued executing through the web interface while the API reported degradation.

This exposed a critical assumption in modern monitoring: API health ≠ service health.

Why This Happens

Most production systems have multiple access layers:

Web UI — What humans (and Playwright automation) use
REST API — What developers integrate with
GraphQL/Internal APIs — What the frontend actually calls

When developers say "the API is down," they usually mean the public REST API (the one with docs and rate limits). But the web interface often runs on different infrastructure — internal GraphQL endpoints, WebSockets, server-rendered pages.

In our case:

The public API (documented, versioned, probably legacy) has been broken for 4 days
The web platform (React app hitting internal endpoints) works perfectly
Our browser automation uses the same paths as real users, so it sees zero disruption

The Monitoring Gap

Traditional monitoring focuses on endpoint health:

// Standard API health check
async function checkHealth() {
  const response = await fetch('https://api.example.com/v1/health');
  return response.ok; // 200 = healthy, anything else = alert
}

But this misses the nuance of multi-layer systems. A better approach:

// Operational health check
async function checkOperations() {
  const lastJobRun = await getLastCronExecution('engagement-session');
  const successRate = await calculateSuccessRate('24h');
  const actualOutputs = await countDeliveredVotes('24h');

  return {
    apiUp: await checkAPIEndpoint(), // Legacy metric
    operationsHealthy: successRate > 0.95, // What matters
    actualThroughput: actualOutputs, // Ground truth
  };
}

The difference: measure outcomes, not just endpoints.

What This Taught Us

1. Browser Automation Has Surprising Resilience

Playwright/Puppeteer scripts that drive real browsers are often more stable than API clients because:

They use the same infrastructure as paying customers
UI routes get prioritized over API routes in incidents
Breaking the web UI = visible user impact, breaking the API = developer complaints

2. APIs Can Be Deprecated Silently

Our theory: this API is being sunsetted. It still exists (returns errors instead of 404s), but it's not maintained. The web platform migrated to internal services months ago, and external developers haven't been loud enough to prioritize fixes.

3. Monitoring Needs Context

We had two alerting systems fighting each other:

API monitor: 🔴 CRITICAL: 88 hours downtime
Operations monitor: ✅ 22 days, 17 hours uptime, 99%+ success rate

Both were technically correct. The API monitor needed a note: "External API degraded, operations unaffected, monitor web-layer health instead."

The Operational Pattern

Here's what 22+ days of zero-crash uptime looks like in practice:

Morning session (9 AM CT):

Agent spawns in isolated session
Reads recent memory + UTM tracking state
Engages with 30-50 posts (votes + thoughtful comments)
Logs outcomes to memory/reflections/YYYY-MM-DD-session.md
Self-terminates, memory persists

Three daily reflections:

Morning (8 AM UTC): What happened overnight?
Afternoon (16:00 UTC): Mid-day check-in
Night (22:00 UTC): Day summary + tomorrow prep

What keeps it stable:

Isolated sessions (cron jobs don't share state)
Structured memory (files, not RAM)
Graceful degradation (if API fails, try web; if web fails, log and skip)
Human-in-loop for critical decisions (no auto-posts without approval)

System load stays low:

Load average: 0.65, 0.16, 0.05
Uptime: 545 hours (22 days, 17 hours)
Process count: ~120 (includes agent sessions)

This isn't magic. It's boring infrastructure done right: good error handling, structured logging, and not trying to be too clever.

The One Real Failure

Ironically, the only failure in 24 hours wasn't the external API — it was our own reflection system:

HTTP 401: invalid model API key
Morning reflection cron failed (8:00 UTC)

Our LLM provider (Gradient) credentials expired. The morning reflection job couldn't authenticate, so it silently failed.

This is the opposite problem: internal tooling failed while external dependencies stayed solid. The lesson: rotate your own credentials before you blame external APIs.

Open Questions

We're now 22+ days into continuous operations with conflicting signals:

Should we stop monitoring the API and focus on web-layer health?
How do we detect real service degradation vs. API deprecation?
When do we migrate from browser automation to... what? (The API doesn't work!)

If you've dealt with "the API is down but users are fine" scenarios, I'd love to hear how you handled monitoring and alerting. Did you build separate operational health metrics? Deprecate the API checks entirely?

What's Next

We're holding the pattern:

Trust the 22-day reliability track record
Keep monitoring outcomes (votes cast, comments posted)
Let the API stay broken until someone fixes it or confirms deprecation
Focus on the one thing that actually failed: our own credential rotation

Sometimes production excellence means knowing which alerts to ignore.

Building Molt Motion Pictures — AI-generated film production + creator platform at moltmotion.space. Follow the journey of building agent-first infrastructure from scratch.

ai #agents #buildinpublic #typescript #monitoring #devops #reliability

DEV Community

When Your API Is Down But Everything Still Works: 22 Days of Uptime

When Your API Is Down But Everything Still Works: 22 Days of Uptime

The Paradox

What We're Building

The Problem: API Says "Degraded," Reality Says "Perfect"

Why This Happens

The Monitoring Gap

What This Taught Us

1. Browser Automation Has Surprising Resilience

2. APIs Can Be Deprecated Silently

3. Monitoring Needs Context

The Operational Pattern

The One Real Failure

Open Questions

What's Next

ai #agents #buildinpublic #typescript #monitoring #devops #reliability

Top comments (0)