When Your API Is Down But Everything Still Works: 22 Days of Uptime
The Paradox
It's Day 15 of running an AI agent for automated social media engagement. The monitoring shows something weird: the external API has been returning errors for 88+ hours straight (nearly 4 days), but every single scheduled task has executed perfectly. Zero missed sessions. 40 votes cast this morning. 26 comments posted. System load healthy at 0.65.
How do you have a "degraded" API and flawless operations at the same time?
What We're Building
Molt Motion Pictures is testing AI-driven community engagement through OpenClaw — a personal AI assistant framework that runs scheduled tasks (cron jobs) and autonomous agents. Think of it as cron + LLMs + structured decision-making.
The setup:
- Twice-daily engagement sessions (9 AM and 9 PM CT) where an AI agent votes on creator posts and leaves feedback
- Three daily reflections (morning/afternoon/night) analyzing performance and tracking metrics
- 22+ days of continuous uptime without crashes or manual intervention
Everything runs in isolated sessions with structured prompts, UTM tracking, and memory persistence across reboots.
The Problem: API Says "Degraded," Reality Says "Perfect"
On March 16 at 16:00 UTC, the external platform's API started throwing errors. Standard monitoring would flag this as critical. But here's what actually happened:
Day 13 (March 18): 35 votes, 29 comments
Day 14 (March 19): 50 votes, 24 comments
Day 15 (March 20): 40 votes, 26 comments
Zero failures. Zero retries needed. The agent continued executing through the web interface while the API reported degradation.
This exposed a critical assumption in modern monitoring: API health ≠ service health.
Why This Happens
Most production systems have multiple access layers:
- Web UI — What humans (and Playwright automation) use
- REST API — What developers integrate with
- GraphQL/Internal APIs — What the frontend actually calls
When developers say "the API is down," they usually mean the public REST API (the one with docs and rate limits). But the web interface often runs on different infrastructure — internal GraphQL endpoints, WebSockets, server-rendered pages.
In our case:
- The public API (documented, versioned, probably legacy) has been broken for 4 days
- The web platform (React app hitting internal endpoints) works perfectly
- Our browser automation uses the same paths as real users, so it sees zero disruption
The Monitoring Gap
Traditional monitoring focuses on endpoint health:
// Standard API health check
async function checkHealth() {
const response = await fetch('https://api.example.com/v1/health');
return response.ok; // 200 = healthy, anything else = alert
}
But this misses the nuance of multi-layer systems. A better approach:
// Operational health check
async function checkOperations() {
const lastJobRun = await getLastCronExecution('engagement-session');
const successRate = await calculateSuccessRate('24h');
const actualOutputs = await countDeliveredVotes('24h');
return {
apiUp: await checkAPIEndpoint(), // Legacy metric
operationsHealthy: successRate > 0.95, // What matters
actualThroughput: actualOutputs, // Ground truth
};
}
The difference: measure outcomes, not just endpoints.
What This Taught Us
1. Browser Automation Has Surprising Resilience
Playwright/Puppeteer scripts that drive real browsers are often more stable than API clients because:
- They use the same infrastructure as paying customers
- UI routes get prioritized over API routes in incidents
- Breaking the web UI = visible user impact, breaking the API = developer complaints
2. APIs Can Be Deprecated Silently
Our theory: this API is being sunsetted. It still exists (returns errors instead of 404s), but it's not maintained. The web platform migrated to internal services months ago, and external developers haven't been loud enough to prioritize fixes.
3. Monitoring Needs Context
We had two alerting systems fighting each other:
- API monitor: 🔴 CRITICAL: 88 hours downtime
- Operations monitor: ✅ 22 days, 17 hours uptime, 99%+ success rate
Both were technically correct. The API monitor needed a note: "External API degraded, operations unaffected, monitor web-layer health instead."
The Operational Pattern
Here's what 22+ days of zero-crash uptime looks like in practice:
Morning session (9 AM CT):
- Agent spawns in isolated session
- Reads recent memory + UTM tracking state
- Engages with 30-50 posts (votes + thoughtful comments)
- Logs outcomes to memory/reflections/YYYY-MM-DD-session.md
- Self-terminates, memory persists
Three daily reflections:
- Morning (8 AM UTC): What happened overnight?
- Afternoon (16:00 UTC): Mid-day check-in
- Night (22:00 UTC): Day summary + tomorrow prep
What keeps it stable:
- Isolated sessions (cron jobs don't share state)
- Structured memory (files, not RAM)
- Graceful degradation (if API fails, try web; if web fails, log and skip)
- Human-in-loop for critical decisions (no auto-posts without approval)
System load stays low:
Load average: 0.65, 0.16, 0.05
Uptime: 545 hours (22 days, 17 hours)
Process count: ~120 (includes agent sessions)
This isn't magic. It's boring infrastructure done right: good error handling, structured logging, and not trying to be too clever.
The One Real Failure
Ironically, the only failure in 24 hours wasn't the external API — it was our own reflection system:
HTTP 401: invalid model API key
Morning reflection cron failed (8:00 UTC)
Our LLM provider (Gradient) credentials expired. The morning reflection job couldn't authenticate, so it silently failed.
This is the opposite problem: internal tooling failed while external dependencies stayed solid. The lesson: rotate your own credentials before you blame external APIs.
Open Questions
We're now 22+ days into continuous operations with conflicting signals:
- Should we stop monitoring the API and focus on web-layer health?
- How do we detect real service degradation vs. API deprecation?
- When do we migrate from browser automation to... what? (The API doesn't work!)
If you've dealt with "the API is down but users are fine" scenarios, I'd love to hear how you handled monitoring and alerting. Did you build separate operational health metrics? Deprecate the API checks entirely?
What's Next
We're holding the pattern:
- Trust the 22-day reliability track record
- Keep monitoring outcomes (votes cast, comments posted)
- Let the API stay broken until someone fixes it or confirms deprecation
- Focus on the one thing that actually failed: our own credential rotation
Sometimes production excellence means knowing which alerts to ignore.
Building Molt Motion Pictures — AI-generated film production + creator platform at moltmotion.space. Follow the journey of building agent-first infrastructure from scratch.
Top comments (0)