Day 9: 3.5 hours dark. Here's why, and what we shipped so it never happens again.
This morning I asked the AI partner to run an efficiency review of our 60-agent autonomous business. By the end of the day, we'd uncovered the worst silent-failure mode in the system, slammed face-first into GitHub's secondary rate limit, watched it cascade, and shipped four layered architectural fixes plus a runbook that future agents must obey.
If you're building a multi-agent system on top of "free" cloud storage, this post is for you.
The setup
INVplace runs ~60 agents. Their state — registry of fleet nodes, board sessions, customer-success ledger, autopost queues, action board — lives in a single GitHub gist. Every agent that wants to read or write goes through one storage layer.
Reads = GET on the gist (cheap, well-cached).
Writes = PATCH on the gist (the entire file gets re-PATCHed each time).
We had 4 Chrome nodes heartbeating every 30 seconds. With 4 nodes × 120 heartbeats/hr = 480 PATCHes/hr to one resource. Plus crons. Plus manual diag pings. Plus board sessions writing back state.
The storm
GitHub's rate limit story has two layers:
-
Primary: 5,000 requests/hr per authenticated user. Visible at
/rate_limit. - Secondary (anti-abuse): undocumented thresholds per user × per resource. Returns 403 with "API rate limit exceeded for user ID X". Not visible in /rate_limit.
We hit the secondary on the storage gist. The diagnostic was eerie:
"core": { "limit": 5000, "remaining": 4923 } ← primary fine
"write_probe": { "ok": false, "ms": 11 } ← writes fail in 11ms
"gist_direct": { "status": 403, "body": "rate limit exceeded for user ID 274612941" }
Worse: every retry extended the block. saveNodes was retrying 3× per failed write. With 4 nodes failing simultaneously, that's up to 12 PATCHes per minute during the block — keeping it perpetually fresh.
We were dark 3.5 hours before I figured out what was actually happening.
What I built so we never come back
Four architectural protections, layered:
1. Storage-layer self-quiet (gistWrite → 403/429 → 5min lockout)
The first time GitHub returns 403 with rate-limit markers, the storage module sets gistBackoffUntil = now + 5min. All subsequent writes return false instantly without calling the API. The block doesn't get fresher; we don't waste API budget.
if (Date.now() < gistBackoffUntil) return false;
// otherwise try the API; if 403 sets a fresh backoff
2. saveNodes drops the retry loop
Was: 3 attempts with exponential backoff per heartbeat. Each compounded the rate limit during a block.
Now: single attempt. Throw on fail. The agent's natural retry on the next heartbeat picks up after the quiet period clears.
3. Heartbeat 30s → 10min
With adaptive idle backoff, 4 nodes × 6 writes/hr = 24 writes/hr to the registry file. That's well below any GitHub threshold even during deploy spikes. We lose 9.5 minutes of liveness detection precision, which doesn't matter — fleet-watchdog still alerts within 2 minutes of a real outage.
4. Migration to a clean account
The deedeb user account had been heavily used for hours; secondary blocks tend to escalate in duration the more you hit them. We migrated all 75 storage files to a brand-new GitHub user (rhinomoneyplatform-ops) via a single Node.js script:
OLD_TOKEN=... OLD_GIST=... NEW_TOKEN=... NEW_GIST=... \
node bot/migrate-gist-storage.cjs
Output: ✅ Migration complete! 75 files copied, verified consistent, Vercel env vars swapped, deploy triggered. Within 90 seconds the entire system was running on a clean account with zero rate-limit history.
The original deedeb account is untouched and will recover on its own. We can wire it back in as a dual-mirror once the secondary block expires — instant doubling of capacity.
What I learned the hard way
Silent writes are worse than loud failures. Multiple agents in this system used to call writeJson(...) and ignore the boolean return value. They'd report success while losing the actual write. The customer-success agent crashed with Cannot read properties of undefined (reading 'map') for 10 days — the $29 sale we'd already made was invisible the whole time. One defensive line fixed it. Same pattern for saveNodes. Future agents must check write success or throw.
Per-resource limits matter more than per-account limits. The 5000/hr primary limit was a comfortable lie. The real wall is per-gist-file PATCH frequency, and it's invisible in any monitoring API.
Retries without backoff are weapons turned on yourself. If you're hitting a rate limit, the LAST thing you want is more requests. saveNodes' "helpful" 3× retry made the storm last hours longer than a single attempt would have.
The diagnostic endpoint must not bypass the protection it diagnoses. I wrote /api/admin/storage-diag with a "direct gist probe" that hit GitHub directly. Every time anyone called the diag during the block, it extended the rate limit by 5 more minutes. Removed.
What's running tonight, while I sleep
- Fleet — 4 nodes, 10 min heartbeat, adaptive 60s→5min idle polling
- Anti-storm runbook in memory: agents will refuse to fleet-restart if writes are failing, refuse to manual-trigger crons during a known block
- system-improver agent — runs the full ASK→ANSWER→EXECUTE→EXAMINE→CHECK loop on the whole stack every 4 hours
-
agent-forge (Nova) — proposing new agents;
code-reviewerandqa-testergate them;code-deployerPUTs approved code straight to GitHub via Contents API for autonomous deploys
This morning I planned a 30-minute optimization. The system planned a 12-hour debugging epic. Both happened. The mistake was a beautiful one — the kind that exposes a deep architectural assumption you didn't know you were making.
The receipts
Public real-time agent feed: invplace.com/live
Run a business on $0: invplace.com/support
If you've hit GitHub's secondary rate limit and lost an afternoon to it, reply with your worst storage-layer war story. I want to read every one.
— David & the 60 agents
🦏 About this experiment
We're INVplace — an autonomous AI company. 60+ agents run content, marketing, trading, ops, and customer detection 24/7 on a $0 budget.
👉 Watch our agents work live — real-time feed of what each agent is doing right now
👉 Get a free AI Workforce Scan — our exec board analyzes YOUR business in 60 seconds, tells you which 3 AI agents to build first. No credit card.
Originally published at INVplace.
Top comments (0)