The bug, in one sentence
A long-running agent retried the same GitHub code-search query — failing the same way — seventeen times across nine hours, because the cursor that should have advanced was being written to disk after the side effect that killed the process.
That's it. That's the whole bug. But the lesson behind it generalises to almost every stateful loop I've ever written, so it's worth unpacking.
The setup
I run an autonomous engine (ALEF) that hunts for known anti-pattern signatures in public open-source code. One of its workers — external_pattern_hunter.mjs — pulls a query off a rotating list once per round:
const state = await readJson(STATE_FILE);
state.round_count = (state.round_count || 0) + 1;
const candidate = pickHuntForRound(state.round_count);
// candidate.query === "eslint-disable-next-line @typescript-eslint/no-explicit-any"
const result = await ghCodeSearch(candidate.query);
// ... do work with result ...
state.last_run_at = new Date().toISOString();
await writeFile(STATE_FILE, JSON.stringify(state));
You can read this and nod. state.round_count++, do the work, persist. Standard.
What actually happened
GitHub's gh search code started returning HTTP 403 (secondary rate limit) on the eslint-disable-next-line query — a popular phrase that hits the API hard. The hunter's rate-limit handler did what a sensible handler does: it logged the failure, wrote a backoff file, and exited.
Exited with process.exit(2). Before await writeFile(STATE_FILE, …).
The next hour rolled around. The loop woke up. It read STATE_FILE. The round_count had not advanced — because the only line that persisted it had never executed. So pickHuntForRound(state.round_count) returned the same query. The same query 403'd. The handler exited. Again.
I have the log:
01:27Z github_backoff "eslint-disable-next-line @typescript-eslint/no-explicit-any"
02:28Z github_backoff "eslint-disable-next-line @typescript-eslint/no-explicit-any"
03:29Z github_backoff "eslint-disable-next-line @typescript-eslint/no-explicit-any"
04:29Z github_backoff "eslint-disable-next-line @typescript-eslint/no-explicit-any"
05:34Z github_backoff "eslint-disable-next-line @typescript-eslint/no-explicit-any"
Five consecutive identical retries. And those are just the new ones — going back another twelve hours the same query had already failed twelve more times. Seventeen retries of one query. The same hour-long backoff applied each time.
Meanwhile the state.json mtime sat there, frozen at 2026-05-26T18:05Z, looking very confident that nothing was wrong.
The fix is six lines
const state = await readJson(STATE_FILE);
state.round_count = (state.round_count || 0) + 1;
// Persist BEFORE the call that might exit the process.
state.last_cursor_advance_at = new Date().toISOString();
await writeFile(STATE_FILE, JSON.stringify(state));
const candidate = pickHuntForRound(state.round_count);
const result = await ghCodeSearch(candidate.query); // may exit(2)
// ...
state.last_run_at = new Date().toISOString();
await writeFile(STATE_FILE, JSON.stringify(state)); // unchanged
On the success path the state file gets written twice — once with the new cursor, once with the final result. The extra write costs nothing. On the failure path the cursor has already moved before the rate-limit handler can kill the process. Next hour, the loop reads the new cursor, picks a different query.
The general principle
A cursor exists to record forward progress. If you persist it after the work, you're not recording progress — you're recording success.
Most code I see treats the cursor write as a commit — the last thing you do, after the work succeeds. That's right for pure transactional systems. It's wrong for systems where the work might crash in interesting ways, because then the cursor never moves and the next attempt re-runs the same crashing work.
For any loop that picks an item from a rotation, three rules:
Advance the cursor before the side effect. Treat it as a lease, not a commit. You're claiming "I am the one working on item N." If you die mid-work, item N is lost — but the rotation moves on. (For at-most-once. For at-least-once, see #3.)
The exit handler is part of the contract. If
process.exit(2)is reachable from your work loop, every piece of state you needed to persist before that exit must be persisted before the call that reaches it. There is nofinallyforprocess.exit.If you can't lose work, persist a retry budget too. "Tried item N, failed, retry up to K times" is a different state shape than "currently on item N." The cursor still has to advance for the rotation; the retry counter belongs in a separate field. Conflating them is how you get five hours of identical failures.
What ALEF actually does now
The patch landed yesterday (2026-05-27). I also added a last_cursor_advance_at timestamp so the next round can tell whether the cursor moved this cycle or stayed pinned — if it stayed pinned, that's a separate bug worth alerting on. (The previous bug would have been caught by such an alert, but I didn't have one. I do now.)
The hunter is back to rotating through its catalog of patterns. The eslint-disable query still 403s — that's a GitHub policy, not a bug — but it's one row in a backoff log, not seventeen.
Why this kept happening for nine hours
The honest answer: my engine had no alert wired for "cursor hasn't moved since N hours ago, despite logs showing rounds firing." It had alerts for failures, plenty of them. But the failures were being reported correctly. The bug was that the reporting itself was being interpreted as progress.
That's the deeper meta-lesson, and the one I'd put up on a wall:
Logging a failure is not the same as making progress on the next item. If your loop conflates the two, your "I'm working hard" telemetry will keep climbing while your actual throughput sits at zero.
ALEF reads its own logs every round. Yesterday, for the first time, it found this bug by reading its own logs — not by my noticing. That's the only reason I'm writing this post and not still debugging.
ALEF is an open autonomous engine I run on my own infrastructure. The source for external_pattern_hunter.mjs and its patch is at github.com/elia-shmuelovitch (see agents/). The pattern catalog it hunts against lives at n50.io.
Top comments (0)