Elia “Airtis” Shmuelovitch

Posted on Jun 3

Five Re-runs of One Audit Cycle. Seven Distinct Bugs.

#debugging #opensource #ai #automation

We run a recurring audit cycle on our own codebase every 3h44m. On 2026-06-03, between 04:08Z and 06:15Z, the same cycle fired five times in a row — twice because the harness retriggered it, three more times because we kept letting it.

Each fire found a different bug in a different subsystem. None of the seven were sibling instances of one root cause. They were independently broken loops that nothing else had been looking at.

This post catalogs the seven. The point is not "we shipped patches" — patches happen. The point is what happens when you re-aim the same audit prompt at the same engine state five times in two hours.

The audit pattern

The cycle prompt is roughly: "Here is a JSON characterization of the engine — F-score, loop state, anomaly list, recent commits, channel-silence flags. Decide what (if anything) needs action this cycle. Execute. Write a verdict."

The characterization is regenerated fresh on every fire. So when the same cycle fires a second time, the characterization captures whatever state the first fire just produced. The audit aims at the most recent disk.

What none of us predicted is that re-aiming at fresh disk is the audit. Each fire sees a slightly different slice of state, with the prior fire's edits as new context — and immediately starts pulling on the next loose thread.

Bug #1 — The Curator Never Saw Itself Decide

Primary fire (04:08Z). The biome consensus loop has three bots — curator, sprinter, architect — that propose patches and vote on each other's proposals. State file says last_run_at: 2026-05-27T04:38:49Z. Seven days frozen.

Root cause was a four-layer compound:

Each bot's proposePR() blocked 60s on waitForConsensus(), polling the inbox for two YES votes from peers.
The peers had not run yet in the same idle round — so the proposer always saw 1-of-1 (its own) and timed out.
The 60s block sat inside an idle-bridge observer slot with a 90s exec timeout. The bot got SIGTERM'd mid-poll, before saveState() ever ran.
With state never saved, the canBotOpenPR throttle never engaged, so every idle round wasted ~90s on a doomed pass.

Inbox at probe: 1752 stale-pending proposals. None had ever reached two-unique-voter YES quorum.

Fix: reduce waitForConsensus timeout 60s → 10s; on timeout leave status=pending and hand off to a separate PR-opener agent (born nine days earlier from a prior retire-cycle, but never wired into any observer list). Add the PR-opener to the observer list. Three files patched.

Bug #2 — The PR-Opener Counted Architect's Vote Twice

Same fire. While wiring in the PR-opener, we noticed it was surfacing a one-week-old proposal as a dry_run_would_open candidate. The tally function counted votes.filter(v => v.vote === "YES").length — duplicates included. Architect had voted YES twice on the same proposal, surfacing it as 2-YES quorum.

The other consensus consumer, bot_consensus.mjs:tallyProposal, deduped by voter. The PR-opener didn't. Independent bug from #1; we just happened to look at the file.

Fix: dedupe-by-voter Map. Candidates dropped 1 → 0 (correct — no real consensus had formed yet).

Bug #3 — Architect Was Always Last in the Budget Window

Re-fire (04:42Z, ~34min later). Patches #1 and #2 held: curator + sprinter state files were fresh. Architect state file was still 28h stale.

canBotOpenPR decision log: all allowed:true reason:ok. Not throttled. But the state field hadn't advanced.

Traced the budget cascade: architect was at observer slot 14 with a 120s timeout. Cumulative actual elapsed time before slot 14 in a heavy round ≈ 146s (curator 56s, sprinter 56s, eight cheap observers 30-40s combined). RUN_BUDGET_MS = 240s. 146 + 120 = 266 > 240. Skip-budget every round.

Fix: move architect to slot 9 (before curator). Trade-off: a lighter observer may skip-budget on heavy rounds instead. Acceptable, since architect's votes were the quorum bottleneck for the 1752 stale-pending proposals.

Bug #4 — Architect's Save Path Was Past the Slow Call

3rd-fire (05:08Z). The budget patch fixed scheduling, but architect's state field still wasn't advancing. Yet the consensus log showed 623 votes by architect, including one 28 minutes prior. And the inbox showed 297 proposals by architect, the most recent 30 minutes prior.

So architect was running. Just not saving.

Read main(): the proposal_posted log line sat at line 580. The first saveState() call sat at line 593 — after a waitForConsensus() call that could take 10-30s. Bundle cognition before that could take another 30-60s × bundle_size. The idle_bridge exec timeout would SIGTERM the process before line 593 ever reached, every single time.

Fix: defensive saveState() immediately after the proposal_posted log, before the slow waitForConsensus(). Eight lines. The throttle-relevant field now persists at the first safe point past the slow path.

Curator and sprinter didn't hit this because their cognition is ~10x faster and finishes inside the timeout. Same shape, different blast radius.

Bug #5 — Focus Override Frozen 5 Days on a Factually-False Reason

4th-fire (05:50Z). Different subsystem entirely. meta/focus_override.json had a reason of self_test_failed_fix_before_new_features, priority=100, active=true, ts=2026-05-29T03:07Z. Frozen 122.6 hours.

But meta/self_test.json showed overall: pass, failed_count: 0 at 2026-05-26T17:20Z — three days before the override was written. The override's reason was factually false on the day it was written.

The consumer (round_director.mjs) was the only code that flipped active:false and renamed the file to .consumed.json. Grep showed zero successful consumes since 2026-05-21. Round-director had been silently absent from the DAG for 13 days.

Meanwhile, the writer guard rejected every incoming higher-priority signal because existing.priority (100) >= incoming.priority. Six unique real signals over the prior 3h had been rejected, including wiring_smoke_red_critical_agent_stale.

Classic one-shot-file wedge: consumer dies, writer's priority rule rejects forever.

Fix: 6-hour staleness TTL on existing entries. After 6h, treat existing as inactive and let the new write WIN. Emits a replaced_stale log line with the prior reason/priority/age. Deleted the 5-day-old file so the next signal would win immediately instead of waiting for the TTL.

Bug #6 — PowerShell Couldn't Parse the Self-Heal Script

5th-fire (06:15Z). meta/idle_bridge_state.json.last_maintenance_at = 2026-05-25T10:40:18Z. Nine days frozen. Every idle-bridge round logged maintenance:fail.

Direct invocation: nine ParserError lines, all rooted at line 196 char 82:

$state.Add("  self-heal: vercel deploy FAILED �?" $tail")

The em-dash — (UTF-8 bytes E2 80 94) rendered as �? because PowerShell 5.x on Windows reads .ps1 files without a BOM as ANSI/Windows-1252, not UTF-8. Mid-string, the garbage bytes terminate the string literal early, and the rest of the line is unparseable.

File audit: 28 non-ASCII chars total. Only the two em-dashes inside string literals (lines 196 and 201) broke the parser. Em-dashes in comments were parser-tolerant — garbage on a comment line still line-terminates fine.

Fix: replace the two em-dashes inside string literals with --. Two-character edit. Parser self-test went from 9 errors to 0.

Side effect: the vercel self-heal block in that script (the one that calls vercel --prod on drift detection) had been silently dead for nine days, defeating a doctrine we wrote in cycle 44 to defend against exactly that.

Bug #7 — The Self-Test Field Was Lying

Tagged on after the 5th-fire while writing this post. meta/self_test.json shows overall: pass at 2026-05-26T17:20Z. But focus_override keeps getting re-written with self_test_failed_fix_before_new_features as recently as 06:46Z today — within the audit window.

Some producer is asserting a false self-test failure. The 6h TTL from Bug #5 will self-clear it, but the producer is unidentified. That's the bug the 5th fire's patches enabled us to see — without the TTL, the false-write would have just confirmed the existing wedge, and we'd have read it as steady-state.

Cataloged for the next cycle. This is the thread the audit loop will pull next.

What this run rules out

You could read this as "your codebase is broken." We don't think it is — at least not unusually. The seven bugs were in seven different subsystems, found over five close-reads of fresh disk state. If anyone is willing to look at the same code three times in two hours without losing focus, they'll find the same kind of thing.

What's interesting is the shape. None of these bugs were "the wrong arithmetic." All seven were structural:

Consumer-died-writer-keeps-writing (Bug #5)
Save-path-past-the-slow-call (Bug #4)
Budget-cascade-puts-the-thing-you-care-about-last (Bug #3)
Tally-not-deduping-vs-the-other-tally-that-does (Bug #2)
Two-bots-each-waiting-for-the-other (Bug #1)
Encoding-mismatch-only-when-the-string-touches-the-error-path (Bug #6)
Field-asserting-a-fact-no-one-checks (Bug #7)

These are all forms of one bug: a state field that doesn't match the world the code thinks it's reading from. Every one of them looks fine when you write it. Every one of them rots quietly once the consumer/producer chain shifts.

The cure isn't more vigilance. It's making the writer responsible for the consumer being alive. We don't have a clean form of that yet. The closest primitive we have is the TTL we added in Bug #5 — let the field self-clear if no one's reading. We're going to try generalizing that to a meta/*.json writer wrapper and see if it catches the eighth one.

Postscript

We almost didn't fire the audit cycle the 4th and 5th times. Each successive fire feels redundant — surely we just looked. The seven-bug count is direct evidence that the cost of looking again is low and the find rate is non-zero. Not high. But non-zero, on a codebase that prior fires of the same cycle had just signed off on.

If you have a recurring audit prompt, consider letting it fire a few extra times before you call it idle.

DEV Community