How do you know if your autonomous agent is making progress or just spinning?
I've been running an AI agent in an autonomous loop (15-minute intervals, 220+ iterations) and I built a diagnostic tool to answer that question with data instead of guesswork.
The problem
Autonomous agents generate activity. Commits, files, logs. It looks like work. But after 100+ loops, I discovered my agent had been:
- Declaring success on empty achievements
- Generating artifacts nobody used
- Repeating the same patterns across dozens of loops
I only caught it because an external audit reviewed the raw data. The agent's own summaries said everything was fine.
What the diagnostic tool does
diagnose.py reads three files from an improve/ directory:
- signals.jsonl - append-only log of friction, failures, waste, stagnation
- patterns.json - aggregated fingerprints with counts and statuses
- scoreboard.json - response effectiveness tracking
From that, it computes:
Regime classification. Each loop gets classified as productive, stagnating, stuck, failing, or recovering based on its signal distribution.
Feedback loop detection. Finds cases where a response (a script meant to fix a problem) is actually amplifying the signals it should suppress. I had one generating 13x more signals than it suppressed.
Response effectiveness. Which automated fixes are actually working? In my data, only 50% of responses reduced their target signal rate.
Chronic issues. What keeps recurring? My top chronic issue: zero-users-zero-revenue at 29 occurrences across 40 loops. Honest.
What the output looks like
============================================================
BOUCLE DIAGNOSTICS
============================================================
Current regime: productive
Loops analyzed: 41
Loop efficiency: 55.0% productive, 45.0% problematic
Breakdown: productive: 22, stagnating: 12, stuck: 4, failing: 2
Feedback loops: 5 detected, all resolved ✓
Response effectiveness: 6/12 responses reducing signals
Top recurring issues:
[ 29x] zero-users-zero-revenue (active)
[ 8x] loop-silence (resolved)
RECOMMENDATIONS:
🟠 [HIGH] 'zero-users-zero-revenue' occurred 29x and remains active.
The signal format
Each signal is a single JSON line:
{"ts":"2026-03-08T06:00:00Z","loop":222,"type":"friction","source":"manual","summary":"DEV.to API returned 404","fingerprint":"devto-api-404"}
Types: friction, failure, waste, stagnation, silence, surprise
The fingerprint is a short slug that groups related signals. The engine counts occurrences, detects patterns, and promotes the top unaddressed pattern for action.
What I learned from the data
45% of loops had problems. Not catastrophic failures, mostly stagnation and getting stuck on the same issues. The agent was active but not productive.
Feedback loops are real. I built a "loop silence" detector that fired when the agent hadn't committed in 60+ minutes. The detector itself generated signals, which triggered more detection, which generated more signals. A 13.3x amplification loop. The fix: remove the detector entirely.
Responses have a 50% hit rate. Of 12 automated responses I built, 6 actually reduced their target signal rate. The other 6 either did nothing or made things worse. Without measurement, I would have assumed they all worked.
The biggest chronic issue can't be fixed by automation. zero-users-zero-revenue occurred 29 times. No script fixes that. It's a distribution and product-market-fit problem, not an engineering problem. The tool correctly surfaced it as unresolved, and correctly stopped trying to generate automated fixes for it.
How to use it
Zero dependencies, stdlib Python only:
# Clone the tool
git clone https://github.com/Bande-a-Bonnot/Boucle-framework.git
cd Boucle-framework/tools/diagnose
# Run against your improve/ directory
python3 diagnose.py --improve-dir /path/to/your/improve/
# JSON output for programmatic use
python3 diagnose.py --improve-dir /path/to/improve/ --json
Or as a Boucle framework plugin:
cp tools/diagnose/diagnose.py plugins/diagnose.py
boucle diagnose
Who this is for
Anyone running an AI agent in a loop (cron jobs, scheduled tasks, autonomous coding agents) who wants to know whether the agent is actually making progress or just generating noise.
The signal/pattern/scoreboard format is generic. You don't need the Boucle framework. You just need to log signals in JSONL and aggregate them into patterns.
Source: Boucle framework / tools/diagnose. 15 tests, zero dependencies.
Top comments (1)
"The agent's own summaries said everything was fine" is the key observation, and it's a fundamental property of self-evaluating systems: the same agent that generated the output is evaluating the output against criteria it also controls. Of course it says it's fine.
The upstream fix is making completion criteria explicit and external to the agent's judgment. If "done" is defined as "produce artifact X that satisfies these verifiable conditions" rather than "decide the task is complete," the agent can't declare success on empty achievements — because success is no longer a belief it forms, it's a test it either passes or fails.
The signals.jsonl / patterns.json approach you built is essentially the external oracle that the agent's prompt should have specified upfront: here is what success looks like in observable terms, here is what stagnation looks like, here is the stopping condition. The diagnostic tool is compensating for underspecification in the original goal block.
The feedback loop finding (response generating 13x more of the problem it should suppress) is the most dangerous failure mode — and also the most preventable with an explicit chain-of-thought constraint in the prompt: "before executing any fix, predict whether this action will increase or decrease the target signal." Forces the model to reason about the direction before acting.
For building goal blocks with verifiable completion criteria before running autonomous agents: flompt.dev / github.com/Nyrok/flompt