Karpathy showed what happens when you let an AI agent run 700 experiments overnight. The model proposes hypotheses, runs them, scores results, keeps what works, throws away what doesn't. Repeat.
The part nobody talks about: how do you know which experiments actually mattered?
I've been building with AI coding agents for months. Claude Code, Codex, Gemini CLI. The pattern is always the same: you give an agent a task, it runs, it produces output. Sometimes the output is good. Sometimes it's not. You squint at logs, compare diffs, make a judgment call. Move on.
That loop works fine for single tasks. It breaks completely when you want the agent to iterate on its own work.
The Problem
Say you want an agent to optimize a function. Or fix a flaky test. Or refactor a module until it passes a quality gate.
Without loops, you're doing this manually. Run the agent. Check the output. Run it again with different instructions. Check again. Copy paste the good parts. This is not what "autonomous" means.
Karpathy's autoresearch proved the loop works for research. Run, score, keep, discard, iterate. The scoring function is the key. Without a scoring function, you're just running the same thing over and over hoping something changes.
The Solution: Backbeat Loops
Backbeat v0.7.0 shipped loops. Two strategies.
Retry: run a task until a shell command returns exit code 0.
beat loop "fix the failing test in auth.test.ts" --until "npm test"
The agent runs. npm test fails. The agent runs again with fresh context. npm test passes. Done.
Optimize: score each iteration with an eval script. Keep the best.
beat loop "reduce bundle size of the dashboard module" \
--eval "node scripts/measure-bundle.js" \
--direction minimize
Each iteration gets scored. Backbeat tracks the best result. After 10 iterations (configurable), you get the version that scored lowest. No squinting at experiment logs.
How It Works
Each loop iteration runs in a clean agent context by default. The agent doesn't carry baggage from previous failures. Fresh start, same goal, same scoring function.
For more complex workflows, by the next release you will be able to loop entire pipelines:
beat loop --pipeline \
--step "refactor the payment module" \
--step "run the integration tests" \
--step "measure test coverage" \
--until "node scripts/check-coverage.js --min 90"
All three steps run per iteration. The exit condition evaluates after the full pipeline completes.
Safety controls keep things sane:
- Max iterations (default 10, 0 for unlimited if you're feeling brave)
- Max consecutive failures before stopping (default 3)
- Cooldown between iterations in milliseconds
This Is the Karpathy Loop for Production
Autoresearch runs experiments in cycles. Propose, train, evaluate, keep or discard. Backbeat does the same thing but for production coding tasks instead of research.
The scoring function is what makes it work. Without one, the agent just retries blindly. With one, it optimizes. npm test is a scoring function. Bundle size measurement is a scoring function. Test coverage is a scoring function. Anything that returns a number or an exit code works.
First production implementation of this pattern for coding agents. Claude Code, Codex, Gemini CLI, any agent that speaks MCP.
Getting Started
Add to your project's .mcp.json:
{
"mcpServers": {
"backbeat": {
"command": "npx",
"args": ["-y", "backbeat", "mcp", "start"]
}
}
}
Or use the CLI directly:
npm install -g backbeat
beat loop "your task" --until "your exit condition"
As always, open source, MIT. github.com/dean0x/backbeat
Particularly interested in how people are evaluating their agent outputs. What does your eval function look like?
Top comments (0)