Building a Self-Improving AI Agent Evaluation Platform in Rust

#rust #ai #machinelearning #agents

When you're building AI agents, evaluation is only half the problem. The harder half is closing the improvement loop: taking what you learned from failing evals and automatically making the agent better.

That's what AgentForge is - an open-source platform that does the full cycle: evaluate → cluster failures → patch prompts → re-evaluate → gate promotion.

Architecture
AgentForge ships as 16 focused Rust crates:

Crate	Role
`agentforge-runner`	Parallel scenario execution
`agentforge-scorer`	Multi-dimensional LLM-as-judge
`agentforge-optimizer`	Automatic prompt patching
`agentforge-redteam`	Adversarial probing
`agentforge-gatekeeper`	Promotion gate
`agentforge-observability`	OTLP tracing + cost
`agentforge-multiagent`	Multi-agent orchestration
`agentforge-finetune`	Fine-tune dataset exporter

The Optimizer Loop
The optimizer reads failure clusters from the scorer, calls an LLM to generate a prompt patch, reruns just the failing scenarios against the patched agent, and writes the result back to the eval database. If the patched agent clears the threshold, the gatekeeper promotes it.

This runs entirely in-process with no external orchestration needed.

Shadow Runs
Before promoting, you can run a shadow comparison: the current champion and the challenger handle the same traffic, scores are diffed, and only a statistically significant improvement triggers promotion.

Why Rust?

Zero-overhead async via Tokio for high-concurrency scenarios runs
sqlx compile-time checked queries against Postgres
Single static binary → ships as a GitHub Action with no runtime deps

Get started
cargo install agentforge-cli agentforge run --agent my-agent.yaml --scenarios 50 --threshold 0.85

Or drop it into CI:
- uses: bhavinkotak/agentforge@v0.1.10 with: agent_file: fixtures/my-agent.yaml threshold: '0.85'

Repo: https://github.com/bhavinkotak/agentforge

Top comments (2)

Harjot Singh • May 31

"Evaluation is only half the problem, the harder half is closing the improvement loop" is exactly the gap most eval tooling leaves open, they tell you the score and stop, and turning a failing eval into an actually-better agent is left as an exercise. Your cycle (evaluate → cluster failures → patch → re-evaluate → gate promotion) is the right shape, and two steps in it are doing the heavy lifting most people skip. Clustering failures is the unlock, because raw eval failures are noise until you group them into the few underlying causes worth fixing, otherwise you patch symptoms one at a time forever. And gate promotion is the discipline that keeps the loop honest, a self-improving loop without a gate is a self-deceiving loop, the agent has to prove the patch beat the old version on a held-out set before it ships, or you just hill-climb into overfitting the eval. The thing I'd guard hardest: the eval set the gate uses has to be immutable and separate from what the patcher sees, or the loop optimizes the scorer instead of the agent. Closed-loop self-improvement only works if the gate can't be gamed. That eval-cluster-patch-but-gate-on-held-out instinct is core to how I think in Moonshift. How do you stop the patch step from overfitting to the failure cluster it was shown?

Bhavin Kotak • Jun 10

Really sharp breakdown, and the framing of "a self-improving loop without a gate is a self-deceiving loop" is exactly right. That's one of the hardest problems to get correct in this space.

On the held-out eval set: You've put your finger on the most critical invariant in the system. In AgentForge, the Gatekeeper (agentforge-gatekeeper) always gates on a separate eval set that the Optimizer never sees during the patch step. The Optimizer (agentforge-optimizer) runs candidate variants against a quick-eval 25-scenario subset - but the final promotion gate fires against the full held-out run database. These are enforced as separate database entries via the migration-tracked PostgreSQL schema (migrations/010_opt_tracking.sql), and the opt_best_agent_id on the parent run record tracks lineage so you can always trace which candidate beat which baseline on what data slice.

On preventing overfitting to the failure cluster: here's how the loop guards against it currently:

Mixed scenario composition at promotion time - The Scenario Generator (agentforge-scenarios) always produces a blend: 50% schema-derived, 30% adversarial, 20% domain-seeded. The Optimizer only sees the failure cluster subset to generate patches, but the Gatekeeper re-evaluates against the full distribution. If a prompt patch just memorizes the failure cluster, it'll fail the broader adversarial and schema-derived scenarios at the gate.

Three-gate promotion logic - The Gatekeeper applies: a Score Gate (+3% improvement over champion), a Regression Gate (≥99% pass on previously passing scenarios), and a Stability Gate (evaluated across 3 random seeds). The Regression Gate is the direct overfitting stopper - a patch that cherry-picks known failure modes, while regressing on previously passing scenarios will fail this gate.

auto_optimize with convergence termination - The loop terminates on converged / no_improvement / max_iterations rather than running unbounded. Each round, a candidate must improve by >1 percentage point on the subset to even be saved as a candidate. This prevents noisy hill-climbing from accumulating.

What's still an open problem (and where you're right to push): The eval set mutation risk - if you keep adding new scenarios derived from failure clusters back into the main eval pool, the gate can slowly drift toward optimizing the scorer. The current approach keeps the failure cluster as an input to the Optimizer only, not as a feedback signal to grow the eval set. A more principled future direction (on the roadmap) is a locked benchmark set akin to agentforge-benchmarks that's never touched by the improvement loop - similar to how GAIA/AgentBench are used as held-out evals.