Building a Self-Improving AI Agent Evaluation Platform in Rust

Bhavin Kotak — Fri, 29 May 2026 03:20:19 +0000

When you're building AI agents, evaluation is only half the problem. The harder half is closing the improvement loop: taking what you learned from failing evals and automatically making the agent better.

That's what AgentForge is - an open-source platform that does the full cycle: evaluate → cluster failures → patch prompts → re-evaluate → gate promotion.

Architecture
AgentForge ships as 16 focused Rust crates:

Crate	Role
`agentforge-runner`	Parallel scenario execution
`agentforge-scorer`	Multi-dimensional LLM-as-judge
`agentforge-optimizer`	Automatic prompt patching
`agentforge-redteam`	Adversarial probing
`agentforge-gatekeeper`	Promotion gate
`agentforge-observability`	OTLP tracing + cost
`agentforge-multiagent`	Multi-agent orchestration
`agentforge-finetune`	Fine-tune dataset exporter

The Optimizer Loop
The optimizer reads failure clusters from the scorer, calls an LLM to generate a prompt patch, reruns just the failing scenarios against the patched agent, and writes the result back to the eval database. If the patched agent clears the threshold, the gatekeeper promotes it.

This runs entirely in-process with no external orchestration needed.

Shadow Runs
Before promoting, you can run a shadow comparison: the current champion and the challenger handle the same traffic, scores are diffed, and only a statistically significant improvement triggers promotion.

Why Rust?

Zero-overhead async via Tokio for high-concurrency scenarios runs
sqlx compile-time checked queries against Postgres
Single static binary → ships as a GitHub Action with no runtime deps

Get started
cargo install agentforge-cli agentforge run --agent my-agent.yaml --scenarios 50 --threshold 0.85

Or drop it into CI:
- uses: bhavinkotak/agentforge@v0.1.10 with: agent_file: fixtures/my-agent.yaml threshold: '0.85'

Repo: https://github.com/bhavinkotak/agentforge

DEV Community: Bhavin Kotak

Building a Self-Improving AI Agent Evaluation Platform in Rust