When you're building AI agents, evaluation is only half the problem. The harder half is closing the improvement loop: taking what you learned from failing evals and automatically making the agent better.
That's what AgentForge is - an open-source platform that does the full cycle: evaluate → cluster failures → patch prompts → re-evaluate → gate promotion.
Architecture
AgentForge ships as 16 focused Rust crates:
| Crate | Role |
|---|---|
agentforge-runner |
Parallel scenario execution |
agentforge-scorer |
Multi-dimensional LLM-as-judge |
agentforge-optimizer |
Automatic prompt patching |
agentforge-redteam |
Adversarial probing |
agentforge-gatekeeper |
Promotion gate |
agentforge-observability |
OTLP tracing + cost |
agentforge-multiagent |
Multi-agent orchestration |
agentforge-finetune |
Fine-tune dataset exporter |
The Optimizer Loop
The optimizer reads failure clusters from the scorer, calls an LLM to generate a prompt patch, reruns just the failing scenarios against the patched agent, and writes the result back to the eval database. If the patched agent clears the threshold, the gatekeeper promotes it.
This runs entirely in-process with no external orchestration needed.
Shadow Runs
Before promoting, you can run a shadow comparison: the current champion and the challenger handle the same traffic, scores are diffed, and only a statistically significant improvement triggers promotion.
Why Rust?
- Zero-overhead async via Tokio for high-concurrency scenarios runs
- sqlx compile-time checked queries against Postgres
- Single static binary → ships as a GitHub Action with no runtime deps
Get started
cargo install agentforge-cli
agentforge run --agent my-agent.yaml --scenarios 50 --threshold 0.85
Or drop it into CI:
- uses: bhavinkotak/agentforge@v0.1.10
with:
agent_file: fixtures/my-agent.yaml
threshold: '0.85'
Top comments (0)