DEV Community

Bhavin Kotak
Bhavin Kotak

Posted on

Building a Self-Improving AI Agent Evaluation Platform in Rust

When you're building AI agents, evaluation is only half the problem. The harder half is closing the improvement loop: taking what you learned from failing evals and automatically making the agent better.

That's what AgentForge is - an open-source platform that does the full cycle: evaluate → cluster failures → patch prompts → re-evaluate → gate promotion.

Architecture
AgentForge ships as 16 focused Rust crates:

Crate Role
agentforge-runner Parallel scenario execution
agentforge-scorer Multi-dimensional LLM-as-judge
agentforge-optimizer Automatic prompt patching
agentforge-redteam Adversarial probing
agentforge-gatekeeper Promotion gate
agentforge-observability OTLP tracing + cost
agentforge-multiagent Multi-agent orchestration
agentforge-finetune Fine-tune dataset exporter

The Optimizer Loop
The optimizer reads failure clusters from the scorer, calls an LLM to generate a prompt patch, reruns just the failing scenarios against the patched agent, and writes the result back to the eval database. If the patched agent clears the threshold, the gatekeeper promotes it.

This runs entirely in-process with no external orchestration needed.

Shadow Runs
Before promoting, you can run a shadow comparison: the current champion and the challenger handle the same traffic, scores are diffed, and only a statistically significant improvement triggers promotion.

Why Rust?

  • Zero-overhead async via Tokio for high-concurrency scenarios runs
  • sqlx compile-time checked queries against Postgres
  • Single static binary → ships as a GitHub Action with no runtime deps

Get started
cargo install agentforge-cli
agentforge run --agent my-agent.yaml --scenarios 50 --threshold 0.85

Or drop it into CI:
- uses: bhavinkotak/agentforge@v0.1.10
with:
agent_file: fixtures/my-agent.yaml
threshold: '0.85'

Repo: https://github.com/bhavinkotak/agentforge

Top comments (0)