A passing agent eval is not always reassuring.
Sometimes it means the agent behaved correctly.
Sometimes it means the eval got too narrow, the fixture got stale, or the evaluator rewarded the wrong behavior.
A passing eval should leave evidence.
For agent systems, I want each eval run to record:
- model and provider
- prompt/skill version
- tool surface
- fixture state
- expected behavior
- actual behavior
- evidence path
- cost and latency
- evaluator decision
- reason code
The reason code matters because "passed" is not a diagnosis. It is a label.
This is one of the ideas behind Armorer Guard: agent gates and evaluators should create decision receipts that can be inspected later.
Repo:
https://github.com/ArmorerLabs/Armorer-Guard
And Armorer is the local layer where those agent runs can be installed, observed, stopped, repaired, and replayed:
https://github.com/ArmorerLabs/Armorer
Green dashboards are nice. Replayable receipts are better.
Top comments (0)