Agent evals should explain why they passed

#agents #ai #llm #testing

A passing agent eval is not always reassuring.

Sometimes it means the agent behaved correctly.

Sometimes it means the eval got too narrow, the fixture got stale, or the evaluator rewarded the wrong behavior.

A passing eval should leave evidence.

For agent systems, I want each eval run to record:

model and provider
prompt/skill version
tool surface
fixture state
expected behavior
actual behavior
evidence path
cost and latency
evaluator decision
reason code

The reason code matters because "passed" is not a diagnosis. It is a label.

This is one of the ideas behind Armorer Guard: agent gates and evaluators should create decision receipts that can be inspected later.

Repo:
https://github.com/ArmorerLabs/Armorer-Guard

And Armorer is the local layer where those agent runs can be installed, observed, stopped, repaired, and replayed:
https://github.com/ArmorerLabs/Armorer

Green dashboards are nice. Replayable receipts are better.

DEV Community

Agent evals should explain why they passed

Top comments (0)