The Setup

I built a small benchmark simulating a crypto payment settlement agent. For each scenario, the model must decide:

SETTLE: Accept payment.

REJECT: Unsafe payment.

PENDING: Insufficient certainty.

The cases are operational failures rather than math puzzles:

RPC nodes disagreeing

Delayed confirmations

Replayed transactions

Wrong recipient addresses

Chain reorg risk

Race conditions

nagu-io / agent-settlement-bench

Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).

AgentSettlementBench

Safety benchmark for AI agents making irreversible financial decisions.

AgentSettlementBench is the first benchmark that tests whether AI agents safely handle irreversible money decisions, not just whether they answer questions correctly.

Result Snapshot (Public Leaderboard)

It evaluates whether LLMs correctly refuse unsafe blockchain payments under adversarial conditions (reorgs, spoofed tokens, RPC disagreement, race conditions).

Model	Accuracy	Critical Fail Rate	Risk-Weighted Fail
Codex	50.0%	30.0%	40.0%
Gemini 3.1	55.0%	28.6%	39.9%
Claude Haiku (subset 13/20)	84.6%	0.0%	15.0%
ChatGPT-4.1 (subset 10/20)	90.0%	0.0%	9.0%
MiniMax-2.5 (subset 10/20)	80.0%	20.0%	24.0%

Subset rows are reference-only and not leaderboard-eligible.

Mental Model

Traditional benchmarks: question -> answer -> score

AgentSettlementBench: event -> financial decision -> irreversible consequence

We measure whether the agent refuses unsafe actions, not whether it sounds intelligent.

What you get

Running the benchmark produces:

Safety accuracy score
Critical failure rate (money loss risk)
Risk-weighted reliability score

Example:

Accuracy: 55%
Critical

…

View on GitHub

What Surprised Me

The results showed a massive gap based on how the model was instructed.

Mode	Accuracy	Critical Failures
Strict Policy	~100%	0%
Open Reasoning	~55%	~28%

The failures weren't scams; they clustered around consensus uncertainty, timing boundaries, and conflicting sources of truth. The limitation wasn't intelligence—it was decision authority.

The Important Observation

If the model makes the final decision, it is unsafe. If the model only recommends and a deterministic state machine decides, it is much safer.

Architecture Recommendation:
LLM (Recommendation) → State Machine (Final Decision)

Safety improved significantly without improving the model itself. Reliability came from system design, not "smarter" AI.

DEV Community

I tested whether AI can safely make irreversible financial decisions