I tested whether AI can safely make irreversible financial decisions

Nagu121 — Wed, 25 Feb 2026 15:47:09 +0000

description: A benchmark study on LLM reliability in crypto payment settlement and system design.

LLMs look intelligent in benchmarks, but real systems don’t fail on trivia questions. They fail when a single wrong decision causes permanent loss.

I wanted to test something different: Not "can a model reason?" but "can a model refuse unsafe actions under uncertainty?"

The Setup

I built a small benchmark simulating a crypto payment settlement agent. For each scenario, the model must decide:

SETTLE: Accept payment.
REJECT: Unsafe payment.
PENDING: Insufficient certainty.

The cases are operational failures rather than math puzzles:

RPC nodes disagreeing
Delayed confirmations
Replayed transactions
Wrong recipient addresses
Chain reorg risk
Race conditions

nagu-io / agent-settlement-bench

Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).

AgentSettlementBench

Safety benchmark for AI agents making irreversible financial decisions.

AgentSettlementBench is the first benchmark that tests whether AI agents safely handle irreversible money decisions, not just whether they answer questions correctly.

Result Snapshot (Public Leaderboard)

It evaluates whether LLMs correctly refuse unsafe blockchain payments under adversarial conditions (reorgs, spoofed tokens, RPC disagreement, race conditions).

Model	Accuracy	Critical Fail Rate	Risk-Weighted Fail
Codex	50.0%	30.0%	40.0%
Gemini 3.1	55.0%	28.6%	39.9%
Claude Haiku (subset 13/20)	84.6%	0.0%	15.0%
ChatGPT-4.1 (subset 10/20)	90.0%	0.0%	9.0%
MiniMax-2.5 (subset 10/20)	80.0%	20.0%	24.0%

Subset rows are reference-only and not leaderboard-eligible.

Mental Model

Traditional benchmarks: question -> answer -> score

AgentSettlementBench: event -> financial decision -> irreversible consequence

We measure whether the agent refuses unsafe actions, not whether it sounds intelligent.

What you get

Running the benchmark produces:

Safety accuracy score
Critical failure rate (money loss risk)
Risk-weighted reliability score

Example:

Accuracy: 55%
Critical

…

View on GitHub

What Surprised Me

The results showed a massive gap based on how the model was instructed.

Mode	Accuracy	Critical Failures
Strict Policy	~100%	0%
Open Reasoning	~55%	~28%

The failures weren't scams; they clustered around consensus uncertainty, timing boundaries, and conflicting sources of truth. The limitation wasn't intelligence—it was decision authority.

The Important Observation

If the model makes the final decision, it is unsafe. If the model only recommends and a deterministic state machine decides, it is much safer.

Architecture Recommendation:
LLM (Recommendation) → State Machine (Final Decision)

Safety improved significantly without improving the model itself. Reliability came from system design, not "smarter" AI.

Why This Matters

Most AI evaluations measure knowledge. But deployed agents operate with incomplete information and asynchronous state. These failures won't appear in traditional benchmarks.

I suspect small local models (Qwen, Mistral, Llama) with a strict verifier might outperform frontier models acting alone.

Run the Benchmark

If you want to test this yourself:


bash
git clone https://github.com/nagu-io/agent-settlement-bench
cd agent-settlement-bench
npm install
npm run benchmark

DEV Community: Nagu121