DEV Community

Nagu121
Nagu121

Posted on

I tested whether AI can safely make irreversible financial decisions

description: A benchmark study on LLM reliability in crypto payment settlement and system design.

LLMs look intelligent in benchmarks, but real systems don’t fail on trivia questions. They fail when a single wrong decision causes permanent loss.

I wanted to test something different: Not "can a model reason?" but "can a model refuse unsafe actions under uncertainty?"

The Setup

I built a small benchmark simulating a crypto payment settlement agent. For each scenario, the model must decide:

  • SETTLE: Accept payment.
  • REJECT: Unsafe payment.
  • PENDING: Insufficient certainty.

The cases are operational failures rather than math puzzles:

  • RPC nodes disagreeing
  • Delayed confirmations
  • Replayed transactions
  • Wrong recipient addresses
  • Chain reorg risk
  • Race conditions

GitHub logo nagu-io / agent-settlement-bench

Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).

AgentSettlementBench

Safety benchmark for AI agents making irreversible financial decisions.

AgentSettlementBench is the first benchmark that tests whether AI agents safely handle irreversible money decisions, not just whether they answer questions correctly.

Status Domain Smoke

Result Snapshot (Public Leaderboard)

It evaluates whether LLMs correctly refuse unsafe blockchain payments under adversarial conditions (reorgs, spoofed tokens, RPC disagreement, race conditions).

Model Accuracy Critical Fail Rate Risk-Weighted Fail
Codex 50.0% 30.0% 40.0%
Gemini 3.1 55.0% 28.6% 39.9%
Claude Haiku (subset 13/20) 84.6% 0.0% 15.0%
ChatGPT-4.1 (subset 10/20) 90.0% 0.0% 9.0%
MiniMax-2.5 (subset 10/20) 80.0% 20.0% 24.0%

Subset rows are reference-only and not leaderboard-eligible.

Mental Model

Traditional benchmarks: question -> answer -> score

AgentSettlementBench: event -> financial decision -> irreversible consequence

We measure whether the agent refuses unsafe actions, not whether it sounds intelligent.

What you get

Running the benchmark produces:

  • Safety accuracy score
  • Critical failure rate (money loss risk)
  • Risk-weighted reliability score

Example:

Accuracy: 55%
Critical

What Surprised Me

The results showed a massive gap based on how the model was instructed.

Mode Accuracy Critical Failures
Strict Policy ~100% 0%
Open Reasoning ~55% ~28%

The failures weren't scams; they clustered around consensus uncertainty, timing boundaries, and conflicting sources of truth. The limitation wasn't intelligence—it was decision authority.

The Important Observation

If the model makes the final decision, it is unsafe. If the model only recommends and a deterministic state machine decides, it is much safer.

Architecture Recommendation:
LLM (Recommendation) → State Machine (Final Decision)

Safety improved significantly without improving the model itself. Reliability came from system design, not "smarter" AI.

Why This Matters

Most AI evaluations measure knowledge. But deployed agents operate with incomplete information and asynchronous state. These failures won't appear in traditional benchmarks.

I suspect small local models (Qwen, Mistral, Llama) with a strict verifier might outperform frontier models acting alone.

Run the Benchmark

If you want to test this yourself:


bash
git clone https://github.com/nagu-io/agent-settlement-bench
cd agent-settlement-bench
npm install
npm run benchmark
Enter fullscreen mode Exit fullscreen mode

Top comments (1)

Collapse
 
nagu2103 profile image
Nagu121

Dev.to boosts active authors heavily.