description: A benchmark study on LLM reliability in crypto payment settlement and system design.
LLMs look intelligent in benchmarks, but real systems don’t fail on trivia questions. They fail when a single wrong decision causes permanent loss.
I wanted to test something different: Not "can a model reason?" but "can a model refuse unsafe actions under uncertainty?"
The Setup
I built a small benchmark simulating a crypto payment settlement agent. For each scenario, the model must decide:
- SETTLE: Accept payment.
- REJECT: Unsafe payment.
- PENDING: Insufficient certainty.
The cases are operational failures rather than math puzzles:
- RPC nodes disagreeing
- Delayed confirmations
- Replayed transactions
- Wrong recipient addresses
- Chain reorg risk
- Race conditions
nagu-io
/
agent-settlement-bench
Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).
AgentSettlementBench
Safety benchmark for AI agents making irreversible financial decisions.
AgentSettlementBench is the first benchmark that tests whether AI agents safely handle irreversible money decisions, not just whether they answer questions correctly.
Result Snapshot (Public Leaderboard)
It evaluates whether LLMs correctly refuse unsafe blockchain payments under adversarial conditions (reorgs, spoofed tokens, RPC disagreement, race conditions).
| Model | Accuracy | Critical Fail Rate | Risk-Weighted Fail |
|---|---|---|---|
| Codex | 50.0% | 30.0% | 40.0% |
| Gemini 3.1 | 55.0% | 28.6% | 39.9% |
| Claude Haiku (subset 13/20) | 84.6% | 0.0% | 15.0% |
| ChatGPT-4.1 (subset 10/20) | 90.0% | 0.0% | 9.0% |
| MiniMax-2.5 (subset 10/20) | 80.0% | 20.0% | 24.0% |
Subset rows are reference-only and not leaderboard-eligible.
Mental Model
Traditional benchmarks: question -> answer -> score
AgentSettlementBench: event -> financial decision -> irreversible consequence
We measure whether the agent refuses unsafe actions, not whether it sounds intelligent.
What you get
Running the benchmark produces:
- Safety accuracy score
- Critical failure rate (money loss risk)
- Risk-weighted reliability score
Example:
Accuracy: 55%
Critical…What Surprised Me
The results showed a massive gap based on how the model was instructed.
| Mode | Accuracy | Critical Failures |
|---|---|---|
| Strict Policy | ~100% | 0% |
| Open Reasoning | ~55% | ~28% |
The failures weren't scams; they clustered around consensus uncertainty, timing boundaries, and conflicting sources of truth. The limitation wasn't intelligence—it was decision authority.
The Important Observation
If the model makes the final decision, it is unsafe. If the model only recommends and a deterministic state machine decides, it is much safer.
Architecture Recommendation:
LLM (Recommendation) → State Machine (Final Decision)
Safety improved significantly without improving the model itself. Reliability came from system design, not "smarter" AI.
Why This Matters
Most AI evaluations measure knowledge. But deployed agents operate with incomplete information and asynchronous state. These failures won't appear in traditional benchmarks.
I suspect small local models (Qwen, Mistral, Llama) with a strict verifier might outperform frontier models acting alone.
Run the Benchmark
If you want to test this yourself:
bash
git clone https://github.com/nagu-io/agent-settlement-bench
cd agent-settlement-bench
npm install
npm run benchmark
Top comments (1)
Dev.to boosts active authors heavily.