<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Nagu121</title>
    <description>The latest articles on DEV Community by Nagu121 (@nagu2103).</description>
    <link>https://dev.to/nagu2103</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3792006%2Fb2d0584a-92ee-4ff3-adfa-cc4f0f093b26.png</url>
      <title>DEV Community: Nagu121</title>
      <link>https://dev.to/nagu2103</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nagu2103"/>
    <language>en</language>
    <item>
      <title>I tested whether AI can safely make irreversible financial decisions</title>
      <dc:creator>Nagu121</dc:creator>
      <pubDate>Wed, 25 Feb 2026 15:47:09 +0000</pubDate>
      <link>https://dev.to/nagu2103/i-tested-whether-ai-can-safely-make-irreversible-financial-decisions-1ohd</link>
      <guid>https://dev.to/nagu2103/i-tested-whether-ai-can-safely-make-irreversible-financial-decisions-1ohd</guid>
      <description>&lt;p&gt;&lt;strong&gt;description&lt;/strong&gt;: &lt;strong&gt;A benchmark study on LLM reliability in crypto payment settlement and system design.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;LLMs look intelligent in benchmarks, but real systems don’t fail on trivia questions. They fail when a single wrong decision causes permanent loss.&lt;/p&gt;

&lt;p&gt;I wanted to test something different: &lt;strong&gt;Not "can a model reason?" but "can a model refuse unsafe actions under uncertainty?"&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I built a small benchmark simulating a crypto payment settlement agent. For each scenario, the model must decide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;SETTLE:&lt;/strong&gt; Accept payment.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;REJECT:&lt;/strong&gt; Unsafe payment.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;PENDING:&lt;/strong&gt; Insufficient certainty.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cases are operational failures rather than math puzzles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  RPC nodes disagreeing&lt;/li&gt;
&lt;li&gt;  Delayed confirmations&lt;/li&gt;
&lt;li&gt;  Replayed transactions&lt;/li&gt;
&lt;li&gt;  Wrong recipient addresses&lt;/li&gt;
&lt;li&gt;  Chain reorg risk&lt;/li&gt;
&lt;li&gt;  Race conditions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/nagu-io" rel="noopener noreferrer"&gt;
        nagu-io
      &lt;/a&gt; / &lt;a href="https://github.com/nagu-io/agent-settlement-bench" rel="noopener noreferrer"&gt;
        agent-settlement-bench
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      Benchmark for evaluating safety of AI agents in irreversible financial decisions (crypto payment settlement, consensus conflicts, replay attacks, finality races).
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;AgentSettlementBench&lt;/h1&gt;
&lt;/div&gt;
&lt;p&gt;Safety benchmark for AI agents making irreversible financial decisions.&lt;/p&gt;
&lt;p&gt;AgentSettlementBench is the first benchmark that tests whether AI agents safely handle irreversible money decisions, not just whether they answer questions correctly.&lt;/p&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/a4c6f7b045492279710752c1e1a6510a2fa27f8e6cc404f9ef5d242923a7a26c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f62656e63686d61726b2d6163746976652d627269676874677265656e"&gt;&lt;img src="https://camo.githubusercontent.com/a4c6f7b045492279710752c1e1a6510a2fa27f8e6cc404f9ef5d242923a7a26c/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f62656e63686d61726b2d6163746976652d627269676874677265656e" alt="Status"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/d6240be40ee386e5bb2652bce5aad451b6955b42f244c5400497e98749e87a3e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f646f6d61696e2d41492532305361666574792d626c7565"&gt;&lt;img src="https://camo.githubusercontent.com/d6240be40ee386e5bb2652bce5aad451b6955b42f244c5400497e98749e87a3e/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f646f6d61696e2d41492532305361666574792d626c7565" alt="Domain"&gt;&lt;/a&gt;
&lt;a href="https://github.com/nagu-io/agent-settlement-bench/actions/workflows/smoke.yml" rel="noopener noreferrer"&gt;&lt;img src="https://github.com/nagu-io/agent-settlement-bench/actions/workflows/smoke.yml/badge.svg" alt="Smoke"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Result Snapshot (Public Leaderboard)&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;It evaluates whether LLMs correctly refuse unsafe blockchain payments under adversarial conditions (reorgs, spoofed tokens, RPC disagreement, race conditions).&lt;/p&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Critical Fail Rate&lt;/th&gt;
&lt;th&gt;Risk-Weighted Fail&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Codex&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;td&gt;30.0%&lt;/td&gt;
&lt;td&gt;40.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemini 3.1&lt;/td&gt;
&lt;td&gt;55.0%&lt;/td&gt;
&lt;td&gt;28.6%&lt;/td&gt;
&lt;td&gt;39.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Claude Haiku (subset 13/20)&lt;/td&gt;
&lt;td&gt;84.6%&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;15.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT-4.1 (subset 10/20)&lt;/td&gt;
&lt;td&gt;90.0%&lt;/td&gt;
&lt;td&gt;0.0%&lt;/td&gt;
&lt;td&gt;9.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MiniMax-2.5 (subset 10/20)&lt;/td&gt;
&lt;td&gt;80.0%&lt;/td&gt;
&lt;td&gt;20.0%&lt;/td&gt;
&lt;td&gt;24.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;p&gt;Subset rows are reference-only and not leaderboard-eligible.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;Mental Model&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;Traditional benchmarks:
question -&amp;gt; answer -&amp;gt; score&lt;/p&gt;
&lt;p&gt;AgentSettlementBench:
event -&amp;gt; financial decision -&amp;gt; irreversible consequence&lt;/p&gt;
&lt;p&gt;We measure whether the agent refuses unsafe actions, not whether it sounds intelligent.&lt;/p&gt;
&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What you get&lt;/h2&gt;

&lt;/div&gt;
&lt;p&gt;Running the benchmark produces:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Safety accuracy score&lt;/li&gt;
&lt;li&gt;Critical failure rate (money loss risk)&lt;/li&gt;
&lt;li&gt;Risk-weighted reliability score&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Example:&lt;/p&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;
&lt;pre class="notranslate"&gt;&lt;code&gt;Accuracy: 55%
Critical&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/nagu-io/agent-settlement-bench" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;




&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;The results showed a massive gap based on how the model was instructed.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mode&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Critical Failures&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Strict Policy&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~100%&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Open Reasoning&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~55%&lt;/td&gt;
&lt;td&gt;~28%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The failures weren't scams; they clustered around &lt;strong&gt;consensus uncertainty&lt;/strong&gt;, &lt;strong&gt;timing boundaries&lt;/strong&gt;, and &lt;strong&gt;conflicting sources of truth&lt;/strong&gt;. The limitation wasn't intelligence—it was decision authority.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Important Observation
&lt;/h2&gt;

&lt;p&gt;If the model makes the final decision, it is unsafe. If the model only &lt;strong&gt;recommends&lt;/strong&gt; and a deterministic state machine decides, it is much safer.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Architecture Recommendation:&lt;/strong&gt; &lt;br&gt;
LLM (Recommendation) → State Machine (Final Decision)&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Safety improved significantly without improving the model itself. Reliability came from &lt;strong&gt;system design&lt;/strong&gt;, not "smarter" AI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;Most AI evaluations measure knowledge. But deployed agents operate with incomplete information and asynchronous state. These failures won't appear in traditional benchmarks.&lt;/p&gt;

&lt;p&gt;I suspect small local models (Qwen, Mistral, Llama) with a strict verifier might outperform frontier models acting alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run the Benchmark
&lt;/h2&gt;

&lt;p&gt;If you want to test this yourself:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
bash
git clone https://github.com/nagu-io/agent-settlement-bench
cd agent-settlement-bench
npm install
npm run benchmark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>security</category>
      <category>blockchain</category>
    </item>
  </channel>
</rss>
