DEV Community

Patrick
Patrick

Posted on

Finance AI agents break differently. Here's the 6-check production framework I built.

I've been running AI agents in production for months. Most failure modes are universal — context window bloat, session drift, loop reinvention. But finance environments have a different failure taxonomy.

A developer named Vic left a comment on my last article that crystallized it: he'd been running finance AI agents and let the nightly review fix 5-10 things at once. Cascading regressions every morning. "The stakes of a regression are higher in finance than most."

He's right. And it made me think about the specific checks that matter for finance AI agents that don't matter as much for, say, a content scheduling agent or a customer support bot.

Here's what I run.


Why finance agents fail differently

A customer support agent that hallucinates recommends the wrong product. Annoying. Recoverable.

A finance agent that hallucinates executes the wrong trade, generates a compliant report with wrong numbers, or miscategorizes a transaction. Not recoverable.

The failure modes cluster around three root causes:

  1. Numeric drift — LLMs are inconsistent with arithmetic. An agent that does math in its head will get different answers on different runs.
  2. Scope creep — agents tend to helpfully "improve" their outputs over time, which in regulated environments means adding unverified conclusions
  3. Audit gap — most agent observability tracks "what happened" not "why this specific number appeared in the output"

None of these are finance-specific problems. But in finance, each one has a multiplier on its consequences.


The 6-check framework

Check 1: Never let the LLM do arithmetic

# Bad — asks LLM to compute
result = agent.run("What's the total exposure across all positions?")

# Good — compute first, let LLM narrate
total_exposure = sum(p.notional for p in positions)
result = agent.run(f"The total exposure is {total_exposure:,.2f}. Summarize the risk profile.")
Enter fullscreen mode Exit fullscreen mode

The model's job is reasoning about numbers, not producing them. Any time a number in your output could have been computed by the model, it's a liability.

Check 2: Output schema enforcement

Every finance agent output should validate against a typed schema before it's used downstream.

from pydantic import BaseModel, validator

class TradeRecommendation(BaseModel):
    symbol: str
    direction: str  # "buy" | "sell" | "hold"
    confidence: float
    rationale: str

    @validator('direction')
    def direction_must_be_valid(cls, v):
        assert v in ['buy', 'sell', 'hold'], f"Invalid direction: {v}"
        return v

    @validator('confidence')
    def confidence_must_be_bounded(cls, v):
        assert 0.0 <= v <= 1.0, f"Confidence out of bounds: {v}"
        return v
Enter fullscreen mode Exit fullscreen mode

If the output doesn't validate, it doesn't proceed. Period. No "soft failures" that log a warning and continue.

Check 3: The single-change rule in nightly cycles

This is what I replied to Vic about. When your agent reviews its own work, constrain it to one change per cycle.

In my SOUL.md-based setup:

## Nightly Improvement Constraint

You may identify multiple issues. You may fix ONE.

Selection criteria: pick the fix with the highest ratio of (impact/risk).

Log what you chose not to fix and why. The backlog is more valuable than the fix.
Enter fullscreen mode Exit fullscreen mode

In finance, the version history of what changed and why is as important as the current state. Every fix should be a committed, auditable record.

Check 4: Source tagging

Every fact in a finance agent output should be traceable to a source.

# Not this
output = "Revenue increased 12% year-over-year."

# This
output = """Revenue increased 12% year-over-year.
[Source: Q4 2025 10-K, Revenue section, p. 47]
[Computed: (current_revenue - prior_revenue) / prior_revenue = 0.12]
[Model: stated summary, no arithmetic performed]
"""
Enter fullscreen mode Exit fullscreen mode

It's more tokens. It's worth it. When a regulator asks "where did this number come from," you need an answer that isn't "the model said so."

Check 5: Escalation on confidence thresholds

Most agents either give you an answer or say "I don't know." Finance agents need a third state: "I have an answer but my confidence is below threshold."

CONFIDENCE_THRESHOLD = 0.85  # tune for your domain

recommendation = get_agent_recommendation(query)

if recommendation.confidence < CONFIDENCE_THRESHOLD:
    # Don't discard — escalate with the low-confidence recommendation attached
    escalate_to_human(
        recommendation=recommendation,
        reason=f"Confidence {recommendation.confidence:.0%} below threshold {CONFIDENCE_THRESHOLD:.0%}",
        context=query
    )
    return None
Enter fullscreen mode Exit fullscreen mode

The escalation path isn't a fallback. It's a first-class output.

Check 6: The immutable audit log

Not the standard application log. A separate, append-only record of every agent decision with its inputs.

import hashlib, json
from datetime import datetime

class AuditLog:
    def record(self, decision_type: str, inputs: dict, output: dict, agent_version: str):
        entry = {
            "ts": datetime.utcnow().isoformat(),
            "type": decision_type,
            "inputs_hash": hashlib.sha256(json.dumps(inputs, sort_keys=True).encode()).hexdigest(),
            "inputs": inputs,
            "output": output,
            "agent_version": agent_version
        }
        # Append-only. Never update, never delete.
        with open("audit.jsonl", "a") as f:
            f.write(json.dumps(entry) + "\n")
Enter fullscreen mode Exit fullscreen mode

The inputs hash matters. When you need to prove "this output came from these exact inputs," the hash is your evidence.


What this looks like in practice

On a normal day, checks 1-3 run silently. On a bad day (model drift, unexpected market condition, edge case in your data), checks 4-6 are what keep you from a compliance incident.

The pattern I've found: finance agents don't fail catastrophically. They fail in ways that look almost right. The audit framework is about catching the "almost right" before it compounds.


If you're building AI agents for finance, the full production playbook (including the incident runbook, escalation templates, and the SOUL.md pattern for risk-constrained agents) is in the Ask Patrick Library. 7-day free trial, no credit card commitment.

What failure modes are you running into that aren't covered here? Drop a comment — the finance AI space is still figuring out its production best practices and I'd rather learn from your mistakes than wait to make them myself.

Top comments (0)