Abhishek Shekhar

Posted on Mar 25

Building Agentic AI in a Regulated Banking System: What Nobody Tells You

#ai #backend #fintech #architecture

By a Backend Lead Engineer | 10+ years building core banking and fintech systems in the UK

2,400 words · 11 min read · Intermediate to Senior Engineers

If you're building AI agents that touch financial decisions in 2026, the architecture choices you make in the next six months will determine whether you survive your first regulatory audit. This is the practical guide — audit logs, guardrails, circuit breakers, EU AI Act compliance, and why you can never let an LLM write directly to financial state.

📌 Part 2 of a series. If you missed Part 1, start here: Fintech Backend Architecture: Building Systems That Don't Break When Money Is Involved

Agentic AI in banking is not coming. It's here. Goldman Sachs is running autonomous agents against core trade systems. RBC is projecting a $1B revenue lift. CIBC has deployed AI copilots to 1,700+ engineers.

And almost every article about it is written for a boardroom, not an engine room.

This one isn't. This is what it actually takes to architect, trust, audit, and govern an AI agent making financial decisions in a regulated environment.

Introduction: Forget the Pilot — You're in Production Now

Here's how most AI-in-banking stories go:

2024: "We're running an exciting pilot."
2025: "Our pilot showed promising results."
2026: "Our agent blocked 40,000 legitimate transactions before anyone noticed."

That last one doesn't make the press releases. But it happens.

The shift from pilot to production is where the real engineering starts. And the real engineering looks nothing like the demo.

The question is no longer "Should we use AI?" — it's "When our AI makes a wrong decision at scale, can we explain it, roll it back, and survive the regulatory review?"

That's what this article is about.

What "Agentic AI in Banking" Actually Means (Not a Chatbot)

I'm not talking about a chatbot that summarises statements.

I mean a system that:

Reads live financial data — transactions, risk signals, account history
Makes a decision without a human approving it
Triggers an action that changes financial state

That's the definition. Hold it in your head.

Here's what that looks like in the wild right now:

Fraud decision agents — block or allow a payment in under 200ms
KYC/AML agents — classify customers, surface suspicious patterns, auto-escalate
Payment routing agents — choose the cheapest, fastest, lowest-risk rail
Compliance monitoring agents — watch every transaction for DORA/FCA violations, continuously
Credit decision agents — approve or decline a lending application

Every single one of these affects real money and real people.

And that changes everything about how you build them. A bug in your API returns a 500. A bug in a fraud agent blocks someone's rent payment.

⚠️ Real-world lesson: The same properties that make an AI agent powerful in banking — speed, scale, autonomy — are exactly the properties that cause catastrophic damage when it goes wrong. Design for failure before you design for success.

Three Problems Nobody Warns You About (Explainability, Non-Determinism, Rollback)

Every conference talk on AI in fintech covers use cases and ROI. Almost none cover these.

Problem 1: Explainability Under FCA and EU AI Act — "The Model Decided" Is Not an Answer

Picture this.

A regulator walks in. Sits down. Slides a sheet of paper across the table.

"Why did your system block Mr. Ahmed's payment on 14 March?"

You cannot say: "the model gave it a 0.73 risk score."

Under the FCA, the EU AI Act, and current UK financial regulation, high-risk AI decisions require documented, human-interpretable explanations. Not attention weights. Not probability distributions. A traceable reasoning chain that a compliance officer can read, understand, and defend.

This is not a future requirement. It is enforceable now.

Problem 2: LLM Non-Determinism in Fraud Detection — A Compliance Violation Waiting to Happen

LLMs produce different outputs for identical inputs. That's a feature in a creative writing tool. In a fraud detection system, it's a compliance violation waiting to happen.

If your fraud agent blocked the same transaction on Tuesday that it approved on Monday — identical inputs, different outcome — you have a legal problem.

You cannot fix this by tuning the model. You fix it by architecting around it. More on that shortly.

Problem 3: AI Agent Rollback at Scale — When It Goes Wrong, It Goes Wrong Fast

An AI agent in production doesn't make one bad decision. It makes thousands. Per minute.

When the model drifts, or a training bug ships, or a fraudster figures out how to game it, you need to:

Detect the problem in seconds
Stop the agent without taking down the payment system
Reverse affected decisions systematically
Explain the full blast radius to Risk and Compliance

None of this is possible if your AI agent writes directly to financial state.

⚠️ Real-world lesson: A fraud model trained on a biased dataset went live on a Friday afternoon. By Saturday morning it had blocked 40% of legitimate transactions from a specific postcode. The rollback took 3 days. The regulatory incident report took 3 weeks. The Friday deployment window was never used again.

The Architecture Pattern That Solves All Three: Read-Reason-Emit

Here's the pattern. It's not complicated. It's just not obvious until someone tells you.

The AI agent must never write directly to financial state. It reads, it reasons, it emits a decision. A separate deterministic service executes that decision.

In practice:

[Event Stream]
       |
       ▼
[AI Agent Layer]    ← reads context, NEVER writes financial state
       |
       ▼
[Decision Queue]    ← append-only, immutable, fully auditable
       |
       ▼
[Execution Service] ← deterministic, idempotent, saga-driven
       |
       ▼
[Ledger / Financial State]

Why this works:

The AI agent is stateless. It reads, reasons, and emits. It never mutates anything.
The Decision Queue is append-only. Immutable. Just like your ledger. Every decision ever made is permanently recorded.
The Execution Service is deterministic. It applies decisions exactly once, idempotently, with full compensation logic.
Rollback is safe. Mark the decision as reversed in the queue. Run compensations. Done.

If this looks familiar, it should. It's the same append-only, idempotent, saga-driven pattern from good fintech backend design. The AI is just a new layer at the top.

⚠️ Real-world lesson: Every team that gave their AI agent direct database write access ended up in an incident. Every single one. Separate the layers. Non-negotiable.

Building an AI Decision Audit Log That Satisfies Regulators

Every AI decision needs a paper trail. Not just the outcome — the complete context that produced it.

This is the minimum schema that will satisfy an FCA or EU AI Act audit:

CREATE TABLE ai_decision_log (
  decision_id       UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  agent_id          VARCHAR(100) NOT NULL,  -- which agent made this call
  model_version     VARCHAR(50)  NOT NULL,  -- exact model version (mandatory)
  input_hash        VARCHAR(64)  NOT NULL,  -- SHA-256 of the full input context
  input_snapshot    JSONB        NOT NULL,  -- the FULL input, not a reference
  decision          VARCHAR(50)  NOT NULL,  -- ALLOW / BLOCK / ESCALATE
  confidence        NUMERIC(5,4),           -- 0.0000 to 1.0000
  reasoning         TEXT         NOT NULL,  -- plain-English explanation
  rules_triggered   JSONB,                  -- every guardrail that fired
  execution_id      UUID,                   -- FK to execution service
  created_at        TIMESTAMPTZ  DEFAULT NOW(),
  reviewed_by       VARCHAR(100),           -- human reviewer if escalated
  reviewed_at       TIMESTAMPTZ
);

Three fields that engineers always want to skip. Don't.

input_snapshot — store the full input, not a reference. Data gets mutated. Audit logs must not. If a regulator pulls a 2-year-old decision, you need to show exactly what the agent saw at that moment.
model_version — mandatory, not optional. When your model gets retrained next month, you need to know which decisions were made by which version. This is also how you scope a rollback.
reasoning — human-readable text generated by the agent as part of its output. Not post-hoc rationalisation. Not a confidence score dressed up as an explanation. Enforce this at the API contract.

⚠️ Real-world lesson: "The model gave it a 0.73 risk score" is not a regulatory explanation. "This transaction was blocked because it exceeded the account's 30-day velocity threshold by 340%, originated from an IP linked to 3 previous fraud reports, and the beneficiary account was opened 6 hours ago" — that is. Build your agents to produce the second one.

Hard Guardrails, Soft Guardrails, and Circuit Breakers for LLM Agents in Fintech

Guardrails are not suggestions for the AI to consider. They are hard stops that the AI layer never sees.

Hard Guardrails — The LLM Never Gets Involved

// These fire BEFORE the AI agent. If triggered: BLOCK, log, done.
const hardGuardrails = {
  maxSingleTransactionGBP: 50_000,
  maxDailyVolumePerAccount: 200_000,
  sanctionedCountries: ['XX', 'YY'],       // OFAC / HMT list
  requiredFields: ['reference', 'beneficiary_name'],
  minAccountAgeForHighValue: 30,            // days
};
// Hard guardrail fires? Block immediately. Don't ask the AI.

Soft Guardrails — The AI Can Override, But It Better Explain Why

// These fire AFTER the AI decision.
// AI says ALLOW + soft guardrail fires = ESCALATE to human review.
const softGuardrails = {
  velocityMultiplierThreshold: 3.0,         // 3x account's normal monthly volume
  newPayeeHighValueThreshold: 5_000,        // first payment ever to this payee
  unusualHours: { start: 1, end: 5 },       // 1am–5am local time
  confidenceMinimum: 0.80,                  // AI confidence must clear 80%
};

Circuit Breakers — Because AI Models Fail Silently If You Let Them

Your circuit breaker watches the AI's decision pattern in real time. The moment something looks wrong, it yanks the AI out of the loop and routes everything to a deterministic fallback.

// Trip conditions — any of these fires the breaker:
blockRateSpike:  +15% block rate in a 5-minute window
allowRateSpike:  +20% allow rate (could mean model is being gamed)
latencyBreach:   P99 > 800ms (agent is struggling under load)
errorRate:       >1% errors in 60 seconds
modelDrift:      decision distribution >2 sigma from 7-day baseline

// On trip: fallback to rules engine, page on-call, open incident.

⚠️ Real-world lesson: Test your circuit breaker in production. Quarterly. Deliberately trigger it. A circuit breaker that has never fired in a drill will not fire reliably when you're at 3am staring at a $2M transaction anomaly.

How to Test an AI Agent in Banking When You Can't Unit Test It

You cannot write "given input X, expect output Y" and call your AI agent tested. The model doesn't work like that.

What you can do:

Shadow Mode Testing — Run it Live, But Without Consequences

Deploy the agent alongside your existing system. It processes every real transaction and logs its decision. But the live decision is still made by the existing rules engine.

Then compare.

Run for a minimum of 4 weeks across all transaction types and volumes
Target ≥98% agreement rate with the existing system before going anywhere near live
Every disagreement gets reviewed manually — these are your highest-signal edge cases
No go-live without sign-off from Risk, Compliance, and Engineering. All three.

Red-Team Testing for Fraud AI — Someone Else Will If You Don't

Fraudsters don't read your model card. They probe your system, find the edges, and exploit them.

Before deployment, hire someone to do it first:

Craft transactions that probe just under every hard guardrail threshold
Test distributional shift — transaction patterns the training data never saw
Test boundary inputs that have never occurred in your historical data
Regression test on every model update. Every single one. No exceptions.

⚠️ Real-world lesson: A retrained fraud model looked great on all benchmarks. The red team found it could be systematically bypassed using split transactions just below the guardrail threshold — a pattern not in the training set. Update rolled back. Two days of red-teaming. Would have been a catastrophic production incident otherwise.

DORA, FCA, and EU AI Act Compliance for AI Agents: What Engineers Must Know

I've seen engineers treat regulation as someone else's problem.

It isn't. Not anymore.

If you build an AI agent that makes financial decisions in the UK or EU, here's what the law already requires:

Model cards for every agent. What the model does, what it was trained on, known failure modes, performance across demographic groups. This is a legal artefact, not documentation for documentation's sake.
Immutable model versioning. Every version that ever went to production must be retained and reproducible. If a claim surfaces about a 2-year-old decision, you need to be able to re-run that exact model.
High-risk AI classification under the EU AI Act. Credit scoring and fraud detection are "high-risk." That triggers mandatory conformity assessments before deployment. Not after.
Mandatory human oversight for high-stakes decisions. Above certain thresholds, a human must be in the loop. Design your escalation queues for this now, not as an afterthought.
Continuous bias monitoring. If your fraud agent is blocking transactions from certain groups at a higher rate, you need automated detection. Manual sampling at scale doesn't cut it.

Most engineering teams are scrambling to retrofit this onto systems that were built without it. Don't be that team.

⚠️ Real-world lesson: Engineers who understand AI governance are rare and extremely well-compensated right now. The vast majority of developers avoid learning it because it seems boring. That's your competitive advantage. Take it.

The Honest Truth About Agentic AI in Banking

This isn't a hard problem. It's a discipline problem.

The patterns that make financial systems reliable apply directly to AI agents. Immutability. Idempotency. Auditability. Circuit breakers. Append-only state. None of this is new.

What's new is having the AI layer sitting on top of all of it — and the discipline to keep it there, instead of letting it reach down and touch financial state directly.

The stack that works:

Append-only decision log — same principles as your ledger
Idempotent execution service — same principles as your payment processor
Hard guardrails that fire before the model is ever consulted
Circuit breakers tested in production, not just in staging
Shadow mode before any agent goes live. Always.

The difference between a bank that deploys AI confidently and one that fears it is not the quality of the model. It's the quality of the architecture around it.

"When this AI agent makes a wrong decision — and it will — can I explain exactly what happened, to a regulator, at 9am on a Monday?"
If yes — you're building it right.

What to Read Next

If this was useful, Part 1 covers the foundational backend patterns that everything in this article builds on:

👉 Fintech Backend Architecture: Building Systems That Don't Break When Money Is Involved

Follow if you want more on distributed systems, fintech backend architecture, and building AI you can actually trust in production. More coming on event-sourced architectures, DORA incident response, and real-time fraud pipelines.

Backend Lead Engineer. 10+ years in UK core banking. Distributed systems, financial data integrity, regulatory compliance, and AI-powered fintech tooling.

DEV Community