Building agents for banking is 30% AI work and 70% compliance plumbing. The 30% is the easy part. Here's how to handle the 70%.
The first time you build an underwriting agent for a bank, you'll write the credit logic in about a week. Then you'll spend the next two months on audit logging, explainability, human-in-the-loop checkpoints, and data residency and you'll understand why fintech AI projects take longer than the demo suggests.
This is the architecture that gets through an audit, using loan underwriting as the running example.
The Core Requirement: Every Decision Must Be Traceable
Regulators don't accept "the model gave it a score of 0.82." They want to know what data was used, what reasoning was applied, and what a human would need to review to understand and challenge the decision.
class AuditableDecision:
"""
Every agent decision in a regulated context must
produce this structure. Not optional. Not added later.
"""
def __init__(self, decision_id: str):
self.decision_id = decision_id
self.inputs_used = {}
self.reasoning_steps = []
self.data_sources_consulted = []
self.model_version = AGENT_VERSION
self.timestamp = datetime.utcnow().isoformat()
self.outcome = None
self.confidence = None
self.human_reviewable_explanation = None
def add_reasoning_step(self, step_description: str, evidence: dict):
self.reasoning_steps.append({
"step": len(self.reasoning_steps) + 1,
"description": step_description,
"evidence": evidence,
"timestamp": datetime.utcnow().isoformat()
})
def finalise(self, outcome: str, confidence: float, explanation: str):
self.outcome = outcome
self.confidence = confidence
self.human_reviewable_explanation = explanation
def to_audit_record(self) -> dict:
return {
"decision_id": self.decision_id,
"inputs": self.inputs_used,
"reasoning_chain": self.reasoning_steps,
"data_sources": self.data_sources_consulted,
"model_version": self.model_version,
"outcome": self.outcome,
"confidence": self.confidence,
"explanation": self.human_reviewable_explanation,
"timestamp": self.timestamp
}
Every step in the agent's reasoning gets appended to this structure as it happens, not reconstructed afterward from logs, which is unreliable and which auditors specifically check for.
The Underwriting Decision Pipeline
async def run_underwriting_agent(application: dict) -> dict:
decision = AuditableDecision(decision_id=application["application_id"])
# Step 1: Document verification
doc_result = await verify_documents(application["documents"])
decision.data_sources_consulted.append("document_verification_service")
decision.add_reasoning_step(
"Verified submitted financial documents",
{"documents_checked": doc_result["count"],
"verification_status": doc_result["status"]}
)
if doc_result["status"] != "verified":
decision.finalise(
outcome="ESCALATE_TO_HUMAN",
confidence=1.0,
explanation=f"Document verification failed: {doc_result['reason']}. "
f"Routed to human reviewer for manual document check."
)
await audit_store.append(decision.to_audit_record())
return {"status": "escalated", "reason": "document_verification"}
# Step 2: Risk scoring via Claude with explicit reasoning
risk_response = await client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1500,
system="""Analyse this loan application's risk factors.
You MUST cite specific data points for every risk factor
identified. Vague reasoning fails compliance review.
Return JSON: {
"risk_score": float,
"risk_factors": [{"factor": str, "data_point": str, "weight": str}],
"recommendation": "APPROVE" | "DENY" | "HUMAN_REVIEW",
"explanation": "2-3 sentences a non-technical auditor can verify"
}""",
messages=[{"role": "user", "content": json.dumps(application["financials"])}]
)
risk_analysis = json.loads(risk_response.content[0].text)
decision.add_reasoning_step(
"Risk assessment based on financial data",
risk_analysis
)
# Step 3: Threshold-based routing - NOT autonomous final decisions
if risk_analysis["risk_score"] > 0.7:
outcome = "HUMAN_REVIEW" # high risk always escalates
elif risk_analysis["risk_score"] < 0.3:
outcome = "AUTO_APPROVE" # only very low risk auto-decides
else:
outcome = "HUMAN_REVIEW" # middle ground escalates too
decision.finalise(
outcome=outcome,
confidence=1 - abs(0.5 - risk_analysis["risk_score"]),
explanation=risk_analysis["explanation"]
)
await audit_store.append(decision.to_audit_record())
return {"status": outcome, "decision_record": decision.to_audit_record()}
Note the threshold structure: only very low risk applications auto-approve. Everything else, high risk and the ambiguous middle ground, escalates to a human. This isn't conservative for the sake of it; it's the threshold design that auditors expect and that protects the institution from autonomous decisions on the cases that matter most.
Human-in-the-Loop: Not a Button, a Checkpoint
A human-in-the-loop checkpoint needs to give the reviewer enough information to make a genuine decision, not just a button to click.
def build_human_review_package(decision: AuditableDecision,
application: dict) -> dict:
return {
"application_summary": application["summary"],
"agent_reasoning_chain": decision.reasoning_steps,
"agent_recommendation": decision.outcome,
"confidence_level": decision.confidence,
"specific_concerns": [
rf for rf in decision.reasoning_steps
if rf.get("evidence", {}).get("weight") == "high"
],
"override_requires_justification": True,
"regulatory_basis": get_applicable_lending_regulations(application)
}
override_requires_justification: True matters because a reviewer who can override without explaining why produces rubber-stamp approval rates, not genuine oversight and that's specifically what examiners check for.
PII Handling in Prompts
Financial data going into LLM prompts needs careful handling, especially for cross-border banks with data residency requirements.
def sanitise_for_prompt(application: dict, residency_zone: str) -> dict:
"""
Strip or tokenise PII before it reaches the LLM prompt,
based on the data residency requirements for this customer.
"""
sanitised = application.copy()
# Replace direct identifiers with tokens
sanitised["applicant_name"] = f"APPLICANT_{hash_id(application['applicant_id'])}"
sanitised["ssn_or_national_id"] = "[REDACTED]"
if residency_zone == "EU":
# GDPR-specific handling
sanitised["address"] = tokenise_address(application["address"])
return sanitised
For cross-border banks, the model call itself may need to happen in a specific region, or via a locally-hosted model, depending on the residency zone, this is an architectural decision made before any agent code is written, not a configuration flag added later.
What Auditors Actually Check
Decision immutability, audit records can't be modified after creation. Reasoning chain completeness, every step that influenced the outcome must be in the record, not reconstructed from application logs after the fact. Override documentation, every human override of an agent recommendation requires a written justification. Threshold justification, the bank must be able to explain why the auto-approve threshold is set where it is, with evidence it doesn't create disparate impact.
The full AI agents for underwriting in banking architecture guide covers the complete pipeline including the specific regulatory frameworks (FCRA, ECOA, FCA lending standards) that shape these design decisions.
Underwriting Is One Workflow
The patterns here, audit trails, explainability, human-in-the-loop thresholds, generalise across banking AI use cases, but the specific requirements differ meaningfully by workflow. Fraud detection has different latency requirements (you need decisions in milliseconds, not minutes) and different audit needs (pattern-based reasoning rather than document-based reasoning). Worth reading the fraud detection architecture guide before you design either system, because the architectural choices you make for one affect how cleanly you can extend to the other.
Published by Dextra Labs | AI Consulting & Enterprise Agent Development
Top comments (0)