How to Architect AI Agents That Pass Banking Compliance Audits (Real Patterns, Not Theory)

#ai #webdev #programming #python

Building agents for banking is 30% AI work and 70% compliance plumbing. The 30% is the easy part. Here's how to handle the 70%.

The first time you build an underwriting agent for a bank, you'll write the credit logic in about a week. Then you'll spend the next two months on audit logging, explainability, human-in-the-loop checkpoints, and data residency and you'll understand why fintech AI projects take longer than the demo suggests.

This is the architecture that gets through an audit, using loan underwriting as the running example.

The Core Requirement: Every Decision Must Be Traceable

Regulators don't accept "the model gave it a score of 0.82." They want to know what data was used, what reasoning was applied, and what a human would need to review to understand and challenge the decision.

class AuditableDecision:
    """
    Every agent decision in a regulated context must
    produce this structure. Not optional. Not added later.
    """
    def __init__(self, decision_id: str):
        self.decision_id = decision_id
        self.inputs_used = {}
        self.reasoning_steps = []
        self.data_sources_consulted = []
        self.model_version = AGENT_VERSION
        self.timestamp = datetime.utcnow().isoformat()
        self.outcome = None
        self.confidence = None
        self.human_reviewable_explanation = None

    def add_reasoning_step(self, step_description: str, evidence: dict):
        self.reasoning_steps.append({
            "step": len(self.reasoning_steps) + 1,
            "description": step_description,
            "evidence": evidence,
            "timestamp": datetime.utcnow().isoformat()
        })

    def finalise(self, outcome: str, confidence: float, explanation: str):
        self.outcome = outcome
        self.confidence = confidence
        self.human_reviewable_explanation = explanation

    def to_audit_record(self) -> dict:
        return {
            "decision_id": self.decision_id,
            "inputs": self.inputs_used,
            "reasoning_chain": self.reasoning_steps,
            "data_sources": self.data_sources_consulted,
            "model_version": self.model_version,
            "outcome": self.outcome,
            "confidence": self.confidence,
            "explanation": self.human_reviewable_explanation,
            "timestamp": self.timestamp
        }

Every step in the agent's reasoning gets appended to this structure as it happens, not reconstructed afterward from logs, which is unreliable and which auditors specifically check for.

The Underwriting Decision Pipeline

async def run_underwriting_agent(application: dict) -> dict:
    decision = AuditableDecision(decision_id=application["application_id"])

    # Step 1: Document verification
    doc_result = await verify_documents(application["documents"])
    decision.data_sources_consulted.append("document_verification_service")
    decision.add_reasoning_step(
        "Verified submitted financial documents",
        {"documents_checked": doc_result["count"], 
         "verification_status": doc_result["status"]}
    )

    if doc_result["status"] != "verified":
        decision.finalise(
            outcome="ESCALATE_TO_HUMAN",
            confidence=1.0,
            explanation=f"Document verification failed: {doc_result['reason']}. "
                        f"Routed to human reviewer for manual document check."
        )
        await audit_store.append(decision.to_audit_record())
        return {"status": "escalated", "reason": "document_verification"}

    # Step 2: Risk scoring via Claude with explicit reasoning
    risk_response = await client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1500,
        system="""Analyse this loan application's risk factors. 
        You MUST cite specific data points for every risk factor 
        identified. Vague reasoning fails compliance review.

        Return JSON: {
            "risk_score": float,
            "risk_factors": [{"factor": str, "data_point": str, "weight": str}],
            "recommendation": "APPROVE" | "DENY" | "HUMAN_REVIEW",
            "explanation": "2-3 sentences a non-technical auditor can verify"
        }""",
        messages=[{"role": "user", "content": json.dumps(application["financials"])}]
    )

    risk_analysis = json.loads(risk_response.content[0].text)
    decision.add_reasoning_step(
        "Risk assessment based on financial data",
        risk_analysis
    )

    # Step 3: Threshold-based routing - NOT autonomous final decisions
    if risk_analysis["risk_score"] > 0.7:
        outcome = "HUMAN_REVIEW"  # high risk always escalates
    elif risk_analysis["risk_score"] < 0.3:
        outcome = "AUTO_APPROVE"  # only very low risk auto-decides
    else:
        outcome = "HUMAN_REVIEW"  # middle ground escalates too

    decision.finalise(
        outcome=outcome,
        confidence=1 - abs(0.5 - risk_analysis["risk_score"]),
        explanation=risk_analysis["explanation"]
    )

    await audit_store.append(decision.to_audit_record())
    return {"status": outcome, "decision_record": decision.to_audit_record()}

Note the threshold structure: only very low risk applications auto-approve. Everything else, high risk and the ambiguous middle ground, escalates to a human. This isn't conservative for the sake of it; it's the threshold design that auditors expect and that protects the institution from autonomous decisions on the cases that matter most.

Human-in-the-Loop: Not a Button, a Checkpoint

A human-in-the-loop checkpoint needs to give the reviewer enough information to make a genuine decision, not just a button to click.

def build_human_review_package(decision: AuditableDecision, 
                                 application: dict) -> dict:
    return {
        "application_summary": application["summary"],
        "agent_reasoning_chain": decision.reasoning_steps,
        "agent_recommendation": decision.outcome,
        "confidence_level": decision.confidence,
        "specific_concerns": [
            rf for rf in decision.reasoning_steps 
            if rf.get("evidence", {}).get("weight") == "high"
        ],
        "override_requires_justification": True,
        "regulatory_basis": get_applicable_lending_regulations(application)
    }

override_requires_justification: True matters because a reviewer who can override without explaining why produces rubber-stamp approval rates, not genuine oversight and that's specifically what examiners check for.

PII Handling in Prompts

Financial data going into LLM prompts needs careful handling, especially for cross-border banks with data residency requirements.

def sanitise_for_prompt(application: dict, residency_zone: str) -> dict:
    """
    Strip or tokenise PII before it reaches the LLM prompt,
    based on the data residency requirements for this customer.
    """
    sanitised = application.copy()

    # Replace direct identifiers with tokens
    sanitised["applicant_name"] = f"APPLICANT_{hash_id(application['applicant_id'])}"
    sanitised["ssn_or_national_id"] = "[REDACTED]"

    if residency_zone == "EU":
        # GDPR-specific handling
        sanitised["address"] = tokenise_address(application["address"])

    return sanitised

For cross-border banks, the model call itself may need to happen in a specific region, or via a locally-hosted model, depending on the residency zone, this is an architectural decision made before any agent code is written, not a configuration flag added later.

What Auditors Actually Check

Decision immutability, audit records can't be modified after creation. Reasoning chain completeness, every step that influenced the outcome must be in the record, not reconstructed from application logs after the fact. Override documentation, every human override of an agent recommendation requires a written justification. Threshold justification, the bank must be able to explain why the auto-approve threshold is set where it is, with evidence it doesn't create disparate impact.

The full AI agents for underwriting in banking architecture guide covers the complete pipeline including the specific regulatory frameworks (FCRA, ECOA, FCA lending standards) that shape these design decisions.

Underwriting Is One Workflow

The patterns here, audit trails, explainability, human-in-the-loop thresholds, generalise across banking AI use cases, but the specific requirements differ meaningfully by workflow. Fraud detection has different latency requirements (you need decisions in milliseconds, not minutes) and different audit needs (pattern-based reasoning rather than document-based reasoning). Worth reading the fraud detection architecture guide before you design either system, because the architectural choices you make for one affect how cleanly you can extend to the other.

Published by Dextra Labs | AI Consulting & Enterprise Agent Development