兆鹏于

Posted on Jun 29

Banking AI Agent Architecture: From Monolith to Skill-Based Coordination

#agents #ai #architecture #systemdesign

Banking AI Agent Architecture: From Monolith to Skill-Based Collaboration

Walk into any bank's IT department and you hear the same sighs: the system is slow again, the new requirement can't be scheduled, and the regulators are back. This isn't one bank's problem — it's the entire industry's. Let me break down the three fundamental traps I've seen play out repeatedly, and how we can escape them with a skill-based agent architecture.

The Three Traps of Bank IT

Trap 1: The Monolith's "Pull One Thread, Unravel Everything"

One city commercial bank I worked with had a core lending system that ballooned to 2.3 million lines of code in a single repo. Changing one rule branch meant a 3-day regression test cycle. When the business said "just a small tweak," the dev team responded: "Are you sure this won't break the other 47 call sites?"

┌─────────────────────────────────────────────┐
│        Monolith Architecture (2.3M LOC)      │
│  ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐    │
│  │Credit │ │ Risk  │ │Post-  │ │Report │    │
│  │Module │ │Module │ │loan   │ │Module │    │
│  └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘    │
│      └─────┬───┴───────┬┘         │         │
│         ┌──┴───────────┴──┐       │         │
│         │ Shared DB Layer  │───────┘         │
│         └─────────────────┘                 │
│  Pain: 1 change → full regression → 3 days  │
└─────────────────────────────────────────────┘

Trap 2: Rules Engine vs. LLM — Flying Solo

The risk team ran a rules engine, the customer service team used an LLM for intent recognition, and the compliance team manually checked with Excel. Three systems, three languages, zero unified data standards. When regulators asked "explain the basis for this approval decision," risk said "rule 37 was triggered," the LLM said "based on semantic analysis," and compliance said "manual review needed." Three logic chains that couldn't line up.

Trap 3: Reinventing the Wheel Across Business Lines

The retail team built a KYC agent. The corporate team built another KYC agent. The interbank team was building a third. Three teams, zero awareness of each other, duplicated capabilities, even their entity extraction prompts were all different. Worse: when AML rules updated, all three KYC agents had to be patched separately.

The root cause isn't outdated technology. It's that the architecture paradigm hasn't kept up with AI capability evolution. Monoliths serve deterministic logic, but the AI era needs composable, replaceable, evaluable agent collaboration.

From "Buying Systems" to "Raising Agents": Three Phases of Digital Transformation

Bank digital transformation isn't a single leap — it goes through three phases, each with different architectures and governance models.

Phase 1: System Procurement (2010-2018)

The keyword was "buy." Core systems from IBM, lending systems from Hundsun, CRM from Siebel. Fast to deploy — but you bought a bunch of black boxes. When custom requirements came, vendor quotes made you question your life choices. The deeper problem: data locked in silos. A "360-degree customer view" required extracting from 6 systems, ETL pipelines running 4 hours, T+1 data freshness.

Phase 2: Platform Construction (2018-2023)

The middle-platform concept surged. Banks built data platforms, AI platforms, business platforms. Right direction, flawed execution. Many banks' platforms became "a second core system" — heavy on construction, light on operations, lots of platform, no scenarios.

One joint-stock bank's AI platform had 7 connected scenarios a year after launch, averaging fewer than 200 daily API calls. The root cause: the platform provided "raw capabilities" (OCR, NLP, speech recognition), not "business skills" (intelligent credit document review, compliance document comparison). Business teams still had to wrap their own logic on top of the APIs.

Phase 3: Agent Raising (2023-Present)

LLMs changed the game. When an LLM can understand business language and execute multi-step tasks, banks can finally stop "buying systems" and start "raising agents."

class BankingAgent:
    def __init__(self, name: str, skills: list, memory_store: dict):
        self.name = name
        self.llm = LLMEngine(model="qwen2.5-72b")
        self.skills = {s.name: s for s in skills}
        self.memory = MemoryStore(memory_store)

    async def execute(self, task: str, context: dict):
        # Step 1: LLM understands the task, selects skills
        skill_plan = await self.llm.plan(
            task, available_skills=list(self.skills.keys())
        )

        # Step 2: Execute skills in planned order
        results = {}
        for step in skill_plan.steps:
            skill = self.skills[step.skill_name]
            results[step.id] = await skill.run(step.params, context)
            # Write each step result to memory for downstream skills
            self.memory.set(step.id, results[step.id])

        # Step 3: Synthesize output
        return await self.llm.synthesize(task, results)

"Raising" means: agents start with simple tasks, accumulate experience (Memory) in production, gradually unlock new capabilities (Skills), and evolve from "one-trick ponies" to "independent operators." More flexible than buying systems, more practical than building platforms.

LLM + Rules Engine: The Hybrid Architecture for Finance

In financial scenarios, pure-LLM and pure-rules approaches both have fatal flaws. Pure LLMs are unexplainable and uncontrollable. Pure rules are inflexible and expensive to maintain. The answer is combining both.

Why Hybrid Is Non-Negotiable

A bank tried pure LLM for credit approval. Test accuracy: 95%. Regulators shut it down. Why? It couldn't explain why Customer A was rejected. The LLM said "the model assessed high risk overall," but regulators demand: "which rule was hit, what was the weight, and what data was the basis?"

Conversely, pure rules had their own bottleneck. One bank's risk rules grew from 120 to 4,700. Conflicts between rules multiplied. The maintenance team swelled from 3 to 17 people. Every rule update required two weeks of conflict detection.

Hybrid Architecture Design

Core principle: Rules as safety net, models as enhancement, arbitration for conflicts, full explainability throughout.

from typing import TypedDict, Optional
from enum import Enum

class DecisionSource(Enum):
    RULE = "rule"
    LLM = "llm"
    HYBRID = "hybrid"

class HybridDecision(TypedDict):
    result: str           # approve / reject / review
    confidence: float     # 0-1
    source: DecisionSource
    rule_hits: list       # rules that were triggered
    llm_factors: list     # factors contributed by LLM
    explanation: str      # explainable output

class HybridDecisionEngine:
    def __init__(self, rule_engine, llm_engine, arbitration_policy="rule_first"):
        self.rule_engine = rule_engine
        self.llm_engine = llm_engine
        self.policy = arbitration_policy

    async def decide(self, application: dict) -> HybridDecision:
        # Step 1: Run rules engine first — deterministic logic skips the model
        rule_result = self.rule_engine.evaluate(application)

        # Hard rule hit: direct decision, no model needed
        if rule_result["hard_hit"]:
            return {
                "result": rule_result["action"],
                "confidence": 1.0,
                "source": DecisionSource.RULE,
                "rule_hits": rule_result["hits"],
                "llm_factors": [],
                "explanation": self._rule_explanation(rule_result["hits"])
            }

        # Step 2: Gray area — let the model supplement
        llm_result = await self.llm_engine.analyze(
            application=application,
            rule_context=rule_result,
            prompt="Based on the application and rule evaluation, supplement risk analysis"
        )

        # Step 3: Arbitration
        final = self._arbitrate(rule_result, llm_result)
        return final

    def _arbitrate(self, rule_result: dict, llm_result: dict) -> HybridDecision:
        # Rule says reject + LLM says approve → rule wins (regulatory safety)
        if rule_result["soft_action"] == "reject" and llm_result["tendency"] == "approve":
            return {
                "result": "review",
                "confidence": 0.6,
                "source": DecisionSource.HYBRID,
                "rule_hits": rule_result["hits"],
                "llm_factors": llm_result["factors"],
                "explanation": f"Rules flagged {rule_result['hits']}, LLM leans approve — escalate to human review"
            }

        # Rule says approve + LLM has concerns → add LLM factors
        if rule_result["soft_action"] == "approve" and llm_result["tendency"] == "reject":
            return {
                "result": "review",
                "confidence": 0.7,
                "source": DecisionSource.HYBRID,
                "rule_hits": rule_result["hits"],
                "llm_factors": llm_result["factors"],
                "explanation": f"Rules pass but LLM identifies {llm_result['factors']} — human review recommended"
            }

        # Both agree → high confidence output
        return {
            "result": rule_result["soft_action"],
            "confidence": 0.95,
            "source": DecisionSource.HYBRID,
            "rule_hits": rule_result["hits"],
            "llm_factors": llm_result["factors"],
            "explanation": f"Rules and LLM agree: {rule_result['soft_action']}"
        }

Real-World Results

After deploying the hybrid architecture in a consumer lending scenario:

Metric	Rules Only	LLM Only	Hybrid
Approval accuracy	87.3%	94.8%	95.2%
Explainability pass rate	100%	23%	100%
Human intervention rate	31%	12%	8%
Rule maintenance cost	High	Low	Medium
Regulatory audit pass	Yes	No	Yes

The pure LLM scored higher on accuracy, but its 23% explainability pass rate made it undeployable. The hybrid approach achieved 100% explainability while improving accuracy nearly 8 points over rules alone.

In financial scenarios, explainability isn't optional — it's an entry requirement. The hybrid architecture's essence isn't piling rules on models, but letting each do what it's best at, with an arbitration layer to resolve conflicts.

How 7 Skills Collaborate: The Agent Architecture Design

Now for the core architecture question: how do 7 skills work together? Let me walk through a complete corporate lending scenario, from customer onboarding to post-loan monitoring.

7 Skills and Their Responsibilities

Skill	Responsibility	Input	Output	Key Capability
KYC Verification	Identity verification & profiling	Business registration + legal rep info	Customer profile + risk labels	Entity extraction, relationship graph
Credit Assessment	Credit rating & limit calculation	Financial statements + credit data	Credit grade + suggested limit	Rules engine + model scoring
Compliance Review	Regulatory compliance check	Business docs + regulation library	Compliance report + violation list	Regulation matching, clause interpretation
Document Comparison	Smart contract/agreement comparison	Standard template + actual document	Diff list + risk warnings	OCR + semantic similarity
Risk Pricing	Rate & fee calculation	Rating result + market data	Pricing plan + profit simulation	Monte Carlo + sensitivity analysis
Post-Loan Monitoring	Repayment tracking & early warning	Account flows + behavioral data	Warning signals + action suggestions	Time-series anomaly detection
Report Generation	Automated approval report	Previous skill outputs	Structured approval report	Template filling + logic validation

The Core of Collaboration: Dependency Graphs and Parallel Scheduling

7 skills don't execute serially. Some run in parallel, some have dependencies. Here's the dependency-driven orchestrator I built:

from dataclasses import dataclass
from typing import Dict, Any, List
import asyncio

@dataclass
class SkillResult:
    skill_name: str
    status: str  # success / failed / skipped
    data: Dict[str, Any]
    duration_ms: int

class SkillOrchestrator:
    def __init__(self):
        # Define skill dependency graph
        self.dependency_graph = {
            "kyc": [],                    # No dependency, runs first
            "credit_assess": ["kyc"],     # Depends on KYC result
            "compliance": ["kyc"],        # Depends on KYC result
            "doc_compare": [],            # No dependency, parallel with KYC
            "risk_pricing": ["credit_assess", "compliance"],
            "post_loan": ["credit_assess"],
            "report_gen": ["risk_pricing", "doc_compare", "compliance"]
        }
        self.skill_instances = {}

    def register(self, name: str, skill):
        self.skill_instances[name] = skill

    async def execute_pipeline(self, application: dict) -> Dict[str, SkillResult]:
        results = {}
        completed = set()

        # Topological sort + parallel scheduling
        while len(completed) < len(self.dependency_graph):
            # Find skills whose dependencies are all completed
            ready = [
                name for name, deps in self.dependency_graph.items()
                if name not in completed 
                and all(d in completed for d in deps)
            ]

            if not ready:
                raise RuntimeError("Circular dependency detected, cannot proceed")

            # Execute all ready skills in parallel
            tasks = []
            for skill_name in ready:
                skill_input = self._build_input(skill_name, application, results)
                tasks.append(self._run_skill(skill_name, skill_input))

            batch_results = await asyncio.gather(*tasks, return_exceptions=True)

            for skill_name, result in zip(ready, batch_results):
                if isinstance(result, Exception):
                    results[skill_name] = SkillResult(
                        skill_name=skill_name, status="failed",
                        data={"error": str(result)}, duration_ms=0
                    )
                else:
                    results[skill_name] = result
                completed.add(skill_name)

        return results

    def _build_input(self, skill_name: str, application: dict, 
                     completed_results: dict) -> dict:
        """Inject upstream skill outputs into current skill's input"""
        deps = self.dependency_graph[skill_name]
        enriched = {"application": application, "upstream": {}}
        for dep in deps:
            if dep in completed_results and completed_results[dep].status == "success":
                enriched["upstream"][dep] = completed_results[dep].data
        return enriched

    async def _run_skill(self, name: str, input_data: dict) -> SkillResult:
        import time
        start = time.time()
        skill = self.skill_instances[name]
        result_data = await skill.run(input_data)
        duration = int((time.time() - start) * 1000)
        return SkillResult(
            skill_name=name, status="success",
            data=result_data, duration_ms=duration
        )

In this flow, KYC and document comparison launch in parallel. Once KYC completes, credit assessment and compliance review run in parallel. After credit assessment, risk pricing and post-loan monitoring configure in parallel. Report generation waits for all upstream skills.

Serial execution: ~4.2 seconds. Parallel scheduling: ~1.8 seconds. 57% faster.

The key isn't "who calls whom" but dependency-graph-driven parallel scheduling. Each skill only declares its dependencies; the orchestrator automatically resolves parallelism. Adding a new skill only requires registering its dependencies — zero code intrusion into the orchestration logic.

Evaluating Your Agent: A 4-Dimension Measurement Framework

"Is the agent smart enough?" can't be answered by gut feeling. In financial scenarios, we need a quantifiable, comparable, trackable evaluation framework.

The 4-Dimension Model

Dimension	Core Metric	Finance Requirement	Method
Accuracy	End-to-end accuracy, false positive/negative rate	FP < 5%, FN < 1%	Labeled dataset + A/B testing
Explainability	Decision traceability rate, rule coverage	100% traceable	Automated audit scripts
Stability	Output variance, adversarial robustness	CV < 0.05	Stress testing + adversarial samples
Efficiency	P95 latency, throughput, cost per 1K calls	P95 < 2s	Performance benchmarks

Implementation

from dataclasses import dataclass
from typing import List, Dict
import statistics

@dataclass
class EvalCase:
    input_data: dict
    expected_output: dict
    category: str  # kyc / credit / compliance / ...
    difficulty: str  # easy / medium / hard

class AgentEvaluator:
    def __init__(self, agent, eval_dataset: List[EvalCase]):
        self.agent = agent
        self.dataset = eval_dataset

    async def run_evaluation(self) -> Dict:
        results = {
            "accuracy": await self._eval_accuracy(),
            "explainability": await self._eval_explainability(),
            "stability": await self._eval_stability(),
            "efficiency": await self._eval_efficiency(),
        }
        # Composite score: explainability carries highest weight in finance
        weights = {"accuracy": 0.3, "explainability": 0.3, 
                   "stability": 0.25, "efficiency": 0.15}
        results["composite_score"] = sum(
            results[k]["score"] * weights[k] for k in weights
        )
        return results

    async def _eval_accuracy(self) -> Dict:
        correct, false_positive, false_negative = 0, 0, 0
        total = len(self.dataset)

        for case in self.dataset:
            actual = await self.agent.execute(
                task=case.input_data["task"],
                context=case.input_data["context"]
            )
            if actual["result"] == case.expected_output["result"]:
                correct += 1
            elif case.expected_output["result"] == "normal":
                false_positive += 1
            else:
                false_negative += 1

        return {
            "score": correct / total,
            "accuracy": correct / total,
            "false_positive_rate": false_positive / total,
            "false_negative_rate": false_negative / total,
            "detail": f"{correct}/{total} correct, {false_positive} FP, {false_negative} FN"
        }

    async def _eval_explainability(self) -> Dict:
        traceable, rule_covered = 0, 0
        total = len(self.dataset)

        for case in self.dataset:
            result = await self.agent.execute(
                task=case.input_data["task"],
                context=case.input_data["context"]
            )
            if result.get("rule_hits") or result.get("llm_factors"):
                traceable += 1
            expected_rules = case.expected_output.get("expected_rules", [])
            if expected_rules:
                actual_rules = [r["rule_id"] for r in result.get("rule_hits", [])]
                if all(r in actual_rules for r in expected_rules):
                    rule_covered += 1

        return {
            "score": (traceable / total + rule_covered / total) / 2,
            "traceable_rate": traceable / total,
            "rule_coverage": rule_covered / total,
            "detail": f"Traceability {traceable/total:.1%}, Rule coverage {rule_covered/total:.1%}"
        }

    async def _eval_stability(self) -> Dict:
        import random
        sample_cases = random.sample(self.dataset, min(20, len(self.dataset)))
        score_variances = []

        for case in sample_cases:
            scores = []
            for _ in range(5):  # Run each case 5 times
                result = await self.agent.execute(
                    task=case.input_data["task"],
                    context=case.input_data["context"]
                )
                scores.append(result.get("confidence", 0))

            if statistics.mean(scores) > 0:
                cv = statistics.stdev(scores) / statistics.mean(scores)
                score_variances.append(cv)

        avg_cv = statistics.mean(score_variances) if score_variances else 0
        return {
            "score": max(0, 1 - avg_cv * 10),
            "avg_cv": avg_cv,
            "detail": f"Average CV {avg_cv:.4f}, target < 0.05"
        }

    async def _eval_efficiency(self) -> Dict:
        import time
        latencies = []

        for case in self.dataset[:50]:
            start = time.time()
            await self.agent.execute(
                task=case.input_data["task"],
                context=case.input_data["context"]
            )
            latencies.append((time.time() - start) * 1000)

        latencies.sort()
        p95 = latencies[int(len(latencies) * 0.95)] if latencies else 0
        avg = statistics.mean(latencies)

        return {
            "score": min(1, 2000 / p95) if p95 > 0 else 1,
            "p95_ms": p95,
            "avg_ms": avg,
            "detail": f"P95={p95:.0f}ms, Avg={avg:.0f}ms, Target P95<2000ms"
        }

Interpreting Results

Accuracy < 90%: Check data quality first, then optimize the model, finally tune rule weights
Explainability < 100%: Unacceptable — this is a regulatory red line, decision chains must be complete
Stability CV > 0.1: Check if prompts have randomness (temperature too high?), add more few-shot examples
P95 > 2s: Check for unnecessary serial calls, optimize the skill orchestration graph

The Open-Source Financial AI Skill Library: 56 Scenarios

No single bank can develop AI capabilities for every scenario. The open-source financial AI skill library lets banks focus on business logic instead of reinventing wheels.

Skill Acceptance Criteria

Every skill must pass three types of acceptance before entering the library:

SKILL_ACCEPTANCE_CRITERIA = {
    # 1. Functional: must have I/O schemas and minimum test cases
    "functional": {
        "input_schema": True,
        "output_schema": True,
        "test_cases_min": 3,
        "edge_cases_min": 1,
    },
    # 2. Security: hard requirements for financial scenarios
    "security": {
        "no_hardcoded_secrets": True,
        "pii_redaction": True,
        "audit_log": True,
        "max_data_retention_days": 30,
    },
    # 3. Performance: basic SLA
    "performance": {
        "p95_latency_ms": 3000,
        "error_rate_max": 0.01,
        "timeout_ms": 10000,
    }
}

Contribution Paths

User: Install from the library, configure parameters, deploy
Customizer: Fork, adjust business parameters (rule thresholds, prompt templates), contribute back upstream
Contributor: Develop new skills, submit PRs, pass all three acceptance types

56 scenarios aren't 56 standalone systems — they're 56 standardized skill building blocks. A bank's competitive advantage lies not in the blocks themselves, but in assembling them into differentiated business workflows.

Engineering Checklist: From Design to Deployment

Architecture Design Phase

#	Check Item	Pass Criteria	Common Pitfall
1	Skill granularity	Each skill independently testable and deployable	Too fine → orchestration overhead; too coarse → no reuse
2	Dependency graph acyclic	Topological sort is executable	A→B→A circular deadlock
3	Hybrid coverage	Every decision node has rule fallback	LLM-only decisions → unexplainable
4	Data traceability	Every output traceable to input + rule + model	Lost intermediate results → audit failure

Development Phase

from abc import ABC, abstractmethod
from pydantic import BaseModel

class SkillInput(BaseModel):
    """Every skill must define an input schema"""
    application: dict
    upstream: dict = {}

class SkillOutput(BaseModel):
    """Every skill must define an output schema"""
    result: str
    confidence: float
    rule_hits: list = []
    llm_factors: list = []
    explanation: str = ""
    metadata: dict = {}

class BaseSkill(ABC):
    """Base class all financial AI skills must inherit"""

    name: str
    version: str

    @abstractmethod
    async def run(self, input_data: SkillInput) -> SkillOutput:
        pass

    @abstractmethod
    def validate_input(self, input_data: dict) -> bool:
        """Input validation: prevent injection and unauthorized access"""
        pass

    @abstractmethod
    def redact_output(self, output: SkillOutput) -> SkillOutput:
        """Output redaction: remove PII and sensitive fields"""
        pass

    async def safe_run(self, input_data: SkillInput) -> SkillOutput:
        """Safe execution: validate → execute → redact → audit"""
        if not self.validate_input(input_data.dict()):
            raise ValueError(f"[{self.name}] Input validation failed")

        output = await self.run(input_data)
        redacted = self.redact_output(output)

        # Structured, tamper-proof audit log
        self._audit_log(input_data, redacted)

        return redacted

Testing Phase

Unit tests: Each skill independently, covering normal + edge + error cases
Integration tests: 2-3 skills grouped, verify data flow and dependency resolution
End-to-end tests: Full 7-skill pipeline, including adversarial samples and stress tests

Deployment Phase

Step	Action	Key Checkpoint
1	Gradual rollout	5% → 20% → 50% → 100%, each stage for at least 3 days
2	Accuracy reconciliation	Agent decision vs. human decision, >5% deviation pauses rollout
3	Explainability verification	Random sample 100 decisions, manually verify traceability
4	Regulatory filing	Submit model documentation (algorithm logic, data sources, risk controls)
5	Emergency plan	One-click fallback to rules-engine-only mode, ensure business continuity

From monolith to skill-based collaboration isn't a technology choice — it's the inevitable path for banking AI. Buying systems was Phase 1's answer, building platforms was Phase 2's exploration, and raising agents is Phase 3's correct answer. What makes agents actually deployable, auditable, and evolvable is: hybrid architecture for safety, skill orchestration for efficiency, four-dimension evaluation for direction, and ecosystem reuse for cost reduction. Get these four things right, and banking AI stops being a slide deck and becomes real business productivity.

DEV Community