Banking AI Agent Architecture: From Monolith to Skill-Based Collaboration
Walk into any bank's IT department and you hear the same sighs: the system is slow again, the new requirement can't be scheduled, and the regulators are back. This isn't one bank's problem — it's the entire industry's. Let me break down the three fundamental traps I've seen play out repeatedly, and how we can escape them with a skill-based agent architecture.
The Three Traps of Bank IT
Trap 1: The Monolith's "Pull One Thread, Unravel Everything"
One city commercial bank I worked with had a core lending system that ballooned to 2.3 million lines of code in a single repo. Changing one rule branch meant a 3-day regression test cycle. When the business said "just a small tweak," the dev team responded: "Are you sure this won't break the other 47 call sites?"
┌─────────────────────────────────────────────┐
│ Monolith Architecture (2.3M LOC) │
│ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ │
│ │Credit │ │ Risk │ │Post- │ │Report │ │
│ │Module │ │Module │ │loan │ │Module │ │
│ └───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘ │
│ └─────┬───┴───────┬┘ │ │
│ ┌──┴───────────┴──┐ │ │
│ │ Shared DB Layer │───────┘ │
│ └─────────────────┘ │
│ Pain: 1 change → full regression → 3 days │
└─────────────────────────────────────────────┘
Trap 2: Rules Engine vs. LLM — Flying Solo
The risk team ran a rules engine, the customer service team used an LLM for intent recognition, and the compliance team manually checked with Excel. Three systems, three languages, zero unified data standards. When regulators asked "explain the basis for this approval decision," risk said "rule 37 was triggered," the LLM said "based on semantic analysis," and compliance said "manual review needed." Three logic chains that couldn't line up.
Trap 3: Reinventing the Wheel Across Business Lines
The retail team built a KYC agent. The corporate team built another KYC agent. The interbank team was building a third. Three teams, zero awareness of each other, duplicated capabilities, even their entity extraction prompts were all different. Worse: when AML rules updated, all three KYC agents had to be patched separately.
The root cause isn't outdated technology. It's that the architecture paradigm hasn't kept up with AI capability evolution. Monoliths serve deterministic logic, but the AI era needs composable, replaceable, evaluable agent collaboration.
From "Buying Systems" to "Raising Agents": Three Phases of Digital Transformation
Bank digital transformation isn't a single leap — it goes through three phases, each with different architectures and governance models.
Phase 1: System Procurement (2010-2018)
The keyword was "buy." Core systems from IBM, lending systems from Hundsun, CRM from Siebel. Fast to deploy — but you bought a bunch of black boxes. When custom requirements came, vendor quotes made you question your life choices. The deeper problem: data locked in silos. A "360-degree customer view" required extracting from 6 systems, ETL pipelines running 4 hours, T+1 data freshness.
Phase 2: Platform Construction (2018-2023)
The middle-platform concept surged. Banks built data platforms, AI platforms, business platforms. Right direction, flawed execution. Many banks' platforms became "a second core system" — heavy on construction, light on operations, lots of platform, no scenarios.
One joint-stock bank's AI platform had 7 connected scenarios a year after launch, averaging fewer than 200 daily API calls. The root cause: the platform provided "raw capabilities" (OCR, NLP, speech recognition), not "business skills" (intelligent credit document review, compliance document comparison). Business teams still had to wrap their own logic on top of the APIs.
Phase 3: Agent Raising (2023-Present)
LLMs changed the game. When an LLM can understand business language and execute multi-step tasks, banks can finally stop "buying systems" and start "raising agents."
class BankingAgent:
def __init__(self, name: str, skills: list, memory_store: dict):
self.name = name
self.llm = LLMEngine(model="qwen2.5-72b")
self.skills = {s.name: s for s in skills}
self.memory = MemoryStore(memory_store)
async def execute(self, task: str, context: dict):
# Step 1: LLM understands the task, selects skills
skill_plan = await self.llm.plan(
task, available_skills=list(self.skills.keys())
)
# Step 2: Execute skills in planned order
results = {}
for step in skill_plan.steps:
skill = self.skills[step.skill_name]
results[step.id] = await skill.run(step.params, context)
# Write each step result to memory for downstream skills
self.memory.set(step.id, results[step.id])
# Step 3: Synthesize output
return await self.llm.synthesize(task, results)
"Raising" means: agents start with simple tasks, accumulate experience (Memory) in production, gradually unlock new capabilities (Skills), and evolve from "one-trick ponies" to "independent operators." More flexible than buying systems, more practical than building platforms.
LLM + Rules Engine: The Hybrid Architecture for Finance
In financial scenarios, pure-LLM and pure-rules approaches both have fatal flaws. Pure LLMs are unexplainable and uncontrollable. Pure rules are inflexible and expensive to maintain. The answer is combining both.
Why Hybrid Is Non-Negotiable
A bank tried pure LLM for credit approval. Test accuracy: 95%. Regulators shut it down. Why? It couldn't explain why Customer A was rejected. The LLM said "the model assessed high risk overall," but regulators demand: "which rule was hit, what was the weight, and what data was the basis?"
Conversely, pure rules had their own bottleneck. One bank's risk rules grew from 120 to 4,700. Conflicts between rules multiplied. The maintenance team swelled from 3 to 17 people. Every rule update required two weeks of conflict detection.
Hybrid Architecture Design
Core principle: Rules as safety net, models as enhancement, arbitration for conflicts, full explainability throughout.
from typing import TypedDict, Optional
from enum import Enum
class DecisionSource(Enum):
RULE = "rule"
LLM = "llm"
HYBRID = "hybrid"
class HybridDecision(TypedDict):
result: str # approve / reject / review
confidence: float # 0-1
source: DecisionSource
rule_hits: list # rules that were triggered
llm_factors: list # factors contributed by LLM
explanation: str # explainable output
class HybridDecisionEngine:
def __init__(self, rule_engine, llm_engine, arbitration_policy="rule_first"):
self.rule_engine = rule_engine
self.llm_engine = llm_engine
self.policy = arbitration_policy
async def decide(self, application: dict) -> HybridDecision:
# Step 1: Run rules engine first — deterministic logic skips the model
rule_result = self.rule_engine.evaluate(application)
# Hard rule hit: direct decision, no model needed
if rule_result["hard_hit"]:
return {
"result": rule_result["action"],
"confidence": 1.0,
"source": DecisionSource.RULE,
"rule_hits": rule_result["hits"],
"llm_factors": [],
"explanation": self._rule_explanation(rule_result["hits"])
}
# Step 2: Gray area — let the model supplement
llm_result = await self.llm_engine.analyze(
application=application,
rule_context=rule_result,
prompt="Based on the application and rule evaluation, supplement risk analysis"
)
# Step 3: Arbitration
final = self._arbitrate(rule_result, llm_result)
return final
def _arbitrate(self, rule_result: dict, llm_result: dict) -> HybridDecision:
# Rule says reject + LLM says approve → rule wins (regulatory safety)
if rule_result["soft_action"] == "reject" and llm_result["tendency"] == "approve":
return {
"result": "review",
"confidence": 0.6,
"source": DecisionSource.HYBRID,
"rule_hits": rule_result["hits"],
"llm_factors": llm_result["factors"],
"explanation": f"Rules flagged {rule_result['hits']}, LLM leans approve — escalate to human review"
}
# Rule says approve + LLM has concerns → add LLM factors
if rule_result["soft_action"] == "approve" and llm_result["tendency"] == "reject":
return {
"result": "review",
"confidence": 0.7,
"source": DecisionSource.HYBRID,
"rule_hits": rule_result["hits"],
"llm_factors": llm_result["factors"],
"explanation": f"Rules pass but LLM identifies {llm_result['factors']} — human review recommended"
}
# Both agree → high confidence output
return {
"result": rule_result["soft_action"],
"confidence": 0.95,
"source": DecisionSource.HYBRID,
"rule_hits": rule_result["hits"],
"llm_factors": llm_result["factors"],
"explanation": f"Rules and LLM agree: {rule_result['soft_action']}"
}
Real-World Results
After deploying the hybrid architecture in a consumer lending scenario:
| Metric | Rules Only | LLM Only | Hybrid |
|---|---|---|---|
| Approval accuracy | 87.3% | 94.8% | 95.2% |
| Explainability pass rate | 100% | 23% | 100% |
| Human intervention rate | 31% | 12% | 8% |
| Rule maintenance cost | High | Low | Medium |
| Regulatory audit pass | Yes | No | Yes |
The pure LLM scored higher on accuracy, but its 23% explainability pass rate made it undeployable. The hybrid approach achieved 100% explainability while improving accuracy nearly 8 points over rules alone.
In financial scenarios, explainability isn't optional — it's an entry requirement. The hybrid architecture's essence isn't piling rules on models, but letting each do what it's best at, with an arbitration layer to resolve conflicts.
How 7 Skills Collaborate: The Agent Architecture Design
Now for the core architecture question: how do 7 skills work together? Let me walk through a complete corporate lending scenario, from customer onboarding to post-loan monitoring.
7 Skills and Their Responsibilities
| Skill | Responsibility | Input | Output | Key Capability |
|---|---|---|---|---|
| KYC Verification | Identity verification & profiling | Business registration + legal rep info | Customer profile + risk labels | Entity extraction, relationship graph |
| Credit Assessment | Credit rating & limit calculation | Financial statements + credit data | Credit grade + suggested limit | Rules engine + model scoring |
| Compliance Review | Regulatory compliance check | Business docs + regulation library | Compliance report + violation list | Regulation matching, clause interpretation |
| Document Comparison | Smart contract/agreement comparison | Standard template + actual document | Diff list + risk warnings | OCR + semantic similarity |
| Risk Pricing | Rate & fee calculation | Rating result + market data | Pricing plan + profit simulation | Monte Carlo + sensitivity analysis |
| Post-Loan Monitoring | Repayment tracking & early warning | Account flows + behavioral data | Warning signals + action suggestions | Time-series anomaly detection |
| Report Generation | Automated approval report | Previous skill outputs | Structured approval report | Template filling + logic validation |
The Core of Collaboration: Dependency Graphs and Parallel Scheduling
7 skills don't execute serially. Some run in parallel, some have dependencies. Here's the dependency-driven orchestrator I built:
from dataclasses import dataclass
from typing import Dict, Any, List
import asyncio
@dataclass
class SkillResult:
skill_name: str
status: str # success / failed / skipped
data: Dict[str, Any]
duration_ms: int
class SkillOrchestrator:
def __init__(self):
# Define skill dependency graph
self.dependency_graph = {
"kyc": [], # No dependency, runs first
"credit_assess": ["kyc"], # Depends on KYC result
"compliance": ["kyc"], # Depends on KYC result
"doc_compare": [], # No dependency, parallel with KYC
"risk_pricing": ["credit_assess", "compliance"],
"post_loan": ["credit_assess"],
"report_gen": ["risk_pricing", "doc_compare", "compliance"]
}
self.skill_instances = {}
def register(self, name: str, skill):
self.skill_instances[name] = skill
async def execute_pipeline(self, application: dict) -> Dict[str, SkillResult]:
results = {}
completed = set()
# Topological sort + parallel scheduling
while len(completed) < len(self.dependency_graph):
# Find skills whose dependencies are all completed
ready = [
name for name, deps in self.dependency_graph.items()
if name not in completed
and all(d in completed for d in deps)
]
if not ready:
raise RuntimeError("Circular dependency detected, cannot proceed")
# Execute all ready skills in parallel
tasks = []
for skill_name in ready:
skill_input = self._build_input(skill_name, application, results)
tasks.append(self._run_skill(skill_name, skill_input))
batch_results = await asyncio.gather(*tasks, return_exceptions=True)
for skill_name, result in zip(ready, batch_results):
if isinstance(result, Exception):
results[skill_name] = SkillResult(
skill_name=skill_name, status="failed",
data={"error": str(result)}, duration_ms=0
)
else:
results[skill_name] = result
completed.add(skill_name)
return results
def _build_input(self, skill_name: str, application: dict,
completed_results: dict) -> dict:
"""Inject upstream skill outputs into current skill's input"""
deps = self.dependency_graph[skill_name]
enriched = {"application": application, "upstream": {}}
for dep in deps:
if dep in completed_results and completed_results[dep].status == "success":
enriched["upstream"][dep] = completed_results[dep].data
return enriched
async def _run_skill(self, name: str, input_data: dict) -> SkillResult:
import time
start = time.time()
skill = self.skill_instances[name]
result_data = await skill.run(input_data)
duration = int((time.time() - start) * 1000)
return SkillResult(
skill_name=name, status="success",
data=result_data, duration_ms=duration
)
In this flow, KYC and document comparison launch in parallel. Once KYC completes, credit assessment and compliance review run in parallel. After credit assessment, risk pricing and post-loan monitoring configure in parallel. Report generation waits for all upstream skills.
Serial execution: ~4.2 seconds. Parallel scheduling: ~1.8 seconds. 57% faster.
The key isn't "who calls whom" but dependency-graph-driven parallel scheduling. Each skill only declares its dependencies; the orchestrator automatically resolves parallelism. Adding a new skill only requires registering its dependencies — zero code intrusion into the orchestration logic.
Evaluating Your Agent: A 4-Dimension Measurement Framework
"Is the agent smart enough?" can't be answered by gut feeling. In financial scenarios, we need a quantifiable, comparable, trackable evaluation framework.
The 4-Dimension Model
| Dimension | Core Metric | Finance Requirement | Method |
|---|---|---|---|
| Accuracy | End-to-end accuracy, false positive/negative rate | FP < 5%, FN < 1% | Labeled dataset + A/B testing |
| Explainability | Decision traceability rate, rule coverage | 100% traceable | Automated audit scripts |
| Stability | Output variance, adversarial robustness | CV < 0.05 | Stress testing + adversarial samples |
| Efficiency | P95 latency, throughput, cost per 1K calls | P95 < 2s | Performance benchmarks |
Implementation
from dataclasses import dataclass
from typing import List, Dict
import statistics
@dataclass
class EvalCase:
input_data: dict
expected_output: dict
category: str # kyc / credit / compliance / ...
difficulty: str # easy / medium / hard
class AgentEvaluator:
def __init__(self, agent, eval_dataset: List[EvalCase]):
self.agent = agent
self.dataset = eval_dataset
async def run_evaluation(self) -> Dict:
results = {
"accuracy": await self._eval_accuracy(),
"explainability": await self._eval_explainability(),
"stability": await self._eval_stability(),
"efficiency": await self._eval_efficiency(),
}
# Composite score: explainability carries highest weight in finance
weights = {"accuracy": 0.3, "explainability": 0.3,
"stability": 0.25, "efficiency": 0.15}
results["composite_score"] = sum(
results[k]["score"] * weights[k] for k in weights
)
return results
async def _eval_accuracy(self) -> Dict:
correct, false_positive, false_negative = 0, 0, 0
total = len(self.dataset)
for case in self.dataset:
actual = await self.agent.execute(
task=case.input_data["task"],
context=case.input_data["context"]
)
if actual["result"] == case.expected_output["result"]:
correct += 1
elif case.expected_output["result"] == "normal":
false_positive += 1
else:
false_negative += 1
return {
"score": correct / total,
"accuracy": correct / total,
"false_positive_rate": false_positive / total,
"false_negative_rate": false_negative / total,
"detail": f"{correct}/{total} correct, {false_positive} FP, {false_negative} FN"
}
async def _eval_explainability(self) -> Dict:
traceable, rule_covered = 0, 0
total = len(self.dataset)
for case in self.dataset:
result = await self.agent.execute(
task=case.input_data["task"],
context=case.input_data["context"]
)
if result.get("rule_hits") or result.get("llm_factors"):
traceable += 1
expected_rules = case.expected_output.get("expected_rules", [])
if expected_rules:
actual_rules = [r["rule_id"] for r in result.get("rule_hits", [])]
if all(r in actual_rules for r in expected_rules):
rule_covered += 1
return {
"score": (traceable / total + rule_covered / total) / 2,
"traceable_rate": traceable / total,
"rule_coverage": rule_covered / total,
"detail": f"Traceability {traceable/total:.1%}, Rule coverage {rule_covered/total:.1%}"
}
async def _eval_stability(self) -> Dict:
import random
sample_cases = random.sample(self.dataset, min(20, len(self.dataset)))
score_variances = []
for case in sample_cases:
scores = []
for _ in range(5): # Run each case 5 times
result = await self.agent.execute(
task=case.input_data["task"],
context=case.input_data["context"]
)
scores.append(result.get("confidence", 0))
if statistics.mean(scores) > 0:
cv = statistics.stdev(scores) / statistics.mean(scores)
score_variances.append(cv)
avg_cv = statistics.mean(score_variances) if score_variances else 0
return {
"score": max(0, 1 - avg_cv * 10),
"avg_cv": avg_cv,
"detail": f"Average CV {avg_cv:.4f}, target < 0.05"
}
async def _eval_efficiency(self) -> Dict:
import time
latencies = []
for case in self.dataset[:50]:
start = time.time()
await self.agent.execute(
task=case.input_data["task"],
context=case.input_data["context"]
)
latencies.append((time.time() - start) * 1000)
latencies.sort()
p95 = latencies[int(len(latencies) * 0.95)] if latencies else 0
avg = statistics.mean(latencies)
return {
"score": min(1, 2000 / p95) if p95 > 0 else 1,
"p95_ms": p95,
"avg_ms": avg,
"detail": f"P95={p95:.0f}ms, Avg={avg:.0f}ms, Target P95<2000ms"
}
Interpreting Results
- Accuracy < 90%: Check data quality first, then optimize the model, finally tune rule weights
- Explainability < 100%: Unacceptable — this is a regulatory red line, decision chains must be complete
- Stability CV > 0.1: Check if prompts have randomness (temperature too high?), add more few-shot examples
- P95 > 2s: Check for unnecessary serial calls, optimize the skill orchestration graph
The Open-Source Financial AI Skill Library: 56 Scenarios
No single bank can develop AI capabilities for every scenario. The open-source financial AI skill library lets banks focus on business logic instead of reinventing wheels.
Skill Acceptance Criteria
Every skill must pass three types of acceptance before entering the library:
SKILL_ACCEPTANCE_CRITERIA = {
# 1. Functional: must have I/O schemas and minimum test cases
"functional": {
"input_schema": True,
"output_schema": True,
"test_cases_min": 3,
"edge_cases_min": 1,
},
# 2. Security: hard requirements for financial scenarios
"security": {
"no_hardcoded_secrets": True,
"pii_redaction": True,
"audit_log": True,
"max_data_retention_days": 30,
},
# 3. Performance: basic SLA
"performance": {
"p95_latency_ms": 3000,
"error_rate_max": 0.01,
"timeout_ms": 10000,
}
}
Contribution Paths
- User: Install from the library, configure parameters, deploy
- Customizer: Fork, adjust business parameters (rule thresholds, prompt templates), contribute back upstream
- Contributor: Develop new skills, submit PRs, pass all three acceptance types
56 scenarios aren't 56 standalone systems — they're 56 standardized skill building blocks. A bank's competitive advantage lies not in the blocks themselves, but in assembling them into differentiated business workflows.
Engineering Checklist: From Design to Deployment
Architecture Design Phase
| # | Check Item | Pass Criteria | Common Pitfall |
|---|---|---|---|
| 1 | Skill granularity | Each skill independently testable and deployable | Too fine → orchestration overhead; too coarse → no reuse |
| 2 | Dependency graph acyclic | Topological sort is executable | A→B→A circular deadlock |
| 3 | Hybrid coverage | Every decision node has rule fallback | LLM-only decisions → unexplainable |
| 4 | Data traceability | Every output traceable to input + rule + model | Lost intermediate results → audit failure |
Development Phase
from abc import ABC, abstractmethod
from pydantic import BaseModel
class SkillInput(BaseModel):
"""Every skill must define an input schema"""
application: dict
upstream: dict = {}
class SkillOutput(BaseModel):
"""Every skill must define an output schema"""
result: str
confidence: float
rule_hits: list = []
llm_factors: list = []
explanation: str = ""
metadata: dict = {}
class BaseSkill(ABC):
"""Base class all financial AI skills must inherit"""
name: str
version: str
@abstractmethod
async def run(self, input_data: SkillInput) -> SkillOutput:
pass
@abstractmethod
def validate_input(self, input_data: dict) -> bool:
"""Input validation: prevent injection and unauthorized access"""
pass
@abstractmethod
def redact_output(self, output: SkillOutput) -> SkillOutput:
"""Output redaction: remove PII and sensitive fields"""
pass
async def safe_run(self, input_data: SkillInput) -> SkillOutput:
"""Safe execution: validate → execute → redact → audit"""
if not self.validate_input(input_data.dict()):
raise ValueError(f"[{self.name}] Input validation failed")
output = await self.run(input_data)
redacted = self.redact_output(output)
# Structured, tamper-proof audit log
self._audit_log(input_data, redacted)
return redacted
Testing Phase
- Unit tests: Each skill independently, covering normal + edge + error cases
- Integration tests: 2-3 skills grouped, verify data flow and dependency resolution
- End-to-end tests: Full 7-skill pipeline, including adversarial samples and stress tests
Deployment Phase
| Step | Action | Key Checkpoint |
|---|---|---|
| 1 | Gradual rollout | 5% → 20% → 50% → 100%, each stage for at least 3 days |
| 2 | Accuracy reconciliation | Agent decision vs. human decision, >5% deviation pauses rollout |
| 3 | Explainability verification | Random sample 100 decisions, manually verify traceability |
| 4 | Regulatory filing | Submit model documentation (algorithm logic, data sources, risk controls) |
| 5 | Emergency plan | One-click fallback to rules-engine-only mode, ensure business continuity |
From monolith to skill-based collaboration isn't a technology choice — it's the inevitable path for banking AI. Buying systems was Phase 1's answer, building platforms was Phase 2's exploration, and raising agents is Phase 3's correct answer. What makes agents actually deployable, auditable, and evolvable is: hybrid architecture for safety, skill orchestration for efficiency, four-dimension evaluation for direction, and ecosystem reuse for cost reduction. Get these four things right, and banking AI stops being a slide deck and becomes real business productivity.
Top comments (0)