RaftLabs - AI App Dev Agency

Posted on May 31 • Originally published at raftlabs.com

Why your KYC agent should never decide pass/fail

#ai #architecture #python #fintech

A full KYC review takes a compliance team 7-10 business days. The agent we build does the same groundwork in under 10 minutes. That part is easy. The hard part is building it so a regulator accepts the output — and that constraint, not the model, is what decides your architecture.

The engineering problem hiding inside "automate KYC" is a routing problem: which decisions go to an LLM, which go to a rules engine, which go to an ML model, and which never leave a human. Get the boundaries wrong and you ship something that's either useless (humans still do everything) or illegal (the model files a regulatory report). Most teams get them wrong in the same way.

TL;DR

KYC/AML automation is a layered pipeline, not one agent. Five layers: extraction → screening → scoring → narrative → human queue.
The one decision the whole system hangs on: the LLM never produces the verdict. It extracts and drafts. Rules and ML decide. Humans sign off.
Real numbers: extraction 5 min → under 10s; full review 7-10 days → under 10 min + human review; build is 10-14 weeks for one jurisdiction, +4-6 weeks per additional one.
The hard parts aren't ML. They're the audit trail, the false-positive feedback loop, and per-jurisdiction rulesets — and all three have to be designed in from day one.
Autonomous SAR filing is illegal in the US, UK, and EU. Build "draft and present," never "draft and submit." ## Why this is a real engineering problem, not a wrapper

The temptation is to treat this as a thin layer over a frontier model: feed it the documents, the transaction history, the sanctions list, and ask "is this customer risky?" Ship in a weekend.

That design fails in production for a reason that has nothing to do with model quality. An LLM gives you inconsistent answers across identical inputs, and compliance is a domain where the same input must produce the same decision every time, with a reason attached. Run the same ambiguous transaction through a chat model twice and you can get "suspicious" and "not suspicious." A regulator asking why you flagged one customer and not an identical other has no acceptable answer when the answer is sampling temperature.

So the model can't own the verdict. But it's genuinely excellent at the work around the verdict — and that work is 60-70% of an analyst's day. The whole architecture falls out of taking that split seriously.

The core abstraction: what each layer is allowed to decide

The pipeline reduces to five layers, and the only thing that matters is which tool is allowed to make a decision at each one.

Layer	Job	Tool	Allowed to decide?
1. Document extraction	Pull fields off IDs, proof of address, registration docs	LLM / VLM	No — produces structured data
2. Rules-based screening	Sanctions, PEP, threshold checks	Rules engine + LLM for name matching	The rule decides; LLM only proposes matches
3. Risk scoring	Combine signals into a score	ML model on your historical decisions	Produces a score, not a pass/fail
4. Narrative generation	Write the human-readable case	LLM	No — drafts only
5. Human review queue	Approve, escalate, file	Human	Yes — the only layer that decides

The principle in one line: LLMs for extraction and drafting, rules and ML for decisions, humans for sign-off. Your rules engine will never hallucinate a sanctions match. Your LLM will never write a coherent narrative from raw JSON with zero prompting. Each layer does the thing it can't get wrong.

The body: where the LLM earns its place

Layer 1 — Document extraction

Reading a passport and pulling name, DOB, document number, and expiry is 5 minutes of analyst time. A vision model does it in under 10 seconds, including handwritten fields, non-Latin scripts, and bad scans. At 100 documents/day that's 7-8 analyst-hours reclaimed daily.

The design decision here is schema-constrained output, not free text. Extraction returns a typed object you can validate, not prose you have to re-parse.

from dataclasses import dataclass
from datetime import date

@dataclass
class ExtractedID:
    full_name: str
    date_of_birth: date
    document_number: str
    expiry_date: date
    confidence: float          # per-field confidence drives routing
    source_doc_id: str         # ties the field back to the image for audit

def extract_id(image_bytes: bytes) -> ExtractedID:
    # VLM call returns JSON matching the schema; we never accept free text.
    raw = vision_model.extract(image_bytes, schema=ExtractedID)
    # Low confidence does not get silently accepted — it routes to a human.
    if raw.confidence < 0.85:               # this threshold is the whole game
        raw = flag_for_manual_review(raw)
    return raw

That confidence < 0.85 line is not a detail. It's the dial that decides how much work the human queue gets, and you will tune it for months (more on calibration below).

Layer 2 — Screening, where LLMs match but rules decide

OFAC, UN, EU, and HM Treasury lists carry names in multiple transliterations and aliases. An exact-match rules engine misses "Mohammed Al-Rashid" when the list says "Muhammad Al Rasheed." An LLM handles the variation naturally.

But here's the boundary most teams blur: the LLM proposes a candidate match; the rules engine, not the model, records the screening result. The model's job is recall (don't miss a real match); the deterministic layer owns the recorded decision and the evidence.

def screen_name(name: str, sanctions_list: list[SanctionsEntry]) -> ScreeningResult:
    # LLM proposes candidates across transliterations/aliases — high recall.
    candidates = llm_name_match(name, sanctions_list)
    # The DECISION is deterministic: a candidate above the match rule is a hit.
    # The LLM never gets to say "probably fine, skip it."
    hits = [c for c in candidates if c.match_score >= MATCH_THRESHOLD]
    return ScreeningResult(
        name=name,
        hits=hits,
        evidence=[c.matched_entry for c in candidates],  # logged regardless
        decided_by="rules_engine",                        # never "llm"
    )

Layer 4 — Narrative generation

Instead of returning a raw score, the agent writes the case: this entity has a registered address in a high-risk jurisdiction, the UBO shares a name variant with an OFAC SDN listing, and two adverse-media mentions from Q3 reference a regulatory investigation in their home market. The analyst gets context, not a number. SAR drafting alone saves 2-3 hours per filing; at 10 SARs/month that's 20-30 hours back.

The narrative is downstream of the decision, not part of it. It reads the structured outputs of layers 1-3 and explains them. It never re-litigates the verdict.

The parts nobody warns you about

This is where the project actually lives or dies. None of it is model work.

1. The audit trail is a schema decision, not a logging afterthought

FinCEN (US), the FCA (UK), and AMLD6 (EU) don't prohibit AI in compliance. They require explainability and human accountability. In a regulatory exam you will be asked to produce the full decision trail for a specific named case: every piece of evidence considered, every reasoning step, every human touch, timestamped.

"The model said so" is not an answer a regulator accepts, which means an agent that returns a score with no attached reasoning isn't an asset — it's a liability you've shipped to production. Retrofitting this is brutal. Design the decision record first and make every layer write to it:

@dataclass
class DecisionRecord:
    case_id: str
    layer: str                  # which pipeline stage
    inputs_hash: str            # reproducibility — same input, same record
    evidence: list[dict]        # what was considered
    output: dict                # score / hit / extracted fields
    decided_by: str             # "rules_engine" | "ml_model" | "human:<id>"
    model_version: str          # you WILL be asked which model decided
    timestamp: datetime

Note model_version. When you upgrade the extraction model six months in, you need to know which version produced a decision a regulator is now questioning. Treat model versions like database migrations — tracked, dated, reversible in your reasoning.

2. False positives are an adoption problem, and the fix is a feedback loop

Early AML alert systems are notorious for flagging ~95% legitimate transactions. Industry analysis (McKinsey, cited in the original — re-source before you quote it) puts false positives at over 90% of transaction-monitoring alerts at most banks. An agent with a 10% false-positive rate reviewing 1,000 transactions/day hands your compliance team 100 wrong alerts every day. They stop trusting it inside a week.

The only fix is a loop from analyst decisions back into the model. If you don't capture "analyst overrode this alert" as labeled training data from day one, your false-positive rate never improves — it's frozen at whatever you launched with. Agentic systems can auto-resolve 55-75% of sanctions alerts, but only with that loop built in. This is a data-pipeline commitment, not a feature you bolt on in Q3.

3. Per-jurisdiction rulesets, not a global standard

What FinCEN requires is not what the FCA requires is not what AMLD6 requires. If you build one global ruleset, you're simultaneously over-compliant in one market (killing onboarding conversion) and under-compliant in another (the actual risk). The screening and scoring layers need to be parameterized by jurisdiction, resolved per case. Build for multiple rulesets even if you launch with one — the seam is expensive to add later.

4. The one thing you are legally barred from automating

Autonomous SAR filing is illegal in the US, UK, and EU. A SAR requires a named individual to sign off before filing. An agent that pre-populates and drafts a SAR is valuable. An agent that files one is a regulatory incident. The architectural consequence: layer 5 is not optional, and the "submit" action must be gated behind human identity, not a service account.

Sequencing: don't build five layers at once

The teams that try to ship the full stack in one sprint ship nothing in three months. The teams that ship one layer, validate it, and expand have a production system in 12-16 weeks. Build in this order:

Weeks 1-4 — Document extraction only. Highest volume, lowest complexity, individual KYC cases. Validate extraction accuracy against a labeled set. Confirm the audit trail actually holds. Confirm the low-confidence route lands cases in the human queue.
Weeks 5-8 — Sanctions + PEP screening. Wire in the rules engine. Tune MATCH_THRESHOLD against known true/false matches. Start capturing analyst overrides now — this is your future training data.
Weeks 9-12 — Risk scoring. Train on your historical decisions, not a generic model. Output a score plus the contributing signals.
Weeks 12-16 — Narrative + SAR draft prep. Last, because it depends on everything above being structured and reliable. Add 4-6 weeks per additional jurisdiction with meaningfully different requirements. And the unglamorous prerequisite: work with compliance counsel before the technical build, not after. The requirements define the architecture here — this is one of the few builds where the regulation genuinely should drive the schema. The same up-front-constraints discipline shows up in our PCI DSS notes for app builders if you're touching payment data in the same system.

Calibration is the real ongoing work

Two thresholds run this system: extraction confidence (layer 1) and match score (layer 2). Both ship at a guess and get tuned for months against production data. Set extraction confidence too high and everything routes to humans — you've automated nothing. Too low and you accept garbage fields into a screening decision. Same tension on match score: high recall floods the queue with false hits; high precision misses a real sanctions match, which is the one error class that ends careers. There's no correct static value. There's a curve, and you move along it as your feedback loop matures and your analysts tell you where the pain is. If you're approaching the rollout itself, the 12-week MVP sequencing we use for production AI maps cleanly onto the layer order above.

Closer

The whole design hinges on one boundary: the LLM never owns the verdict. If you agree with that, the five layers more or less write themselves. If you don't — if you think a sufficiently good model should be allowed to decide pass/fail with the right guardrails — that's the part I'd genuinely want to argue out in a design review, because that's where I've seen teams talk themselves into something a regulator later tears down.

One flag before you quote anything elsewhere: the false-positive and alert-resolution figures (90%+ false positives, 55-75% auto-resolution) trace to McKinsey via the original post — re-source them directly before you put them in front of a regulator or a board. The timeline and time-savings numbers are from real builds, but your jurisdiction mix will move them. Curious where other people draw the LLM-vs-rules line, especially on name matching — that's the layer where I'm least sure the boundary I drew is the only defensible one.

DEV Community