Stuart Watkins

Posted on Apr 12 • Originally published at zenoo.com

AML false positives cost compliance teams $10M annually. Here's how to cut them by 80%.

#typescript #compliance #machinelearning #fintech

TL;DR

Rule-based AML systems generate false positive rates exceeding 90%. For enterprise banks, this translates to $10M or more in wasted operational cost annually. By replacing static thresholds with behavioural baselines and dynamic scoring, teams consistently achieve 60% to 80% false positive reduction whilst maintaining 95%+ true positive rates. This post walks through the TypeScript architecture, a 90-day parallel-run validation strategy, and the production metrics that matter.

Last Wednesday, a compliance director at a mid-tier bank told me her team processed 847 AML alerts that week. Six were genuine. The other 841 consumed 127 hours of analyst time investigating perfectly legitimate transactions. That is £15,000 in salaries to confirm nothing was wrong.

If you are building or maintaining AML transaction monitoring, you have probably seen this pattern from the data side. Your rules fire on every transaction above a fixed amount. Your screening hits on every partial name match. Your analysts drown. And somewhere in the noise, real financial crime slips through.

I have been tracking performance data across 40 institutions, and the numbers are consistent. The typical enterprise compliance team processes 200 alerts daily. At an average of 22 minutes per alert, that is 73 hours of analyst time weekly. When 95% of those alerts are false positives (the FCA's own estimate), you are paying senior professionals £180,000 annually to repeatedly confirm that legitimate transactions are legitimate.

For enterprise banks with £50 billion in assets, this scales to £12M to £20M in wasted operational costs annually. Even conservatively, most institutions land at £8M to £15M in annual false positive costs once you account for analyst salaries, management overhead, and system maintenance.

The fix is not "buy better software". It is architectural. Let me walk you through it.

Why static rules create most of the noise

Traditional AML systems work like security guards with a very long, very specific checklist. Transaction over £9,000? Flag it. Multiple deposits under £8,000? Flag it. Wire transfer to a high-risk jurisdiction? Flag it.

The problem is legitimate business activity triggers these rules constantly. A property developer paying contractors, a consultant receiving delayed invoices, an importer settling quarterly accounts: all generate alerts despite representing normal commercial behaviour.

Fraud rarely arrives fully formed. It escalates over time. But rule-based systems can only spot the end state, not the progression. By the time a rule fires, you are investigating completed schemes rather than preventing emerging risks.

The engineering answer: replace static thresholds with behavioural baselines that learn what "normal" looks like for each customer.

Step 1: model a customer behaviour profile

The core data structure captures transaction patterns per customer over time:

interface CustomerBehaviourProfile {
  customerId: string;
  typicalTransactionAmount: number;
  averageMonthlyVolume: number;
  commonCounterparties: string[];
  normalTransactionTimes: number[]; // hours of day
  seasonalPatterns: MonthlyPattern[];
}

interface MonthlyPattern {
  month: number;
  averageVolume: number;
  typicalTransactionTypes: string[];
}

This profile is rebuilt on a rolling window (we typically use 12 months, weighted towards the most recent 3). The key insight: anomaly detection identifies statistically unusual patterns rather than arbitrary thresholds. A £50K transaction from a regular corporate client scores differently than the same amount from a new cryptocurrency exchange.

Step 2: score deviations, not amounts

Instead of binary flag/no-flag decisions, produce a continuous anomaly score:

interface AnomalyScore {
  score: number; // 0-100
  factors: DeviationFactor[];
  riskLevel: 'LOW' | 'MEDIUM' | 'HIGH';
  requiresInvestigation: boolean;
}

interface DeviationFactor {
  type: 'AMOUNT' | 'FREQUENCY' | 'COUNTERPARTY' | 'TIME' | 'GEOGRAPHY';
  severity: number;
  description: string;
}

Each deviation factor contributes independently. A transaction can score LOW on amount but HIGH on counterparty novelty. The composite score decides whether it reaches a human analyst at all.

Step 3: replace fixed thresholds with dynamic ones

This is where the false positive rate drops dramatically. Instead of a single monetary threshold for all customers, calculate a personalised threshold:

interface DynamicThreshold {
  baseAmount: number;
  customerRiskMultiplier: number;
  relationshipAgeMultiplier: number;
  businessTypeMultiplier: number;
  calculatedThreshold: number;
}

function calculatePersonalisedThreshold(
  customer: CustomerBehaviourProfile,
  transaction: Transaction
): DynamicThreshold {
  const baseAmount = 10000; // regulatory minimum
  const riskMultiplier = customer.riskRating === 'HIGH' ? 0.5 : 2.0;
  const ageMultiplier = customer.relationshipMonths > 24 ? 1.5 : 1.0;
  const businessMultiplier = getBusinessTypeMultiplier(customer.businessType);

  return {
    baseAmount,
    customerRiskMultiplier: riskMultiplier,
    relationshipAgeMultiplier: ageMultiplier,
    businessTypeMultiplier: businessMultiplier,
    calculatedThreshold:
      baseAmount * riskMultiplier * ageMultiplier * businessMultiplier,
  };
}

A high-risk customer with a two-month relationship gets a threshold of £5,000 (10,000 × 0.5 × 1.0). A low-risk corporate client of three years gets £30,000 (10,000 × 2.0 × 1.5). Same regulatory baseline, vastly different alert volumes.

Step 4: parallel-run for 90 days before cutting over

This is non-negotiable. Run both systems simultaneously for 90 days. Your regulator expects you to demonstrate that technology improves outcomes, not just reduces costs. The parallel run gives you the evidence:

interface ParallelRunResults {
  legacySystemAlerts: number;
  aiSystemAlerts: number;
  sharedTruePositives: number;
  aiOnlyTruePositives: number;
  legacyOnlyFalsePositives: number;
  reductionPercentage: number;
}

function analyseParallelResults(
  legacyAlerts: Alert[],
  aiAlerts: Alert[]
): ParallelRunResults {
  const sharedTruePositives = findSharedTruePositives(legacyAlerts, aiAlerts);
  const aiOnlyTruePositives = findAiOnlyTruePositives(legacyAlerts, aiAlerts);
  const legacyOnlyFalsePositives = findLegacyOnlyFalsePositives(
    legacyAlerts,
    aiAlerts
  );

  return {
    legacySystemAlerts: legacyAlerts.length,
    aiSystemAlerts: aiAlerts.length,
    sharedTruePositives: sharedTruePositives.length,
    aiOnlyTruePositives: aiOnlyTruePositives.length,
    legacyOnlyFalsePositives: legacyOnlyFalsePositives.length,
    reductionPercentage:
      (legacyOnlyFalsePositives.length / legacyAlerts.length) * 100,
  };
}

The test cases that matter during the parallel run:

Regular business payment: corporate client making usual supplier payment should score LOW.
New counterparty deviation: same client paying a new overseas entity should score MEDIUM to HIGH.
Amount threshold bypass: transaction below the static threshold but unusual for the customer should still flag.
Temporal pattern recognition: transaction outside normal business hours from a business account should elevate the score.

Target metrics after 90 days

Here is what the data shows across deployments we have tracked:

False positive reduction: 60% to 80%
True positive maintenance: 95%+ (you must not lose real alerts)
Investigation time reduction: 50%
Alert precision: from 5% to 15% with rule-based systems to 35% to 55% with calibrated AI models
Alert volume reduction: 40% to 60% through better initial scoring that filters low-risk activity before human review

One standout case: a tier-one bank reduced alert volume from 200 to 80 per day whilst improving detection precision from 8% to 45%. Same regulations, same data sources, completely different outcomes.

Between 2023 and 2024, suspicious activity reports climbed 27%, which makes detection quality more important, not less. More alerts do not mean better detection. They mean more noise.

Production deployment: the parts that matter

Once the parallel run proves effectiveness, three things determine whether you capture ROI or revert to legacy approaches:

Model governance. Document threshold logic, maintain audit trails, prepare regulatory explanations. AI systems must remain explainable for examination purposes. Regulators are asking tougher questions about SAR quality, not just SAR quantity.

Monitoring dashboards. Track false positive rates daily. If they climb above baseline, investigate model drift or new attack patterns. AI-enabled and synthetic identities now account for 42% of third-party identity fraud cases, so the threat landscape shifts faster than annual rule reviews can handle.

Human oversight. AI reduces volume but does not replace analyst judgement. The most successful implementations maintain experienced human oversight whilst automating the repetitive data gathering and initial risk scoring. In recent industry discussions, 65% of compliance professionals prioritised improving cross-team collaboration over investing in new tools. That signals something: governance matters more than features.

Against platform costs of £2M to £4M annually, ROI typically closes before pilot completion for mid-sized institutions. Teams that reduce false positives by 70% do not just save money. They refocus analyst attention on genuine risks. Instead of spending Tuesday morning investigating a consultant's legitimate invoice payment, your senior analyst can deep-dive into a suspicious structuring pattern.

What I would do differently if starting today

If I were building an AML pipeline from scratch in 2026, I would start with data quality. Weak data governance undermines every model you deploy. Clean, consistent transaction data is non-negotiable.

Then I would define clear metrics before writing a single line of model code: false positive reduction targets (aim for 60%+), investigation time improvements (target 50%+), and alert precision improvements. Without baselines, you cannot prove anything to your regulator or your CFO.

Finally, I would keep the human loop tight. AI handles pattern detection and data gathering. Humans handle complex judgement calls and regulatory interpretation. That split is where the $10M in annual savings actually materialises, and where detection quality improves instead of degrades.

At Zenoo, we see institutions capturing £12M to £20M in annual savings with 40% to 60% false positive reduction. The technology is mature. Implementation discipline determines which teams capture ROI versus those that pilot expensive systems and revert to legacy approaches.

If you are building compliance flows and want to see how orchestrated screening, scoring, and case management work in a single pipeline, check out zenoo.com. 30 minutes. Your data. No slides.

Stuart Watkins is CEO of Zenoo, where we help compliance teams reduce false positives whilst improving detection quality.

DEV Community