DEV Community

Cover image for How we built a sanctions screening API that outperformed the Federal Reserve's benchmark
Verifex
Verifex

Posted on

How we built a sanctions screening API that outperformed the Federal Reserve's benchmark

The Federal Reserve published a sanctions screening
benchmark in September 2025. Their best result using
GPT-4o: 98.95% F1.

We hit 100%. Here's how.

The problem with existing tools

90-95% of sanctions screening alerts are false positives.
Analysts spend $130B/year investigating alerts that are wrong.

The root cause: basic fuzzy matching. Most tools use
Jaro-Winkler with a threshold. That's it.

What we built

9 penalty layers targeting specific false positive patterns:

  • Patronymic derivatives (Ivan ≠ Ivanov)
  • Business-to-person mismatch
  • Substring traps ("Computing" contains "Putin")
  • Common name IDF weighting
  • Mixed-script rejection
  • Zero-width character evasion detection

The matching pipeline

  1. Normalization → smartNormalize()
  2. FAISS MiniLM semantic ANN search
  3. Jaro-Winkler + Monge-Elkan + Soft TF-IDF
  4. Double Metaphone phonetic blocking
  5. 9 penalty layers
  6. LLM cascade (40-85 confidence range)
  7. Adjudication engine

The benchmark

145 real test cases across 13 categories:

  • OFAC, UN, EU, UK sanctions lists
  • Arabic/Cyrillic transliteration
  • Phonetic matching
  • Substring traps
  • Adversarial inputs

Result: 145/145. 100% F1, 100% Recall, 100% Precision.

The Federal Reserve tested organization names only,
Latin script only, 10 countries. They explicitly noted
individual names and non-Latin scripts were
"beyond the scope."

That's exactly what we tested.

The dataset is public

verifex.dev/benchmark

Anyone can run it against any provider.

We're Verifex — sanctions screening API for developers.
$49/month. verifex.dev

Top comments (0)