How we built a sanctions screening API that outperformed the Federal Reserve's benchmark

#api #security #webdev #fintech

The Federal Reserve published a sanctions screening
benchmark in September 2025. Their best result using
GPT-4o: 98.95% F1.

We hit 100%. Here's how.

The problem with existing tools

90-95% of sanctions screening alerts are false positives.
Analysts spend $130B/year investigating alerts that are wrong.

The root cause: basic fuzzy matching. Most tools use
Jaro-Winkler with a threshold. That's it.

What we built

9 penalty layers targeting specific false positive patterns:

Patronymic derivatives (Ivan ≠ Ivanov)
Business-to-person mismatch
Substring traps ("Computing" contains "Putin")
Common name IDF weighting
Mixed-script rejection
Zero-width character evasion detection

The matching pipeline

Normalization → smartNormalize()
FAISS MiniLM semantic ANN search
Jaro-Winkler + Monge-Elkan + Soft TF-IDF
Double Metaphone phonetic blocking
9 penalty layers
LLM cascade (40-85 confidence range)
Adjudication engine

The benchmark

145 real test cases across 13 categories:

OFAC, UN, EU, UK sanctions lists
Arabic/Cyrillic transliteration
Phonetic matching
Substring traps
Adversarial inputs

Result: 145/145. 100% F1, 100% Recall, 100% Precision.

The Federal Reserve tested organization names only,
Latin script only, 10 countries. They explicitly noted
individual names and non-Latin scripts were
"beyond the scope."

That's exactly what we tested.

The dataset is public

verifex.dev/benchmark

Anyone can run it against any provider.

We're Verifex — sanctions screening API for developers.
$49/month. verifex.dev

DEV Community