The Federal Reserve published a sanctions screening
benchmark in September 2025. Their best result using
GPT-4o: 98.95% F1.
We hit 100%. Here's how.
The problem with existing tools
90-95% of sanctions screening alerts are false positives.
Analysts spend $130B/year investigating alerts that are wrong.
The root cause: basic fuzzy matching. Most tools use
Jaro-Winkler with a threshold. That's it.
What we built
9 penalty layers targeting specific false positive patterns:
- Patronymic derivatives (Ivan ≠ Ivanov)
- Business-to-person mismatch
- Substring traps ("Computing" contains "Putin")
- Common name IDF weighting
- Mixed-script rejection
- Zero-width character evasion detection
The matching pipeline
- Normalization → smartNormalize()
- FAISS MiniLM semantic ANN search
- Jaro-Winkler + Monge-Elkan + Soft TF-IDF
- Double Metaphone phonetic blocking
- 9 penalty layers
- LLM cascade (40-85 confidence range)
- Adjudication engine
The benchmark
145 real test cases across 13 categories:
- OFAC, UN, EU, UK sanctions lists
- Arabic/Cyrillic transliteration
- Phonetic matching
- Substring traps
- Adversarial inputs
Result: 145/145. 100% F1, 100% Recall, 100% Precision.
The Federal Reserve tested organization names only,
Latin script only, 10 countries. They explicitly noted
individual names and non-Latin scripts were
"beyond the scope."
That's exactly what we tested.
The dataset is public
verifex.dev/benchmark
Anyone can run it against any provider.
We're Verifex — sanctions screening API for developers.
$49/month. verifex.dev
Top comments (0)