DEV Community: Verifex

From Fuzzy Matching to Evidence Capsules: Building an Explainable Sanctions Screening Engine

Verifex — Thu, 14 May 2026 14:20:59 +0000

Sanctions screening looks simple from the outside.

Take a name, compare it against a list, return a score above a threshold, send it to review.

That was how I thought about it before I started building Verifex.

The reality is different.

The problem nobody talks about

A compliance reviewer does not just need to know that two names are similar. They need to understand why a match was created, what evidence supports it, what weakens it, and whether the decision holds up during an audit six months later.

A score alone does not answer any of those questions.

When the engine returns 0.92, the reviewer is still left asking: was that the surname? The alias? The date of birth? The country? The source list?

Without that breakdown, every review is manual reconstruction from scratch.

What fuzzy matching misses

Fuzzy string matching works fine for clean data.

John Smith vs John Smith -- no problem.
ACME Ltd vs ACME Limited -- no problem.

But real sanctions data is messier than that.

Names get reordered. Transliteration varies across source lists. Some entries have aliases, some do not. Dates of birth are missing or partial. Nationalities are stored inconsistently. Common names create noise. Some lists store names as SURNAME, Given Patronymic and a naive parser flips them.

That last one caused a real bug in early versions of the engine. The parser was treating PUTIN as a given name because it appeared before the comma. The match score dropped even though the match was obvious to any human reviewer.

A single final score would have only told me something was wrong. The evidence breakdown told me exactly where.

Evidence Capsules

The idea I have been building around is simple.

Instead of returning only a score, the engine produces a structured evidence object for every candidate match. I call this an Evidence Capsule.

Each capsule contains:

the query name and the candidate name
source list information
token-level name comparison
date of birth signal
country and nationality signal
identifier signals
a list of supporting evidence
a list of weakening evidence
reason codes
audit warnings

The goal is not to replace the reviewer. The goal is to give the reviewer a structured explanation so they are not starting from zero every time.

Scoring as evidence weighting

Fuzzy matching produces a similarity score.

What I wanted was something closer to evidence-weighted reasoning.

The internal model follows a log-odds structure:

log_odds = prior_log_odds + sum(evidence_weights)
posterior = sigmoid(log_odds)

Each signal contributes independently. An exact surname match increases the score. An exact date of birth increases it strongly. A country mismatch pulls it down. A match based only on a common given name gets penalized. Missing context is recorded explicitly rather than ignored.

This is not the same as saying the output is a calibrated probability. That distinction matters.

Why calibration matters

If the engine outputs 0.90, that does not automatically mean the result is 90% likely to be a true match. To know that, you need calibration data.

The measurement layer I added tracks:

Brier Score
Expected Calibration Error
Reliability curves
Threshold sweeps across source families

These answer the practical questions. When the engine says 0.9, how often is it right? Which source family is overconfident? What threshold increases review burden without catching more true matches?

Compliance systems should not hide behind vague scores. They need measurable behavior.

What this does not claim

This is not a claim that the engine has zero false negatives.

It is not a claim that human review is unnecessary.

The current goal is more limited and more honest: build a screening engine that can explain its own reasoning, persist that reasoning for audit, and measure whether its scores reflect reality.

A proper benchmark against labeled outcomes is still in progress.

Why this direction matters

The hard part of sanctions screening is rarely finding a possible match. The hard part is explaining why it was escalated, cleared, or reviewed, in a way that holds up later.

That is the shift I think compliance infrastructure needs:

from fuzzy scores to structured evidence to defensible review workflows.

That is what I am building with Verifex.

Bank of Scotland was fined £160K for a Cyrillic transliteration failure. Here's the technical breakdown.

Verifex — Sun, 12 Apr 2026 14:15:31 +0000

In January 2026, OFSI fined Bank of Scotland £160,000.
24 payments went through to a designated Russian individual.
Root cause: the screening tool couldn't match Cyrillic
transliteration variants.

This wasn't negligence. It was a technical failure that
most sanctions screening tools still have today.

Why Cyrillic matching fails

There are multiple competing standards for Cyrillic → Latin
transliteration: BGN/PCGN (used by US/UK governments), ISO 9,
GOST, ICAO, and dozens of informal spellings.

A single name like "Шварц" legitimately appears as:

Shvarts
Shvartz
Schwarz
Shvarc
Svarc

Every one of them is "correct" — depending on which standard
was used. Most screening tools pick one. If the watchlist
entry uses BGN/PCGN and the customer's passport uses ICAO,
you get a miss. That miss cost Bank of Scotland £160K.

The patronymic problem

Russian names have three parts: given name, patronymic,
and surname.

"Ivan," "Ivanov," and "Ivanovich" are completely different
people:

Ivan → given name
Ivanov → surname ("of Ivan")
Ivanovich → patronymic ("son of Ivan")

A naive fuzzy matcher sees 70%+ character overlap and scores
them as near-matches. This floods compliance queues with
false positives while simultaneously missing real hits.

The "Mohammed problem"

Arabic has 12+ formal romanization systems: ALA-LC, ISO 233,
UNGEGN, BGN/PCGN, DIN 31635...

A single Arabic name produces 300+ valid Latin spellings.
"Mohammed," "Muhammad," "Mohamed," "Mehmet," "Muhamad" —
same person, different systems.

The Beider-Morse algorithm — arguably the most sophisticated
phonetic matching system ever built — explicitly removed
Arabic support. The maintainers cited "severe performance
issues related to excessively complicated phonetics."

If the best phonetic algorithm gives up on Arabic, what are
most commercial tools doing?

Answer: Jaro-Winkler with a threshold. Which is why false
positive rates on Arabic names run above 90% in most systems.

The substring trap

"Computing" contains the substring "p-u-t-i-n."

Without whole-word boundary enforcement, your screening
system flags tech companies. This sounds absurd — but it
happens in production systems every day.

We caught this when testing our own engine. A query for
a software company returned a high-confidence sanctions
match because a substring of the company name overlapped
with a sanctioned individual's name.

The fix: whole-word tokenization. Only match on complete
tokens, never on substrings.

What the benchmark gap looks like

No commercial sanctions screening vendor publishes accuracy
benchmarks. Not Refinitiv, not ComplyAdvantage, not
sanctions.io.

OpenSanctions — the best open-source system — publishes
their numbers: 91.3% F1, 99% recall, 84.5% precision.

The Federal Reserve published a sanctions screening paper
in September 2025. Best result using GPT-4o: 98.95% F1 —
tested on Latin-script organization names only.

Nobody is publishing results on Arabic transliteration,
Cyrillic variants, or patronymic edge cases. Exactly the
cases that generate real fines.

What we built

We built Verifex (verifex.dev) to address this directly.
The matching engine combines:

Soft TF-IDF + Monge-Elkan — the academic gold standard for string matching (Cohen, Ravikumar, Fienberg 2003)
IDF corpus weighting — "Mohammed" and "Kim" are statistically common. They should score lower than rare tokens like "Qadhafi"
Double Metaphone phonetic blocking — across multiple transliteration standards simultaneously
9 penalty layers — patronymic derivatives, substring boundaries, entity-type mismatches, mixed-script detection
LLM cascade — for ambiguous matches in the 40-95% confidence range

Result: 100% F1 on an independent 145-case benchmark —
including Arabic transliteration, Cyrillic variants, phonetic
matching, and adversarial substring inputs.

The full benchmark is public: verifex.dev/benchmark

Anyone can run it against any provider.

Bank of Scotland's fine was preventable. The technology
to handle Cyrillic transliteration exists — it's just not
in most commercial tools. If you're building or evaluating
a sanctions screening solution, the benchmark cases at
verifex.dev/benchmark show exactly where most tools fail.

How we built a sanctions screening API that outperformed the Federal Reserve's benchmark

Verifex — Sat, 11 Apr 2026 20:43:22 +0000

The Federal Reserve published a sanctions screening
benchmark in September 2025. Their best result using
GPT-4o: 98.95% F1.

We hit 100%. Here's how.

The problem with existing tools

90-95% of sanctions screening alerts are false positives.
Analysts spend $130B/year investigating alerts that are wrong.

The root cause: basic fuzzy matching. Most tools use
Jaro-Winkler with a threshold. That's it.

What we built

9 penalty layers targeting specific false positive patterns:

Patronymic derivatives (Ivan ≠ Ivanov)
Business-to-person mismatch
Substring traps ("Computing" contains "Putin")
Common name IDF weighting
Mixed-script rejection
Zero-width character evasion detection

The matching pipeline

Normalization → smartNormalize()
FAISS MiniLM semantic ANN search
Jaro-Winkler + Monge-Elkan + Soft TF-IDF
Double Metaphone phonetic blocking
9 penalty layers
LLM cascade (40-85 confidence range)
Adjudication engine

The benchmark

145 real test cases across 13 categories:

OFAC, UN, EU, UK sanctions lists
Arabic/Cyrillic transliteration
Phonetic matching
Substring traps
Adversarial inputs

Result: 145/145. 100% F1, 100% Recall, 100% Precision.

The Federal Reserve tested organization names only,
Latin script only, 10 countries. They explicitly noted
individual names and non-Latin scripts were
"beyond the scope."

That's exactly what we tested.

The dataset is public

verifex.dev/benchmark

Anyone can run it against any provider.

We're Verifex — sanctions screening API for developers.
$49/month. verifex.dev