<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Verifex</title>
    <description>The latest articles on DEV Community by Verifex (@verifex).</description>
    <link>https://dev.to/verifex</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3845333%2F3ae43f1d-edba-4efa-b097-f9c2a7290da3.png</url>
      <title>DEV Community: Verifex</title>
      <link>https://dev.to/verifex</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/verifex"/>
    <language>en</language>
    <item>
      <title>From Fuzzy Matching to Evidence Capsules: Building an Explainable Sanctions Screening Engine</title>
      <dc:creator>Verifex</dc:creator>
      <pubDate>Thu, 14 May 2026 14:20:59 +0000</pubDate>
      <link>https://dev.to/verifex/from-fuzzy-matching-to-evidence-capsules-building-an-explainable-sanctions-screening-engine-62c</link>
      <guid>https://dev.to/verifex/from-fuzzy-matching-to-evidence-capsules-building-an-explainable-sanctions-screening-engine-62c</guid>
      <description>&lt;p&gt;Sanctions screening looks simple from the outside.&lt;/p&gt;

&lt;p&gt;Take a name, compare it against a list, return a score above a threshold, send it to review.&lt;/p&gt;

&lt;p&gt;That was how I thought about it before I started building Verifex.&lt;/p&gt;

&lt;p&gt;The reality is different.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem nobody talks about
&lt;/h2&gt;

&lt;p&gt;A compliance reviewer does not just need to know that two names are similar. They need to understand why a match was created, what evidence supports it, what weakens it, and whether the decision holds up during an audit six months later.&lt;/p&gt;

&lt;p&gt;A score alone does not answer any of those questions.&lt;/p&gt;

&lt;p&gt;When the engine returns 0.92, the reviewer is still left asking: was that the surname? The alias? The date of birth? The country? The source list?&lt;/p&gt;

&lt;p&gt;Without that breakdown, every review is manual reconstruction from scratch.&lt;/p&gt;

&lt;h2&gt;
  
  
  What fuzzy matching misses
&lt;/h2&gt;

&lt;p&gt;Fuzzy string matching works fine for clean data.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;John Smith&lt;/code&gt; vs &lt;code&gt;John Smith&lt;/code&gt; -- no problem.&lt;br&gt;
&lt;code&gt;ACME Ltd&lt;/code&gt; vs &lt;code&gt;ACME Limited&lt;/code&gt; -- no problem.&lt;/p&gt;

&lt;p&gt;But real sanctions data is messier than that.&lt;/p&gt;

&lt;p&gt;Names get reordered. Transliteration varies across source lists. Some entries have aliases, some do not. Dates of birth are missing or partial. Nationalities are stored inconsistently. Common names create noise. Some lists store names as &lt;code&gt;SURNAME, Given Patronymic&lt;/code&gt; and a naive parser flips them.&lt;/p&gt;

&lt;p&gt;That last one caused a real bug in early versions of the engine. The parser was treating &lt;code&gt;PUTIN&lt;/code&gt; as a given name because it appeared before the comma. The match score dropped even though the match was obvious to any human reviewer.&lt;/p&gt;

&lt;p&gt;A single final score would have only told me something was wrong. The evidence breakdown told me exactly where.&lt;/p&gt;
&lt;h2&gt;
  
  
  Evidence Capsules
&lt;/h2&gt;

&lt;p&gt;The idea I have been building around is simple.&lt;/p&gt;

&lt;p&gt;Instead of returning only a score, the engine produces a structured evidence object for every candidate match. I call this an Evidence Capsule.&lt;/p&gt;

&lt;p&gt;Each capsule contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;the query name and the candidate name&lt;/li&gt;
&lt;li&gt;source list information&lt;/li&gt;
&lt;li&gt;token-level name comparison&lt;/li&gt;
&lt;li&gt;date of birth signal&lt;/li&gt;
&lt;li&gt;country and nationality signal&lt;/li&gt;
&lt;li&gt;identifier signals&lt;/li&gt;
&lt;li&gt;a list of supporting evidence&lt;/li&gt;
&lt;li&gt;a list of weakening evidence&lt;/li&gt;
&lt;li&gt;reason codes&lt;/li&gt;
&lt;li&gt;audit warnings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal is not to replace the reviewer. The goal is to give the reviewer a structured explanation so they are not starting from zero every time.&lt;/p&gt;
&lt;h2&gt;
  
  
  Scoring as evidence weighting
&lt;/h2&gt;

&lt;p&gt;Fuzzy matching produces a similarity score.&lt;/p&gt;

&lt;p&gt;What I wanted was something closer to evidence-weighted reasoning.&lt;/p&gt;

&lt;p&gt;The internal model follows a log-odds structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;log_odds = prior_log_odds + sum(evidence_weights)
posterior = sigmoid(log_odds)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each signal contributes independently. An exact surname match increases the score. An exact date of birth increases it strongly. A country mismatch pulls it down. A match based only on a common given name gets penalized. Missing context is recorded explicitly rather than ignored.&lt;/p&gt;

&lt;p&gt;This is not the same as saying the output is a calibrated probability. That distinction matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why calibration matters
&lt;/h2&gt;

&lt;p&gt;If the engine outputs 0.90, that does not automatically mean the result is 90% likely to be a true match. To know that, you need calibration data.&lt;/p&gt;

&lt;p&gt;The measurement layer I added tracks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Brier Score&lt;/li&gt;
&lt;li&gt;Expected Calibration Error&lt;/li&gt;
&lt;li&gt;Reliability curves&lt;/li&gt;
&lt;li&gt;Threshold sweeps across source families&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These answer the practical questions. When the engine says 0.9, how often is it right? Which source family is overconfident? What threshold increases review burden without catching more true matches?&lt;/p&gt;

&lt;p&gt;Compliance systems should not hide behind vague scores. They need measurable behavior.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this does not claim
&lt;/h2&gt;

&lt;p&gt;This is not a claim that the engine has zero false negatives.&lt;/p&gt;

&lt;p&gt;It is not a claim that human review is unnecessary.&lt;/p&gt;

&lt;p&gt;The current goal is more limited and more honest: build a screening engine that can explain its own reasoning, persist that reasoning for audit, and measure whether its scores reflect reality.&lt;/p&gt;

&lt;p&gt;A proper benchmark against labeled outcomes is still in progress.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this direction matters
&lt;/h2&gt;

&lt;p&gt;The hard part of sanctions screening is rarely finding a possible match. The hard part is explaining why it was escalated, cleared, or reviewed, in a way that holds up later.&lt;/p&gt;

&lt;p&gt;That is the shift I think compliance infrastructure needs:&lt;/p&gt;

&lt;p&gt;from fuzzy scores to structured evidence to defensible review workflows.&lt;/p&gt;

&lt;p&gt;That is what I am building with &lt;a href="https://verifex.dev" rel="noopener noreferrer"&gt;Verifex&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>regtech</category>
      <category>api</category>
      <category>compliance</category>
      <category>typescript</category>
    </item>
    <item>
      <title>Bank of Scotland was fined £160K for a Cyrillic transliteration failure. Here's the technical breakdown.</title>
      <dc:creator>Verifex</dc:creator>
      <pubDate>Sun, 12 Apr 2026 14:15:31 +0000</pubDate>
      <link>https://dev.to/verifex/bank-of-scotland-was-fined-ps160k-for-a-cyrillic-transliteration-failure-heres-the-technical-3h5m</link>
      <guid>https://dev.to/verifex/bank-of-scotland-was-fined-ps160k-for-a-cyrillic-transliteration-failure-heres-the-technical-3h5m</guid>
      <description>&lt;p&gt;In January 2026, OFSI fined Bank of Scotland £160,000. &lt;br&gt;
24 payments went through to a designated Russian individual. &lt;br&gt;
Root cause: the screening tool couldn't match Cyrillic &lt;br&gt;
transliteration variants.&lt;/p&gt;

&lt;p&gt;This wasn't negligence. It was a technical failure that &lt;br&gt;
most sanctions screening tools still have today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Cyrillic matching fails
&lt;/h2&gt;

&lt;p&gt;There are multiple competing standards for Cyrillic → Latin &lt;br&gt;
transliteration: BGN/PCGN (used by US/UK governments), ISO 9, &lt;br&gt;
GOST, ICAO, and dozens of informal spellings.&lt;/p&gt;

&lt;p&gt;A single name like "Шварц" legitimately appears as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Shvarts&lt;/li&gt;
&lt;li&gt;Shvartz
&lt;/li&gt;
&lt;li&gt;Schwarz&lt;/li&gt;
&lt;li&gt;Shvarc&lt;/li&gt;
&lt;li&gt;Svarc&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Every one of them is "correct" — depending on which standard &lt;br&gt;
was used. Most screening tools pick one. If the watchlist &lt;br&gt;
entry uses BGN/PCGN and the customer's passport uses ICAO, &lt;br&gt;
you get a miss. That miss cost Bank of Scotland £160K.&lt;/p&gt;

&lt;h2&gt;
  
  
  The patronymic problem
&lt;/h2&gt;

&lt;p&gt;Russian names have three parts: given name, patronymic, &lt;br&gt;
and surname.&lt;/p&gt;

&lt;p&gt;"Ivan," "Ivanov," and "Ivanovich" are completely different &lt;br&gt;
people:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ivan → given name&lt;/li&gt;
&lt;li&gt;Ivanov → surname ("of Ivan")&lt;/li&gt;
&lt;li&gt;Ivanovich → patronymic ("son of Ivan")&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A naive fuzzy matcher sees 70%+ character overlap and scores &lt;br&gt;
them as near-matches. This floods compliance queues with &lt;br&gt;
false positives while simultaneously missing real hits.&lt;/p&gt;

&lt;h2&gt;
  
  
  The "Mohammed problem"
&lt;/h2&gt;

&lt;p&gt;Arabic has 12+ formal romanization systems: ALA-LC, ISO 233, &lt;br&gt;
UNGEGN, BGN/PCGN, DIN 31635...&lt;/p&gt;

&lt;p&gt;A single Arabic name produces 300+ valid Latin spellings. &lt;br&gt;
"Mohammed," "Muhammad," "Mohamed," "Mehmet," "Muhamad" — &lt;br&gt;
same person, different systems.&lt;/p&gt;

&lt;p&gt;The Beider-Morse algorithm — arguably the most sophisticated &lt;br&gt;
phonetic matching system ever built — explicitly removed &lt;br&gt;
Arabic support. The maintainers cited "severe performance &lt;br&gt;
issues related to excessively complicated phonetics."&lt;/p&gt;

&lt;p&gt;If the best phonetic algorithm gives up on Arabic, what are &lt;br&gt;
most commercial tools doing?&lt;/p&gt;

&lt;p&gt;Answer: Jaro-Winkler with a threshold. Which is why false &lt;br&gt;
positive rates on Arabic names run above 90% in most systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  The substring trap
&lt;/h2&gt;

&lt;p&gt;"Computing" contains the substring "p-u-t-i-n."&lt;/p&gt;

&lt;p&gt;Without whole-word boundary enforcement, your screening &lt;br&gt;
system flags tech companies. This sounds absurd — but it &lt;br&gt;
happens in production systems every day.&lt;/p&gt;

&lt;p&gt;We caught this when testing our own engine. A query for &lt;br&gt;
a software company returned a high-confidence sanctions &lt;br&gt;
match because a substring of the company name overlapped &lt;br&gt;
with a sanctioned individual's name.&lt;/p&gt;

&lt;p&gt;The fix: whole-word tokenization. Only match on complete &lt;br&gt;
tokens, never on substrings.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the benchmark gap looks like
&lt;/h2&gt;

&lt;p&gt;No commercial sanctions screening vendor publishes accuracy &lt;br&gt;
benchmarks. Not Refinitiv, not ComplyAdvantage, not &lt;br&gt;
sanctions.io.&lt;/p&gt;

&lt;p&gt;OpenSanctions — the best open-source system — publishes &lt;br&gt;
their numbers: &lt;strong&gt;91.3% F1, 99% recall, 84.5% precision.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Federal Reserve published a sanctions screening paper &lt;br&gt;
in September 2025. Best result using GPT-4o: &lt;strong&gt;98.95% F1&lt;/strong&gt; — &lt;br&gt;
tested on Latin-script organization names only.&lt;/p&gt;

&lt;p&gt;Nobody is publishing results on Arabic transliteration, &lt;br&gt;
Cyrillic variants, or patronymic edge cases. Exactly the &lt;br&gt;
cases that generate real fines.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we built
&lt;/h2&gt;

&lt;p&gt;We built Verifex (verifex.dev) to address this directly. &lt;br&gt;
The matching engine combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Soft TF-IDF + Monge-Elkan&lt;/strong&gt; — the academic gold standard 
for string matching (Cohen, Ravikumar, Fienberg 2003)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IDF corpus weighting&lt;/strong&gt; — "Mohammed" and "Kim" are 
statistically common. They should score lower than rare 
tokens like "Qadhafi"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Double Metaphone phonetic blocking&lt;/strong&gt; — across multiple 
transliteration standards simultaneously&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9 penalty layers&lt;/strong&gt; — patronymic derivatives, substring 
boundaries, entity-type mismatches, mixed-script detection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM cascade&lt;/strong&gt; — for ambiguous matches in the 40-95% 
confidence range&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: &lt;strong&gt;100% F1 on an independent 145-case benchmark&lt;/strong&gt; — &lt;br&gt;
including Arabic transliteration, Cyrillic variants, phonetic &lt;br&gt;
matching, and adversarial substring inputs.&lt;/p&gt;

&lt;p&gt;The full benchmark is public: &lt;strong&gt;verifex.dev/benchmark&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Anyone can run it against any provider.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Bank of Scotland's fine was preventable. The technology &lt;br&gt;
to handle Cyrillic transliteration exists — it's just not &lt;br&gt;
in most commercial tools. If you're building or evaluating &lt;br&gt;
a sanctions screening solution, the benchmark cases at &lt;br&gt;
verifex.dev/benchmark show exactly where most tools fail.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>fintech</category>
      <category>api</category>
      <category>security</category>
    </item>
    <item>
      <title>How we built a sanctions screening API that outperformed the Federal Reserve's benchmark</title>
      <dc:creator>Verifex</dc:creator>
      <pubDate>Sat, 11 Apr 2026 20:43:22 +0000</pubDate>
      <link>https://dev.to/verifex/how-we-built-a-sanctions-screening-api-that-outperformed-the-federal-reserves-benchmark-57m2</link>
      <guid>https://dev.to/verifex/how-we-built-a-sanctions-screening-api-that-outperformed-the-federal-reserves-benchmark-57m2</guid>
      <description>&lt;p&gt;The Federal Reserve published a sanctions screening &lt;br&gt;
benchmark in September 2025. Their best result using &lt;br&gt;
GPT-4o: 98.95% F1.&lt;/p&gt;

&lt;p&gt;We hit 100%. Here's how.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem with existing tools
&lt;/h2&gt;

&lt;p&gt;90-95% of sanctions screening alerts are false positives.&lt;br&gt;
Analysts spend $130B/year investigating alerts that are wrong.&lt;/p&gt;

&lt;p&gt;The root cause: basic fuzzy matching. Most tools use &lt;br&gt;
Jaro-Winkler with a threshold. That's it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we built
&lt;/h2&gt;

&lt;p&gt;9 penalty layers targeting specific false positive patterns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Patronymic derivatives (Ivan ≠ Ivanov)&lt;/li&gt;
&lt;li&gt;Business-to-person mismatch&lt;/li&gt;
&lt;li&gt;Substring traps ("Computing" contains "Putin")&lt;/li&gt;
&lt;li&gt;Common name IDF weighting&lt;/li&gt;
&lt;li&gt;Mixed-script rejection&lt;/li&gt;
&lt;li&gt;Zero-width character evasion detection&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The matching pipeline
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Normalization → smartNormalize()&lt;/li&gt;
&lt;li&gt;FAISS MiniLM semantic ANN search&lt;/li&gt;
&lt;li&gt;Jaro-Winkler + Monge-Elkan + Soft TF-IDF&lt;/li&gt;
&lt;li&gt;Double Metaphone phonetic blocking&lt;/li&gt;
&lt;li&gt;9 penalty layers&lt;/li&gt;
&lt;li&gt;LLM cascade (40-85 confidence range)&lt;/li&gt;
&lt;li&gt;Adjudication engine&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The benchmark
&lt;/h2&gt;

&lt;p&gt;145 real test cases across 13 categories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OFAC, UN, EU, UK sanctions lists&lt;/li&gt;
&lt;li&gt;Arabic/Cyrillic transliteration&lt;/li&gt;
&lt;li&gt;Phonetic matching&lt;/li&gt;
&lt;li&gt;Substring traps&lt;/li&gt;
&lt;li&gt;Adversarial inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Result: 145/145. 100% F1, 100% Recall, 100% Precision.&lt;/p&gt;

&lt;p&gt;The Federal Reserve tested organization names only, &lt;br&gt;
Latin script only, 10 countries. They explicitly noted &lt;br&gt;
individual names and non-Latin scripts were &lt;br&gt;
"beyond the scope."&lt;/p&gt;

&lt;p&gt;That's exactly what we tested.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dataset is public
&lt;/h2&gt;

&lt;p&gt;verifex.dev/benchmark&lt;/p&gt;

&lt;p&gt;Anyone can run it against any provider.&lt;/p&gt;

&lt;p&gt;We're Verifex — sanctions screening API for developers.&lt;br&gt;
$49/month. verifex.dev&lt;/p&gt;

</description>
      <category>fintech</category>
      <category>api</category>
      <category>webdev</category>
      <category>security</category>
    </item>
  </channel>
</rss>
