<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Siyana Hristova</title>
    <description>The latest articles on DEV Community by Siyana Hristova (@siyana_hristova_900e581ee).</description>
    <link>https://dev.to/siyana_hristova_900e581ee</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3770953%2F2ca04f47-ddf4-42c8-8235-be0725e997c0.png</url>
      <title>DEV Community: Siyana Hristova</title>
      <link>https://dev.to/siyana_hristova_900e581ee</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/siyana_hristova_900e581ee"/>
    <language>en</language>
    <item>
      <title>How to fuzzy-match 1M rows with dbt in under 10 minutes (2026 guide)</title>
      <dc:creator>Siyana Hristova</dc:creator>
      <pubDate>Sun, 29 Mar 2026 20:58:21 +0000</pubDate>
      <link>https://dev.to/siyana_hristova_900e581ee/how-to-fuzzy-match-1m-rows-with-dbt-in-under-10-minutes-2026-guide-2edg</link>
      <guid>https://dev.to/siyana_hristova_900e581ee/how-to-fuzzy-match-1m-rows-with-dbt-in-under-10-minutes-2026-guide-2edg</guid>
      <description>&lt;p&gt;Duplicate records rarely look like a priority at first — until they start breaking reporting, outreach, or reconciliation workflows.&lt;/p&gt;

&lt;p&gt;From slightly different versions of "Acme Inc" in a CRM to inconsistent supplier names across systems or messy post-merger datasets, fuzzy matching becomes essential whenever identical strings are no longer a reliable signal of the same real-world entity.&lt;/p&gt;

&lt;h2&gt;
  
  
  The scaling wall: why warehouse-native fuzzy matching breaks at scale
&lt;/h2&gt;

&lt;p&gt;Fuzzy matching looks simple on a 1,000-row sample. But at real scale, the math changes. A naive all-to-all comparison grows at O(N²). Once you hit 100k+ rows, comparison space explodes, and warehouse-native approaches become slow, expensive, or brittle.&lt;/p&gt;

&lt;p&gt;In practice, teams usually try a sequence of approaches before realizing the real complexity. They might start with warehouse similarity functions, hit performance limits, then move to Python or notebook experiments — only to discover new bottlenecks around memory usage, blocking strategy design, and data cleanup. At that point, what looked like a simple dedupe task starts turning into a permanent matching pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Blocking and candidate generation logic&lt;/li&gt;
&lt;li&gt;String normalization and suffix cleanup&lt;/li&gt;
&lt;li&gt;Threshold tuning and evaluation loops&lt;/li&gt;
&lt;li&gt;Parallelization and memory management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What started as a quick cleanup task quietly turns into ongoing engineering overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution: call a production fuzzy-matching engine from dbt
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://similarity-api.com/" rel="noopener noreferrer"&gt;Similarity API&lt;/a&gt; is a hosted infrastructure service designed for high-performance deduplication and reconciliation.&lt;/p&gt;

&lt;p&gt;Instead of building and maintaining your own matching pipeline, you send the relevant strings to a dedicated matching engine optimized for noisy real-world data and large-scale workloads — then load the results back into your warehouse as a normal dbt model output.&lt;/p&gt;

&lt;h2&gt;
  
  
  The technical edge: adaptive preprocessing at scale
&lt;/h2&gt;

&lt;p&gt;In real workflows, fuzzy matching quality is determined as much by data preparation strategy as by the similarity metric itself.&lt;/p&gt;

&lt;p&gt;Local implementations often require teams to design custom normalization rules, suffix cleaning logic, token ordering heuristics, and blocking strategies — each of which must be tuned as datasets evolve.&lt;/p&gt;

&lt;p&gt;Similarity API embeds these steps directly into the matching engine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Dataset-aware normalization: preprocessing adapts dynamically to string length, token density, and noise patterns&lt;/li&gt;
&lt;li&gt;Scale-optimized cleaning pipeline: preprocessing runs as part of the distributed matching flow, preventing cleanup stages from becoming bottlenecks at 1M+ rows&lt;/li&gt;
&lt;li&gt;Configuration instead of custom code: matching behaviour is controlled through parameters such as &lt;code&gt;similarity_threshold&lt;/code&gt;, &lt;code&gt;use_token_sort&lt;/code&gt;, and &lt;code&gt;remove_punctuation&lt;/code&gt;, rather than bespoke scripts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture allows teams to focus on match review and downstream data actions rather than maintaining fragile preprocessing pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why fuzzy matching belongs in your dbt layer
&lt;/h2&gt;

&lt;p&gt;This guide is designed for a dbt Python model workflow.&lt;/p&gt;

&lt;p&gt;That makes dbt a strong execution surface for fuzzy matching because you can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;pull source data from your warehouse using dbt refs or sources&lt;/li&gt;
&lt;li&gt;call the matching API inside a repeatable transformation workflow&lt;/li&gt;
&lt;li&gt;materialize match results back into warehouse tables&lt;/li&gt;
&lt;li&gt;keep dedupe logic close to the rest of your analytics engineering stack&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, this means you can move from one-off cleanup to a reusable model that runs as part of your broader data pipeline.&lt;/p&gt;

&lt;p&gt;Before running this model, you will need a Similarity API production token.&lt;/p&gt;

&lt;p&gt;You can generate one from the Similarity API dashboard. The token is passed as a standard Bearer authorization header in the request.&lt;/p&gt;

&lt;p&gt;What you actually get back&lt;/p&gt;

&lt;p&gt;Example input&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"Acme Inc"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ACME Incorporated"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Beta LLC"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Beta Limited"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Example output (index_pairs)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.91&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each result represents two rows that likely refer to the same real-world entity, along with a similarity score.&lt;/p&gt;

&lt;p&gt;By default, the API returns index pairs, which you can join back to the staged input rows for review, clustering, or merge workflows.&lt;/p&gt;

&lt;p&gt;Output format is configurable — you can instead return string pairs, clustered groups of duplicates, or fully deduplicated record lists depending on your cleanup strategy.&lt;/p&gt;

&lt;p&gt;The following dbt Python model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reads company names from an upstream dbt model&lt;/li&gt;
&lt;li&gt;sends them to the Similarity API&lt;/li&gt;
&lt;li&gt;returns duplicate index pairs joined back to the original strings
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dbt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;dbt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;materialized&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SIMILARITY_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;api_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.similarity-api.com/dedupe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;source_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dbt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stg_companies&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;to_pandas&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;source_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;source_df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;strings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;source_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loaded &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows from dbt ref(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;stg_companies&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity_threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remove_punctuation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;use_token_sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;index_pairs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;api_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
        &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Workflow complete: found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; duplicate pairs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idx_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idx_2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name_2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;dedupe_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idx_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idx_2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;score&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;dedupe_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dedupe_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idx_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;dedupe_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name_2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dedupe_df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idx_2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dedupe_df&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A minimal &lt;code&gt;dbt_project.yml&lt;/code&gt; setup would expose &lt;code&gt;SIMILARITY_API_KEY&lt;/code&gt; to the runtime environment, and the resulting table can then feed review models, merge workflows, or downstream entity clustering.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest "under 10-minute" claim
&lt;/h2&gt;

&lt;p&gt;Here is how the timing works in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~7 minutes: benchmarked processing time for a 1M-row dataset in Similarity API. This varies with string length and duplicate density.&lt;/li&gt;
&lt;li&gt;~2 minutes: drop the model into your dbt project, set the API key, and run it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No blocking strategy design. No distributed compute tuning. No regex cleanup scripts.&lt;/p&gt;

&lt;h2&gt;
  
  
  From prototype to production
&lt;/h2&gt;

&lt;p&gt;The advantage of dbt is that this does not have to stay a one-off experiment.&lt;/p&gt;

&lt;p&gt;Once the model works, you can schedule it as part of your normal transformation workflow and build downstream logic on top of the output table:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;review likely duplicate pairs&lt;/li&gt;
&lt;li&gt;cluster entities before enrichment&lt;/li&gt;
&lt;li&gt;feed survivorship / merge logic&lt;/li&gt;
&lt;li&gt;monitor duplicate volume over time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because the interface is standard HTTP, the matching engine becomes a reusable data-quality component inside the same dbt workflow your team already maintains.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final word
&lt;/h2&gt;

&lt;p&gt;At large scale, fuzzy matching stops being a string-similarity problem and becomes an infrastructure problem.&lt;/p&gt;

&lt;p&gt;Similarity API is built for teams that prefer to spend engineering time on analytics and product logic — not on maintaining custom deduplication pipelines.&lt;/p&gt;

&lt;p&gt;Instead of weeks of pipeline work, you can run one dbt model and move straight to reviewing and acting on clean data.&lt;/p&gt;

&lt;p&gt;Stop building matching infrastructure. Start acting on clean entities.&lt;/p&gt;

</description>
      <category>dbt</category>
      <category>bigquery</category>
      <category>bigdata</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>How to Fuzzy-Match 1 Million Rows in BigQuery in under 10 minutes</title>
      <dc:creator>Siyana Hristova</dc:creator>
      <pubDate>Mon, 23 Mar 2026 14:20:18 +0000</pubDate>
      <link>https://dev.to/siyana_hristova_900e581ee/how-to-fuzzy-match-1-million-rows-in-bigquery-in-under-10-minutes-3hfn</link>
      <guid>https://dev.to/siyana_hristova_900e581ee/how-to-fuzzy-match-1-million-rows-in-bigquery-in-under-10-minutes-3hfn</guid>
      <description>&lt;p&gt;Duplicate records rarely look like a priority at first — until they start breaking reporting, outreach, or reconciliation workflows.&lt;/p&gt;

&lt;p&gt;From slightly different versions of "Acme Inc" in a CRM to inconsistent supplier names across systems or messy post-merger datasets, fuzzy matching becomes essential whenever identical strings are no longer a reliable signal of the same real-world entity.&lt;/p&gt;




&lt;h2&gt;
  
  
  The scaling wall: why warehouse-native fuzzy matching breaks at scale
&lt;/h2&gt;

&lt;p&gt;Fuzzy matching looks simple on a 1,000-row sample. But at real scale, the math changes. A naive all-to-all comparison grows at &lt;strong&gt;O(N²)&lt;/strong&gt;. Once you hit 100k+ rows, comparison space explodes, and local scripts or warehouse-native approaches become slow, expensive, or brittle.&lt;/p&gt;

&lt;p&gt;In practice, teams usually try a sequence of approaches before realizing the real complexity. They might start with warehouse similarity functions (such as edit distance or token similarity), hit performance limits, then switch to a quick Python script — only to discover new bottlenecks around memory usage, blocking strategy design, and data cleanup.&lt;/p&gt;

&lt;p&gt;At that point, what looked like a simple task starts turning into a permanent matching pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Blocking and candidate generation logic
&lt;/li&gt;
&lt;li&gt;String normalization and suffix cleanup
&lt;/li&gt;
&lt;li&gt;Threshold tuning and evaluation loops
&lt;/li&gt;
&lt;li&gt;Parallelization and memory management
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What started as a quick dedupe task quietly turns into ongoing engineering overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  The solution: call a production fuzzy-matching engine
&lt;/h2&gt;

&lt;p&gt;Similarity API is a hosted infrastructure service designed for high-performance deduplication and reconciliation.&lt;/p&gt;

&lt;p&gt;Instead of building and maintaining your own matching pipeline, you send the dataset to a dedicated matching engine optimized for noisy real-world data and large-scale workloads.&lt;/p&gt;




&lt;h2&gt;
  
  
  The technical edge: adaptive preprocessing at scale
&lt;/h2&gt;

&lt;p&gt;In real workflows, fuzzy matching quality is determined as much by data preparation strategy as by the similarity metric itself.&lt;/p&gt;

&lt;p&gt;Local implementations often require teams to design custom normalization rules, suffix cleaning logic, token ordering heuristics, and blocking strategies — each of which must be tuned as datasets evolve.&lt;/p&gt;

&lt;p&gt;Similarity API embeds these steps directly into the matching engine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dataset-aware normalization:&lt;/strong&gt; preprocessing adapts dynamically to string length, token density, and noise patterns
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale-optimized cleaning pipeline:&lt;/strong&gt; preprocessing runs as part of the distributed matching flow, preventing cleanup stages from becoming bottlenecks at 1M+ rows
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration instead of custom code:&lt;/strong&gt; matching behaviour is controlled through parameters such as &lt;code&gt;similarity_threshold&lt;/code&gt;, &lt;code&gt;use_token_sort&lt;/code&gt;, and &lt;code&gt;remove_punctuation&lt;/code&gt;, rather than bespoke scripts
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This architecture allows teams to focus on match review and downstream data actions rather than maintaining fragile preprocessing pipelines.&lt;/p&gt;




&lt;h2&gt;
  
  
  The BigQuery Notebook
&lt;/h2&gt;

&lt;p&gt;This guide is designed to run inside a &lt;strong&gt;BigQuery notebook environment&lt;/strong&gt; (Colab Enterprise integrated into BigQuery).&lt;/p&gt;

&lt;p&gt;These notebooks let you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Query production tables directly from BigQuery
&lt;/li&gt;
&lt;li&gt;Run Python data workflows without provisioning infrastructure
&lt;/li&gt;
&lt;li&gt;Call external APIs for heavy compute tasks
&lt;/li&gt;
&lt;li&gt;Write results back into BigQuery tables
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, this makes them an ideal surface for large-scale fuzzy matching workflows: data stays in the warehouse, while compute-intensive matching runs in a scalable external service.&lt;/p&gt;

&lt;p&gt;Before running the notebook cell, you will need a &lt;strong&gt;Similarity API production token&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You can generate one from the Similarity API dashboard. The token is passed as a standard Bearer authorization header in the request.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you actually get back
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Example input
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;["Acme Inc", "ACME Incorporated", "Beta LLC", "Beta Limited"]&lt;/code&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Example output
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.94&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.91&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each result represents two rows that likely refer to the same real-world entity, along with a similarity score.&lt;/p&gt;

&lt;p&gt;By default, the API returns index pairs, which you can quickly join back to your BigQuery table for review or merge workflows.&lt;/p&gt;

&lt;p&gt;Output format is configurable — you can instead return string pairs, clustered groups of duplicates, or fully deduplicated record lists depending on your cleanup strategy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Notebook example
&lt;/h2&gt;

&lt;p&gt;The following code snippet:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;reads a dataset directly from BigQuery&lt;/li&gt;
&lt;li&gt;sends company names to the Similarity API&lt;/li&gt;
&lt;li&gt;returns duplicate index pairs
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;google.cloud&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;bigquery&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# ---- CONFIG ----
&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_PROJECT_ID&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;DATASET&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_DATASET&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;TABLE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_TABLE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;COLUMN&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_PRODUCTION_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;API_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.similarity-api.com/dedupe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# ---- LOAD DATA FROM BIGQUERY ----
&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;bigquery&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;project&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
SELECT &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COLUMN&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;
FROM `&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;DATASET&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;TABLE&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;`
WHERE &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;COLUMN&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; IS NOT NULL
&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="n"&gt;strings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_dataframe&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="n"&gt;COLUMN&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loaded &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows from BigQuery&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ---- CALL SIMILARITY API ----
&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity_threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.65&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remove_punctuation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;use_token_sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;index_pairs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;API_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Workflow complete: found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; duplicate pairs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# ---- OPTIONAL: SAVE RESULTS BACK TO BIGQUERY ----
&lt;/span&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;dup_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idx_1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;idx_2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

    &lt;span class="n"&gt;table_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;PROJECT_ID&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;DATASET&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.dedupe_results&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;job&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load_table_from_dataframe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dup_df&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;job&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Saved results to &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;table_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The honest "under 10-minute" claim
&lt;/h2&gt;

&lt;p&gt;Here is how the timing works in practice:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~7 minutes: benchmarked processing time for a 1M-row dataset in Similarity API (varies with string length)&lt;/li&gt;
&lt;li&gt;~2 minutes: copy-paste the notebook cell, run the query, and start the job&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No blocking strategy design. No distributed compute tuning. No regex cleanup scripts.&lt;/p&gt;

&lt;h2&gt;
  
  
  From prototype to production
&lt;/h2&gt;

&lt;p&gt;Notebooks are ideal for validating matching quality and running one-off reconciliation jobs.&lt;/p&gt;

&lt;p&gt;In production, the same API call pattern can be embedded into:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;scheduled BigQuery workflows&lt;/li&gt;
&lt;li&gt;Airflow or Prefect pipelines&lt;/li&gt;
&lt;li&gt;backend data services&lt;/li&gt;
&lt;li&gt;low-code automation tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because the interface is standard HTTP, the matching engine becomes a reusable data-quality component across your stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Final word
&lt;/h2&gt;

&lt;p&gt;At large scale, fuzzy matching stops being a string-similarity problem and becomes an infrastructure problem.&lt;/p&gt;

&lt;p&gt;Similarity API is built for teams that prefer to spend engineering time on analytics and product logic — not on maintaining custom deduplication pipelines.&lt;/p&gt;

&lt;p&gt;Instead of weeks of pipeline work, you can run a notebook cell and move straight to reviewing and acting on clean data.&lt;/p&gt;

</description>
      <category>bigquery</category>
      <category>fuzzymatching</category>
      <category>productivity</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How to Reconcile Salesforce Leads Against Contacts at Scale</title>
      <dc:creator>Siyana Hristova</dc:creator>
      <pubDate>Mon, 23 Mar 2026 13:41:59 +0000</pubDate>
      <link>https://dev.to/siyana_hristova_900e581ee/how-to-reconcile-salesforce-leads-against-contacts-at-scale-2nd4</link>
      <guid>https://dev.to/siyana_hristova_900e581ee/how-to-reconcile-salesforce-leads-against-contacts-at-scale-2nd4</guid>
      <description>&lt;p&gt;Duplicate identity records are almost inevitable in modern Salesforce environments.&lt;/p&gt;

&lt;p&gt;Leads enter the CRM from web forms, enrichment tools, outbound prospecting platforms, partner integrations, event uploads, product sign-ups, and manual entry. Even in well-governed systems, slight variations in names, emails, company formatting, and job titles accumulate over time.&lt;/p&gt;

&lt;p&gt;At scale, teams eventually need to answer practical operational questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Which of our newly imported leads already exist as contacts?&lt;/li&gt;
&lt;li&gt;Who should own this inbound lead if the account already exists?&lt;/li&gt;
&lt;li&gt;How do we clean identity data before migrations or reporting resets?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where lead-to-contact reconciliation workflows emerge.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why teams run lead-to-contact reconciliation
&lt;/h2&gt;

&lt;p&gt;This workflow is typically driven by operational needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Reporting accuracy&lt;/strong&gt; — duplicate identities fragment attribution and pipeline analytics&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Routing correctness&lt;/strong&gt; — inbound leads often need to inherit ownership from existing accounts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Import risk reduction&lt;/strong&gt; — bulk uploads can create thousands of duplicates without pre-checks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automation enablement&lt;/strong&gt; — surfacing candidate matches enables auto-assignment and conversion rules&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Over time, reconciliation becomes a recurring RevOps capability rather than a one-off cleanup exercise.&lt;/p&gt;




&lt;h2&gt;
  
  
  What reconciliation workflows look like in practice
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pre-import identity checks
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Export existing contacts&lt;/li&gt;
&lt;li&gt;Compare new leads against the contact base&lt;/li&gt;
&lt;li&gt;Review high-confidence matches&lt;/li&gt;
&lt;li&gt;Merge or update records before import&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Scheduled identity cleanup jobs
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Compare recently created leads to contacts&lt;/li&gt;
&lt;li&gt;Write similarity scores or match IDs to custom fields&lt;/li&gt;
&lt;li&gt;Create review queues for RevOps teams&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Automation-driven identity resolution
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Apex triggers call external reconciliation endpoints before lead insert&lt;/li&gt;
&lt;li&gt;Salesforce Flows surface candidate matches for SDR review&lt;/li&gt;
&lt;li&gt;Nightly jobs reassign leads to existing account owners&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At this stage, similarity matching becomes part of operational CRM infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  Exact vs similarity matching in CRM reconciliation
&lt;/h2&gt;

&lt;p&gt;Traditional deduplication relies on exact matching — typically strict email equality or rule-based logic.&lt;/p&gt;

&lt;p&gt;Exact matching works well when identity signals are clean and standardized.&lt;/p&gt;

&lt;p&gt;In real go-to-market environments, identity data drifts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;People use multiple email addresses&lt;/li&gt;
&lt;li&gt;Company names appear in different formats&lt;/li&gt;
&lt;li&gt;Titles and suffixes vary&lt;/li&gt;
&lt;li&gt;Records are created across disconnected systems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Similarity-based matching addresses this ambiguity by asking:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Are these records likely to represent the same real-world person?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Exact matching remains a useful first filter.&lt;br&gt;
Similarity matching expands coverage to edge cases that strict rules cannot resolve at scale.&lt;/p&gt;


&lt;h2&gt;
  
  
  How reconciliation pipelines typically work
&lt;/h2&gt;

&lt;p&gt;Conceptually, identity matching pipelines involve:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pre-processing&lt;/strong&gt; — normalize casing, punctuation, token order, and company suffixes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Similarity scoring&lt;/strong&gt; — compare identity strings&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filtering&lt;/strong&gt; — retain matches above a defined confidence threshold&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This approach works on small datasets.&lt;br&gt;
It becomes harder when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;CRM datasets reach hundreds of thousands of records&lt;/li&gt;
&lt;li&gt;Identity drift occurs continuously through imports and enrichment&lt;/li&gt;
&lt;li&gt;Reconciliation must run automatically or on a frequent schedule&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At that point, teams often move from ad-hoc scripts toward scalable matching infrastructure.&lt;/p&gt;


&lt;h2&gt;
  
  
  Replacing the pipeline with a single reconciliation call
&lt;/h2&gt;

&lt;p&gt;Instead of designing and maintaining a full matching pipeline, teams can use a reconciliation API.&lt;/p&gt;

&lt;p&gt;Example request:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;lead_match_strings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;contact_match_strings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity_threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.82&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top_n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remove_punctuation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;use_token_sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flat_table&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;res&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.similarity-api.com/reconcile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;
&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A key design decision is defining the &lt;strong&gt;identity string&lt;/strong&gt; — commonly a combination of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;First name&lt;/li&gt;
&lt;li&gt;Last name&lt;/li&gt;
&lt;li&gt;Email&lt;/li&gt;
&lt;li&gt;Company / account name&lt;/li&gt;
&lt;li&gt;Job title&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Example reconciliation output
&lt;/h2&gt;

&lt;p&gt;When using a flat table output format, matches are returned at row level:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;lead_index&lt;/th&gt;
&lt;th&gt;lead_identity&lt;/th&gt;
&lt;th&gt;contact_index&lt;/th&gt;
&lt;th&gt;contact_identity&lt;/th&gt;
&lt;th&gt;score&lt;/th&gt;
&lt;th&gt;matched&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Jane Doe&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:jane@acme.com"&gt;jane@acme.com&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Acme Inc&lt;/td&gt;
&lt;td&gt;1542&lt;/td&gt;
&lt;td&gt;Jane Doe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Jane Doe&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:jane@acme.com"&gt;jane@acme.com&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;Acme Inc&lt;/td&gt;
&lt;td&gt;9811&lt;/td&gt;
&lt;td&gt;Janet Doe&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Mark Lee&lt;/td&gt;
&lt;td&gt;&lt;a href="mailto:mark@north.io"&gt;mark@north.io&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;North IO&lt;/td&gt;
&lt;td&gt;2207&lt;/td&gt;
&lt;td&gt;Marc Lee&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These candidate matches can then power:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lead conversion workflows&lt;/li&gt;
&lt;li&gt;Ownership reassignment&lt;/li&gt;
&lt;li&gt;Deduplication review queues&lt;/li&gt;
&lt;li&gt;Automated CRM hygiene jobs&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Final thoughts
&lt;/h2&gt;

&lt;p&gt;Lead-to-contact reconciliation is not just a data cleanup task.&lt;br&gt;
In high-volume Salesforce environments, it becomes a foundational operational capability.&lt;/p&gt;

&lt;p&gt;Teams that implement scalable identity matching gain:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;More reliable pipeline attribution&lt;/li&gt;
&lt;li&gt;Cleaner account ownership signals&lt;/li&gt;
&lt;li&gt;Safer bulk imports&lt;/li&gt;
&lt;li&gt;Stronger automation across RevOps workflows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As CRM datasets grow, reconciliation workflows evolve from manual checks into continuous identity infrastructure.&lt;/p&gt;

&lt;p&gt;Try a 100k rows leads dedupe for free at &lt;a href="https://similarity-api.com/try-it" rel="noopener noreferrer"&gt;https://similarity-api.com/try-it&lt;/a&gt;&lt;/p&gt;

</description>
      <category>salesforce</category>
      <category>cleanleads</category>
      <category>crm</category>
      <category>revenueoperations</category>
    </item>
    <item>
      <title>How to fuzzy-match a 1M-row dataset to a canonical reference in under 10 minutes (2026 guide)</title>
      <dc:creator>Siyana Hristova</dc:creator>
      <pubDate>Fri, 20 Mar 2026 14:35:13 +0000</pubDate>
      <link>https://dev.to/siyana_hristova_900e581ee/how-to-fuzzy-match-a-1m-row-dataset-to-a-canonical-reference-in-under-10-minutes-2026-guide-3gp1</link>
      <guid>https://dev.to/siyana_hristova_900e581ee/how-to-fuzzy-match-a-1m-row-dataset-to-a-canonical-reference-in-under-10-minutes-2026-guide-3gp1</guid>
      <description>&lt;p&gt;Unifying operational data against a canonical reference is a foundational analytics task — and one that becomes surprisingly complex at scale.&lt;/p&gt;

&lt;p&gt;Whether you are matching a newly acquired CRM against an existing customer base, aligning vendor lists across procurement systems, or validating inbound leads before enrichment, reconciliation is the practical way to identify which records refer to the same real-world entities across datasets.&lt;/p&gt;




&lt;h2&gt;
  
  
  The scaling wall: why cross-dataset matching gets hard fast
&lt;/h2&gt;

&lt;p&gt;Matching problems often start with what looks like a manageable task: align a large operational dataset with a canonical reference table.&lt;/p&gt;

&lt;p&gt;But when that reference dataset contains hundreds of thousands or millions of rows, naive matching approaches quickly become impractical. A brute-force comparison of &lt;strong&gt;3,000 records against 1,000,000 candidates already implies billions of potential similarity checks&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In real workflows, teams typically try a sequence of approaches before realizing the full complexity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;warehouse similarity joins that become slow or expensive
&lt;/li&gt;
&lt;li&gt;Python scripts that run out of memory or require heavy batching
&lt;/li&gt;
&lt;li&gt;ad-hoc preprocessing logic for suffix cleanup and token normalization
&lt;/li&gt;
&lt;li&gt;fragile threshold tuning loops that must be revisited as data evolves
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;What began as a simple reconciliation step can quietly turn into a long-term engineering burden.&lt;/p&gt;




&lt;h2&gt;
  
  
  The solution: use a purpose-built reconciliation engine
&lt;/h2&gt;

&lt;p&gt;Similarity API provides a hosted infrastructure layer designed specifically for large-scale &lt;strong&gt;A-to-B entity matching&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of engineering candidate-generation logic, blocking strategies, and distributed compute orchestration yourself, you send:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a smaller dataset (for example 3K inbound records)
&lt;/li&gt;
&lt;li&gt;a larger reference dataset (for example a 1M-row master table)
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The engine handles the matching workflow and returns the most likely corresponding entities.&lt;/p&gt;

&lt;p&gt;This lets teams focus on review, enrichment, and downstream automation rather than building matching infrastructure.&lt;/p&gt;




&lt;h2&gt;
  
  
  The technical edge: adaptive matching across asymmetric datasets
&lt;/h2&gt;

&lt;p&gt;Reconciliation is fundamentally different from deduplication because datasets are &lt;strong&gt;asymmetric in size and structure&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Local implementations typically require custom logic to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;generate candidate pools efficiently
&lt;/li&gt;
&lt;li&gt;normalize naming conventions across systems
&lt;/li&gt;
&lt;li&gt;tune similarity thresholds for different entity types
&lt;/li&gt;
&lt;li&gt;rank or filter multiple potential matches
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Similarity API embeds these steps directly into the matching engine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive candidate generation:&lt;/strong&gt; optimized search strategies reduce comparison space automatically
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dataset-aware normalization:&lt;/strong&gt; cleaning logic adapts to string density and noise patterns
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configurable ranking behaviour:&lt;/strong&gt; parameters control match strictness and output structure
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows teams to run reconciliation workflows at scale without designing bespoke matching pipelines.&lt;/p&gt;




&lt;h2&gt;
  
  
  What you actually get back
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Example input datasets&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Dataset A (new records):&lt;/p&gt;

&lt;p&gt;&lt;code&gt;["Acme Corporation", "Beta Solutions Ltd", "Gamma Tech"]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Dataset B (reference dataset excerpt):&lt;/p&gt;

&lt;p&gt;&lt;code&gt;["ACME Corp", "Beta Solutions Limited", "Delta Industries", "Gamma Technologies"]&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example reconciliation output (top match pairs)&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.93&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.91&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.88&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each result represents a likely match between a record in the smaller dataset and a candidate in the larger reference dataset, along with a similarity score.&lt;/p&gt;

&lt;p&gt;Output format is configurable depending on workflow needs. Teams may choose to return:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;top match index pairs
&lt;/li&gt;
&lt;li&gt;ranked candidate lists
&lt;/li&gt;
&lt;li&gt;string match previews for validation
&lt;/li&gt;
&lt;li&gt;enriched reconciliation tables
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This flexibility allows the same matching engine to support exploratory validation, automated enrichment, or production reconciliation pipelines.&lt;/p&gt;




&lt;h2&gt;
  
  
  Example reconciliation call
&lt;/h2&gt;

&lt;p&gt;This minimal Python example demonstrates the core workflow. In practice, the same call can be embedded into notebooks, orchestration pipelines, backend services, or analytics transformations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_PRODUCTION_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;API_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.similarity-api.com/reconcile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;new_records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Acme Corporation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Beta Solutions Ltd&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Gamma Tech&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="n"&gt;reference_records&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_large_reference_dataset_somehow&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;  &lt;span class="c1"&gt;# e.g. warehouse extract
&lt;/span&gt;
&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;new_records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data_b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;reference_records&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;top_n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity_threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remove_punctuation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;API_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;matches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;matches&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; reconciliation matches&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The honest “under 10-minute” claim
&lt;/h2&gt;

&lt;p&gt;For a common workload such as &lt;strong&gt;reconciling ~3,000 inbound records against a 1M-row reference dataset&lt;/strong&gt;, runtime typically breaks down as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~7 minutes:&lt;/strong&gt; matching and ranking performed by the reconciliation engine
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~2–3 minutes:&lt;/strong&gt; extracting the reference dataset and triggering the workflow
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No custom blocking logic. No distributed similarity joins. No manual candidate ranking pipelines.&lt;/p&gt;




&lt;h2&gt;
  
  
  From ad-hoc validation to production reconciliation
&lt;/h2&gt;

&lt;p&gt;Once teams validate reconciliation accuracy, this workflow can be embedded into recurring processes such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;lead enrichment validation before CRM ingestion
&lt;/li&gt;
&lt;li&gt;supplier master data alignment
&lt;/li&gt;
&lt;li&gt;post-migration entity reconciliation
&lt;/li&gt;
&lt;li&gt;data quality monitoring across system boundaries
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because the interface is standard HTTP, reconciliation becomes a reusable infrastructure component rather than a bespoke project.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final word
&lt;/h2&gt;

&lt;p&gt;At scale, reconciliation is not a similarity-function problem — it is a &lt;strong&gt;candidate-generation and infrastructure problem&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Similarity API enables teams to match asymmetric datasets quickly without building custom pipelines for blocking, ranking, and normalization.&lt;/p&gt;

&lt;p&gt;Instead of engineering reconciliation logic from scratch, you can focus on reviewing matches and acting on unified entity data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stop building matching infrastructure. Start operating on reconciled entities.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Get a free API token at &lt;a href="https://similarity-api.com/" rel="noopener noreferrer"&gt;https://similarity-api.com/&lt;/a&gt;&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>airflow</category>
      <category>dbt</category>
      <category>bigquery</category>
    </item>
    <item>
      <title>Why It Rarely Makes Sense to Build Fuzzy Matching Yourself in 2026</title>
      <dc:creator>Siyana Hristova</dc:creator>
      <pubDate>Thu, 19 Mar 2026 11:25:23 +0000</pubDate>
      <link>https://dev.to/siyana_hristova_900e581ee/why-it-rarely-makes-sense-to-build-fuzzy-matching-yourself-in-2026-1k9f</link>
      <guid>https://dev.to/siyana_hristova_900e581ee/why-it-rarely-makes-sense-to-build-fuzzy-matching-yourself-in-2026-1k9f</guid>
      <description>&lt;p&gt;Fuzzy matching finds records that refer to the same entity even when the text is not identical. It shows up everywhere: CRM deduplication, company name matching across systems, lead and account cleanup, product catalog cleanup, supplier matching, and post‑merger data reconciliation.&lt;/p&gt;

&lt;p&gt;In practice, that sounds much easier than it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The scale problem
&lt;/h2&gt;

&lt;p&gt;On small datasets, basic approaches can look good enough.&lt;/p&gt;

&lt;p&gt;At real operational scale, they stop being practical. Naive all‑to‑all comparison grows too fast, which is why workflows that seem fine on a sample often become slow, expensive, or unusable on large datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hidden pipeline problem
&lt;/h2&gt;

&lt;p&gt;The hard part is not just scoring string similarity.&lt;/p&gt;

&lt;p&gt;To make fuzzy matching work in production, teams usually have to build a full supporting pipeline around it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;preprocessing and normalization&lt;/li&gt;
&lt;li&gt;company suffix and token cleanup&lt;/li&gt;
&lt;li&gt;blocking and candidate generation&lt;/li&gt;
&lt;li&gt;threshold tuning&lt;/li&gt;
&lt;li&gt;batching and memory management&lt;/li&gt;
&lt;li&gt;evaluation and ongoing maintenance&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of those steps affects both speed and match quality. For example, blocking and candidate generation are often necessary to make matching fast enough, but if they are designed poorly, they can quietly miss true matches.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real cost of building it yourself
&lt;/h2&gt;

&lt;p&gt;Even optimistic assumptions make DIY fuzzy matching more expensive than it first appears.&lt;/p&gt;

&lt;p&gt;According to U.S. Bureau of Labor Statistics data, the median software engineer salary is about $133k/year. When benefits and overhead are included, total employer cost is typically around 1.4× salary, which translates to roughly $90/hour loaded engineering cost.&lt;/p&gt;

&lt;p&gt;If a team builds an internal fuzzy‑matching pipeline in just 2 weeks (≈80 engineering hours), the implementation cost alone is roughly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;≈ $7,280 in engineering time&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This excludes ongoing tuning, maintenance, infrastructure cost, and the risk of degraded match quality at larger scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  The math with Similarity API
&lt;/h2&gt;

&lt;p&gt;Using Similarity API changes the cost structure completely.&lt;/p&gt;

&lt;p&gt;Assume:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5 hours of engineering time to evaluate, integrate, and operationalize the API&lt;/li&gt;
&lt;li&gt;Loaded engineering cost ≈ $90/hour&lt;/li&gt;
&lt;li&gt;API pricing $1.99 per 10,000 rows&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a workload of 1,000,000 rows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Engineering setup cost ≈ $450&lt;/li&gt;
&lt;li&gt;API processing cost ≈ $199&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Total ≈ $649 to get production fuzzy matching on a 1M‑row dataset.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the tradeoff is clear
&lt;/h2&gt;

&lt;p&gt;Compared to a conservative DIY build cost of about $7,280, a team would need to run 1M rows every month for roughly 3 years before total Similarity API spend reaches the same level.&lt;/p&gt;

&lt;p&gt;And that comparison still ignores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ongoing pipeline maintenance&lt;/li&gt;
&lt;li&gt;model tuning as data evolves&lt;/li&gt;
&lt;li&gt;engineering opportunity cost&lt;/li&gt;
&lt;li&gt;reliability risks in edge cases&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most teams do not actually want a fuzzy‑matching project. They want correct matches at scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi4fzqfl3w3jla4sj9zcw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi4fzqfl3w3jla4sj9zcw.png" alt=" " width="800" height="509"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The practical conclusion
&lt;/h2&gt;

&lt;p&gt;Similarity API removes the need to design, implement, tune, and maintain a dedicated fuzzy‑matching pipeline.&lt;/p&gt;

&lt;p&gt;Instead of investing weeks of engineering effort upfront and carrying long‑term maintenance risk, teams can call an API built specifically for large‑scale deduplication and reconciliation — and move on to higher‑leverage work.&lt;/p&gt;

&lt;p&gt;In 2026, for most real workloads, that is simply the more rational engineering and financial decision.&lt;/p&gt;

&lt;p&gt;Try it for free at &lt;a href="https://similarity-api.com/try-it" rel="noopener noreferrer"&gt;https://similarity-api.com/try-it&lt;/a&gt;&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>aws</category>
      <category>airflow</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Fuzzy-match 1M rows in under 10 minutes (2026 Edition)</title>
      <dc:creator>Siyana Hristova</dc:creator>
      <pubDate>Wed, 11 Mar 2026 12:49:49 +0000</pubDate>
      <link>https://dev.to/siyana_hristova_900e581ee/fuzzy-match-1m-rows-in-under-10-minutes-2026-edition-4eh9</link>
      <guid>https://dev.to/siyana_hristova_900e581ee/fuzzy-match-1m-rows-in-under-10-minutes-2026-edition-4eh9</guid>
      <description>&lt;p&gt;Duplicate records are easy to ignore until they're everywhere. &lt;/p&gt;

&lt;p&gt;Whether it's three versions of "Acme, Inc." in your CRM, a messy lead import, or a post-merger database reconciliation, fuzzy matching is the only way to find records that refer to the same entity when exact string matches fail.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Scaling Wall: Why DIY Fails
&lt;/h2&gt;

&lt;p&gt;Fuzzy matching sounds simple on a 1,000-row sample. But at scale, the math changes. A naive all-to-all comparison scales at $O(N^2)$. Once you hit 100k+ rows, the comparison space explodes, and your local script or SQL workflow will grind to a halt.&lt;/p&gt;

&lt;p&gt;I spent a long time trying to build these pipelines myself. Most of us start with a simple Python script and end up building a monster. You quickly find yourself manually managing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Infrastructure:&lt;/strong&gt; Blocking, indexing, and parallelization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tuning:&lt;/strong&gt; Endless threshold tweaking and "brittle" regex cleanup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Maintenance:&lt;/strong&gt; Keeping custom pipelines alive as your data volume grows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The result? Your "simple task" turns into a permanent engineering tax.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Technical Edge: Adaptive Preprocessing
&lt;/h2&gt;

&lt;p&gt;The hardest part of fuzzy matching isn't just the comparison—it's the cleaning. Similarity API uses an internal engine that adapts its strategy depending on the input size and noise level.&lt;/p&gt;

&lt;p&gt;Unlike local libraries that force you to write your own cleanup code, this engine:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Adapts to Dataset Structure:&lt;/strong&gt; Automatically adjusts normalization strategies based on string length and density.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optimized for Scale:&lt;/strong&gt; Preprocessing is baked into the matching pipeline, ensuring that even at 1M+ rows, the "cleanup" phase doesn't become a bottleneck.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Configuration over Code:&lt;/strong&gt; You don't write cleaning scripts; you toggle parameters like &lt;code&gt;token_sort&lt;/code&gt; or &lt;code&gt;remove_punctuation&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Solution: A Production-Ready Infrastructure
&lt;/h2&gt;

&lt;p&gt;After testing various approaches, I started leaning into Similarity API for my own professional workflows. It is a hosted, paid infrastructure service designed for high-performance deduplication.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Value Prop:&lt;/strong&gt; You aren't just buying speed; you're buying a production-ready component. By offloading matching to a dedicated API, you move the complexity out of your codebase and into a scalable, managed environment.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;💰 &lt;strong&gt;Note: Professional Infrastructure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is not a free, community-maintained library. Similarity API is a commercial service. You will need to sign up for an API key, and it operates on a usage-based pricing model. Because it's a paid service, you get guaranteed uptime and dedicated support. If you are building tools for your company, offloading this to a paid service is a small price to pay to avoid the "engineering tax" of maintaining custom matching code.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Integration: Build Once, Automate Forever
&lt;/h2&gt;

&lt;p&gt;While the example below runs easily in a notebook for prototyping, the real power is embedding this into repeatable production workflows.&lt;/p&gt;

&lt;p&gt;For smaller datasets, the direct API call is the fastest route. However, if your dataset exceeds 10MB, you should use our specialized File Upload endpoint, which is designed to handle larger batches efficiently.&lt;/p&gt;

&lt;p&gt;Because it is a standard REST API, you can integrate it into any environment that supports HTTP requests:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Code-First:&lt;/strong&gt; Airflow, Prefect, GitHub Actions, or Python/Node.js backend services.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No-Code/Low-Code:&lt;/strong&gt; n8n, Zapier, Make.com, or Retool.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise:&lt;/strong&gt; Databricks, Snowflake, or AWS Lambda jobs.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;

&lt;span class="c1"&gt;# Professional-grade matching requires a paid API key
&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_PRODUCTION_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;API_URL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://api.similarity-api.com/dedupe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Load your production dataset
&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;large_dataset.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;strings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;company_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;dropna&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;astype&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Define your configuration
&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;strings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;config&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;similarity_threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remove_punctuation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;to_lowercase&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;use_token_sort&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_format&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;index_pairs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# The API handles the orchestration and scaling automatically
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;API_URL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Bearer &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;API_KEY&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                         &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                         &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3600&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;response_data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Workflow Complete: Found &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; duplicates.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  ⏱️ The Honest "10-Minute" Claim
&lt;/h2&gt;

&lt;p&gt;I claim you can dedupe 1M rows in under 10 minutes. Here is the math:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;7 Minutes:&lt;/strong&gt; The time the engine actually takes to crunch through 1,000,000 rows (based on my public benchmarks).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;3 Minutes:&lt;/strong&gt; The time it takes for you to copy the code above, paste it into Colab, and grab a coffee while it runs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're faster at copy-pasting, you might even finish in 8.&lt;/p&gt;

&lt;p&gt;Want to prove it yourself? Don't take my word for it. I keep the methodology transparent—because when you pay for infrastructure, you should know exactly what you're getting.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Word
&lt;/h2&gt;

&lt;p&gt;When data gets large, the hard part isn't the similarity function - it's the infrastructure. Similarity API is a service for teams that value engineering time over building custom deduplication scripts. It allows you to skip the pipeline work and get straight to the results: reviewing, merging, and acting on clean data.&lt;/p&gt;

&lt;p&gt;Explore full API docs on &lt;a href="https://similarity-api.com/documentation" rel="noopener noreferrer"&gt;https://similarity-api.com/documentation&lt;/a&gt;&lt;/p&gt;

</description>
      <category>databricks</category>
      <category>airflow</category>
      <category>awsbigdata</category>
      <category>snowflake</category>
    </item>
    <item>
      <title>Fuzzy-match millions of rows in Databricks (2026)</title>
      <dc:creator>Siyana Hristova</dc:creator>
      <pubDate>Wed, 25 Feb 2026 13:08:28 +0000</pubDate>
      <link>https://dev.to/siyana_hristova_900e581ee/fuzzy-match-millions-of-rows-in-databricks-2026-832</link>
      <guid>https://dev.to/siyana_hristova_900e581ee/fuzzy-match-millions-of-rows-in-databricks-2026-832</guid>
      <description>&lt;p&gt;When you fuzzy-match 10 million rows, you aren't "just comparing strings." A naïve dedupe implies roughly n(n−1)/2 ≈ 5×10¹³ potential pairs. At this scale, approaches that feel "quick" on small tables start to break.&lt;/p&gt;

&lt;p&gt;In Databricks, most teams reach for one of three options:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Spark-native candidate generation (LSH/MinHash)&lt;/strong&gt;&lt;br&gt;
Fast to start, but you end up tuning a tradeoff between missed matches and huge candidate sets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Entity-resolution frameworks&lt;/strong&gt;&lt;br&gt;
Powerful, but often heavier than you want for "dedupe this column."&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Custom Python scoring (UDFs / pandas UDFs)&lt;/strong&gt;&lt;br&gt;
Easy to prototype, but at large scale jobs become dominated by Python overhead, skew, and shuffles.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A practical approach is to let Databricks handle what it's best at (data access, ETL, governance) and offload the actual matching step to a service built specifically for high-scale deduplication.&lt;/p&gt;

&lt;p&gt;In this tutorial, we'll do that using Similarity API — an async "job" style matching service where you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;upload a dataset once (CSV or Parquet)&lt;/li&gt;
&lt;li&gt;start a job&lt;/li&gt;
&lt;li&gt;poll status&lt;/li&gt;
&lt;li&gt;then download results (as Parquet or CSV)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This doesn't eliminate all cost — you still export data and ingest results — but it avoids the most fragile part: doing pairwise matching inside Spark.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use Similarity API for the matching step?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Avoid Spark-side pairwise matching&lt;/strong&gt;: no cartesian joins or UDF-based scoring at scale.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normalization options built in&lt;/strong&gt;: punctuation removal, lowercasing, token sorting, and a company_names preset that strips common business suffixes (Inc/LLC/Ltd/etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic output artifact&lt;/strong&gt;: the service returns a file you can land back into Delta (e.g., per-row annotations, membership maps, or match pairs).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Proven at 1M+ scale&lt;/strong&gt;: &lt;a href="https://similarity-api.com/blog/speed-benchmarks" rel="noopener noreferrer"&gt;see the benchmark run&lt;/a&gt; (1M rows in ~7 minutes) and comparisons vs common fuzzy matching approaches.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The workflow: Databricks ↔ Similarity API
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Network egress&lt;/strong&gt;: This workflow assumes your Databricks compute can make outbound HTTPS requests to the Similarity API hostname. In many enterprise and some serverless setups, outbound internet/DNS is restricted by policy—if so, you'll need an admin to allow outbound access (or allowlist the API domain) for the notebook to reach the service.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Access (pricing + token)&lt;/strong&gt;: Similarity API is a paid service. To run this notebook you’ll need an API token—create an account to get one (there’s typically a free trial/credits for testing), then store the token in Databricks Secrets as &lt;code&gt;API_TOKEN&lt;/code&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Important detail about Similarity API's current output contract: for "row annotations," the result includes a row_id that is the 0..n-1 positional index of the uploaded file. To join results back to your source table, we'll create an explicit index in Databricks and persist an idx → primary_key mapping.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Build a stable index and export a single-column parquet
&lt;/h2&gt;

&lt;p&gt;We create:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;idx&lt;/code&gt;: 0..n-1&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pk&lt;/code&gt;: your real primary key&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;value&lt;/code&gt;: the string column you want to dedupe&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then we write:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an index map table to Delta (&lt;code&gt;idx&lt;/code&gt;, &lt;code&gt;pk&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;a single-column Parquet containing only &lt;code&gt;value&lt;/code&gt; (in the same row order) for upload
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from pyspark.sql import functions as F
from pyspark.sql.window import Window
import time, os

# Config
SOURCE_TABLE  = "main.crm.customers"
PK_COLUMN     = "customer_id"    # change to your true PK
STRING_COLUMN = "company_name"   # change to your string column

# Use DBFS scheme for Spark paths
TMP_DIR_DBFS     = f"dbfs:/tmp/similarity_api/{int(time.time())}"
PARQUET_DIR_DBFS = f"{TMP_DIR_DBFS}/input_parquet"

base = (
    spark.table(SOURCE_TABLE)
    .select(
        F.col(PK_COLUMN).cast("string").alias("pk"),
        F.col(STRING_COLUMN).cast("string").alias("value"),
    )
    .where(F.col("value").isNotNull() &amp;amp; (F.length(F.trim(F.col("value"))) &amp;gt; 0))
)

# Create a deterministic 0..n-1 index by ordering on pk.
w = Window.orderBy(F.col("pk"))
indexed = base.withColumn("idx", (F.row_number().over(w) - 1).cast("long"))

# Persist idx → pk mapping for join-back
indexed.select("idx", "pk").write.mode("overwrite") \
    .format("delta").saveAsTable("main.crm.similarity_idx_map")

# Export ONLY the value column, in the same row order.
# Use coalesce(1) (not repartition(1)) to avoid a full shuffle.
indexed.select("value").coalesce(1).write.mode("overwrite") \
    .parquet(PARQUET_DIR_DBFS)

print("Wrote parquet to:", PARQUET_DIR_DBFS)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: coalesce(1) makes Spark write a single part-*.parquet data file (plus a few small metadata files). Similarity API currently returns one signed upload URL for one object, so this "single part file" approach is the simplest way to upload. For very large datasets, you'll eventually want multi-part ingestion (multiple files) or storage-native ingestion — this version is intentionally "works now."&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 — Create a Similarity API job and upload the Parquet file
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import glob, requests

API_URL = "https://api.similarity-api.com"
TOKEN   = dbutils.secrets.get(scope="similarity", key="API_TOKEN")
headers = {"Authorization": f"Bearer {TOKEN}"}

payload = {
    "config": {
        "input_format":         "parquet",
        "similarity_threshold": 0.85,
        "use_case":             "company_names",
        "output_format":        "row_annotations",
        "output_file_format":   "parquet",
        "top_k":                50
        # If you upload a multi-column parquet later, add:
        # "input_column": "value"
    }
}

# Convert Spark path -&amp;gt; local driver path for Python file access
PARQUET_DIR_LOCAL = PARQUET_DIR_DBFS.replace("dbfs:", "/dbfs")
part_file = glob.glob(f"{PARQUET_DIR_LOCAL}/part-*.parquet")[0]

# 1) Create job (NEW PATH)
resp = requests.post(
    f"{API_URL}/dedupe/jobs",
    headers=headers,
    json=payload,
    timeout=120
)
resp.raise_for_status()
data = resp.json()
job_id     = data["job_id"]
upload_url = data["upload_url"]
print("job_id:", job_id)

# 2) Upload file bytes to signed URL
with open(part_file, "rb") as f:
    r = requests.put(
        upload_url,
        data=f,
        headers={"Content-Type": "application/octet-stream"},
        timeout=3600
    )
    r.raise_for_status()

# 3) Commit (starts async run) (NEW PATH)
r = requests.post(
    f"{API_URL}/dedupe/jobs/{job_id}/commit",
    headers=headers,
    timeout=120
)
r.raise_for_status()
print("Committed. rows_total:", r.json().get("rows_total"))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Step 3 — Poll, download results, and land back into Delta
&lt;/h2&gt;

&lt;p&gt;Similarity API returns a signed &lt;code&gt;result_url&lt;/code&gt; (HTTPS). Spark typically won't read HTTPS URLs directly as Parquet, so we download to DBFS first and then load with &lt;code&gt;spark.read.parquet&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import time, requests, os

def wait_for_results(job_id: str) -&amp;gt; str:
    while True:
        resp = requests.get(
            f"{API_URL}/dedupe/jobs/{job_id}",   # NEW PATH
            headers=headers,
            timeout=120
        )
        resp.raise_for_status()
        res = resp.json()
        print(f"Stage: {res.get('stage')} ({res.get('progress')}%) | Status: {res.get('job_status')}")
        if res.get("job_status") == "completed":
            if "result_url" not in res:
                raise RuntimeError("Job completed but no result_url returned.")
            return res["result_url"]
        if res.get("job_status") == "failed":
            raise RuntimeError(f"Job failed: {res.get('error')}")
        time.sleep(10)

result_url = wait_for_results(job_id)

# Save results to DBFS (driver local path for Python)
OUT_DIR_DBFS  = f"{TMP_DIR_DBFS}/results"
OUT_DIR_LOCAL = OUT_DIR_DBFS.replace("dbfs:", "/dbfs")
os.makedirs(OUT_DIR_LOCAL, exist_ok=True)
local_path = f"{OUT_DIR_LOCAL}/result.parquet"

with requests.get(result_url, stream=True, timeout=3600) as r:
    r.raise_for_status()
    with open(local_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=8 * 1024 * 1024):
            if chunk:
                f.write(chunk)

# Spark reads from dbfs:/...
results_df = spark.read.parquet(f"{OUT_DIR_DBFS}/result.parquet")
results_df.write.mode("overwrite") \
    .format("delta").saveAsTable("main.crm.similarity_results")

# Join back to your original pk using the idx map
idx_map = spark.table("main.crm.similarity_idx_map")  # idx, pk
joined = results_df.join(
    idx_map, results_df["row_id"] == idx_map["idx"], "left"
)
joined.write.mode("overwrite") \
    .format("delta").saveAsTable("main.crm.customers_dedupe_annotations")

print("Wrote Delta tables: main.crm.similarity_results, main.crm.customers_dedupe_annotations")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At this point, you've got a Delta table keyed by your original &lt;code&gt;pk&lt;/code&gt; with whatever annotations Similarity API returned (representatives, membership, similarity scores, etc.). You can inspect schema with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spark.table("main.crm.customers_dedupe_annotations").printSchema()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Security note&lt;/strong&gt;: Similarity API processes uploaded data only to compute the requested matching results. Customer data is not sold, shared, or used for advertising or model training. To minimize exposure, this workflow exports only the single string column required for deduplication.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;At 10M rows, the bottleneck isn't string similarity — it's building a reliable end-to-end workflow that doesn't devolve into Spark shuffles, UDF overhead, and constant tuning.&lt;/p&gt;

&lt;p&gt;By letting Databricks handle data access and governance, and offloading the matching step to Similarity API, you get a workflow that's reproducible, configurable, and doesn't require maintaining custom matching infrastructure.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>dataengineering</category>
      <category>databricks</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Scaling Fuzzy Matching: From Local Scripts to Production Pipelines</title>
      <dc:creator>Siyana Hristova</dc:creator>
      <pubDate>Mon, 23 Feb 2026 09:38:25 +0000</pubDate>
      <link>https://dev.to/siyana_hristova_900e581ee/scaling-fuzzy-matching-from-local-scripts-to-production-pipelines-49i3</link>
      <guid>https://dev.to/siyana_hristova_900e581ee/scaling-fuzzy-matching-from-local-scripts-to-production-pipelines-49i3</guid>
      <description>&lt;p&gt;I’ve handled fuzzy matching across the spectrum: academic research, scrappy startups, and enterprise-grade production environments. While the core objective—deduplicating or reconciling "messy" data—remains the same, the engineering constraints shift drastically as your row count climbs.&lt;/p&gt;

&lt;p&gt;At its heart, fuzzy matching is a two-dimensional problem:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Precision&lt;/strong&gt;: Defining similarity (Levenshtein, Jaro-Winkler, Cosine, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scale&lt;/strong&gt;: Managing the computational cost of comparisons.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most tutorials focus on the first. This article focuses on the second: the operational "pain bands" that force you to change your architecture.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;The Quadratic Trap: Why Size Matters&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The fundamental challenge of fuzzy matching is that it is natively a quadratic problem. A naive comparison of every record against every other record follows O(n²) complexity. This means that as your dataset grows, the computational effort doesn't just increase—it explodes.&lt;/p&gt;

&lt;p&gt;What works for 1,000 rows (1,000,000 comparisons) becomes an operational nightmare at 100,000 rows (10,000,000,000 comparisons). At this volume, the time and memory required to complete a single run exceed the limits of standard hardware. To survive, you must move from "compare everything" to "intelligent blocking and indexing."&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Small Scale: Up to 50k Rows&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The "Laptop Scale"
&lt;/h3&gt;

&lt;p&gt;At this volume, the overhead of a distributed system or a complex API is usually overkill. You can still afford to be slightly inefficient because the total compute time is measured in seconds or minutes, not hours.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solutions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Power Query / Excel Fuzzy Lookup&lt;/strong&gt;: Perfect for one-off analyst reconciliation. It’s accessible and requires zero code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenRefine&lt;/strong&gt;: A powerhouse for interactive clustering. If your data is "messy" (misspellings, varying formats), the human-in-the-loop approach here is unbeatable for accuracy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local Python Libraries&lt;/strong&gt;: Libraries like RapidFuzz or TheFuzz (formerly FuzzyWuzzy) allow you to bake matching into your scripts. RapidFuzz is significantly faster due to its C++ backbone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hosted APIs&lt;/strong&gt; (e.g., Similarity API): At this scale, a hosted API is "super cheap" (often free) and saves hours of implementation.

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Preprocessing&lt;/strong&gt;: These APIs handle the heavy lifting of normalization—stripping whitespace, fixing casing, and removing punctuation—automatically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain Optimization&lt;/strong&gt;: Most are pre-optimized for specific use cases like company names, automatically handling legal suffixes (Inc, Ltd, Corp, GmbH) so "Apple" and "Apple Inc." match without custom logic.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost and Complexity
&lt;/h3&gt;

&lt;p&gt;The direct cost here is essentially zero (software-wise), but the engineering cost is in the "tweak-and-wait" cycle. You’ll spend time writing regex pre-processors and testing similarity thresholds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommended Option&lt;/strong&gt;: RapidFuzz. If you’re a developer, it's the fastest path to a working prototype without adding external dependencies.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mid Scale: 50k–200k Rows
&lt;/h2&gt;

&lt;p&gt;This is where the quadratic growth starts to bite. A naive "all-against-all" comparison will likely crash your local machine or run for hours. You now need to introduce blocking (only comparing records that share a common key, like a ZIP code or a first initial).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solutions
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DIY Blocking Pipelines&lt;/strong&gt;: You write logic to partition the data. This reduces the O(n²) problem to a series of smaller, manageable chunks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Splink&lt;/strong&gt;: An open-source Python library for probabilistic record linkage. It uses the Fellegi-Sunter model to "learn" how to match records based on patterns in your data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hosted APIs&lt;/strong&gt;: Similarity API becomes more attractive here because it handles the blocking and indexing logic under the hood. You simply send the data and get matches back.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost and Complexity
&lt;/h3&gt;

&lt;p&gt;Complexity jumps significantly. You aren't just matching strings anymore; you’re managing an indexing strategy. If your blocking rules are too strict, you miss matches; too loose, and your compute bill (or wait time) skyrockets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommended Option&lt;/strong&gt;: Hosted APIs (Similarity API). At this scale, the time spent maintaining custom blocking logic often exceeds the cost of a managed service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Large Scale: 200k–2M Rows
&lt;/h2&gt;

&lt;p&gt;You have officially left the realm of local processing. You now need a distributed environment or a highly optimized indexing engine.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solutions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Distributed Processing (Apache Spark / Databricks)&lt;/strong&gt;: This is the industry standard for big data. You distribute the O(n²) load across a cluster. It is incredibly powerful but requires a Data Engineer to maintain.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entity Resolution Engines&lt;/strong&gt;: Purpose-built software (like Senzing or Tilores) designed specifically for identity resolution and linking.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hosted APIs&lt;/strong&gt;: A robust Similarity API can process a million records in a few minutes by utilizing high-performance indexing. This provides a "cloud-native" way to get Spark-level performance without the Spark-level maintenance.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost and Complexity
&lt;/h3&gt;

&lt;p&gt;The cost is now split between Compute (Cloud fees) and Headcount (Engineering time). Running a Spark cluster isn't cheap, and the time spent "tuning" the cluster for fuzzy joins is a hidden drain on productivity.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommended Option&lt;/strong&gt;: Hosted APIs (Similarity API). It provides the best balance of "Time to Value" vs. "Performance" for recurring production workloads.&lt;/p&gt;

&lt;h2&gt;
  
  
  Very Large Scale: 2M+ Rows
&lt;/h2&gt;

&lt;p&gt;At this scale, you aren't just "matching"; you are performing Entity Resolution. You need persistent IDs that stay consistent even as the data changes over time.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Solutions
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Master Data Management (MDM) Platforms&lt;/strong&gt;: Enterprise suites (Informatica, Reltio) that handle the entire lifecycle of data. They are expensive and take months to implement.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector Databases&lt;/strong&gt;: Using embeddings and "Approximate Nearest Neighbor" (ANN) search to find matches in high-dimensional space.
Hosted APIs: Similarity API can be used as the matching engine for a custom MDM, providing the heavy-duty compute while your internal systems handle the "golden record" logic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cost and Complexity
&lt;/h3&gt;

&lt;p&gt;The scale demands a significant budget. MDM licenses can reach six figures, while DIY Vector DB solutions require specialized knowledge of machine learning and embedding models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommended Option&lt;/strong&gt;: Hosted APIs paired with an internal Entity Store. This allows you to scale the matching logic infinitely while keeping the business logic (your "Source of Truth") in-house.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Comparison Table: Choosing Your Path&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd2xd8nltr2c33hckfx4c.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd2xd8nltr2c33hckfx4c.png" alt=" " width="800" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: Scale is a Strategy, Not a Bug
&lt;/h2&gt;

&lt;p&gt;Fuzzy matching is often treated as a "one-and-done" cleanup task. But as data grows, it quickly transforms into a significant architectural bottleneck. The goal isn't just to find the most accurate algorithm; it's to choose a path that balances computational cost, engineering maintenance, and iteration speed.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;At small scales&lt;/strong&gt;, don't over-engineer. Use a Hosted API to skip the preprocessing headache and move on to your actual work.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;At mid-to-large scales&lt;/strong&gt;, recognize that you are no longer in "scripting" territory. Every hour spent debugging a Spark cluster or tuning a blocking rule is an hour not spent on your core product.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ultimately, the best fuzzy matching implementation is the one you don't have to think about. Whether you "buy" via a Hosted API or "build" via a distributed cluster, ensure your choice accounts for the O(n²) reality before your data lake becomes a data swamp.&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>data</category>
      <category>dataengineering</category>
      <category>python</category>
    </item>
    <item>
      <title>I built a fuzzy matching engine that's 300x faster than RapidFuzz on 1M records</title>
      <dc:creator>Siyana Hristova</dc:creator>
      <pubDate>Tue, 17 Feb 2026 14:18:03 +0000</pubDate>
      <link>https://dev.to/siyana_hristova_900e581ee/i-built-a-fuzzy-matching-engine-thats-300x-faster-than-rapidfuzz-on-1m-records-1o00</link>
      <guid>https://dev.to/siyana_hristova_900e581ee/i-built-a-fuzzy-matching-engine-thats-300x-faster-than-rapidfuzz-on-1m-records-1o00</guid>
      <description>&lt;p&gt;Fuzzy matching is one of those tasks that feels "easy" until you hit real-world data volumes.&lt;/p&gt;

&lt;p&gt;If you’re comparing two strings, &lt;code&gt;fuzz.ratio("Microsoft", "Micsrosoft Corpp")&lt;/code&gt; works in microseconds. But what happens when you have to deduplicate a CRM with 1,000,000 rows?&lt;/p&gt;

&lt;p&gt;I spent the last few weeks benchmarking the "standard" Python ways to do this - RapidFuzz, TheFuzz, and Levenshtein - and I realized why everyone hates data cleaning: The O(N²) scaling wall is real.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The Benchmark: 10k to 1M Rows&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I set up a head-to-head comparison in a standard Google Colab environment (2 vCPUs, 13GB RAM) using synthetic data with realistic typos (swaps, replacements, and "fat-finger" errors).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxkf7a0hn3bmy9b867rgs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxkf7a0hn3bmy9b867rgs.png" alt=" " width="800" height="275"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;The "Wall"&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;At 10,000 records, RapidFuzz is a beast. It’s fast, optimized C++, and totally usable.&lt;/p&gt;

&lt;p&gt;But fuzzy matching at scale is fundamentally a "many-to-many" problem. When you double your data, you quadruple the work. By the time I hit 100,000 rows, RapidFuzz was taking over 20 minutes. At 1,000,000 rows, local libraries don't just get slow - they crash. You run out of RAM during the matrix construction or your CPU sits at 100% for three days.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How I Optimized for 1M+ Rows&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To get the Similarity API to finish a 1M-row dedupe in 7 minutes, I had to move away from naive loops and implement a dual-engine strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Deterministic Indexing:&lt;/strong&gt; Instead of comparing every string to every other string (quadratic time), I use an adaptive indexing strategy that "blocks" similar strings together before the math starts.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;N-Gram Vectorization:&lt;/strong&gt; I treat strings as high-dimensional vectors. This allows me to use optimized linear algebra libraries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Off-Heap Memory Management:&lt;/strong&gt; To prevent the "OOM (Out of Memory)" crashes common in Python, I use memory-mapping (np.memmap) to process data larger than the available RAM.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Stop building dedupe pipelines&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If you are a developer, your time is better spent building features than babysitting a 12-hour deduplication script that might crash at 99%.&lt;/p&gt;

&lt;p&gt;I’ve open-sourced the benchmark suite and the Google Colab environment I used so you can verify the numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/similarity-api/similarity-api-benchmarks/blob/main/fuzzy_matching_speed_benchmarks_2025" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;:&lt;/strong&gt; View the Benchmark Code — See how we handle 1M+ rows.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;a href="https://colab.research.google.com/drive/1uEtWQ7HYCdykjL85bbg83KcABiF-3TQV?usp=sharing" rel="noopener noreferrer"&gt;Google Colab&lt;/a&gt;:&lt;/strong&gt; Run the Demo — Test the engine in your browser.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;I’ve set up a free tier for the API that handles up to 100,000 records. You can generate a free token with &lt;a href="https://similarity-api.com/" rel="noopener noreferrer"&gt;a free sign up&lt;/a&gt;. It’s meant to be a low-friction way to test real-world data without having to spin up your own infrastructure.&lt;/p&gt;

&lt;p&gt;I’m also looking for a few people with very large datasets (5M+ rows) to help me stress-test the next version of the async engine. If you're hitting scale limits that current tools can't solve, feel free to reach out.&lt;/p&gt;

</description>
      <category>python</category>
      <category>showdev</category>
      <category>performance</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
