DEV Community: Siyana Hristova

Building Your First Web App with AI: A Beginner's Guide Part 1

Siyana Hristova — Wed, 27 May 2026 13:12:00 +0000

Imagine you're building a web app that takes a long, messy list of contacts, leads, or company names and removes near-duplicates. You go to lovable.dev and type something like this:

Build a web app called Clean. Users upload a CSV or Excel file, choose which column to deduplicate on, set a similarity threshold using a slider, and download the cleaned file. The app should support fuzzy matching, not just exact duplicates. Design should be minimal and professional, data-tool aesthetic, muted colour palette, no unnecessary decoration. Primary action always visible. No sign-up required for a first use.

Hit enter and a few minutes later, there it is.

Pretty cool, right?

Three minutes to build. Weeks to fix what you didn't know to think about. That's the gap this guide closes.

Every mistake in this article happened while I was actively using LLMs to guide me through everything from writing code to pricing, distribution, and compliance. The tools weren't the problem. Not knowing what to ask was.

This guide covers the basics of everything between a first Lovable prompt and a live product (authentication, payments, security, compliance, SEO, and the infrastructure decisions that will matter once you have real users) and links to deep-dives where relevant.

What's a web app, and should you build one?

A web app is an application that runs in the browser, designed to be used on a desktop. Think of tools like Figma or Airtable, the kind of thing where you'd be annoyed if someone handed you a phone to use it. Web apps store data, manage user accounts, handle payments, run logic in the background. A static website tells you things. A web app does things.

Is this the right format for your idea? If what you have in mind lives on mobile (something people do on the go, uses the camera, or works best in short bursts) this isn't the right guide for you. Web apps are for sitting down. What we're building here also won't be publishable on the App Store, which is worth knowing upfront if that's part of your distribution plan.

Clean (the tool from the prompt above) is something I actually built and launched. For it, desktop is clearly the right medium: cleaning a CSV is the kind of task you sit down for, with a screen wide enough to actually see what you're working with. We'll use it as a running example throughout: the decisions I made, the ones I got wrong, and what I'd do differently.

Market research

Building fast is one of the great things about AI tools. Building the wrong thing fast is still expensive.

Market research is how you construct a hypothesis, specific enough to act on, that someone will pay for what you're offering before anyone ever has. It shapes what you build, what you charge, who your competitors are, and who you sell to first. It won't give you all the answers (the market will correct you once real users show up), but it saves you from building something nobody wants and tells you what to watch for when they do.

The goal is specificity. Not "people have messy data" but "sales reps searching 'dedupe two CSVs' are a different buyer than ops teams searching 'data deduplication software': different urgency, different budget, different tolerance for setup friction." That gap is a product, a price point, and a go-to-market strategy in one insight. One shortcut: solve your own problem, or your friends'. You already know where existing solutions fall short.

Under the hood: how web apps are built

The more you understand the basic architecture of a web app, the better you can direct Lovable (or your web app builder of choice) toward something that makes sense for what your tool will eventually need to do, not just what it does today. It's the difference between telling a contractor "build me a room" and knowing enough to say "I'll need plumbing in that wall eventually." One of those conversations saves you a very expensive renovation later.

Every web app is made up of three types of components.

Frontend is everything the user sees and interacts with: the layout, the buttons, the forms, the logic that says "when someone clicks this, show that." In Lovable's case, it's written in React, a popular frontend framework, and lives in a GitHub repository Lovable creates on your behalf. For Clean: the upload interface, the column selector, the settings panel, the download button, all frontend. One thing worth knowing: frontend code is readable by anyone. Open any website in your browser, right-click, hit Inspect, and what you see is the frontend. That matters for what belongs there and what doesn't.

The second is a database, where anything that needs to persist is stored. User accounts, processing history, pricing tier, anything that should still be there when a user comes back tomorrow. Without one, every page refresh wipes everything: no user, no history, no trace of what just happened. Lovable uses Supabase for this, which also hosts edge functions (small pieces of server-side logic that run outside the browser without requiring a full backend). Think of them as the connective tissue between your frontend and anything heavier running elsewhere. Lovable can generate these.

The third is a backend, server-side logic that runs independently of the browser, on infrastructure you manage separately. You may not need one for your MVP. You will need one when the logic is too heavy, too slow, or too proprietary for the frontend, and since frontend code is publicly readable, proprietary logic doesn't belong there. For Clean, the backend runs on Google Cloud Platform: a proprietary matching algorithm, significantly faster than open-source alternatives, with preprocessing options (normalising case, stripping punctuation, removing company suffixes, sorting word order) that make a real difference on messy real-world data.

Lovable handles the frontend and connects to Supabase for the database. The backend, if you need one, is built separately; LLMs can help you write it, but it won't appear from a Lovable prompt alone. Knowing which layer a problem belongs to is what lets you ask the right question.

What goes in your database, and what doesn't

Now that you know what a database is, the more useful question is what belongs in it.

Anything that changes because a user did something: database. Accounts, processing history, settings, anything that should still be there when someone comes back tomorrow. Static content that only changes when you decide to update it (help pages, comparison tables, blog posts): frontend code. That distinction will save you a painful migration later. The rule: if it changes because a user did something, it belongs in the database. If it changes because you decided to change it, it belongs in the code.

One thing to get right early: decide what data you'll want later and make sure you're capturing it now. Usage counts, timestamps, key user actions: nearly impossible to reconstruct after the fact.

That's it for part 1. Read the full web app AI solo-builder guide or wait for part two on here.

I write about the things I learn while building solo with AI. Follow along at https://randomsolobuilder.ai/

How to fuzzy-match 1M rows with dbt in under 10 minutes (2026 guide)

Siyana Hristova — Sun, 29 Mar 2026 20:58:21 +0000

Duplicate records rarely look like a priority at first — until they start breaking reporting, outreach, or reconciliation workflows.

From slightly different versions of "Acme Inc" in a CRM to inconsistent supplier names across systems or messy post-merger datasets, fuzzy matching becomes essential whenever identical strings are no longer a reliable signal of the same real-world entity.

The scaling wall: why warehouse-native fuzzy matching breaks at scale

Fuzzy matching looks simple on a 1,000-row sample. But at real scale, the math changes. A naive all-to-all comparison grows at O(N²). Once you hit 100k+ rows, comparison space explodes, and warehouse-native approaches become slow, expensive, or brittle.

In practice, teams usually try a sequence of approaches before realizing the real complexity. They might start with warehouse similarity functions, hit performance limits, then move to Python or notebook experiments — only to discover new bottlenecks around memory usage, blocking strategy design, and data cleanup. At that point, what looked like a simple dedupe task starts turning into a permanent matching pipeline:

Blocking and candidate generation logic
String normalization and suffix cleanup
Threshold tuning and evaluation loops
Parallelization and memory management

What started as a quick cleanup task quietly turns into ongoing engineering overhead.

The solution: call a production fuzzy-matching engine from dbt

Similarity API is a hosted infrastructure service designed for high-performance deduplication and reconciliation.

Instead of building and maintaining your own matching pipeline, you send the relevant strings to a dedicated matching engine optimized for noisy real-world data and large-scale workloads — then load the results back into your warehouse as a normal dbt model output.

The technical edge: adaptive preprocessing at scale

In real workflows, fuzzy matching quality is determined as much by data preparation strategy as by the similarity metric itself.

Local implementations often require teams to design custom normalization rules, suffix cleaning logic, token ordering heuristics, and blocking strategies — each of which must be tuned as datasets evolve.

Similarity API embeds these steps directly into the matching engine:

Dataset-aware normalization: preprocessing adapts dynamically to string length, token density, and noise patterns
Scale-optimized cleaning pipeline: preprocessing runs as part of the distributed matching flow, preventing cleanup stages from becoming bottlenecks at 1M+ rows
Configuration instead of custom code: matching behaviour is controlled through parameters such as similarity_threshold, use_token_sort, and remove_punctuation, rather than bespoke scripts

This architecture allows teams to focus on match review and downstream data actions rather than maintaining fragile preprocessing pipelines.

Why fuzzy matching belongs in your dbt layer

This guide is designed for a dbt Python model workflow.

That makes dbt a strong execution surface for fuzzy matching because you can:

pull source data from your warehouse using dbt refs or sources
call the matching API inside a repeatable transformation workflow
materialize match results back into warehouse tables
keep dedupe logic close to the rest of your analytics engineering stack

In practice, this means you can move from one-off cleanup to a reusable model that runs as part of your broader data pipeline.

Before running this model, you will need a Similarity API production token.

You can generate one from the Similarity API dashboard. The token is passed as a standard Bearer authorization header in the request.

What you actually get back

Example input

["Acme Inc", "ACME Incorporated", "Beta LLC", "Beta Limited"]

Example output (index_pairs)

[
    [0, 1, 0.94],
    [2, 3, 0.91]
]

Each result represents two rows that likely refer to the same real-world entity, along with a similarity score.

By default, the API returns index pairs, which you can join back to the staged input rows for review, clustering, or merge workflows.

Output format is configurable — you can instead return string pairs, clustered groups of duplicates, or fully deduplicated record lists depending on your cleanup strategy.

The following dbt Python model:

reads company names from an upstream dbt model
sends them to the Similarity API
returns duplicate index pairs joined back to the original strings

import os
import requests
import pandas as pd

def model(dbt, session):
    dbt.config(materialized="table")

    api_key = os.environ["SIMILARITY_API_KEY"]
    api_url = "https://api.similarity-api.com/dedupe"
    source_df = dbt.ref("stg_companies").to_pandas()
    source_df = source_df.reset_index(drop=True)

    strings = (
        source_df["company_name"]
        .dropna()
        .astype(str)
        .tolist()
    )

    print(f"Loaded {len(strings):,} rows from dbt ref('stg_companies')")
    payload = {
        "data": strings,
        "config": {
            "similarity_threshold": 0.65,
            "remove_punctuation": True,
            "to_lowercase": True,
            "use_token_sort": False,
            "output_format": "index_pairs",
        },
    }
    response = requests.post(
        api_url,
        headers={"Authorization": f"Bearer {api_key}"},
        json=payload,
        timeout=3600,
    )
    response.raise_for_status()

    results = response.json().get("response_data", [])
    print(f"Workflow complete: found {len(results):,} duplicate pairs")
    if not results:
        return pd.DataFrame(
            columns=[
                "idx_1",
                "idx_2",
                "score",
                "company_name_1",
                "company_name_2",
            ]
        )

    dedupe_df = pd.DataFrame(results, columns=["idx_1", "idx_2", "score"])
    dedupe_df["company_name_1"] = dedupe_df["idx_1"].map(lambda i: strings[i])
    dedupe_df["company_name_2"] = dedupe_df["idx_2"].map(lambda i: strings[i])

    return dedupe_df

A minimal dbt_project.yml setup would expose SIMILARITY_API_KEY to the runtime environment, and the resulting table can then feed review models, merge workflows, or downstream entity clustering.

The honest "under 10-minute" claim

Here is how the timing works in practice:

~7 minutes: benchmarked processing time for a 1M-row dataset in Similarity API. This varies with string length and duplicate density.
~2 minutes: drop the model into your dbt project, set the API key, and run it

No blocking strategy design. No distributed compute tuning. No regex cleanup scripts.

From prototype to production

The advantage of dbt is that this does not have to stay a one-off experiment.

Once the model works, you can schedule it as part of your normal transformation workflow and build downstream logic on top of the output table:

review likely duplicate pairs
cluster entities before enrichment
feed survivorship / merge logic
monitor duplicate volume over time

Because the interface is standard HTTP, the matching engine becomes a reusable data-quality component inside the same dbt workflow your team already maintains.

Final word

At large scale, fuzzy matching stops being a string-similarity problem and becomes an infrastructure problem.

Similarity API is built for teams that prefer to spend engineering time on analytics and product logic — not on maintaining custom deduplication pipelines.

Instead of weeks of pipeline work, you can run one dbt model and move straight to reviewing and acting on clean data.

Stop building matching infrastructure. Start acting on clean entities.

How to Fuzzy-Match 1 Million Rows in BigQuery in under 10 minutes

Siyana Hristova — Mon, 23 Mar 2026 14:20:18 +0000

Duplicate records rarely look like a priority at first — until they start breaking reporting, outreach, or reconciliation workflows.

The scaling wall: why warehouse-native fuzzy matching breaks at scale

Fuzzy matching looks simple on a 1,000-row sample. But at real scale, the math changes. A naive all-to-all comparison grows at O(N²). Once you hit 100k+ rows, comparison space explodes, and local scripts or warehouse-native approaches become slow, expensive, or brittle.

In practice, teams usually try a sequence of approaches before realizing the real complexity. They might start with warehouse similarity functions (such as edit distance or token similarity), hit performance limits, then switch to a quick Python script — only to discover new bottlenecks around memory usage, blocking strategy design, and data cleanup.

At that point, what looked like a simple task starts turning into a permanent matching pipeline:

Blocking and candidate generation logic
String normalization and suffix cleanup
Threshold tuning and evaluation loops
Parallelization and memory management

What started as a quick dedupe task quietly turns into ongoing engineering overhead.

The solution: call a production fuzzy-matching engine

Similarity API is a hosted infrastructure service designed for high-performance deduplication and reconciliation.

Instead of building and maintaining your own matching pipeline, you send the dataset to a dedicated matching engine optimized for noisy real-world data and large-scale workloads.

The technical edge: adaptive preprocessing at scale

In real workflows, fuzzy matching quality is determined as much by data preparation strategy as by the similarity metric itself.

Similarity API embeds these steps directly into the matching engine:

Dataset-aware normalization: preprocessing adapts dynamically to string length, token density, and noise patterns
Scale-optimized cleaning pipeline: preprocessing runs as part of the distributed matching flow, preventing cleanup stages from becoming bottlenecks at 1M+ rows
Configuration instead of custom code: matching behaviour is controlled through parameters such as similarity_threshold, use_token_sort, and remove_punctuation, rather than bespoke scripts

This architecture allows teams to focus on match review and downstream data actions rather than maintaining fragile preprocessing pipelines.

The BigQuery Notebook

This guide is designed to run inside a BigQuery notebook environment (Colab Enterprise integrated into BigQuery).

These notebooks let you:

Query production tables directly from BigQuery
Run Python data workflows without provisioning infrastructure
Call external APIs for heavy compute tasks
Write results back into BigQuery tables

In practice, this makes them an ideal surface for large-scale fuzzy matching workflows: data stays in the warehouse, while compute-intensive matching runs in a scalable external service.

Before running the notebook cell, you will need a Similarity API production token.

You can generate one from the Similarity API dashboard. The token is passed as a standard Bearer authorization header in the request.

What you actually get back

Example input

["Acme Inc", "ACME Incorporated", "Beta LLC", "Beta Limited"]

Example output

[
    [0, 1, 0.94],
    [2, 3, 0.91]
]

Each result represents two rows that likely refer to the same real-world entity, along with a similarity score.

By default, the API returns index pairs, which you can quickly join back to your BigQuery table for review or merge workflows.

Output format is configurable — you can instead return string pairs, clustered groups of duplicates, or fully deduplicated record lists depending on your cleanup strategy.

Notebook example

The following code snippet:

reads a dataset directly from BigQuery
sends company names to the Similarity API
returns duplicate index pairs

from google.cloud import bigquery
import requests
import pandas as pd

# ---- CONFIG ----
PROJECT_ID = "YOUR_PROJECT_ID"
DATASET = "YOUR_DATASET"
TABLE = "YOUR_TABLE"
COLUMN = "company_name"

API_KEY = "YOUR_PRODUCTION_KEY"
API_URL = "https://api.similarity-api.com/dedupe"

# ---- LOAD DATA FROM BIGQUERY ----
client = bigquery.Client(project=PROJECT_ID)

query = f"""
SELECT {COLUMN}
FROM `{PROJECT_ID}.{DATASET}.{TABLE}`
WHERE {COLUMN} IS NOT NULL
"""

strings = (
    client.query(query)
    .result()
    .to_dataframe()[COLUMN]
    .astype(str)
    .tolist()
)

print(f"Loaded {len(strings):,} rows from BigQuery")

# ---- CALL SIMILARITY API ----
payload = {
    "data": strings,
    "config": {
        "similarity_threshold": 0.65,
        "remove_punctuation": True,
        "to_lowercase": True,
        "use_token_sort": False,
        "output_format": "index_pairs",
    },
}

response = requests.post(
    API_URL,
    headers={"Authorization": f"Bearer {API_KEY}"},
    json=payload,
    timeout=3600,
)

response.raise_for_status()

results = response.json().get("response_data", [])

print(f"Workflow complete: found {len(results):,} duplicate pairs")

# ---- OPTIONAL: SAVE RESULTS BACK TO BIGQUERY ----
if results:
    dup_df = pd.DataFrame(results, columns=["idx_1", "idx_2"])

    table_id = f"{PROJECT_ID}.{DATASET}.dedupe_results"

    job = client.load_table_from_dataframe(dup_df, table_id)
    job.result()

    print(f"Saved results to {table_id}")

The honest "under 10-minute" claim

Here is how the timing works in practice:

~7 minutes: benchmarked processing time for a 1M-row dataset in Similarity API (varies with string length)
~2 minutes: copy-paste the notebook cell, run the query, and start the job

No blocking strategy design. No distributed compute tuning. No regex cleanup scripts.

From prototype to production

Notebooks are ideal for validating matching quality and running one-off reconciliation jobs.

In production, the same API call pattern can be embedded into:

scheduled BigQuery workflows
Airflow or Prefect pipelines
backend data services
low-code automation tools

Because the interface is standard HTTP, the matching engine becomes a reusable data-quality component across your stack.

Final word

At large scale, fuzzy matching stops being a string-similarity problem and becomes an infrastructure problem.

Similarity API is built for teams that prefer to spend engineering time on analytics and product logic — not on maintaining custom deduplication pipelines.

Instead of weeks of pipeline work, you can run a notebook cell and move straight to reviewing and acting on clean data.

Test the algo for free with an online Excel/CSV dedupe file tool. Compare two CSVs or Excel files in minutes online here.

How to Reconcile Salesforce Leads Against Contacts at Scale

Siyana Hristova — Mon, 23 Mar 2026 13:41:59 +0000

Duplicate identity records are almost inevitable in modern Salesforce environments.

Leads enter the CRM from web forms, enrichment tools, outbound prospecting platforms, partner integrations, event uploads, product sign-ups, and manual entry. Even in well-governed systems, slight variations in names, emails, company formatting, and job titles accumulate over time.

At scale, teams eventually need to answer practical operational questions:

Which of our newly imported leads already exist as contacts?
Who should own this inbound lead if the account already exists?
How do we clean identity data before migrations or reporting resets?

This is where lead-to-contact reconciliation workflows emerge.

Why teams run lead-to-contact reconciliation

This workflow is typically driven by operational needs:

Reporting accuracy — duplicate identities fragment attribution and pipeline analytics
Routing correctness — inbound leads often need to inherit ownership from existing accounts
Import risk reduction — bulk uploads can create thousands of duplicates without pre-checks
Automation enablement — surfacing candidate matches enables auto-assignment and conversion rules

Over time, reconciliation becomes a recurring RevOps capability rather than a one-off cleanup exercise.

What reconciliation workflows look like in practice

Pre-import identity checks

Export existing contacts
Compare new leads against the contact base
Review high-confidence matches
Merge or update records before import

Scheduled identity cleanup jobs

Compare recently created leads to contacts
Write similarity scores or match IDs to custom fields
Create review queues for RevOps teams

Automation-driven identity resolution

Apex triggers call external reconciliation endpoints before lead insert
Salesforce Flows surface candidate matches for SDR review
Nightly jobs reassign leads to existing account owners

At this stage, similarity matching becomes part of operational CRM infrastructure.

Exact vs similarity matching in CRM reconciliation

Traditional deduplication relies on exact matching — typically strict email equality or rule-based logic.

Exact matching works well when identity signals are clean and standardized.

In real go-to-market environments, identity data drifts:

People use multiple email addresses
Company names appear in different formats
Titles and suffixes vary
Records are created across disconnected systems

Similarity-based matching addresses this ambiguity by asking:

Are these records likely to represent the same real-world person?

Exact matching remains a useful first filter.
Similarity matching expands coverage to edge cases that strict rules cannot resolve at scale.

How reconciliation pipelines typically work

Conceptually, identity matching pipelines involve:

Pre-processing — normalize casing, punctuation, token order, and company suffixes
Similarity scoring — compare identity strings
Filtering — retain matches above a defined confidence threshold

This approach works on small datasets.
It becomes harder when:

CRM datasets reach hundreds of thousands of records
Identity drift occurs continuously through imports and enrichment
Reconciliation must run automatically or on a frequent schedule

At that point, teams often move from ad-hoc scripts toward scalable matching infrastructure.

Replacing the pipeline with a single reconciliation call

Instead of designing and maintaining a full matching pipeline, teams can use a reconciliation API.

Example request:

payload = {
    "data_a": lead_match_strings,
    "data_b": contact_match_strings,
    "config": {
        "similarity_threshold": 0.82,
        "top_n": 3,
        "to_lowercase": True,
        "remove_punctuation": True,
        "use_token_sort": True,
        "output_format": "flat_table"
    }
}

res = requests.post(
    "https://api.similarity-api.com/reconcile",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json=payload
).json()

A key design decision is defining the identity string — commonly a combination of:

First name
Last name
Email
Company / account name
Job title

Example reconciliation output

When using a flat table output format, matches are returned at row level:

lead_index	lead_identity	contact_index	contact_identity	score	matched
0	Jane Doe	jane@acme.com	Acme Inc	1542	Jane Doe
0	Jane Doe	jane@acme.com	Acme Inc	9811	Janet Doe
1	Mark Lee	mark@north.io	North IO	2207	Marc Lee

These candidate matches can then power:

Lead conversion workflows
Ownership reassignment
Deduplication review queues
Automated CRM hygiene jobs

Final thoughts

Lead-to-contact reconciliation is not just a data cleanup task.
In high-volume Salesforce environments, it becomes a foundational operational capability.

Teams that implement scalable identity matching gain:

More reliable pipeline attribution
Cleaner account ownership signals
Safer bulk imports
Stronger automation across RevOps workflows

As CRM datasets grow, reconciliation workflows evolve from manual checks into continuous identity infrastructure.

For a faster test, try the Excel/CSV online dedupe tool or compare two CSVs or Excel files online.

How to fuzzy-match a 1M-row dataset to a canonical reference in under 10 minutes (2026 guide)

Siyana Hristova — Fri, 20 Mar 2026 14:35:13 +0000

Unifying operational data against a canonical reference is a foundational analytics task — and one that becomes surprisingly complex at scale.

Whether you are matching a newly acquired CRM against an existing customer base, aligning vendor lists across procurement systems, or validating inbound leads before enrichment, reconciliation is the practical way to identify which records refer to the same real-world entities across datasets.

The scaling wall: why cross-dataset matching gets hard fast

Matching problems often start with what looks like a manageable task: align a large operational dataset with a canonical reference table.

But when that reference dataset contains hundreds of thousands or millions of rows, naive matching approaches quickly become impractical. A brute-force comparison of 3,000 records against 1,000,000 candidates already implies billions of potential similarity checks.

In real workflows, teams typically try a sequence of approaches before realizing the full complexity:

warehouse similarity joins that become slow or expensive
Python scripts that run out of memory or require heavy batching
ad-hoc preprocessing logic for suffix cleanup and token normalization
fragile threshold tuning loops that must be revisited as data evolves

What began as a simple reconciliation step can quietly turn into a long-term engineering burden.

The solution: use a purpose-built reconciliation engine

Similarity API provides a hosted infrastructure layer designed specifically for large-scale A-to-B entity matching.

Instead of engineering candidate-generation logic, blocking strategies, and distributed compute orchestration yourself, you send:

a smaller dataset (for example 3K inbound records)
a larger reference dataset (for example a 1M-row master table)

The engine handles the matching workflow and returns the most likely corresponding entities.

This lets teams focus on review, enrichment, and downstream automation rather than building matching infrastructure.

The technical edge: adaptive matching across asymmetric datasets

Reconciliation is fundamentally different from deduplication because datasets are asymmetric in size and structure.

Local implementations typically require custom logic to:

generate candidate pools efficiently
normalize naming conventions across systems
tune similarity thresholds for different entity types
rank or filter multiple potential matches

Similarity API embeds these steps directly into the matching engine:

Adaptive candidate generation: optimized search strategies reduce comparison space automatically
Dataset-aware normalization: cleaning logic adapts to string density and noise patterns
Configurable ranking behaviour: parameters control match strictness and output structure

This allows teams to run reconciliation workflows at scale without designing bespoke matching pipelines.

What you actually get back

Example input datasets

Dataset A (new records):

["Acme Corporation", "Beta Solutions Ltd", "Gamma Tech"]

Dataset B (reference dataset excerpt):

["ACME Corp", "Beta Solutions Limited", "Delta Industries", "Gamma Technologies"]

Example reconciliation output (top match pairs)

[
[0, 0, 0.93],
[1, 1, 0.91],
[2, 3, 0.88]
]

Each result represents a likely match between a record in the smaller dataset and a candidate in the larger reference dataset, along with a similarity score.

Output format is configurable depending on workflow needs. Teams may choose to return:

top match index pairs
ranked candidate lists
string match previews for validation
enriched reconciliation tables

This flexibility allows the same matching engine to support exploratory validation, automated enrichment, or production reconciliation pipelines.

Example reconciliation call

This minimal Python example demonstrates the core workflow. In practice, the same call can be embedded into notebooks, orchestration pipelines, backend services, or analytics transformations.

import requests

API_KEY = "YOUR_PRODUCTION_KEY"
API_URL = "https://api.similarity-api.com/reconcile"

new_records = [
    "Acme Corporation",
    "Beta Solutions Ltd",
    "Gamma Tech",
]

reference_records = load_large_reference_dataset_somehow()  # e.g. warehouse extract

payload = {
    "data_a": new_records,
    "data_b": reference_records,
    "config": {
        "top_n": 1,
        "similarity_threshold": 0.7,
        "remove_punctuation": True,
        "to_lowercase": True,
    },
}

response = requests.post(
    API_URL,
    headers={"Authorization": f"Bearer {API_KEY}"},
    json=payload,
    timeout=3600,
)

matches = response.json().get("response_data", [])
print(f"Found {len(matches)} reconciliation matches")

The honest “under 10-minute” claim

For a common workload such as reconciling ~3,000 inbound records against a 1M-row reference dataset, runtime typically breaks down as:

~7 minutes: matching and ranking performed by the reconciliation engine
~2–3 minutes: extracting the reference dataset and triggering the workflow

No custom blocking logic. No distributed similarity joins. No manual candidate ranking pipelines.

From ad-hoc validation to production reconciliation

Once teams validate reconciliation accuracy, this workflow can be embedded into recurring processes such as:

lead enrichment validation before CRM ingestion
supplier master data alignment
post-migration entity reconciliation
data quality monitoring across system boundaries

Because the interface is standard HTTP, reconciliation becomes a reusable infrastructure component rather than a bespoke project.

Final word

At scale, reconciliation is not a similarity-function problem — it is a candidate-generation and infrastructure problem.

Similarity API enables teams to match asymmetric datasets quickly without building custom pipelines for blocking, ranking, and normalization.

Instead of engineering reconciliation logic from scratch, you can focus on reviewing matches and acting on unified entity data.

Stop building matching infrastructure. Start operating on reconciled entities.

Get a free API token at https://similarity-api.com/

Why It Rarely Makes Sense to Build Fuzzy Matching Yourself in 2026

Siyana Hristova — Thu, 19 Mar 2026 11:25:23 +0000

Fuzzy matching finds records that refer to the same entity even when the text is not identical. It shows up everywhere: CRM deduplication, company name matching across systems, lead and account cleanup, product catalog cleanup, supplier matching, and post‑merger data reconciliation.

In practice, that sounds much easier than it is.

The scale problem

On small datasets, basic approaches can look good enough.

At real operational scale, they stop being practical. Naive all‑to‑all comparison grows too fast, which is why workflows that seem fine on a sample often become slow, expensive, or unusable on large datasets.

The hidden pipeline problem

The hard part is not just scoring string similarity.

To make fuzzy matching work in production, teams usually have to build a full supporting pipeline around it:

preprocessing and normalization
company suffix and token cleanup
blocking and candidate generation
threshold tuning
batching and memory management
evaluation and ongoing maintenance

Each of those steps affects both speed and match quality. For example, blocking and candidate generation are often necessary to make matching fast enough, but if they are designed poorly, they can quietly miss true matches.

The real cost of building it yourself

Even optimistic assumptions make DIY fuzzy matching more expensive than it first appears.

According to U.S. Bureau of Labor Statistics data, the median software engineer salary is about $133k/year. When benefits and overhead are included, total employer cost is typically around 1.4× salary, which translates to roughly $90/hour loaded engineering cost.

If a team builds an internal fuzzy‑matching pipeline in just 2 weeks (≈80 engineering hours), the implementation cost alone is roughly:

≈ $7,280 in engineering time

This excludes ongoing tuning, maintenance, infrastructure cost, and the risk of degraded match quality at larger scale.

The math with Similarity API

Using Similarity API changes the cost structure completely.

Assume:

5 hours of engineering time to evaluate, integrate, and operationalize the API
Loaded engineering cost ≈ $90/hour
API pricing $1.99 per 10,000 rows

For a workload of 1,000,000 rows:

Engineering setup cost ≈ $450
API processing cost ≈ $199

Total ≈ $649 to get production fuzzy matching on a 1M‑row dataset.

Why the tradeoff is clear

Compared to a conservative DIY build cost of about $7,280, a team would need to run 1M rows every month for roughly 3 years before total Similarity API spend reaches the same level.

And that comparison still ignores:

ongoing pipeline maintenance
model tuning as data evolves
engineering opportunity cost
reliability risks in edge cases

Most teams do not actually want a fuzzy‑matching project. They want correct matches at scale.

The practical conclusion

Similarity API removes the need to design, implement, tune, and maintain a dedicated fuzzy‑matching pipeline.

Instead of investing weeks of engineering effort upfront and carrying long‑term maintenance risk, teams can call an API built specifically for large‑scale deduplication and reconciliation — and move on to higher‑leverage work.

In 2026, for most real workloads, that is simply the more rational engineering and financial decision.

Try it for free at https://similarity-api.com/try-it

Fuzzy-match 1M rows in under 10 minutes (2026 Edition)

Siyana Hristova — Wed, 11 Mar 2026 12:49:49 +0000

Duplicate records are easy to ignore until they're everywhere.

Whether it's three versions of "Acme, Inc." in your CRM, a messy lead import, or a post-merger database reconciliation, fuzzy matching is the only way to find records that refer to the same entity when exact string matches fail.

The Scaling Wall: Why DIY Fails

Fuzzy matching sounds simple on a 1,000-row sample. But at scale, the math changes. A naive all-to-all comparison scales at $O(N^2)$. Once you hit 100k+ rows, the comparison space explodes, and your local script or SQL workflow will grind to a halt.

I spent a long time trying to build these pipelines myself. Most of us start with a simple Python script and end up building a monster. You quickly find yourself manually managing:

Infrastructure: Blocking, indexing, and parallelization.
Tuning: Endless threshold tweaking and "brittle" regex cleanup.
Maintenance: Keeping custom pipelines alive as your data volume grows.

The result? Your "simple task" turns into a permanent engineering tax.

The Technical Edge: Adaptive Preprocessing

The hardest part of fuzzy matching isn't just the comparison—it's the cleaning. Similarity API uses an internal engine that adapts its strategy depending on the input size and noise level.

Unlike local libraries that force you to write your own cleanup code, this engine:

Adapts to Dataset Structure: Automatically adjusts normalization strategies based on string length and density.
Optimized for Scale: Preprocessing is baked into the matching pipeline, ensuring that even at 1M+ rows, the "cleanup" phase doesn't become a bottleneck.
Configuration over Code: You don't write cleaning scripts; you toggle parameters like token_sort or remove_punctuation.

The Solution: A Production-Ready Infrastructure

After testing various approaches, I started leaning into Similarity API for my own professional workflows. It is a hosted, paid infrastructure service designed for high-performance deduplication.

The Value Prop: You aren't just buying speed; you're buying a production-ready component. By offloading matching to a dedicated API, you move the complexity out of your codebase and into a scalable, managed environment.

💰 Note: Professional Infrastructure

This is not a free, community-maintained library. Similarity API is a commercial service. You will need to sign up for an API key, and it operates on a usage-based pricing model. Because it's a paid service, you get guaranteed uptime and dedicated support. If you are building tools for your company, offloading this to a paid service is a small price to pay to avoid the "engineering tax" of maintaining custom matching code.

Integration: Build Once, Automate Forever

While the example below runs easily in a notebook for prototyping, the real power is embedding this into repeatable production workflows.

For smaller datasets, the direct API call is the fastest route. However, if your dataset exceeds 10MB, you should use our specialized File Upload endpoint, which is designed to handle larger batches efficiently.

Because it is a standard REST API, you can integrate it into any environment that supports HTTP requests:

Code-First: Airflow, Prefect, GitHub Actions, or Python/Node.js backend services.
No-Code/Low-Code: n8n, Zapier, Make.com, or Retool.
Enterprise: Databricks, Snowflake, or AWS Lambda jobs.

import requests
import pandas as pd

# Professional-grade matching requires a paid API key
API_KEY = "YOUR_PRODUCTION_KEY"
API_URL = "https://api.similarity-api.com/dedupe"

# Load your production dataset
df = pd.read_csv("large_dataset.csv")
strings = df["company_name"].dropna().astype(str).tolist()

# Define your configuration
payload = {
    "data": strings,
    "config": {
        "similarity_threshold": 0.85,
        "remove_punctuation": True,
        "to_lowercase": True,
        "use_token_sort": True,
        "output_format": "index_pairs",
    },
}

# The API handles the orchestration and scaling automatically
response = requests.post(API_URL,
                         headers={"Authorization": f"Bearer {API_KEY}"},
                         json=payload,
                         timeout=3600)

results = response.json().get("response_data", [])
print(f"Workflow Complete: Found {len(results):,} duplicates.")

⏱️ The Honest "10-Minute" Claim

I claim you can dedupe 1M rows in under 10 minutes. Here is the math:

7 Minutes: The time the engine actually takes to crunch through 1,000,000 rows (based on my public benchmarks).
3 Minutes: The time it takes for you to copy the code above, paste it into Colab, and grab a coffee while it runs.

If you're faster at copy-pasting, you might even finish in 8.

Want to prove it yourself? Don't take my word for it. I keep the methodology transparent—because when you pay for infrastructure, you should know exactly what you're getting.

Final Word

When data gets large, the hard part isn't the similarity function - it's the infrastructure. Similarity API is a service for teams that value engineering time over building custom deduplication scripts. It allows you to skip the pipeline work and get straight to the results: reviewing, merging, and acting on clean data.

Explore full API docs on https://similarity-api.com/documentation

Fuzzy-match millions of rows in Databricks (2026)

Siyana Hristova — Wed, 25 Feb 2026 13:08:28 +0000

When you fuzzy-match 10 million rows, you aren't "just comparing strings." A naïve dedupe implies roughly n(n−1)/2 ≈ 5×10¹³ potential pairs. At this scale, approaches that feel "quick" on small tables start to break.

In Databricks, most teams reach for one of three options:

Spark-native candidate generation (LSH/MinHash)
Fast to start, but you end up tuning a tradeoff between missed matches and huge candidate sets.
Entity-resolution frameworks
Powerful, but often heavier than you want for "dedupe this column."
Custom Python scoring (UDFs / pandas UDFs)
Easy to prototype, but at large scale jobs become dominated by Python overhead, skew, and shuffles.

A practical approach is to let Databricks handle what it's best at (data access, ETL, governance) and offload the actual matching step to a service built specifically for high-scale deduplication.

In this tutorial, we'll do that using Similarity API — an async "job" style matching service where you:

upload a dataset once (CSV or Parquet)
start a job
poll status
then download results (as Parquet or CSV)

This doesn't eliminate all cost — you still export data and ingest results — but it avoids the most fragile part: doing pairwise matching inside Spark.

Why use Similarity API for the matching step?

Avoid Spark-side pairwise matching: no cartesian joins or UDF-based scoring at scale.
Normalization options built in: punctuation removal, lowercasing, token sorting, and a company_names preset that strips common business suffixes (Inc/LLC/Ltd/etc.).
Deterministic output artifact: the service returns a file you can land back into Delta (e.g., per-row annotations, membership maps, or match pairs).
Proven at 1M+ scale: see the benchmark run (1M rows in ~7 minutes) and comparisons vs common fuzzy matching approaches.

The workflow: Databricks ↔ Similarity API

Prerequisites:

Network egress: This workflow assumes your Databricks compute can make outbound HTTPS requests to the Similarity API hostname. In many enterprise and some serverless setups, outbound internet/DNS is restricted by policy—if so, you'll need an admin to allow outbound access (or allowlist the API domain) for the notebook to reach the service.
Access (pricing + token): Similarity API is a paid service. To run this notebook you’ll need an API token—create an account to get one (there’s typically a free trial/credits for testing), then store the token in Databricks Secrets as API_TOKEN

Important detail about Similarity API's current output contract: for "row annotations," the result includes a row_id that is the 0..n-1 positional index of the uploaded file. To join results back to your source table, we'll create an explicit index in Databricks and persist an idx → primary_key mapping.

Step 1 — Build a stable index and export a single-column parquet

We create:

idx: 0..n-1
pk: your real primary key
value: the string column you want to dedupe

Then we write:

an index map table to Delta (idx, pk)
a single-column Parquet containing only value (in the same row order) for upload

from pyspark.sql import functions as F
from pyspark.sql.window import Window
import time, os

# Config
SOURCE_TABLE  = "main.crm.customers"
PK_COLUMN     = "customer_id"    # change to your true PK
STRING_COLUMN = "company_name"   # change to your string column

# Use DBFS scheme for Spark paths
TMP_DIR_DBFS     = f"dbfs:/tmp/similarity_api/{int(time.time())}"
PARQUET_DIR_DBFS = f"{TMP_DIR_DBFS}/input_parquet"

base = (
    spark.table(SOURCE_TABLE)
    .select(
        F.col(PK_COLUMN).cast("string").alias("pk"),
        F.col(STRING_COLUMN).cast("string").alias("value"),
    )
    .where(F.col("value").isNotNull() & (F.length(F.trim(F.col("value"))) > 0))
)

# Create a deterministic 0..n-1 index by ordering on pk.
w = Window.orderBy(F.col("pk"))
indexed = base.withColumn("idx", (F.row_number().over(w) - 1).cast("long"))

# Persist idx → pk mapping for join-back
indexed.select("idx", "pk").write.mode("overwrite") \
    .format("delta").saveAsTable("main.crm.similarity_idx_map")

# Export ONLY the value column, in the same row order.
# Use coalesce(1) (not repartition(1)) to avoid a full shuffle.
indexed.select("value").coalesce(1).write.mode("overwrite") \
    .parquet(PARQUET_DIR_DBFS)

print("Wrote parquet to:", PARQUET_DIR_DBFS)

Note: coalesce(1) makes Spark write a single part-*.parquet data file (plus a few small metadata files). Similarity API currently returns one signed upload URL for one object, so this "single part file" approach is the simplest way to upload. For very large datasets, you'll eventually want multi-part ingestion (multiple files) or storage-native ingestion — this version is intentionally "works now."

Step 2 — Create a Similarity API job and upload the Parquet file

import glob, requests

API_URL = "https://api.similarity-api.com"
TOKEN   = dbutils.secrets.get(scope="similarity", key="API_TOKEN")
headers = {"Authorization": f"Bearer {TOKEN}"}

payload = {
    "config": {
        "input_format":         "parquet",
        "similarity_threshold": 0.85,
        "use_case":             "company_names",
        "output_format":        "row_annotations",
        "output_file_format":   "parquet",
        "top_k":                50
        # If you upload a multi-column parquet later, add:
        # "input_column": "value"
    }
}

# Convert Spark path -> local driver path for Python file access
PARQUET_DIR_LOCAL = PARQUET_DIR_DBFS.replace("dbfs:", "/dbfs")
part_file = glob.glob(f"{PARQUET_DIR_LOCAL}/part-*.parquet")[0]

# 1) Create job (NEW PATH)
resp = requests.post(
    f"{API_URL}/dedupe/jobs",
    headers=headers,
    json=payload,
    timeout=120
)
resp.raise_for_status()
data = resp.json()
job_id     = data["job_id"]
upload_url = data["upload_url"]
print("job_id:", job_id)

# 2) Upload file bytes to signed URL
with open(part_file, "rb") as f:
    r = requests.put(
        upload_url,
        data=f,
        headers={"Content-Type": "application/octet-stream"},
        timeout=3600
    )
    r.raise_for_status()

# 3) Commit (starts async run) (NEW PATH)
r = requests.post(
    f"{API_URL}/dedupe/jobs/{job_id}/commit",
    headers=headers,
    timeout=120
)
r.raise_for_status()
print("Committed. rows_total:", r.json().get("rows_total"))

Step 3 — Poll, download results, and land back into Delta

Similarity API returns a signed result_url (HTTPS). Spark typically won't read HTTPS URLs directly as Parquet, so we download to DBFS first and then load with spark.read.parquet.

import time, requests, os

def wait_for_results(job_id: str) -> str:
    while True:
        resp = requests.get(
            f"{API_URL}/dedupe/jobs/{job_id}",   # NEW PATH
            headers=headers,
            timeout=120
        )
        resp.raise_for_status()
        res = resp.json()
        print(f"Stage: {res.get('stage')} ({res.get('progress')}%) | Status: {res.get('job_status')}")
        if res.get("job_status") == "completed":
            if "result_url" not in res:
                raise RuntimeError("Job completed but no result_url returned.")
            return res["result_url"]
        if res.get("job_status") == "failed":
            raise RuntimeError(f"Job failed: {res.get('error')}")
        time.sleep(10)

result_url = wait_for_results(job_id)

# Save results to DBFS (driver local path for Python)
OUT_DIR_DBFS  = f"{TMP_DIR_DBFS}/results"
OUT_DIR_LOCAL = OUT_DIR_DBFS.replace("dbfs:", "/dbfs")
os.makedirs(OUT_DIR_LOCAL, exist_ok=True)
local_path = f"{OUT_DIR_LOCAL}/result.parquet"

with requests.get(result_url, stream=True, timeout=3600) as r:
    r.raise_for_status()
    with open(local_path, "wb") as f:
        for chunk in r.iter_content(chunk_size=8 * 1024 * 1024):
            if chunk:
                f.write(chunk)

# Spark reads from dbfs:/...
results_df = spark.read.parquet(f"{OUT_DIR_DBFS}/result.parquet")
results_df.write.mode("overwrite") \
    .format("delta").saveAsTable("main.crm.similarity_results")

# Join back to your original pk using the idx map
idx_map = spark.table("main.crm.similarity_idx_map")  # idx, pk
joined = results_df.join(
    idx_map, results_df["row_id"] == idx_map["idx"], "left"
)
joined.write.mode("overwrite") \
    .format("delta").saveAsTable("main.crm.customers_dedupe_annotations")

print("Wrote Delta tables: main.crm.similarity_results, main.crm.customers_dedupe_annotations")

At this point, you've got a Delta table keyed by your original pk with whatever annotations Similarity API returned (representatives, membership, similarity scores, etc.). You can inspect schema with:

spark.table("main.crm.customers_dedupe_annotations").printSchema()

Security note: Similarity API processes uploaded data only to compute the requested matching results. Customer data is not sold, shared, or used for advertising or model training. To minimize exposure, this workflow exports only the single string column required for deduplication.

Conclusion

At 10M rows, the bottleneck isn't string similarity — it's building a reliable end-to-end workflow that doesn't devolve into Spark shuffles, UDF overhead, and constant tuning.

By letting Databricks handle data access and governance, and offloading the matching step to Similarity API, you get a workflow that's reproducible, configurable, and doesn't require maintaining custom matching infrastructure.

Scaling Fuzzy Matching: From Local Scripts to Production Pipelines

Siyana Hristova — Mon, 23 Feb 2026 09:38:25 +0000

I’ve handled fuzzy matching across the spectrum: academic research, scrappy startups, and enterprise-grade production environments. While the core objective—deduplicating or reconciling "messy" data—remains the same, the engineering constraints shift drastically as your row count climbs.

At its heart, fuzzy matching is a two-dimensional problem:

Precision: Defining similarity (Levenshtein, Jaro-Winkler, Cosine, etc.).
Scale: Managing the computational cost of comparisons.

Most tutorials focus on the first. This article focuses on the second: the operational "pain bands" that force you to change your architecture.

The Quadratic Trap: Why Size Matters

The fundamental challenge of fuzzy matching is that it is natively a quadratic problem. A naive comparison of every record against every other record follows O(n²) complexity. This means that as your dataset grows, the computational effort doesn't just increase—it explodes.

What works for 1,000 rows (1,000,000 comparisons) becomes an operational nightmare at 100,000 rows (10,000,000,000 comparisons). At this volume, the time and memory required to complete a single run exceed the limits of standard hardware. To survive, you must move from "compare everything" to "intelligent blocking and indexing."

Small Scale: Up to 50k Rows

The "Laptop Scale"

At this volume, the overhead of a distributed system or a complex API is usually overkill. You can still afford to be slightly inefficient because the total compute time is measured in seconds or minutes, not hours.

The Solutions

Power Query / Excel Fuzzy Lookup: Perfect for one-off analyst reconciliation. It’s accessible and requires zero code.
OpenRefine: A powerhouse for interactive clustering. If your data is "messy" (misspellings, varying formats), the human-in-the-loop approach here is unbeatable for accuracy.
Local Python Libraries: Libraries like RapidFuzz or TheFuzz (formerly FuzzyWuzzy) allow you to bake matching into your scripts. RapidFuzz is significantly faster due to its C++ backbone.
Hosted APIs (e.g., Similarity API): At this scale, a hosted API is "super cheap" (often free) and saves hours of implementation.
- Preprocessing: These APIs handle the heavy lifting of normalization—stripping whitespace, fixing casing, and removing punctuation—automatically.
- Domain Optimization: Most are pre-optimized for specific use cases like company names, automatically handling legal suffixes (Inc, Ltd, Corp, GmbH) so "Apple" and "Apple Inc." match without custom logic.

Cost and Complexity

The direct cost here is essentially zero (software-wise), but the engineering cost is in the "tweak-and-wait" cycle. You’ll spend time writing regex pre-processors and testing similarity thresholds.

Recommended Option: RapidFuzz. If you’re a developer, it's the fastest path to a working prototype without adding external dependencies.

Mid Scale: 50k–200k Rows

This is where the quadratic growth starts to bite. A naive "all-against-all" comparison will likely crash your local machine or run for hours. You now need to introduce blocking (only comparing records that share a common key, like a ZIP code or a first initial).

The Solutions

DIY Blocking Pipelines: You write logic to partition the data. This reduces the O(n²) problem to a series of smaller, manageable chunks.
Splink: An open-source Python library for probabilistic record linkage. It uses the Fellegi-Sunter model to "learn" how to match records based on patterns in your data.
Hosted APIs: Similarity API becomes more attractive here because it handles the blocking and indexing logic under the hood. You simply send the data and get matches back.

Cost and Complexity

Complexity jumps significantly. You aren't just matching strings anymore; you’re managing an indexing strategy. If your blocking rules are too strict, you miss matches; too loose, and your compute bill (or wait time) skyrockets.

Recommended Option: Hosted APIs (Similarity API). At this scale, the time spent maintaining custom blocking logic often exceeds the cost of a managed service.

Large Scale: 200k–2M Rows

You have officially left the realm of local processing. You now need a distributed environment or a highly optimized indexing engine.

The Solutions

Distributed Processing (Apache Spark / Databricks): This is the industry standard for big data. You distribute the O(n²) load across a cluster. It is incredibly powerful but requires a Data Engineer to maintain.
Entity Resolution Engines: Purpose-built software (like Senzing or Tilores) designed specifically for identity resolution and linking.
Hosted APIs: A robust Similarity API can process a million records in a few minutes by utilizing high-performance indexing. This provides a "cloud-native" way to get Spark-level performance without the Spark-level maintenance.

Cost and Complexity

The cost is now split between Compute (Cloud fees) and Headcount (Engineering time). Running a Spark cluster isn't cheap, and the time spent "tuning" the cluster for fuzzy joins is a hidden drain on productivity.

Recommended Option: Hosted APIs (Similarity API). It provides the best balance of "Time to Value" vs. "Performance" for recurring production workloads.

Very Large Scale: 2M+ Rows

At this scale, you aren't just "matching"; you are performing Entity Resolution. You need persistent IDs that stay consistent even as the data changes over time.

The Solutions

Master Data Management (MDM) Platforms: Enterprise suites (Informatica, Reltio) that handle the entire lifecycle of data. They are expensive and take months to implement.
Vector Databases: Using embeddings and "Approximate Nearest Neighbor" (ANN) search to find matches in high-dimensional space. Hosted APIs: Similarity API can be used as the matching engine for a custom MDM, providing the heavy-duty compute while your internal systems handle the "golden record" logic.

Cost and Complexity

The scale demands a significant budget. MDM licenses can reach six figures, while DIY Vector DB solutions require specialized knowledge of machine learning and embedding models.

Recommended Option: Hosted APIs paired with an internal Entity Store. This allows you to scale the matching logic infinitely while keeping the business logic (your "Source of Truth") in-house.

Comparison Table: Choosing Your Path

Conclusion: Scale is a Strategy, Not a Bug

Fuzzy matching is often treated as a "one-and-done" cleanup task. But as data grows, it quickly transforms into a significant architectural bottleneck. The goal isn't just to find the most accurate algorithm; it's to choose a path that balances computational cost, engineering maintenance, and iteration speed.

At small scales, don't over-engineer. Use a Hosted API to skip the preprocessing headache and move on to your actual work.
At mid-to-large scales, recognize that you are no longer in "scripting" territory. Every hour spent debugging a Spark cluster or tuning a blocking rule is an hour not spent on your core product.

Ultimately, the best fuzzy matching implementation is the one you don't have to think about. Whether you "buy" via a Hosted API or "build" via a distributed cluster, ensure your choice accounts for the O(n²) reality before your data lake becomes a data swamp.

I built a fuzzy matching engine that's 300x faster than RapidFuzz on 1M records

Siyana Hristova — Tue, 17 Feb 2026 14:18:03 +0000

Fuzzy matching is one of those tasks that feels "easy" until you hit real-world data volumes.

If you’re comparing two strings, fuzz.ratio("Microsoft", "Micsrosoft Corpp") works in microseconds. But what happens when you have to deduplicate a CRM with 1,000,000 rows?

I spent the last few weeks benchmarking the "standard" Python ways to do this - RapidFuzz, TheFuzz, and Levenshtein - and I realized why everyone hates data cleaning: The O(N²) scaling wall is real.

The Benchmark: 10k to 1M Rows

I set up a head-to-head comparison in a standard Google Colab environment (2 vCPUs, 13GB RAM) using synthetic data with realistic typos (swaps, replacements, and "fat-finger" errors).

The "Wall"

At 10,000 records, RapidFuzz is a beast. It’s fast, optimized C++, and totally usable.

But fuzzy matching at scale is fundamentally a "many-to-many" problem. When you double your data, you quadruple the work. By the time I hit 100,000 rows, RapidFuzz was taking over 20 minutes. At 1,000,000 rows, local libraries don't just get slow - they crash. You run out of RAM during the matrix construction or your CPU sits at 100% for three days.

How I Optimized for 1M+ Rows

To get the Similarity API to finish a 1M-row dedupe in 7 minutes, I had to move away from naive loops and implement a dual-engine strategy:

Deterministic Indexing: Instead of comparing every string to every other string (quadratic time), I use an adaptive indexing strategy that "blocks" similar strings together before the math starts.
N-Gram Vectorization: I treat strings as high-dimensional vectors. This allows me to use optimized linear algebra libraries.
Off-Heap Memory Management: To prevent the "OOM (Out of Memory)" crashes common in Python, I use memory-mapping (np.memmap) to process data larger than the available RAM.

Stop building dedupe pipelines

If you are a developer, your time is better spent building features than babysitting a 12-hour deduplication script that might crash at 99%.

I’ve open-sourced the benchmark suite and the Google Colab environment I used so you can verify the numbers:

GitHub Repository: View the Benchmark Code — See how we handle 1M+ rows.
Google Colab: Run the Demo — Test the engine in your browser.

I’ve set up a free tier for the API that handles up to 100,000 records. You can generate a free token with a free sign up. It’s meant to be a low-friction way to test real-world data without having to spin up your own infrastructure.

I’m also looking for a few people with very large datasets (5M+ rows) to help me stress-test the next version of the async engine. If you're hitting scale limits that current tools can't solve, feel free to reach out.