DEV Community

GrimLabs
GrimLabs

Posted on

I Need Fuzzy Matching But I Dont Know Python

I posted on Reddit a few months ago asking how to match two lists of company names that werent exactly identical. One list from our CRM, one from a vendor database. About 12,000 records each. I needed to find the overlaps but VLOOKUP was useless because the names were formatted differently.

Every single response told me to use Python. "Just pip install fuzzywuzzy." "Use pandas merge with a custom matching function." "Write a script with the thefuzz library."

Great advice if you know Python. I dont. I'm an operations analyst. I use Excel, Google Sheets, and SQL when i have to. Python is not in my toolkit and honestly i dont have three months to learn it just to solve this one problem.

And thats the thing. The people who most need fuzzy matching are almost never developers. They're the people in finance matching invoices. The salespeople deduplicating lead lists. The ops team reconciling inventory data. The HR person merging employee records from two systems after an acquisition.

These are spreadsheet people. And fuzzy matching tools for spreadsheet people basically dont exist.

The mismatch between need and access

Turns out i'm not alone in this frustration. Not even close.

According to Stack Overflow's 2023 Developer Survey, Python is used by about 49% of professional developers. But professional developers are a small fraction of the people who work with data. A Forrester study estimated that there are roughly 10x more "data workers" (people who use data in their jobs) than there are developers.

So if 49% of developers know Python, and developers are maybe 10% of data workers, then roughly 5% of the people who work with data can use Python for fuzzy matching. The other 95% are stuck.

And its not like these 95% dont know the problem exists. They deal with messy data every day. They know that "Johnson & Johnson" and "Johnson and Johnson" should match. They just dont have a tool that can do it without code.

What the current options actually look like

If you need fuzzy matching today and you cant write code, here are your realistic options. Not gonna lie, its a short list.

Excel add-ins. There are a few third-party add-ins that claim to do fuzzy matching in Excel. Fuzzy Lookup is one from Microsoft Research. It works for small datasets (a few thousand rows) but its slow, not very configurable, and hasnt been updated in years. The matching quality is mediocre.

Google Sheets add-ons. Same story. A few exist. Most are limited to a few hundred rows before they time out. Some charge per match, which gets expensive fast.

OpenRefine. This is a free, open source tool that handles fuzzy matching well. Its more powerful than Excel add-ins and can handle larger datasets. But the interface is... not intuitive. Theres a real learning curve. I spent an afternoon trying to use it and gave up around hour three when i couldnt figure out how to configure the clustering settings the way i needed.

Dedupe.io. A web-based dedup tool that uses machine learning. Its pretty good actually. But its primarily for deduplication within a single list, not for matching between two lists. And pricing starts at $100/month.

Ask someone who knows Python. This is what most people end up doing. You find the one person in your company who can code, beg them for help, and wait three days for them to have time. Its a terrible workflow for something that should be self-service.

What a non-technical user actually needs

After going through all of these options, i have a pretty clear picture of what would actually solve this problem. And its simpler than you'd think.

Upload two files. CSV or Excel. Drag and drop. No configuration wizards, no data source connections.

Select the columns to match. Click on "Company Name" in file A. Click on "Vendor Name" in file B. Tell the tool these are the columns to compare.

Choose matching sensitivity. A slider or simple setting. "Strict" for near-exact matches only. "Moderate" for standard fuzzy matching. "Loose" for catching everything that might be related.

Get results with confidence scores. A table showing each match with a percentage confidence. "Acme Corp" matched to "Acme Corporation Inc" with 92% confidence. "IBM" matched to "International Business Machines" with 87% confidence. "ABC Consulting" matched to "ABC Consulting Group LLC" with 78% confidence.

Review and approve. A human checks the low-confidence matches. Approves or rejects. Exports the clean matched dataset.

Thats it. Five steps. No code. No formulas. No dependencies to install. Just upload, match, review, export.

Thats actually why DataReconIQ exists. I built it to handle exactly this, fuzzy matching without writing a single line of code. Upload, match, review, export.

Why this gap exists

You might wonder why this tool gap persists. If so many people need fuzzy matching without code, why hasnt someone built it ages ago?

I think theres a few reasons.

Developer blind spot. The people building data tools are developers. For them, fuzzy matching is a solved problem. "Just use fuzzywuzzy, its three lines of code." They dont experience the pain of non-coders because theyre not non-coders.

Market categorization. Fuzzy matching gets lumped into "data engineering" or "data science" categories, which are assumed to be developer territories. Nobody makes a "fuzzy matching for operations teams" product category because the market doesn't think about it that way.

Enterprise lock-in. The companies that do offer fuzzy matching as a service (Informatica, Talend, IBM DataStage) price it for enterprise. $50K+ implementations with consultants. The mid-market user who needs to match 12,000 records is invisible to them.

Spreadsheet assumptions. Tool builders assume that if someone is working in spreadsheets, VLOOKUP is good enough. They dont consider that spreadsheet users might have matching needs that go beyond exact match.

The hidden productivity loss

Heres what happens in the real world when fuzzy matching isnt accessible.

The operations analyst who needs to match vendor lists does one of two things. Either they spend 8-12 hours doing it manually (side-by-side comparison, sorting, squinting at similar names) or they do it partially and accept a 20-30% miss rate.

Both options are bad. The manual approach wastes time and is error-prone because of fatigue. The partial approach means missed matches that cause downstream problems (duplicate payments, missed invoices, broken reports).

Multiply this across every department in every company that works with imperfect data, and the aggregate productivity loss is staggering.

A McKinsey report on data-driven organizations found that employees spend about 30% of their time searching for, validating, and reconciling data. Not analyzing it. Not making decisions with it. Just getting it into a usable state.

Accessible fuzzy matching tools would chip away at that 30% significantly. Not eliminate it, but reduce it enough to matter.

Its a tool problem, not a skills problem

I want to be clear about something. The people who need fuzzy matching and dont know Python are not unskilled. They're domain experts. They know their data, their business processes, and their specific matching requirements better than any developer would.

What they lack is a tool that speaks their language. They think in terms of spreadsheets, columns, and business rules. Not in terms of libraries, functions, and algorithms.

The gap isnt in their skills. Its in the tools available to them. And that gap is slowly closing as more products recognize that data matching is a mainstream need, not a developer-only problem.

If you're one of the 95% who needs fuzzy matching but doesnt code, know that the problem isnt you. The tools just havent caught up to the need yet. But theyre getting there. And in the meantime, you shouldnt have to learn a programming language just to match two lists of company names.

Top comments (0)