DEV Community

Cover image for You're probably using the wrong fuzzy matching algorithm (and here's how to see why)
Ruslan Manov
Ruslan Manov

Posted on

You're probably using the wrong fuzzy matching algorithm (and here's how to see why)

Most developers reach for fuzzywuzzy or difflib.SequenceMatcher the moment they need fuzzy string matching. The ratio comes back — 0.73, looks reasonable — and they ship it. But Levenshtein Distance and SequenceMatcher measure fundamentally different things, and picking the wrong one silently corrupts your results.

I built a terminal app that animates both algorithms step by step so you can see why they disagree. Here's what I learned.


The experiment that changed how I think about fuzzy matching

Compare these two strings:

A: "Acme Corp."
B: "ACME Corporation"
Enter fullscreen mode Exit fullscreen mode

Obviously the same company, right? Let's check:

from difflib import SequenceMatcher

# SequenceMatcher
SequenceMatcher(None, "Acme Corp.", "ACME Corporation").ratio()
# → 0.615 (61.5%)

# Levenshtein ratio
# distance = 9 edits, max_len = 16
# ratio = 1 - 9/16 = 0.4375 (43.8%)
Enter fullscreen mode Exit fullscreen mode

SequenceMatcher says 61.5%. Levenshtein says 43.8%. That's not a minor disagreement — if your threshold is 50%, one algorithm matches and the other rejects.

Now try these:

A: "Saturday"
B: "Sunday"
Enter fullscreen mode Exit fullscreen mode
SequenceMatcher(None, "Saturday", "Sunday").ratio()
# → 0.571 (57.1%)

# Levenshtein: distance = 3, max_len = 8
# ratio = 1 - 3/8 = 0.625 (62.5%)
Enter fullscreen mode Exit fullscreen mode

Now Levenshtein scores higher. The algorithms flipped.

This isn't a bug. They're answering different questions.


What Levenshtein actually measures

Vladimir Levenshtein published his distance metric in 1965 at the Keldysh Institute in Moscow. The question is simple:

What is the minimum number of single-character operations (insert, delete, substitute) needed to transform string A into string B?

The algorithm builds a dynamic programming matrix. Each cell D[i][j] represents the optimal edit distance between the first i characters of A and the first j characters of B:

        ε    s    i    t    t    i    n    g
   ε    0    1    2    3    4    5    6    7
   k    1    1    2    3    4    5    6    7
   i    2    2    1    2    3    4    5    6
   t    3    3    2    1    2    3    4    5
   t    4    4    3    2    1    2    3    4
   e    5    5    4    3    2    2    3    4
   n    6    6    5    4    3    3    2    3
Enter fullscreen mode Exit fullscreen mode

"kitten" → "sitting" = 3 edits: k→s (substitute), e→i (substitute), +g (insert).

Key property: Every edit costs exactly 1. Levenshtein doesn't care if you're changing case ("A"→"a"), expanding abbreviations ("Corp."→"Corporation"), or fixing typos ("teh"→"the"). Each character-level change is equally expensive.

This is why "Acme Corp." vs "ACME Corporation" scores so low — there are 9 individual character changes, and each one costs a full point.


What SequenceMatcher actually measures

Python's difflib.SequenceMatcher implements the Ratcliff/Obershelp "Gestalt Pattern Matching" algorithm (1983). The question is different:

What proportion of both strings consists of contiguous matching blocks?

Ratio = 2 · M / T
M = total characters in matching blocks
T = total characters in both strings
Enter fullscreen mode Exit fullscreen mode

The algorithm works by divide-and-conquer:

  1. Find the longest contiguous match between the two strings
  2. Recursively find matches to the left and right of that match
  3. Sum up all matching characters

For "The quick brown fox" vs "The quikc brown fax":

Step 1: Find longest match → " brown f" (8 chars)
Step 2: Left: "The quick" vs "The quikc" → "The qui" (7 chars)
Step 3: Remainders: "ck" vs "kc" → "k" (1 char)
Step 4: Right: "ox" vs "ax" → "x" (1 char)

M = 8 + 7 + 1 + 1 = 17
T = 19 + 19 = 38
Ratio = 34/38 = 89.5%
Enter fullscreen mode Exit fullscreen mode

Key property: Long contiguous matches are rewarded disproportionately. "Acme Corp." and "ACME Corporation" share long blocks like "me Corp" — so SequenceMatcher scores them higher despite the character-level differences.


The comparison table that matters

Scenario Levenshtein SequenceMatcher Winner
Typo: "programing"→"programming" 90.9% 95.2% Both good
Company: "Acme Corp."→"ACME Corporation" 43.8% 61.5% SM
Days: "Saturday"→"Sunday" 62.5% 57.1% Lev
Accounting: "Invoice #12345"→"Inv. 12345" 64.3% 66.7% Close
Cyrillic: "Левенщайн"→"Левенштейн" 70.0% 73.7% SM slight
Typo: "The quick brown fox"→"The quikc brown fax" 89.5% 89.5% Tie
Different: "algorithm"→"altruistic" 40.0% 50.0% SM

Pattern: Levenshtein wins when differences are few but scattered (efficient edit path). SequenceMatcher wins when strings share long common blocks despite format variations.


Seeing is believing — the demo

I built a terminal app that animates both algorithms in real time. Here's what each demo shows:

Demo 1: Watch the DP matrix fill

Every cell fills with a flash, colored by operation type. You literally see the wavefront of dynamic programming propagate through the matrix. Then the optimal path traces back in magenta, and the edit operations appear:

k→s (sub)  i=i (match)  t=t  t=t  e→i (sub)  n=n  +g (ins)
Enter fullscreen mode Exit fullscreen mode

Demo 2: Block discovery in real time

SequenceMatcher's divide-and-conquer becomes visible:

  • Gray highlight = current search region
  • Green flash = discovered matching block
  • Step log shows the recursion order

You can see why it finds " brown f" before "The qui" — longest first, then recurse.

Demo 3: Head-to-head arena

Animated bars grow simultaneously for 5 string pairs. The winner indicator appears per round. You viscerally see where the algorithms diverge.

Demo 4: Try your own strings

Type any two strings and get the full analysis: DP matrix (if short enough), both scores, colored diff, matching blocks, edit operations.

Demo 5: Real-world scenarios

  • Typo correction: Dictionary lookup with ranked results
  • Name dedup: Company name clustering
  • Fuzzy VLOOKUP: Invoice → catalog matching

Demo 6: Hybrid scoring

An animated weight sweep from w=0.0 (pure SM) to w=1.0 (pure Lev) with decision guidance.


When to use which — the decision framework

Typo correction, spell checking?
  → Levenshtein (edit model matches typo generation)

Name/entity deduplication?
  → SequenceMatcher (format-tolerant block matching)

Accounting codes, invoice matching?
  → Hybrid w=0.3-0.5 (format varies but typos also matter)

Plagiarism detection, document similarity?
  → SequenceMatcher (long shared passages are the signal)

Search autocomplete?
  → SequenceMatcher + prefix bonus

DNA/protein alignment?
  → Weighted Levenshtein (Needleman-Wunsch with substitution matrices)
Enter fullscreen mode Exit fullscreen mode

Try it yourself

git clone https://github.com/RMANOV/fuzzy-match-visual.git
cd fuzzy-match-visual
python3 demo.py
Enter fullscreen mode Exit fullscreen mode

Single file, zero dependencies, Python 3.8+, any modern terminal with truecolor.

Controls: arrow keys to navigate, Enter to select, 1-6 to jump to demos, S for speed control (0.25×-4×), Ctrl+C returns to menu.


The historical footnote

Levenshtein published in 1965, working on error-correcting codes at the Soviet Academy of Sciences. The Wagner-Fischer DP algorithm came in 1974. Ratcliff/Obershelp's Gestalt matching appeared in Dr. Dobb's Journal in 1988. Tim Peters (author of the Zen of Python) wrote Python's difflib.SequenceMatcher.

Three decades of algorithm design, two fundamentally different philosophies of what "similarity" means, and most developers still use them interchangeably.

Now you can see why that's a mistake.


GitHub: https://github.com/RMANOV/fuzzy-match-visual

Top comments (0)