FuzzyEquals()

#textprocessing #csharp #dotnet

Let's say you want to determine whether something is almost equal to what you're looking for, but it doesn't have to strictly be equal. You just want it close enough. There's all sorts of reasons for this, like writing a spell checker, or just trying to find some possibly misspelled words. It's also useful for attempting to tolerate slightly mangled data sent over a wire on a protocol that doesn't have error correction. After all, if you can correct it on your end, you don't have to send a packet back, hope it gets there, and then wait for the mangled packet to be resent. It's even useful in genetic analysis, although don't ask me why.

If this seems complex, you're right. There's a lot of work that goes into how to efficiently detect these situations. Luckily, you don't have to understand any of that.

FuzzyEquals() is an extension method you can use almost exactly like Equals() to do what you'd expect.

"bob".Equals("bob") // true
"bob".Equals("mob") // false

"bob".FuzzyEquals("bob") // true
"bob".FuzzyEquals("mob") // true
"bob".FuzzyEquals("boob") // true
"bob".FuzzyEquals("mom") // false

Aside from the FuzzyEquals(this String, String) form, you can also specify as a third parameter, either the maximum allowed edits, or, the minimum similarity to satisfy. Either way, it's tunable to your exact accuracy needs. More on that later.

Furthermore, like most of Stringier's functions, any String parameter is overloaded for almost any text type you could imagine, including Char[], Span<Char>, ReadOnlySpan<Char>, and even the unsafe fat-pointer Char*, Int32!

So, how does this work? In computational linguistics, a subfield of computer science, there's a class of functions called edit-distance metrics or string-similarity metrics. These are technically two different things, but an edit-distance algorithm can easily be adapted for both purposes, so we'll be talking about that specifically. Inside the library that also provides FuzzyEquals() are implementations of various edit-distance algorithms. FuzzyEquals() actually exists as a sort of abstraction layer on top of those, so that the actual implementation could be changed over time, for performance or accuracy needs. But what exactly is an edit-distance?

In simplest terms, it's the number of edits required to get from one text to another.

But this is computer science, so of course it's not that simple. Luckily, as far as concepts go, this isn't a bad one.

"bob" -> "bob"

This is an easy one to explain, as there's no changes. So, there's an edit-distance of 0.

"bob" -> "mob"
 ^        ^

This time, at the caret was a substitution ('b', 'm'). This counts as a single change, so an edit distance of 1.

"bob" -> "boob"
   ^        ^^
   1        23

Now we've got a confusing one, because there's multiple ways to view this, and indeed different edit-distance algorithms handle this differently. It's clear 1 is the only thing being changed, but, how? One way to view it is that 1 was substituted as ('b', 'o'), and then 3 was inserted. But it's also possible to view it as 2 was inserted, and 1 was transposed into the position at 3. In either case, this is two-edits.

"bob" -> "obb"

See if you can figure this one out on your own.

What if I told you this could be two different results? No seriously. Consider the first approach.

"bob" -> "obb"
 ^^       ^^
 12       21

Transposing 1 and 2 swaps both characters in one edit.

However, substituting 1 as ('b', 'o') and 2 as ('o', 'b') is two substitutions, for two edits.

Exactly what results you'll get depends extensively on what specific metric you're using, because they all have certain limitations. Hamming only counts substitutions. Levenshtein counts insertions, deletions, and substitutions. Damerau-Levenshtein adds transpositions. There's metrics out there which only count transpositions!

FuzzyEquals() doesn't require you to understand any of that. It's just a nice, user-friendly wrapper for "hey, are these approximately the same thing?".

So what about those fine tuning options? Well, the maximum edits parameter should be pretty obvious considering we just went over what counts for an edit. The other option has to do with a similarity score, which is a Double of range [0, 1], where 0 is completely dissimilar and 1 is identical. While a maximum edit bound is a fixed amount regardless of the length of either string, a minimum similarity bound is relative to the size of the strings, and might arguably be more useful.

If you're interested in utilizing this, or just playing around, you can get the package here.

DEV Community

FuzzyEquals()

Top comments (0)