loading...

Understanding and tuning fuzzy queries in ElasticSearch by example

hernamvel profile image Hernan Velasquez ・3 min read

What's a fuzzy query?

Fuzzy queries are one of the most powerful features available in ElasticSearch (ES). It allows to search for similar terms across thousands of documents on an existing index. For example, with fuzzy queries a search term like "colombia" is found on entries like "columbia", "colombie" or "locombia" depending on how tuned we configure the request query.

Here is assumed that the reader has some basic knowledge on Elasticsearch, in particular how to send query requests to the server. There are a ton of great resources to catch up before going forward with this post.

To execute fuzzy queries, it's important to understand that ElasticSearch measures similarity using the Levenshtein Edit Distance algorithm, defined as the number of individual changes required to make one term identical to the another one. A change can be an insert, edit or delete operation (ES can also handle transpositions but we'll leave it for another post). It's not that complicated. Here are two examples:

The Levenshtein distance between colombia and columbia is 1 because you only need to make one change: change the 'u' for a 'o' in in the term columbia.

The Levenshtein distance between colombia and columbi is 2 because you only need to make two changes: change the 'u' for a 'o' in in the term columbia and insert an 'a' at the end of the term.

Executing a fuzzy query in ElasticSearch

Executing a fuzzy query is not that far from executing a typical search query in ElasticSearch. It's important to add the fuzziness parameter (even the fact that its optional) in the json request with the maximum Levenshtein distance you want to allow. Let's assume we have and index with this documents:

doc1: "i will marry you because I love you"
doc2: "i will live with harry"
doc2: "i'm sorry for your loss"

The query

GET /_search
{
  "query": {
    "fuzzy": {
      "name": {
        "value": "harry"
        "fuziness": 1
      }
    }
  }
}

Will return documents doc1 and doc2 because

Levenshtein('harry', 'marry') = 1 in doc1
Levenshtein('harry', 'harry') = 0 in doc2
Levenshtein('harry', 'sorry') = 2 in doc2

Of course, if we change fuziness from 1 to 2, all doc1, doc2 and doc3 documents will be returned.

Undestanding "fuziness": "auto"

Sometimes we need more flexibility on how long we want the maximum Levenshtein Distance to be defined. For short terms it makes no sense to have a distance greater that 2 because it can generate a lot of false positives.

In this sense, ElasticSearch allows the value "auto" to the fuziness parameter that will work in this way:

If the length of the term is 0, 1 or 2 the maximum Levenshtein distance allowed will be 0. In other words, only terms with exact matches will be returned.

If the length of the term is 3, 4 or 5 the maximum Levenshtein distance allowed will be 1.

If the length of the term is greater than 5 the maximum Levenshtein distance allowed will be 2.

To illustrate this, lets consider this example:

doc1: "don't be so hard on me"
doc2: "i will live with my friends"
doc2: "love this twitter trends"

The query

GET /_search
{
  "query": {
    "fuzzy": {
      "name": {
        "value": "me"
        "fuziness": "auto"
      }
    }
  }
}

Will return doc1 since it has the exact term "me". On doc2 although Levenshtein('me', 'my') = 1, the length of the term is 2 characters so it will be discarded.

Lets consider this other query over the same documents:

GET /_search
{
  "query": {
    "fuzzy": {
      "name": {
        "value": "friends"
        "fuziness": "auto"
      }
    }
  }
}

This will return:

doc2 because Levenshtein('friends', 'friends') = 0 (exact match) and length of the term is 6

doc3 because Levenshtein('friends', 'trends') = 2 (change 'f' -> 'r' and insert 'i' in term 'trends' and length of the term is 6

Tuning up "fuziness": "auto"

Although the default definition for auto fuziness in ES is a widely tested and accepted standard, sometimes you'll want to tweak a little bit the relation between Levenshtein Distance and the length of the terms.

For instance, I find sometimes inconvenient to make fuzzy queries on terms with length 3, so I'm inclined to look for exact matches on these terms. To achieve that, we can tweak the query request passing length ranges on the fuzziness entry. For example:

"fuziness": "auto:4,6"

Will work in this way:

If the length of the term is 0, 1, 2 or 3 a exact match is expected.

If the length of the term is 4 or 5 the maximum Levenshtein distance allowed will be 1.

If the length of the term is greater than 5 the maximum Levenshtein distance allowed will be 2.

Conclusion

ElasticSearch is a great tool that may be sometime used under its real capabilities. This small post just wanted to illustrate how fuzzy queries work using simple examples. I hope the reader find this useful.

Posted on by:

hernamvel profile

Hernan Velasquez

@hernamvel

Ruby on Rails, Javascript frameworks and general web development

Discussion

markdown guide