DEV Community

loizenai
loizenai

Posted on

1

Elasticsearch Tokenizers – Partial Word Tokenizers

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers

Elasticsearch Tokenizers – Partial Word Tokenizers

In this tutorial, we're gonna look at 2 tokenizers that can break up text or words into small fragments, for partial word matching: N-Gram Tokenizer and Edge N-Gram Tokenizer.

I. N-Gram Tokenizer

ngram tokenizer does 2 things:

  • break up text into words when it encounters specified characters (whitespace, punctuation...)
  • emit N-grams of each word of the specified length (quick with length = 2 -> [qu, ui, ic, ck] )

=> N-grams are like a sliding window of continuous letters.

For example:


POST _analyze
{
  "tokenizer": "ngram",
  "text": "Spring 5"
}

It will generate terms with a sliding (1 char min-width, 2 chars max-width) window:


[ "S", "Sp", "p", "pr", "r", "ri", "i", "in", "n", "ng", "g", "g ", " ", " 5", "5" ]

Configuration

  • min_gram: minimum length of characters in a gram (min-width of the sliding window). Defaults to 1.
  • max_gram: maximum length of characters in a gram (max-width of the sliding window). Defaults to 2.
  • token_chars: character classes that will be included in a token. Elasticsearch will split on characters that don’t belong to:
  • letter (a, b, ...)
  • digit (1, 2, ...)
  • whitespace (" ", "\n", ...)
  • punctuation (!, ", ...)
  • symbol ($, %, ...)

Defaults to [] (keep all characters).

For example, we will create a tokenizer with sliding window (width = 3) and character classes: only letter & digit.


PUT jsa_index_n-gram
{
  "settings": {
    "analysis": {
      "analyzer": {
        "jsa_analyzer": {
          "tokenizer": "jsa_tokenizer"
        }
      },
      "tokenizer": {
        "jsa_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST jsa_index_n-gram/_analyze
{
  "analyzer": "jsa_analyzer",
  "text": "Tut101: Spring 5"
}

Terms:

More at:

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-partial-word-tokenizers

Elasticsearch Tokenizers – Partial Word Tokenizers

Hostinger image

Get n8n VPS hosting 3x cheaper than a cloud solution

Get fast, easy, secure n8n VPS hosting from $4.99/mo at Hostinger. Automate any workflow using a pre-installed n8n application and no-code customization.

Start now

Top comments (0)

AWS GenAI LIVE image

Real challenges. Real solutions. Real talk.

From technical discussions to philosophical debates, AWS and AWS Partners examine the impact and evolution of gen AI.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay