DEV Community

loizenai
loizenai

Posted on

1

Elasticsearch Tokenizers – Word Oriented Tokenizers

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-word-oriented-tokenizers

Elasticsearch Tokenizers – Word Oriented Tokenizers

A tokenizer breaks a stream of characters up into individual tokens (characters, words...), then outputs a stream of tokens. We can also use tokenizer to record the order or position of each term (for phrase and word proximity queries), or the start and end character offsets of the original word which the term represents (for highlighting search snippets).

In this tutorial, we're gonna look at how to use some Word Oriented Tokenizers which tokenize full text into individual words.

1. Standard Tokenizer

standard tokenizer provides grammar based tokenization:


POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

Token:


{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "",
      "position": 0
    },
    {
      "token": "2",
      "start_offset": 4,
      "end_offset": 5,
      "type": "",
      "position": 1
    },
    {
      "token": "QUICK",
      "start_offset": 6,
      "end_offset": 11,
      "type": "",
      "position": 2
    },
    {
      "token": "Brown",
      "start_offset": 12,
      "end_offset": 17,
      "type": "",
      "position": 3
    },
    {
      "token": "Foxes",
      "start_offset": 18,
      "end_offset": 23,
      "type": "",
      "position": 4
    },
    {
      "token": "jumped",
      "start_offset": 24,
      "end_offset": 30,
      "type": "",
      "position": 5
    },
    {
      "token": "over",
      "start_offset": 31,
      "end_offset": 35,
      "type": "",
      "position": 6
    },
    {
      "token": "the",
      "start_offset": 36,
      "end_offset": 39,
      "type": "",
      "position": 7
    },
    {
      "token": "lazy",
      "start_offset": 40,
      "end_offset": 44,
      "type": "",
      "position": 8
    },
    {
      "token": "dog's",
      "start_offset": 45,
      "end_offset": 50,
      "type": "",
      "position": 9
    },
    {
      "token": "bone",
      "start_offset": 51,
      "end_offset": 55,
      "type": "",
      "position": 10
    }
  ]
}

To keep things simple, we can write term from tokens in this way:


[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

Max Token Length

We can configure maximum token length (max_token_length - Defaults to 255).
If a token exceeds this length, it is split at max_token_length intervals.
For example, we set max_token_length to 4, it makes QUICK separate to QUIC and K.

More at:

https://grokonez.com/elasticsearch/elasticsearch-tokenizers-word-oriented-tokenizers

Elasticsearch Tokenizers – Word Oriented Tokenizers

Heroku

This site is built on Heroku

Join the ranks of developers at Salesforce, Airbase, DEV, and more who deploy their mission critical applications on Heroku. Sign up today and launch your first app!

Get Started

Top comments (0)

Qodo Takeover

Introducing Qodo Gen 1.0: Transform Your Workflow with Agentic AI

Rather than just generating snippets, our agents understand your entire project context, can make decisions, use tools, and carry out tasks autonomously.

Read full post

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay