Understanding the whitespace tokenizer!

#llm #chatbot #yaml #rasa

Our previous blog: Understanding RASA pipelines describes how RASA NLU handles stories, rules, policies and forms.
Here after, we'll dive deeper into how pipelines should be developed and how each pipeline may be developed.

Contents of this blog:

Developing Pipelines
WhitespaceTokenizer

Developing Pipelines:

As we discussed in the last blog, a pipeline is the basic architecture of any chatbot. These pipelines are built in a similar manner to functional coding or OOP, where the programmer effectively writes functions for specific operations which are then extended for further functional additions.

def add(x, y):
    return x + y

def add_two_num(a, b):
    print(add(a, b))

if __name__ == "__main__":
    num1 = int(input("Provide the 1st num: "))
    num2 = int(input("Provide the 2nd num: "))
    add_two_num(num1, num2)

When we develop a pipeline the basic considerations are, what do we want to achieve and is there a pre-existing package which does what we want already.

If your answer is yes, it makes things very easy!

The most basic resources for anyone working with RASA lies in its base documentation, the base repository, and their API.

Once we identify that are is a set of pipelines which could be useful for us, we begin effectively stacking one on top of the other building our functionality.

Example:

Very recently I developed a bot for a clinic, as the answers required consistency and the quires could range from 'I need a vet' to 'Mind one for the animal doctor', RASA was the perfect fit.

When I was working with RASA, I began by building architecture from a bottom-up approach, beginning by defining what how a word should be defined.

This is where; we use the:

WhitespaceTokenizer

Now even though we've heard of or explained the 'WhitespaceTokenizer' before this blog, I want to dive deep within the working of the module.

It is the first step within the steps of RASA NLU pipelines:

pipeline:
- name: WhitespaceTokenizer
  intent_tokenization_flag: true
  intent_split_symbol: "_"

The only purpose its servers is to break the sentences of users into 'tokens' it is not used for syntactical analysis, intent analysis, or even sentence normalisation.

It is what decides where one 'token' is formed, now as redundant this maybe we consider tokens, not words.

As ML models are unable to directly work with large string data, or rather raw text, they use tokens which are then converted to features and further into embeddings. Whitespacetokenizer is the simplest type, it only looks for whitespaces within a sentence and defines tokens.

Internal working:

Tokeinzation

Consider a sentence, as:

'Hey? Can you direct me to the purchase page?'

Now the tokenizer works by dividing the sentences on whitespaces, and form a list as:

["Hey?", "Can", "you", "direct", "me", "to", "the", "purchase", "page?"]

The tokenizer does not remove any punctuations from sentences, this simple rule allows for a range of emotions to be captured through each input.

As linguistically, Hey!, Hey?, Hey?(hesitant) or even Hey can have a multitude of different meanings which the model must capture to be precise. Whenever the module forms a singular token the information which is stored by it consists of the starting character number within the string, the ending character number and the message itself.

{
  "text": "direct",
  "start": 14,
  "end": 19
}

In terms of low-level code, one could map RASA string handling to how strings are terminated within C using '/0' or 'nullpointers' within Linked Lists.

Rather than using it as its own, the Whitespace tokenizer is seen as a building block. Another similar tokenizer for periods is the RegexTokenizer. It too is consistently used within projects as but rather than working with sentenced tokens, it works with paragraphs and further divides them into sentences.

Now that we have our building block placed down, here after we'll move to how sentences are considered syncatically.

The next blog: To be released