Unknownerror-404

Posted on Dec 30, 2025 • Edited on Jan 2

The predecessors of LLM's: Understanding Chatbots

#rasa #llm #ai #chatbot

Contents of this blog:

Sentence segmentation
Tokenization
POS tagging
Parsing
Named Entity Recognition
Relation extraction
Conversational Chatbots using RASA

Natural Language Processing:

For those unfamiliar with it, Natural Language Processing (NLP) can be described as the application of computational linguistics within computer science. While this definition captures the theory, its practical meaning is best understood through application.

In practice, NLP involves building systems that can process and work with human language, ranging from analyzing sentence structure to generating appropriate responses based on that analysis, as seen in modern large language models (LLMs).

However, generating meaningful responses requires a clear understanding of several foundational concepts, some theory, and a significant amount of practical experimentation.

In this series, I aim to explore the process of building small-scale pretrained chatbots, beginning with rule and intent-based systems using RASA and YAML, and gradually progressing toward small-scale LLMs. So, let’s begin with the basics…

Sentence segmentation

Sentence segmentation is the most essential and one of the earliest processing steps. Sentence Segmentation is used to track the start and end of within a given paragraph.
For e.g.:

It was nearly midnight. The Doctor was on his way out.
It was nearly midnight -> Sentence 1
the Doctor was on his way out. -> Sentence 2

Accurate sentence segmentation is critical, as errors at this stage can propagate to downstream tasks such as parsing, named entity recognition, and information extraction.

Tokenization

Tokenization is the process of dividing each sentence found by segmentation into smaller portions named "Tokens". These tokens are extracted in order of meaningful instances. So, essentially the sentence is divided into meaningful tokens holding essential structural information.

The doctor reviewed the patient’s chart.
Tokens: ["The", "doctor", "reviewed", "the", "patient", "’s", "chart", "."]

Tokenization helps the model in considering what a word stands for in a given structure. Inaccurate tokenization can further propagate downwards leading to illogical wording patterns.

Tokenization can be further sub-divided into three categories based on the requirement. Namely tokens can be formed on the basis of word extraction, Sub-word Tokenization, or Character Tokenization. Let's briefly understand them as some of these topics are currently used in modern NLP.

Word Extraction: Word Extraction Tokenization works as explained above, essentially extracting tokens to form words.

Sub-word Tokenization: Sub-word tokenization breaks words into smaller units to better handle rare, ambiguous, or previously unseen terms. Instead of relying on a fixed vocabulary of complete words, sub-word tokenizers decompose words into frequently occurring character sequences learned from training data.

This approach allows lightweight or vocabulary-limited models to generalize effectively without treating unfamiliar words as entirely unknown.

E.g.: 'Antibiotics' -> [Anti, Biotics] or [Anti, Bio, Tics]

These sub-word units are derived from statistical patterns rather than semantic meaning, and the exact split depends on the tokenization algorithm used (such as BPE or WordPiece).

While sub-word tokenization is primarily an NLP technique, it can indirectly support text-to-speech (TTS) systems in integrated pipelines by enabling consistent handling of rare or complex words before phoneme or pronunciation modeling occurs.

Character level tokenization: Character-level tokenization is a more fine-grained approach in which text is decomposed into individual characters rather than words or sub-words.

E.g.: 'Antibiotics' -> ['A', 'n', 't', 'i', 'b', 'i', 'o', 't', 'i', 'c', 's'].

This method is useful for handling noisy input, spelling variations, and highly specialized terminology, though it often increases sequence length and computational cost. Character-level tokenization is typically used in niche applications or combined with higher-level tokenization strategies.

POS tagging

POS tagging stands for Part of speech tagging, during this process, each token is labeled based on its own linguistic 'Part of speech'.
Just like high school POS tagging just states:

Doctor -> Noun
screeched -> Verb
! -> Punctuation (Punct internally)

Parsing

Although parsing is no longer a central architectural component in modern LLM-based chatbots, it remains a foundational concept that historically informed how linguistic structure is modeled in NLP systems. Parsing focuses on identifying how words within a sentence relate to one another through grammatical roles and dependencies.

At its core, parsing assigns syntactic roles to words, allowing a sentence to be represented in a structured form. For e.g.:

Doctor -> Subject
treated -> Verb (POS determined)
the -> determiner
dog -> Object

The aim with parsing is to perform a sort of syntactical analysis, parsing considers relating each word within the sentence and assigning it to a relation type, to be considered further for Entity Recognition by the Entity Recognizing Module.

Named Entity Recognition

Entity Recognition Module (ERM) also known as Named Entity Recognition (NER) is the process of identifying named entities from the list provided after parsing, it is the most important module within the application as this module can be changed on the basis of the task i.e. it is task specific. On the basis of the NER module used, we obtain results as:

Dr. XYZ -> Doctor
amoxicillin -> Medicine

It is necessary for contextualization of tokens, entity detection, and entity classification. Based on the requirements the module can be rule based, ML-based or Neural NERs. Each one is used effectively for simple, learned and complex applications respectively.

Relation extraction

Relation Extraction (RE) is an NLP process that identifies and classifies meaningful relationships between entities detected in text. While Named Entity Recognition (NER) answers “what entities are present?”, relation extraction answers “how are these entities connected?”
Relation extraction operates on text where entities have already been identified and determines the semantic relationship between them.

Dr. XYZ prescribed amoxicillin to patient.
(amoxicillin given_to patient)

This way only the most important relationships are considered and mapped by the extractor.

Conversational Chatbots using RASA

So, what does this information imply within chatbots? Effectively, nothing. Although this does give you a clear understanding of how computers handle sentences when trying to understand them. Now, when working with RASA, we would not be defining with any of this, but we still work with a few concepts such as intents, entities, relations, etc.

In the blogs following this one, we'll dive deeper into how RASA develops chatbots, beginning with similar basics working all the way up to hopefully a working chatbot which you can converse with.

So, until next time!
The next blog: Understnading RASA

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments. Some comments have been hidden by the post's author - find out more