DEV Community

Rohab Shabbir
Rohab Shabbir

Posted on • Updated on

Introduction to Transformer Models

(It is guide for beginners so basic examples have been used)

NLP

NLP is a field of linguistics and machine learning focused on understanding everything related to human language.

What is NLP

  • Classifying whole sentences — sentiment analysis
  • Classifying each word in a sentence — grammatically
  • Generating text content — auto generated text

Transformers and NLP
Transformers are game-changers in NLP. Unlike traditional models, they excel at understanding connections between words, no matter the distance. This "attention" allows them to act like language experts, analyzing massive amounts of text to perform tasks like translation and summarization with impressive accuracy. We'll explore how these transformers work next!

Transformers

These are basically models that can do almost every task of NLP; some are mentioned below. The most basic object that can do these tasks is pipeline() function.

Sentiment analysis
It can classify sentences that are positive or negative.
Sentiment analysis
0.999… score tells that machine is confident about this 99%.
We can also pass several sentences, score for each will be provided.
By default, this pipeline selects a particular pretrained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when we create the classifier object

Zero-shot classification
It allows us to label the data which we want instead of relying the data labels in the models.
zero shotoutput

Text generation
The main idea about text generation is we’ll provide some text and it will generate text. We can also control the total length of output text.
text generationIf we don’t specify any model, it will use default model otherwise we can specify models as in above picture.

Mask filling
The idea of this task is to fill in the blanks
maskmask fillingThe value of k tells the number of possibility in the place of .

Named entity recognition
It can separate the person, organization or other things in a sentence.
recognitionresult

  • PER – person
  • ORG – organization
  • LOC – location

Question answering
It will give the answer based on provided information. It does not generate answers it just extracts the answers from the given context
question answeroutput

Summarization
In this case, it will summarize the whole paragraph which we will provide.
summaryoutput

Translation
It will translate your provide text into different languages.
translationI have provided model name as well as translation languages “en-ur” English to Urdu.

How transformers work?

The architecture was introduced in 2018, some influential models are GPT, BERT etc.
The transformer models are basically language models, meaning they have been trained on large amounts of raw text in a self-supervised fashion. Self supervised learning means that humans are not needed to label the data. It is not useful for specific practical tasks so in that case we use Transfer Learning. It is transferring knowledge of specific model to other model for other specific task.
Transformers are large models, to achieve better results, the models should be trained on large data but training on large data impacts environment heavily due to emission of carbon dioxide.
So instead of pretraining(training of model from scratch) we finetune the existing models(using pretraining models) in order to reduce time, effects on environment.
Fine-tuning a model therefore has lower time, data, financial, and environmental costs. It is also quicker and easier to iterate over different fine-tuning schemes, as the training is less constraining than a full pretraining.

General Architecture
It generally consists of 2 sections

  • Encoders
  • Decoders

Encoders receive input and builds representation of its features.
Decoders uses the above representation and gives output.

Models
There 3 types of models

  • Only encoders — these are good for tasks that require understanding of input such as name or entity recognition etc.
  • Only decoders — these are good for generative tasks.
  • Both encoders and decoders — these are good for generative tasks that need input such as summarization or translation.

layers

ENCODERS

The architecture of BERT(the most popular model) is “encoder only”.

How does it actually works
It takes input of certain words then generate a sequence (numerical, feature vector) for these words.
IencoderThe numerical values generated for each word is not just value of that word but the numerical value or sequence is generated depending upon context of the sentence (Self attention mechanism), from left and right in sentence.(bi-directional)

When encoders can be used

  • Classifications tasks
  • Question answering tasks
  • Masked language modeling In these tasks Encoders really shine.

Representatives of this family

  • ALBERT
  • BERT
  • DistilBERT
  • ELECTRA
  • RoBERTa

DECODERS

We can do similar tasks with decoders as in encoders with a little bit loss of performance.
decoderThe difference between Encoders and decoders is that encoders uses self attention mechanism while decoders use a masked self attention mechanism, that is it generates sequence for a word independently of its context.

When we should use a decoder

  • Text generation (ability to generate text, a word or a known sequence of words in NLP is called casual language modeling)
  • Word prediction At each stage, for a given word the attention layers can only access the words positioned before it in the sentence. These models are often called auto-regressive models.

Representatives of this family

  • CTRL
  • GPT
  • GPT-2
  • Transformer XL

ENCODER-DECODER

In these type of models, we use encoder alongside with decoder.

Working
Let’s take an example of translation (transduction)
encoder-decoderWe give a sentence as input to encoder, it generates some numerical sequence for those words and then these words are taken as input by decoder. Decoder decodes the sequence and output a word. The start of sequence word indicates that it should start decoding the words. When we get the first word and feature vector(sequence generated by encoder), encoder is no more needed.
We have learnt about auto regressive manner of decoder. So, the word it output can now be used as its input to generate 2nd word. It will goes on until the sequence is finished.
In this model, encoder takes care of understanding the sequence and decoder takes care about generation of output based on understanding of encoder.

Where we can use these

  • Translation
  • Summarization
  • generative question answering

Representatives of this family

  • BART
  • mBART
  • Marian
  • T5

Limitations
Important note at the end of article is that if you want to use pretrain the model or finetune model, while these models are powerful but comes with limitations.
limitationsoutputWhile requiring a mask for above data it gives these possible words gender specific. So if you are using any of these models this can be an issue.

Conclusion
In conclusion, transformer models have revolutionized the field of NLP. Their ability to understand relationships between words and handle long sequences makes them powerful tools for a wide range of tasks, from translation and text summarization to question answering and text generation. While the technical details can be complex, hopefully, this introduction has given you a basic understanding of how transformers work and their potential impact on the future of human-computer interaction.

Top comments (0)