Muhammad Saim

Posted on Jul 4

Attention Mechanism

#nlp #llm #machinelearning #deeplearning

Attention basically refers to considering something important and ignoring other unimportant information.

Abstract

Consider you are stadium and many cricket teams are there and you want to see Pakistan team you just see the players wearing green color kit and ignore rest of all. Brain consider the important thing is green color because it is one thing that make them different from other.

Introduction

Same analogy is needed in the deep learning. In deep learning if you want to increase the efficiency then attention mechanism plays important role. This play important role in rising of deep learning. In deep learning it takes all the input break that input into parts and focus on every part then assign the score to these parts. Now higher score parts are consider more important and have higher impact. Therefore, it reduces all the other parts which have low score.

Previous work

Previously LSTM/RNN was used in which there are encoder and decoder. Encoder make the summary of input data and passed to decoder but problem in this is if the sentence is long it cannot make the good summary which creates the bad response from decoder. RNNs cannot remember longer sentences and sequences due to the vanishing/exploding gradient problem.

Key Concepts

Query, Key, and Value:
Query (Q): The element for which we are seeking attention.
Key (K): The elements in the input sequence that the model can potentially focus on.
Value (V): The elements in the input sequence that are associated with the keys, from which the output is generated.

Attention Score

The attention score is calculated by taking the dot product of the query and the key, which measures how much focus each key should get relative to the query.
These scores are then normalized using a softmax function to produce a probability distribution.
Weighted Sum:
The normalized attention scores are used to create a weighted sum of the values. This weighted sum represents the attention output.

Types of Attention

Self-Attention (or Scaled Dot-Product Attention)
Used in transformer models where the query, key, and value all come from the same sequence.
Involves computing attention scores between every pair of elements in the sequence.
Multi-Head Attention:
Extends the self-attention mechanism by using multiple sets of queries, keys, and values.
Each set, or "head," processes the input differently, and the results are concatenated and linearly transformed to produce the final output.

Mathematical Formulation

Score (Q,K) = QkT

Scaled Scores

Scaled Score (Q,K) = QKT / √dk

Softmax to get Attention Weights:

Attention Weights = softmax(QKT / √dk)

Weighted Sum to get the final output:

Attention Output=Attention Weights⋅V

Understanding attention mechanism

There are hidden states in rnn and the final hidden state is passed to decoder and this make decoder to do computation give results.
Take the example of machine translation. Here the sentence is passed and result is not up to the mark because the only final hidden state is passed.

Now this problem can be solved by attention mechanism by not passing only final hidden state pass all the states to decoder this makes the decoder to solve the problems more efficiently and give the good translation result.

Transformer Model

In the transformer architecture, attention mechanisms are crucial for both the encoder and the decoder:
• Encoder: Each layer uses self-attention to process the input sequence and generate a representation.
• Decoder: Uses a combination of self-attention (to process the output sequence so far) and encoder-decoder attention (to focus on relevant parts of the input sequence).
The attention mechanism has been instrumental in the success of models like BERT, GPT, and other transformer-based models, enabling them to handle complex tasks such as translation, summarization, and question answering effectively.

DEV Community

Attention Mechanism

Abstract

Introduction

Previous work

Key Concepts

Attention Score

Types of Attention

Mathematical Formulation

Scaled Scores

Softmax to get Attention Weights:

Weighted Sum to get the final output:

Understanding attention mechanism

Transformer Model

Top comments (0)

Read next

Self-Training LLMs for Text Classification using DQC Toolkit

Essential Deep Learning Checklist: Best Practices Unveiled

Part 2: Mastering Prompts and Language Models with LangChain

Python Essentials: A Speedy Introduction