DEV Community

Cover image for Let's pay some Attention!
Shambhavi Mishra
Shambhavi Mishra

Posted on

1 1

Let's pay some Attention!

Before discussing a new technology or methodology, we should try to understand the need of it. And so, let us know what gave path to the Transformer Networks.

Challenges with Recurrent Neural Networks

Image from mc.ai
(Image Source : mc.ai)

Gradients are simply vectors pointing in the direction of highest rate of increase of the function. During backpropagation, gradients go through matrix multiplication multiple times using the chain rule. Small gradients get smaller until they vanish and thus it gets harder to train the weights. This is called the vanishing gradient problem.
While smaller gradients vanish, if your gradient is a large value they go on increasing and result in very large updates to our network. This is known as the exploding gradient problem.

Another challenge one faces with RNNs is that of 'reccurence'. Recurrence prevents parallel computation.
Also, large number of training steps are required to train an RNN.

Solution to all our problems is - Transformers!
As the title says, Attention is all you need by Vaswani et al, (2017) is the paper that introduced the concept of transformers.
Let us first understand the Attention Mechanism.
Below attached is an image from my notes of Prof. Pascal Poupart's lecture on Transformers.
Alt Text

Attention Mechanism mimics the retrieval of a value (v) for a query (q) based on a key (k) in the database.
We have a query and some keys (k1, k2, k3, k4), we aim to produce an output which is a linear combination of values where the weights come from the similarity between our query and keys.
In the above diagram, the first layer consists of the keys (vectors). We generate another layer from the similarity comparison of these keys with the query (q). Thus the second layer consists of similarities (s).

We take softmax of these values to yield another layer (a). The product of values in (a) with the values (v) gives us the attention value.

So far we have understood what gave rise to the need of Attention and what exactly is Attention Mechanism.
What more will we cover?

  • Multihead Attention
  • Masked Multihead Attention
  • Layer Normalisation
  • Positional Embedding
  • Comparison of Self Attention and Recurrent Layers

Let's cover all this in the next blog!
You can follow me on twitter where I share all the good content and blogs!

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read more

Top comments (0)

Postmark Image

Speedy emails, satisfied customers

Are delayed transactional emails costing you user satisfaction? Postmark delivers your emails almost instantly, keeping your customers happy and connected.

Sign up