Before discussing a new technology or methodology, we should try to understand the need of it. And so, let us know what gave path to the Transformer Networks.

## Challenges with Recurrent Neural Networks

Gradients are simply vectors pointing in the direction of highest rate of increase of the function. During backpropagation, gradients go through matrix multiplication multiple times using the chain rule. Small gradients get smaller until they vanish and thus it gets harder to train the weights. This is called the vanishing gradient problem.

While smaller gradients vanish, if your gradient is a large value they go on increasing and result in very large updates to our network. This is known as the exploding gradient problem.

Another challenge one faces with RNNs is that of 'reccurence'. Recurrence prevents parallel computation.

Also, large number of training steps are required to train an RNN.

Solution to all our problems is - **Transformers**!

As the title says, *Attention is all you need* by Vaswani et al, (2017) is the paper that introduced the concept of transformers.

Let us first understand the **Attention Mechanism**.

Below attached is an image from my notes of Prof. Pascal Poupart's lecture on Transformers.

Attention Mechanism mimics the retrieval of a value (v) for a query (q) based on a key (k) in the database.

We have a query and some keys (k1, k2, k3, k4), we aim to produce an output which is a linear combination of values where the weights come from the similarity between our query and keys.

In the above diagram, the first layer consists of the keys (vectors). We generate another layer from the similarity comparison of these keys with the query (q). Thus the second layer consists of similarities (s).

We take softmax of these values to yield another layer (a). The product of values in (a) with the values (v) gives us the attention value.

So far we have understood what gave rise to the need of *Attention* and what exactly is *Attention Mechanism*.

What more will we cover?

- Multihead Attention
- Masked Multihead Attention
- Layer Normalisation
- Positional Embedding
- Comparison of Self Attention and Recurrent Layers

Let's cover all this in the next blog!

You can follow me on twitter where I share all the good content and blogs!

## Discussion