Attention Is All You Need - Part 5

Hello, I'm Ganesh. I'm building git-lrc, an AI code reviewer that runs on every commit. It is free, unlimited, and source-available on GitHub. Star Us to help devs discover the project. Do give it a try and share your feedback for improving the product.

In previous article we discussed about step 2 of transformer model, i.e. position encoding.

In this article we will discuss step 3 of transformer model, i.e. Multi-Head Attention.

Why Traditional RNN model didn't work for long sentences?

Before 2017, we were using LSTM and RNN models for NLP tasks.

Basicaly as the input of words and processing and context was very less.

For Example let's assume there are 3 words model processes words 1 by 1.

So, first sentence it was taking about river bank.

The river bank.
The United Bank

Next it is about united bank which is has no related data but as we did embeding and positonal encodings we have very low probablity of understanding the context.

Here is the example of how it vector might look like.

How Single Attention Head works?

A single attention head works by determining how much focus a specific token (word) in a sequence should place on other tokens to better understand its own context.

Let's take example: "The cat sat on the mat."

For the token "sat", the attention head might learn to pay high attention to "cat" and "mat" because they are directly related to "sat".

Let's get understanding these in details in next article by actual implementing it.

Reference: https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

Any feedback or contributors are welcome! It’s online, source-available, and ready for anyone to use.
⭐ Star it on GitHub: https://github.com/HexmosTech/git-lrc

DEV Community

Attention Is All You Need - Part 5

Why Traditional RNN model didn't work for long sentences?

How Single Attention Head works?

Top comments (0)