Thoughts on how Positional Embedding in Transformers was designed

#machinelearning

I have used Transformers for a period of time but haven't really dived into it. Recently I went through this blog and it explained very clearly what is positional embedding, how it works, and insights about the basic mechanisms.

That blog is enough if you want to understand what positional embedding is and how it is constructed. But I would like to know / guess how the writers tackled this problem step by step, what were their initial thoughts. By learning this path, I may get new knowledge about the methodology about how similar problems can be solved, which I think is more important than positional embedding itself.

We notice that for most of the time the semantic vector of each word constitutes of small numbers usually somewhere between -1 and 1. As a result, we don't want to make our positional embedding too big to eliminate the meanings, because in the end the semantic vector and positional vector are added together for later processes. Thus, it is ideal to have our positional embedding vectors comprising small numbers between -1 and 1. This naturally leads us to Sin and Cos.

But which one should we choose? It is not a big deal at this moment because we can simply add a phase $\phi$ to convert sin to cos, vice versa. Let's go with sin at the beginning.

In general, the entry i of positional vector at time t of dimension d can be written as a function like

f(i, t) = sin(\omega_i t)---(1)

or, it can be

f(i, t) = sin(\omega_t i)---(2)

But we want positional vectors as different as possible because they in fact indicate different positions of words in a sentence. We definitely don't want those embeddings confuse our model.

If we choose function (1), the entry i of all the positional embeddings is periodical with t. To be specific, if we take time stamp t as axis x, and the value of entry i as axis y, those dots on the plain are from the same sin function, which is periodical.

On the other hand, if we choose function (2), for each time stamp t, in other words for each distict positional vector, the value of all entries are periodical with entry index i. To be specific, if we take entry index i as axis x, and its value as axis y, those dots on the plain are from the same sin function, which is periodical.

So which one should we choose then?

What about function (2)? In the end, our goal is to add this positional vector to the semantic vector of the same dimension d, so if the values of each positional vector goes periodical with entry index i, the positional vector is similar to the concatenation of some short vectors that looks the same. As a result, it could not utilize full dimension d to carry position information, which is definitely not what we expect.

But if we look at fuction (1) and focus on each positional vector, it is easy to see that those entries come from different sin functions and are not periodical, which mitigates the redundency problem we faced with function (2).

So, at this point, we will go with function (1).

f(i, t) = sin(\omega_i t)

Right now it's time to determine the parameter $\omega_i$ . To make it clearer, how do we determine cycle T? To get to this point, we need to think a little deeper.

What will happen if we set a huge T? If you are familiar with sin, it is easy to tell that the entry i of all positional vectors are almost 0. On the opposite, if we set a really small T, it's hard to tell, maybe those entries may fluctuate dramaticly or also can be pretty stable.

We want the final combination vector to keep both positoinal and semantic information, so it means we have to make sure that some part of the entries are so close to 0 that they only make minor or even no changes to the original semantic vector. This being explained I think it is pretty obvious for we choose to go with big cycle T as i gets bigger.

How big it should be? Really big I'd say, because we want all extries to be somewhere around 0. How do get gigantic numbers? Exponential functions! And this is how we came up with the confusing $\omega_k$ mentioned in the article.

DEV Community

Thoughts on how Positional Embedding in Transformers was designed

Top comments (0)