[Learning notes] reading "Attention is all you need" paper

Abstract

Encoder: is just the part we studied about RNN that reads the input sequence and "digest" it, the part responsible for creating and updating the hidden state, in a way it's like a person reading something and keeping the "gist" of it in mind

Decoder: is the part responsible for using that "gist", that hidden state, the mathematical vector, and use it to produce an output
Attention mechanism: hidden state in case RNN ig
Dispensing with: getting rid of

Pros of transformers:

Better results
Parallelization
Less time to train

"Speedometer"reading for AI translation ability:

BLEU (Bilingual Evaluation Understudy):
so it seems to be some math formula, used as a metric to grade a machine's translation, comparing it with a translation written by a professional, a human ofc
0.0 would mean there was no matching and the model produced a horribly wrong result
100.0 or 1.0 would mean it was a perfect match, but ofc this can never be the case, we can say the exact same thing using lots of different synonyms and ways, and it still be correct, so in reality a 100.0 is impossible and a score of high 20s or 30s seems to be quite impressive and considered very high quality
WMT 2014 (Workshop on Machine Translation)
seems to be some annual Olympics fot translation AIs, the 2014 is considered a famous benchmark exactly bcz of this paper, since it broke all previous records and all

Why English to German? It seems like this specific language pair is famously tricky bcz German has a complex grammar rules and long compound words

Ensembles: a team approach like, when we take 4 to 5 versions of a model and make them work together on the same input, or sentence here, and they, sort of, "vote" for the best translation, the logic behind it is that, a team of experts is always better than a single expert working alone (hmm), but the thing is, ensembles are very slow and expensive
So what they saying here is, transformers, singlehandedly, bested a whole team of what previously was the best models, so like "we won by a landslide against the toughest possible competition"
2 BLUE is the victory margin basically, which seems to be actually quite crazy jump, cuz an improve by 0.2 or 0.5 was enough to get a paper published in a top journal

So it seems like before they always used to use many models, like running 4 to 8 models at once and averaging their answers, aka using "ensembles" to get good results, and now we have this new model, that didn't need a team, solo it beat everyone else and by far, establishing a new SOTA, a new single model state of the art, breaking world records basically
For English to french, the score is way higher, 41.0, huge compared to German one but that's due to the fact that french is very similar to English, like share a lot of similar words, same order, same structures too kinda, so ofc it typically always scores higher, still a score of 40+ is considered a threshold for high quality translation, like, it means we rarely can find a major grammatical error, hitting such a score with a single model nonetheless was so impressive and unheard of before transformers

"Training for 3.5 days on 8 GPUs": now this part is just basically a flex, it's saying this model is faster, cheaper, and way smarter than anything that ever came before, it usually took weeks and even months to train a world-class translation model, and certainly it would need more computation power, while transformers used 8 GPUs, in the world of supercomputing this is such a small setup, and the time needed for the training, 3.5 days? That's crazy fast, it means one could iterate and test and try so many new different ideas and see results in just days, not months, we just basically can actually test lots of theories and perfect it, more doable than having to wait for months for a model to finish training everytime we wanna test a new architecture or something

"A small fraction of the training cost of the best models from the literature": we just pointing here how old SOTA models required massive amount of computing power, money, electricity, u name it, to reach their score, with transformers much better results were obtained with just a fraction of the electricity and money, this is also due to parallelization but we'll get to that later

DEV Community

[Learning notes] reading "Attention is all you need" paper

Abstract

Top comments (0)