In natural language processing, a word embedding is basically a clever way to represent each word as a vector: a list of numbers that captures its meaning. Think of it like giving words GPS coordinates in a multi-dimensional space, where similar words (like "happy" and "joyful") end up as neighbors, making it easier for computers to analyze text. These embeddings are born from techniques like neural networks, squishing down word co-occurrence data, or even probabilistic models that play detective with word contexts.
Why do we care? Because when you plug these embeddings into NLP neural networks for tasks like parsing sentence structures or sniffing out sentiment in reviews, they supercharge performance, suddenly, it understands language nuances way better, all by representing words as vectors in a multidimensional space. And those layers are basically frozen, you don't need to train as those already knows all words meanings.
Resulting in the very famous example, taking a "king" removing the "man" component, putting a "woman" gives you a "queen", great way to express how words truly became vectors in the multidimensional space (King - Man + Woman = Queen).
This is like taking a layer in your network, sending it to primary school to catch main vocabulary and understand semantic relationships between words. Just vocabulary! Only words, nothing more!
Now what if we take this primary school layer, put into a big model, send the whole model to high school and university gaining all fundamental lingual and contextual intelligence.
Looking at LLMs as a natural extension of word embeddings makes sense. A new backbone for anything related to natural language processing and generation. Just how word embeddings are used to represent words as vectors in a pre-trained layer, LLM is the whole (or most?) lingual knowledge represented as a Neural Network. Converting the pre-trained knowledge from mere words to representations of meanings and relationships, all humankind knowledge as a neural network!
This new way looking at LLMs as not just a final product used mainly for generation, but a compressed representation of knowledge that can be "embedded" allows us to do a number of things.
In a recent research, our team was trying to introduce more heads to a LLaMA model, heads specialized in analysis tasks, not generation tasks. Mainly keeping the whole model as is, and only training a small part based on the task. We weren't the only team that did that, there is a number of researchers now doing the same thing. Keep the LLM as is, or introduce minor change, like making it bidirectional, and build an NER (Named-entity recognition), or freeze everything, train querying parts (Q) of the attention layers and build a document classifier after maybe changing the loss function a bit.
The possibilities are endless, and with that in mind, when prompting alone leaves you hanging with little success, and when you don't want to use legacy NLP methods, Try multi-head LLMs (specially if you can break a bank). At least you'll have something almost as knowledgable as you are.
Top comments (0)