Jyoti Prajapati

Posted on Jan 16 • Edited on Jan 22

Transformer Series - Blog #2 Embeddings: Turning Tokens into Vectors

#ai #tutorial #machinelearning #computerscience

This is the second post in a series where we build Transformers and Vision Transformers from the ground up.
In the previous post, we covered the neural network fundamentals required for Transformers.
In this post, we focus on embeddings — the step that makes attention possible.

Why Embeddings Matter More Than Attention

Attention often gets all the credit in Transformers.

But attention operates on vectors, not on words, tokens, or pixels.

Before a Transformer can reason, compare, or attend to anything, we must answer a simple question:

How do we represent discrete symbols in a neural network?

The answer is: embeddings.
Without embeddings, attention has nothing to work on.

Neural Networks Don’t Understand Tokens

Neural networks operate on:

Real-valued vectors
Continuous spaces
Matrix multiplication

They do not understand: words, characters, tokens, and categories.

If we feed raw token IDs directly into a Transformer, the model would assume numeric relationships that do not exist.

For example:

Token ID 10 is not "twice" token ID 5
Token ID 1000 is not "larger" than token ID 10 in any semantic sense

We need a representation that removes this artificial ordering.

One-Hot Encoding: The Simplest Idea

The most basic way to represent tokens is one-hot encoding. If the vocabulary size is 𝑉:

Each token is a vector of length 𝑉
Exactly one position is 1, the rest are 0

Example:
"cat" → [0, 0, 1, 0, 0, ...]
"dog" → [0, 1, 0, 0, 0, ...]

Problems with One-Hot Encoding

Are extremely sparse
Do not encode similarity
Scale poorly with large vocabularies

From a neural network’s perspective:

"cat" and "dog" are just as different as "cat" and "car"

This is not useful for learning language.

Dense Embeddings: Learning Meaningful Representations

Instead of fixed one-hot vectors, we learn dense embeddings.

An embedding maps each token to a vector in a lower-dimensional continuous space:

token→ 𝑅^𝑑

Where:
𝑑 is the embedding dimension (e.g. 128, 512, 768)

Example:
"cat" → [0.12, -0.31, 0.78, ...]
"dog" → [0.15, -0.28, 0.74, ...]

Now:

Similar words have similar vectors
Distance and direction carry meaning
Neural networks can operate naturally

Figure 1: One-hot vectors are sparse and encode no similarity. Dense embeddings are compact and capture semantic relationships.

Embedding Layer = Lookup Table + Learning

An embedding layer is conceptually simple.
It is just a matrix:

𝐸 ∈ 𝑅^(𝑉×𝑑)

Here
𝑉: vocabulary size

𝑑: embedding dimension

Given a token ID 𝑖, the embedding is:

𝑥 = 𝐸[𝑖]

That’s it.

During training:

Gradients flow into this matrix
Embeddings are learned end-to-end
Meaning emerges from task optimization

There is no separate "semantic training phase" — embeddings learn meaning because the model needs it.

Why Attention Needs Embeddings

Attention computes similarity using dot products:

Attention (𝑄, 𝐾, 𝑉) ∝ 𝑄𝐾^𝑇

Dot products only make sense if:

Inputs live in the same vector space
Dimensions align
Geometry is meaningful

Embeddings provide exactly that.

Without embeddings:

Queries and keys would be meaningless
Similarity scores would be arbitrary
Attention would collapse

Figure 2: Embeddings map discrete tokens into a shared vector space where similarity and attention become meaningful.

Token Embeddings Are Only Part of the Story

So far, embeddings only tell us what the token is.

They do not tell us:

Where the token appears in the sequence
Whether it comes before or after another token

Transformers solve this using positional information, which we’ll cover in the next post.

For now, remember:

Token embeddings encode identity and meaning
Position must be injected separately

Embeddings Are Learned, Not Hardcoded

A common misconception is that embeddings are fixed or predefined.

In Transformers:

Embeddings are trainable parameters
They adapt to the task
They change during fine-tuning

This is why:

The same word has different embeddings in different models
Domain-specific fine-tuning changes representation quality

Key Takeaways

Neural networks cannot operate on discrete symbols
One-hot encoding is insufficient for learning semantics
Dense embeddings map tokens into continuous vector spaces
Embedding layers are trainable lookup tables
Attention relies entirely on embeddings to function

What’s Next?

In the next post, we’ll cover Positional Encoding — how Transformers understand order without recurrence or convolution.

DEV Community