This is the second post in a series where we build Transformers and Vision Transformers from the ground up.
In the previous post, we covered the neural network fundamentals required for Transformers.
In this post, we focus on embeddings β the step that makes attention possible.
Why Embeddings Matter More Than Attention
Attention often gets all the credit in Transformers.
But attention operates on vectors, not on words, tokens, or pixels.
Before a Transformer can reason, compare, or attend to anything, we must answer a simple question:
How do we represent discrete symbols in a neural network?
- The answer is: embeddings.
- Without embeddings, attention has nothing to work on.
Neural Networks Donβt Understand Tokens
Neural networks operate on:
- Real-valued vectors
- Continuous spaces
- Matrix multiplication
They do not understand: words, characters, tokens, and categories.
If we feed raw token IDs directly into a Transformer, the model would assume numeric relationships that do not exist.
For example:
- Token ID 10 is not "twice" token ID 5
- Token ID 1000 is not "larger" than token ID 10 in any semantic sense
We need a representation that removes this artificial ordering.
One-Hot Encoding: The Simplest Idea
The most basic way to represent tokens is one-hot encoding. If the vocabulary size is π:
- Each token is a vector of length π
- Exactly one position is 1, the rest are 0
Example:
"cat" β [0, 0, 1, 0, 0, ...]
"dog" β [0, 1, 0, 0, 0, ...]
Problems with One-Hot Encoding
- Are extremely sparse
- Do not encode similarity
- Scale poorly with large vocabularies
From a neural networkβs perspective:
- "cat" and "dog" are just as different as "cat" and "car"
This is not useful for learning language.
Dense Embeddings: Learning Meaningful Representations
Instead of fixed one-hot vectors, we learn dense embeddings.
An embedding maps each token to a vector in a lower-dimensional continuous space:
tokenβ π
^π
Where:
π is the embedding dimension (e.g. 128, 512, 768)
Example:
"cat" β [0.12, -0.31, 0.78, ...]
"dog" β [0.15, -0.28, 0.74, ...]
Now:
- Similar words have similar vectors
- Distance and direction carry meaning
- Neural networks can operate naturally

Figure 1: One-hot vectors are sparse and encode no similarity. Dense embeddings are compact and capture semantic relationships.
Embedding Layer = Lookup Table + Learning
An embedding layer is conceptually simple.
It is just a matrix:
πΈ β π
^(πΓπ)
Here
π: vocabulary size
π: embedding dimension
Given a token ID π, the embedding is:
π₯ = πΈ[π]
Thatβs it.
During training:
- Gradients flow into this matrix
- Embeddings are learned end-to-end
- Meaning emerges from task optimization
There is no separate "semantic training phase" β embeddings learn meaning because the model needs it.
Why Attention Needs Embeddings
Attention computes similarity using dot products:
Attention (π, πΎ, π) β ππΎ^π
Dot products only make sense if:
- Inputs live in the same vector space
- Dimensions align
- Geometry is meaningful
Embeddings provide exactly that.
Without embeddings:
- Queries and keys would be meaningless
- Similarity scores would be arbitrary
- Attention would collapse
Figure 2: Embeddings map discrete tokens into a shared vector space where similarity and attention become meaningful.
Token Embeddings Are Only Part of the Story
So far, embeddings only tell us what the token is.
They do not tell us:
- Where the token appears in the sequence
- Whether it comes before or after another token
Transformers solve this using positional information, which weβll cover in the next post.
For now, remember:
- Token embeddings encode identity and meaning
- Position must be injected separately
Embeddings Are Learned, Not Hardcoded
A common misconception is that embeddings are fixed or predefined.
In Transformers:
- Embeddings are trainable parameters
- They adapt to the task
- They change during fine-tuning
This is why:
- The same word has different embeddings in different models
- Domain-specific fine-tuning changes representation quality
Key Takeaways
- Neural networks cannot operate on discrete symbols
- One-hot encoding is insufficient for learning semantics
- Dense embeddings map tokens into continuous vector spaces
- Embedding layers are trainable lookup tables
- Attention relies entirely on embeddings to function
Whatβs Next?
In the next post, weβll cover Positional Encoding β how Transformers understand order without recurrence or convolution.

Top comments (0)