Dolly Sharma

Posted on Feb 22

Tokens

#ai #machinelearning #datascience

🧠 Tokens in Transformers — Developer Notes

🔹 What is a Token?

A token is the smallest unit of text that a transformer model processes.

It is created by a tokenizer and then converted into numerical IDs before entering the model.

⚠️ Important: Token ≠ always a full word.

🔹 What Can Be a Token?

Depending on the tokenizer, a token may be:

whole word
subword (most common)
character
punctuation
special symbol

✅ Modern transformers mainly use subword tokenization.

🔹 Example

Sentence:

I like eating apples

Possible subword tokens:

[I] [like] [eat] [##ing] [apple] [##s]

🔹 Transformer Processing Pipeline

Raw Text → Tokenizer → Tokens → Token IDs → Embeddings → Transformer

Neural networks only understand numbers, so tokens must be converted to IDs and then to vectors.

🔹 Why Tokenization Is Needed

Tokenization helps to:

reduce vocabulary size
handle unknown words
capture morphology
improve generalization
enable efficient training

🔹 Special Tokens (Encoder Models)

Typical encoder input:

[CLS] I like apples [SEP]

Roles

[CLS] → sentence-level representation
[SEP] → separator between sentences

Token in Vector Form —

🔹 Hidden Size Rule

If a Transformer model has:

hidden_size = 768

✅ Every token is represented by a 768-dimensional vector inside the model.

Fixed by architecture
Independent of sentence length
Same dimension for all tokens

🔹 For n Tokens

If tokenization produces n tokens, the representation matrix is:

H ∈ R^(n × 768)

Meaning

n → number of tokens (rows)
768 → hidden dimension (columns)
Each row → contextual vector of one token

🔹 Example

Sentence → tokenized into:

7 subword tokens

Then:

H ∈ R^(7 × 768)

✔ 7 token vectors
✔ each vector size = 768

🔹 Important Clarification

❌ Tokens do NOT multiply with 768
✅ Token vectors are stacked

Correct view:

token₁ → ℝ^768  
token₂ → ℝ^768  
...  
token₇ → ℝ^768

Stacked as:

H = [token₁_vec
     token₂_vec
     ...
     token₇_vec]  ∈ R^(7 × 768)

🔹 Shape Through Encoder Layers

In encoder models (like DeBERTa):

Input shape  = (n × 768)  
After layer1 = (n × 768)  
After layer2 = (n × 768)  
...

✅ Shape stays constant
✅ Only values become more contextual

🔹 Mental Model

Number of tokens → number of rows
Hidden size → number of columns

In models like DeBERTa-base, each token is mapped to a fixed 768-dimensional vector, so a sentence with n tokens produces a representation matrix of shape (n × 768).

🔹 Important Interview Points

Token ≠ word
Most transformers use subword tokens
Tokenizer is model-specific
Tokens are converted to IDs before embeddings
Similar tokens → similar vectors (in context)

🔹 Common Tokenizer Types

Model Family	Tokenizer
BERT / DeBERTa	WordPiece
GPT family	BPE
LLaMA	SentencePiece

🔹 One-Line Mental Model

Token = smallest text unit the transformer understands.

🔹 Ultra-Short Interview Answer

A token is the smallest textual unit produced by a tokenizer and converted into numerical form so that a transformer model can process input text.

DEV Community

Tokens

🧠 Tokens in Transformers — Developer Notes

🔹 What is a Token?

🔹 What Can Be a Token?

🔹 Example

🔹 Transformer Processing Pipeline

🔹 Why Tokenization Is Needed

🔹 Special Tokens (Encoder Models)

Token in Vector Form —

🔹 Hidden Size Rule

🔹 For n Tokens

🔹 Example

🔹 Important Clarification

🔹 Shape Through Encoder Layers

🔹 Mental Model

🔹 Important Interview Points

🔹 Common Tokenizer Types

🔹 One-Line Mental Model

🔹 Ultra-Short Interview Answer

Top comments (0)