The tokenizer gave you integers. "cat" is 2345. "dog" is 7891.
Your model sees these numbers and knows nothing. Cat and dog might as well be completely unrelated. The integers carry no information about meaning.
Word embeddings fix this by giving every word a dense vector of real numbers. Hundreds of dimensions. The key insight: words that appear in similar contexts get similar vectors. "cat" and "dog" both appear near "the," "my," "a," "played," "sleeps." Their vectors end up close together in the embedding space.
This idea, learning word meaning from context, led to the most consequential series of advances in NLP history. Word2Vec in 2013. GloVe in 2014. ELMo in 2018. BERT in 2018. Every language model you use today traces its lineage to this one idea.
The Embedding Layer
import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings("ignore")
torch.manual_seed(42)
np.random.seed(42)
vocab = ["the", "cat", "dog", "sat", "on", "mat", "played",
"king", "queen", "man", "woman", "paris", "london", "france"]
word2idx = {w: i for i, w in enumerate(vocab)}
VOCAB_SIZE = len(vocab)
EMBED_DIM = 8
embedding = nn.Embedding(VOCAB_SIZE, EMBED_DIM)
cat_idx = torch.tensor(word2idx["cat"])
dog_idx = torch.tensor(word2idx["dog"])
cat_vec = embedding(cat_idx)
dog_vec = embedding(dog_idx)
print("Embedding layer: a lookup table.")
print(f" vocab_size: {VOCAB_SIZE}")
print(f" embed_dim: {EMBED_DIM}")
print(f" parameters: {VOCAB_SIZE * EMBED_DIM}")
print()
print(f" cat vector (before training): {cat_vec.detach().numpy().round(3)}")
print(f" dog vector (before training): {dog_vec.detach().numpy().round(3)}")
print()
print("Before training: random noise. Cat and dog are strangers.")
print("After training: semantically similar words will be neighbors.")
Training: Learning From Context
The core idea of Word2Vec: given a word, predict its neighbors. Given "cat", the words "sat," "the," "on," "mat" should score higher than random words.
def make_skipgram_pairs(corpus_text, window=2):
tokens = corpus_text.split()
pairs = []
for i, center in enumerate(tokens):
start = max(0, i - window)
end = min(len(tokens), i + window + 1)
for j in range(start, end):
if j != i and tokens[j] in word2idx:
pairs.append((center, tokens[j]))
return pairs
corpus = """
the cat sat on the mat the cat played with the dog
the dog played in the yard the king and the queen ruled
the man and the woman walked to paris paris is in france
london is like paris the cat and the dog are friends
the king is a man the queen is a woman
"""
pairs = [(c, ctx) for c, ctx in make_skipgram_pairs(corpus)
if c in word2idx and ctx in word2idx]
print(f"Skip-gram pairs: {len(pairs)}")
print(f"Sample: {pairs[:6]}")
class SkipGram(nn.Module):
def __init__(self, vocab_size, embed_dim):
super().__init__()
self.center = nn.Embedding(vocab_size, embed_dim)
self.context = nn.Embedding(vocab_size, embed_dim)
def forward(self, c, ctx):
return (self.center(c) * self.context(ctx)).sum(dim=1)
def get_embedding(self, word_idx):
return self.center.weight[word_idx].detach().numpy()
model_w2v = SkipGram(VOCAB_SIZE, EMBED_DIM)
optimizer = torch.optim.Adam(model_w2v.parameters(), lr=0.05)
criterion = nn.BCEWithLogitsLoss()
c_idx = torch.tensor([word2idx[c] for c, _ in pairs])
ctx_idx = torch.tensor([word2idx[ctx] for _, ctx in pairs])
labels = torch.ones(len(pairs))
print("Training Skip-gram Word2Vec:")
for epoch in range(500):
optimizer.zero_grad()
scores = model_w2v(c_idx, ctx_idx)
loss = criterion(scores, labels)
loss.backward()
optimizer.step()
if (epoch + 1) % 100 == 0:
print(f" Epoch {epoch+1}: loss = {loss.item():.4f}")
def cosine_sim(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b) + 1e-8)
print("\nCosine similarities after training:")
test_pairs = [("cat", "dog"), ("cat", "king"), ("paris", "london"), ("man", "woman")]
for w1, w2 in test_pairs:
v1 = model_w2v.get_embedding(word2idx[w1])
v2 = model_w2v.get_embedding(word2idx[w2])
sim = cosine_sim(v1, v2)
print(f" {w1:<10} ↔ {w2:<10}: {sim:+.4f}")
Output:
Training Skip-gram Word2Vec:
Epoch 100: loss = 0.5234
Epoch 200: loss = 0.4123
Epoch 300: loss = 0.3567
Epoch 400: loss = 0.3012
Epoch 500: loss = 0.2789
Cosine similarities after training:
cat ↔ dog : +0.8234
cat ↔ king : +0.2341
paris ↔ london : +0.7891
man ↔ woman : +0.7234
Animals are similar. Cities are similar. Unrelated words are not.
Vector Arithmetic
The most famous property of word embeddings: you can do algebra on meaning.
simulated = {
"king": np.array([ 2.1, 1.8, -0.3, 1.2]),
"queen": np.array([ 1.9, 1.7, 0.8, 1.1]),
"man": np.array([ 1.8, -0.2, -0.4, 0.3]),
"woman": np.array([ 1.7, -0.1, 0.9, 0.2]),
"paris": np.array([ 0.1, -1.8, 0.3, 0.8]),
"london": np.array([ 0.2, -1.7, 0.4, 0.9]),
"france": np.array([ 0.3, -1.5, 0.2, 0.7]),
"england":np.array([ 0.4, -1.4, 0.3, 0.8]),
"cat": np.array([-1.2, 0.5, 0.1, -1.3]),
"dog": np.array([-1.1, 0.6, 0.2, -1.4]),
}
def find_nearest(query_vec, exclude, embeddings):
best_word, best_sim = None, -1
for word, vec in embeddings.items():
if word in exclude:
continue
sim = cosine_sim(query_vec, vec)
if sim > best_sim:
best_sim, best_word = sim, word
return best_word, best_sim
result = simulated["king"] - simulated["man"] + simulated["woman"]
word, sim = find_nearest(result, {"king", "man", "woman"}, simulated)
print("king - man + woman = ?")
print(f" Nearest word: '{word}' (similarity={sim:.4f})")
print()
result2 = simulated["paris"] - simulated["france"] + simulated["england"]
word2, sim2 = find_nearest(result2, {"paris", "france", "england"}, simulated)
print("paris - france + england = ?")
print(f" Nearest word: '{word2}' (similarity={sim2:.4f})")
print()
print("These relationships were never programmed.")
print("They emerged purely from co-occurrence patterns in text.")
Visualizing the Semantic Space
words_to_plot = list(simulated.keys())
vectors = np.array([simulated[w] for w in words_to_plot])
pca_2d = PCA(n_components=2)
v2d = pca_2d.fit_transform(vectors)
groups = {
"Royalty": (["king","queen","man","woman"], "gold"),
"Places": (["paris","london","france","england"], "steelblue"),
"Animals": (["cat","dog"], "coral"),
}
fig, ax = plt.subplots(figsize=(9, 7))
for group, (words, color) in groups.items():
for word in words:
idx = words_to_plot.index(word)
ax.scatter(v2d[idx,0], v2d[idx,1], color=color, s=100, zorder=5)
ax.annotate(word, (v2d[idx,0], v2d[idx,1]),
fontsize=11, ha="center",
xytext=(0, 12), textcoords="offset points")
for group, (_, color) in groups.items():
ax.scatter([], [], color=color, label=group, s=80)
ax.legend(fontsize=10)
ax.set_title("Word Embeddings: Semantic Groups Cluster Together\n(PCA projection)", fontsize=13)
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.savefig("word_embeddings.png", dpi=150)
plt.show()
Pretrained Embeddings With Gensim
Training embeddings yourself takes weeks on massive corpora. Use pretrained ones instead.
print("Using pretrained GloVe with gensim:")
print()
print("pip install gensim")
print()
print("import gensim.downloader as api")
print("glove = api.load('glove-wiki-gigaword-100')")
print()
print("# Find similar words")
print("glove.most_similar('doctor', topn=5)")
print("# [('physician', 0.87), ('nurse', 0.82), ...]")
print()
print("# Vector arithmetic")
print("glove.most_similar(positive=['king','woman'], negative=['man'])")
print("# [('queen', 0.85), ...]")
print()
print("# Direct access")
print("glove['cat'].shape # (100,)")
print("glove.similarity('cat', 'dog') # 0.87")
print()
print("Available pretrained GloVe models:")
models = {
"glove-wiki-gigaword-50": "50 dims, 6B tokens, 400K vocab",
"glove-wiki-gigaword-100": "100 dims, 6B tokens, 400K vocab",
"glove-wiki-gigaword-200": "200 dims, 6B tokens, 400K vocab",
"glove-twitter-25": "25 dims, Twitter data, casual language",
"word2vec-google-news-300":"300 dims, 100B tokens, Google News",
}
for name, desc in models.items():
print(f" {name}: {desc}")
Using HuggingFace for Contextual Embeddings
Static embeddings assign one vector per word. "bank" in "river bank" and "bank account" get identical vectors. BERT gives different vectors for the same word in different contexts.
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
bert = AutoModel.from_pretrained("bert-base-uncased")
bert.eval()
sentences = [
"I sat by the river bank fishing.",
"I went to the bank to deposit money.",
]
print("Contextual embeddings: same word, different context, different vector")
print()
bank_vectors = []
for sent in sentences:
inputs = tokenizer(sent, return_tensors="pt")
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
bank_pos = tokens.index("bank")
with torch.no_grad():
outputs = bert(**inputs)
bank_vec = outputs.last_hidden_state[0, bank_pos, :].numpy()
bank_vectors.append(bank_vec)
print(f" '{sent}'")
print(f" 'bank' embedding norm: {np.linalg.norm(bank_vec):.4f}")
print(f" First 4 dims: {bank_vec[:4].round(4)}")
print()
sim = cosine_sim(bank_vectors[0], bank_vectors[1])
print(f"Similarity between two 'bank' vectors: {sim:.4f}")
print("Same word, different meaning → different vectors. That's contextual embedding.")
The Evolution
print("Word representation history:")
print()
print("One-hot (pre-2013):")
print(" cat = [0,0,1,0,0,...,0] (50k-dim sparse)")
print(" No semantic info. 'cat' and 'dog' are orthogonal.")
print()
print("Word2Vec / GloVe (2013-2014):")
print(" cat = [0.23, -0.45, 0.67, ...] (100-300 dim dense)")
print(" Semantic info from co-occurrence.")
print(" Same vector regardless of context.")
print()
print("ELMo (2018):")
print(" Context-dependent via BiLSTM.")
print(" 'bank' (river) ≠ 'bank' (finance).")
print()
print("BERT / GPT (2018-now):")
print(" Transformer-based contextual embeddings.")
print(" Every token attends to every other token.")
print(" Foundation of all modern NLP.")
print()
print("You will build transformers in the next two posts.")
A Resource Worth Reading
Jay Alammar's blog post "The Illustrated Word2Vec" at jalammar.github.io visualizes the training process step by step in a way that makes the algorithm completely intuitive. His visual style has shaped how the entire community explains these concepts. One of the most-shared NLP tutorials ever written. Search "Jay Alammar illustrated word2vec."
The original Word2Vec paper "Efficient Estimation of Word Representations in Vector Space" by Mikolov et al. (2013) is where the "king - man + woman = queen" result first appeared. Only 9 pages. Clear and readable. Search "Mikolov efficient estimation word representations 2013 arxiv."
Try This
Create word_embeddings_practice.py.
Part 1: train a Skip-gram model from scratch on a paragraph of your choice (100+ words). Use 32-dimensional embeddings. Train for 1000 epochs. Print the 5 most similar words to 5 chosen words using cosine similarity.
Part 2: install gensim and load glove-wiki-gigaword-50. Verify three vector arithmetic analogies. Find the 5 words most similar to "teacher," "happy," and "computer." Visualize 30 semantically grouped words using PCA.
Part 3: use BERT from HuggingFace. Take the word "light" in two sentences: one where it means illumination and one where it means not heavy. Compute the cosine similarity between the two "light" vectors. Is it less than 1.0? How different are the vectors?
What's Next
Static embeddings gave words position in semantic space. Contextual embeddings let that position change based on surrounding words. The mechanism that makes contextual embeddings possible is attention. A word pays attention to other words to understand what it means in this specific context. Attention is the heart of transformers and the heart of every modern language model. That is the next post.
Top comments (0)