Google Search used to work by matching keywords.
You type "jaguar speed." You get pages about the Jaguar car. Because "speed" and "jaguar" appear on car performance pages. The fact that you might mean the animal does not matter. Keywords do not carry context.
In 2019, Google upgraded its search to use BERT. Now when you type "can you get medicine for someone pharmacy," the model understands that "for someone" means you are picking up a prescription for another person, not buying it for yourself. That context completely changes the relevant results.
BERT is a transformer encoder pretrained on 3.3 billion words using a clever self-supervised objective: predict randomly masked words. No labels required. The pretraining forces the model to build deep contextual understanding of language. Then you fine-tune on your specific task with a small labeled dataset.
The results shifted the entire field. In November 2018, BERT achieved state-of-the-art on 11 NLP benchmarks simultaneously. A single model, pretrained once, beating specialized models on every task.
How BERT Was Pretrained
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
from transformers import (BertTokenizer, BertModel, BertForSequenceClassification,
BertForMaskedLM, pipeline, AutoTokenizer, AutoModel)
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
torch.manual_seed(42)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
print("BERT Pretraining Objectives:")
print()
print("1. MASKED LANGUAGE MODEL (MLM)")
print(" Randomly mask 15% of tokens.")
print(" Predict the masked tokens.")
print(" Uses BOTH left and right context simultaneously.")
print(" This is the 'bidirectional' in Bidirectional Encoder Representations.")
print()
print("2. NEXT SENTENCE PREDICTION (NSP)")
print(" Given sentence A and sentence B:")
print(" Predict whether B follows A in the original text.")
print(" Trains the model to understand sentence relationships.")
print()
sentence = "The capital of France is Paris and it is a beautiful city."
tokens = tokenizer.tokenize(sentence)
print(f"Original: '{sentence}'")
print(f"Tokens: {tokens}")
print()
masked_tokens = tokens.copy()
np.random.seed(42)
mask_indices = np.random.choice(len(tokens), size=int(0.15 * len(tokens)), replace=False)
for idx in mask_indices:
masked_tokens[idx] = "[MASK]"
print(f"Masked: {masked_tokens}")
print(f"BERT must predict the masked tokens using all surrounding context.")
BERT's Architecture
bert_configs = {
"bert-base-uncased": {"layers": 12, "d_model": 768, "n_heads": 12, "params": "110M"},
"bert-large-uncased": {"layers": 24, "d_model": 1024, "n_heads": 16, "params": "340M"},
"bert-base-multilingual": {"layers": 12, "d_model": 768, "n_heads": 12, "params": "179M"},
"distilbert-base-uncased": {"layers": 6, "d_model": 768, "n_heads": 12, "params": "66M"},
}
print("BERT Model Variants:")
print(f"{'Model':<30} {'Layers':>8} {'d_model':>8} {'Heads':>8} {'Params':>10}")
print("=" * 68)
for name, cfg in bert_configs.items():
print(f"{name:<30} {cfg['layers']:>8} {cfg['d_model']:>8} "
f"{cfg['n_heads']:>8} {cfg['params']:>10}")
print()
bert = BertModel.from_pretrained("bert-base-uncased")
print(f"\nActual parameter count: {sum(p.numel() for p in bert.parameters()):,}")
print()
print("BERT Input Format:")
print(" [CLS] sentence_A [SEP] sentence_B [SEP]")
print()
text_a = "The cat sat on the mat."
text_b = "The dog played in the yard."
encoded = tokenizer(text_a, text_b)
tokens = tokenizer.convert_ids_to_tokens(encoded["input_ids"])
print(f" Tokens: {tokens}")
print(f" Type IDs: {encoded['token_type_ids']}")
print(f" 0 = sentence A, 1 = sentence B")
Getting BERT Embeddings
bert.eval()
sentences = [
"The bank is on the river.",
"I deposited money at the bank.",
"The cat sat on the mat.",
"Dogs are friendly animals.",
]
all_cls_embeddings = []
for sentence in sentences:
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, max_length=128)
with torch.no_grad():
outputs = bert(**inputs)
cls_embedding = outputs.last_hidden_state[:, 0, :]
all_cls_embeddings.append(cls_embedding.squeeze().numpy())
all_cls_embeddings = np.array(all_cls_embeddings)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
emb_2d = pca.fit_transform(all_cls_embeddings)
fig, ax = plt.subplots(figsize=(9, 6))
colors = ["steelblue", "steelblue", "coral", "coral"]
for i, (sentence, color) in enumerate(zip(sentences, colors)):
ax.scatter(emb_2d[i, 0], emb_2d[i, 1], color=color, s=120, zorder=5)
ax.annotate(sentence[:30] + "...", (emb_2d[i, 0], emb_2d[i, 1]),
fontsize=9, xytext=(8, 5), textcoords="offset points")
ax.set_title("BERT [CLS] Embeddings: Sentence Similarity in 2D", fontsize=12)
ax.grid(True, alpha=0.2)
plt.tight_layout()
plt.savefig("bert_embeddings.png", dpi=150)
plt.show()
from sklearn.metrics.pairwise import cosine_similarity
sim_matrix = cosine_similarity(all_cls_embeddings)
print("Cosine similarity between sentences:")
for i, s1 in enumerate(sentences):
for j, s2 in enumerate(sentences):
if i < j:
print(f" {sim_matrix[i,j]:.3f} '{s1[:30]}' ↔ '{s2[:30]}'")
Fine-Tuning BERT for Text Classification
newsgroups = fetch_20newsgroups(
subset="all",
categories=["sci.space", "rec.sport.hockey",
"talk.politics.guns", "comp.graphics"],
remove=("headers", "footers", "quotes")
)
texts = newsgroups.data[:1200]
labels = newsgroups.target[:1200]
texts = [t[:512] for t in texts]
X_tr, X_te, y_tr, y_te = train_test_split(
texts, labels, test_size=0.2, random_state=42, stratify=labels)
print(f"Training samples: {len(X_tr)}")
print(f"Test samples: {len(X_te)}")
print(f"Classes: {newsgroups.target_names}")
print()
class NewsDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len=128):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self): return len(self.texts)
def __getitem__(self, i):
enc = self.tokenizer(
self.texts[i],
max_length=self.max_len,
padding="max_length",
truncation=True,
return_tensors="pt"
)
return {
"input_ids": enc["input_ids"].squeeze(),
"attention_mask": enc["attention_mask"].squeeze(),
"label": torch.tensor(self.labels[i], dtype=torch.long)
}
train_ds = NewsDataset(X_tr, y_tr, tokenizer)
test_ds = NewsDataset(X_te, y_te, tokenizer)
train_loader = DataLoader(train_ds, batch_size=16, shuffle=True)
test_loader = DataLoader(test_ds, batch_size=32, shuffle=False)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
bert_clf = BertForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=4
)
bert_clf = bert_clf.to(device)
optimizer = optim.AdamW(bert_clf.parameters(), lr=2e-5, weight_decay=0.01)
scheduler = optim.lr_scheduler.LinearLR(optimizer, start_factor=1.0,
end_factor=0.0, total_iters=len(train_loader)*3)
print("Fine-tuning BERT for news classification:")
print(f"{'Epoch':>6} {'Train Loss':>12} {'Train Acc':>10} {'Test Acc':>10}")
print("=" * 42)
for epoch in range(3):
bert_clf.train()
total_loss = correct = total = 0
for batch in train_loader:
input_ids = batch["input_ids"].to(device)
attention_mask = batch["attention_mask"].to(device)
labels_b = batch["label"].to(device)
optimizer.zero_grad()
outputs = bert_clf(input_ids=input_ids,
attention_mask=attention_mask,
labels=labels_b)
loss = outputs.loss
loss.backward()
torch.nn.utils.clip_grad_norm_(bert_clf.parameters(), 1.0)
optimizer.step()
scheduler.step()
total_loss += loss.item()
correct += outputs.logits.argmax(1).eq(labels_b).sum().item()
total += labels_b.size(0)
bert_clf.eval()
t_correct = t_total = 0
with torch.no_grad():
for batch in test_loader:
out = bert_clf(
input_ids=batch["input_ids"].to(device),
attention_mask=batch["attention_mask"].to(device)
)
t_correct += out.logits.argmax(1).eq(batch["label"].to(device)).sum().item()
t_total += batch["label"].size(0)
print(f"{epoch+1:>6} {total_loss/len(train_loader):>12.4f} "
f"{correct/total:>10.2%} {t_correct/t_total:>10.2%}")
Output:
Fine-tuning BERT for news classification:
Epoch Train Loss Train Acc Test Acc
==========================================
1 0.4823 86.77% 90.42%
2 0.1912 94.12% 92.08%
3 0.0934 97.19% 93.33%
93% accuracy in 3 epochs fine-tuning on 960 training examples. Compare this to the Naive Bayes and TF-IDF classifiers from Phase 6 which typically get 80-85% on similar tasks.
Using the HuggingFace Pipeline API
print("Quick BERT usage with HuggingFace Pipelines:")
print()
sentiment_pipeline = pipeline("sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english",
device=-1)
test_sentences = [
"This movie was absolutely incredible!",
"I hated every minute of that film.",
"The food was okay, nothing special.",
"Best experience I have ever had.",
"Complete waste of money and time.",
]
print("Sentiment Analysis:")
for sentence in test_sentences:
result = sentiment_pipeline(sentence)[0]
print(f" {result['label']:<10} {result['score']:.3f} '{sentence}'")
Output:
Sentiment Analysis:
POSITIVE 0.999 'This movie was absolutely incredible!'
NEGATIVE 0.998 'I hated every minute of that film.'
NEGATIVE 0.576 'The food was okay, nothing special.'
POSITIVE 0.999 'Best experience I have ever had.'
NEGATIVE 0.998 'Complete waste of money and time.'
ner_pipeline = pipeline("ner",
model="dbmdz/bert-large-cased-finetuned-conll03-english",
aggregation_strategy="simple",
device=-1)
ner_text = "Elon Musk founded SpaceX in Hawthorne, California in 2002."
entities = ner_pipeline(ner_text)
print(f"\nNamed Entity Recognition:")
print(f"Text: '{ner_text}'")
print("Entities found:")
for ent in entities:
print(f" {ent['entity_group']:<6} '{ent['word']}' (score={ent['score']:.3f})")
Output:
Named Entity Recognition:
Text: 'Elon Musk founded SpaceX in Hawthorne, California in 2002.'
Entities found:
PER 'Elon Musk' (score=0.999)
ORG 'SpaceX' (score=0.998)
LOC 'Hawthorne' (score=0.987)
LOC 'California' (score=0.996)
BERT for Semantic Search
def encode_sentences(sentences, model, tokenizer, device="cpu"):
model.eval()
embeddings = []
for sent in sentences:
inputs = tokenizer(sent, return_tensors="pt",
truncation=True, max_length=128,
padding=True).to(device)
with torch.no_grad():
out = model(**inputs)
cls_emb = out.last_hidden_state[:, 0, :].squeeze().cpu().numpy()
embeddings.append(cls_emb)
return np.array(embeddings)
from sklearn.metrics.pairwise import cosine_similarity
document_corpus = [
"Machine learning is a subset of artificial intelligence.",
"Deep learning uses neural networks with many layers.",
"The stock market crashed in 2008 causing global recession.",
"Python is the most popular programming language for data science.",
"Transformers revolutionized natural language processing in 2017.",
"The Amazon rainforest produces 20% of the world's oxygen.",
"BERT stands for Bidirectional Encoder Representations from Transformers.",
"Random forests combine many decision trees to improve accuracy.",
]
query = "How do neural networks understand text?"
doc_embeddings = encode_sentences(document_corpus, bert, tokenizer)
query_embedding = encode_sentences([query], bert, tokenizer)
similarities = cosine_similarity(query_embedding, doc_embeddings)[0]
ranked = sorted(enumerate(similarities), key=lambda x: -x[1])
print(f"Semantic Search: '{query}'")
print()
print(f"{'Rank':>5} {'Score':>8} Document")
print("-" * 65)
for rank, (idx, score) in enumerate(ranked[:4], 1):
print(f"{rank:>5} {score:>8.4f} {document_corpus[idx]}")
What BERT Cannot Do
print("BERT limitations:")
print()
print("1. CANNOT GENERATE TEXT")
print(" BERT is an encoder. It understands, it does not write.")
print(" Bidirectional attention means no causal left-to-right generation.")
print()
print("2. FIXED CONTEXT WINDOW")
print(" bert-base: max 512 tokens.")
print(" Longer documents must be chunked or truncated.")
print()
print("3. EXPENSIVE INFERENCE")
print(" 110M parameters for every prediction.")
print(" DistilBERT (66M) is 60% smaller and 40% faster with ~97% performance.")
print()
print("4. PRETRAINING DOMAIN MATTERS")
print(" bert-base trained on books + Wikipedia.")
print(" For medical text: BioBERT.")
print(" For legal text: LegalBERT.")
print(" For code: CodeBERT.")
print(" General BERT underperforms on specialized domains.")
print()
print("5. REQUIRES FINE-TUNING PER TASK")
print(" BERT with no task head outputs embeddings, not answers.")
print(" GPT-style models can answer questions with zero or few-shot prompting.")
print(" This prompted the shift toward GPT-style models for general AI assistants.")
A Resource Worth Reading
The original BERT paper "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" by Devlin et al. (2018) from Google AI is essential reading. It clearly describes both pretraining objectives, the fine-tuning approach, and results on 11 benchmarks. One of the most cited papers in NLP history. Search "Devlin BERT pre-training deep bidirectional transformers 2018 arxiv."
Jay Alammar's "The Illustrated BERT, ELMo, and co." at jalammar.github.io visualizes the masking strategy, the input format, and the fine-tuning approach with characteristic clarity. The companion piece to his transformer visualization. Mandatory reading for understanding how pretraining works visually. Search "Jay Alammar illustrated BERT ELMo."
Try This
Create bert_practice.py.
Part 1: semantic similarity. Load BERT. Encode 10 pairs of sentences (5 similar, 5 unrelated). Compute cosine similarity for each pair. Do the similar pairs consistently score higher? Plot as a bar chart.
Part 2: masked language model. Load BertForMaskedLM. Take three sentences and mask different words (important nouns, verbs, adjectives). Print the top 5 predictions for each mask. Does BERT predict contextually appropriate words?
Part 3: fine-tuning. Fine-tune BertForSequenceClassification on the IMDB sentiment dataset (or any binary classification dataset). Train for 3 epochs. Report accuracy, precision, recall, and F1. Compare to TF-IDF + LogisticRegression from Phase 6.
Part 4: feature extraction. Use BERT as a frozen feature extractor. Extract [CLS] embeddings for all samples. Train an sklearn LogisticRegression on top. How does this compare to full fine-tuning in accuracy and training time?
What's Next
BERT reads text from both directions simultaneously. It understands language. But it cannot generate language.
GPT flips the design: decoder only, causal attention, trained to predict the next token. This one change is responsible for ChatGPT, text generation, code completion, and every generative AI application. Next post.
Top comments (0)