Akhilesh

Posted on May 22

94. HuggingFace: Your Library for Every Pretrained Model

#ai #productivity #beginners #huggingface

You could implement every model from scratch. We did that in the last few posts.

You won't do that in practice.

HuggingFace is the place where every state-of-the-art model lives. BERT, GPT-2, LLaMA, Stable Diffusion, Whisper, CLIP. All of them. You load any of them in three lines. You fine-tune them in twenty. You push your own trained models to share with the world.

It's the npm of machine learning. You need to know it.

What You'll Learn Here

The four core HuggingFace libraries and what each does
Pipelines: the fastest path from model to output
Tokenizers: everything you need to know about preprocessing
Loading models and configs from the hub
The Datasets library: clean data loading and processing
AutoClasses: one line for any task
Saving and pushing models to the hub
Practical patterns for real projects

Installing HuggingFace

pip install transformers datasets accelerate
pip install sentencepiece  # for some tokenizers

The Four Core Libraries

# 1. transformers - models and tokenizers
from transformers import (
    AutoModel, AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoModelForQuestionAnswering,
    AutoModelForCausalLM,
    pipeline
)

# 2. datasets - loading and processing datasets
from datasets import load_dataset, Dataset, DatasetDict

# 3. accelerate - distributed training and GPU management
from accelerate import Accelerator

# 4. evaluate - metrics and evaluation
import evaluate

print("HuggingFace libraries loaded")

Pipelines: From Zero to Working Model in One Call

Pipelines are the highest-level abstraction. One function call handles tokenization, model forward pass, and output decoding.

from transformers import pipeline

# Text classification / sentiment
classifier = pipeline('sentiment-analysis')
results = classifier([
    "This product changed my life.",
    "Absolute garbage, don't waste your money.",
    "It's fine I guess."
])
for r in results:
    print(f"  {r['label']:<10} {r['score']:.3f}")

Output:

  POSITIVE   0.9998
  NEGATIVE   0.9997
  NEGATIVE   0.6421

# Named Entity Recognition
ner = pipeline('ner', grouped_entities=True)
entities = ner("Elon Musk is the CEO of SpaceX, based in Hawthorne, California.")
for e in entities:
    print(f"  {e['entity_group']:<6} '{e['word']}'  score={e['score']:.3f}")

Output:

  PER    'Elon Musk'  score=0.999
  ORG    'SpaceX'     score=0.997
  LOC    'Hawthorne'  score=0.982
  LOC    'California' score=0.994

# Question Answering
qa = pipeline('question-answering')
answer = qa(
    question="What is the boiling point of water?",
    context="Water boils at 100 degrees Celsius at standard atmospheric pressure."
)
print(f"Answer: '{answer['answer']}'  score={answer['score']:.3f}")

# Text generation
generator = pipeline('text-generation', model='gpt2')
result = generator("The best way to learn programming is", max_new_tokens=40, do_sample=True)
print(result[0]['generated_text'])

# Summarization
summarizer = pipeline('summarization', model='facebook/bart-large-cnn')
text = """
Machine learning is a method of data analysis that automates analytical model building.
It is based on the idea that systems can learn from data, identify patterns and make decisions
with minimal human intervention. Machine learning is a type of artificial intelligence that
allows software applications to become more accurate at predicting outcomes without being
explicitly programmed to do so.
"""
summary = summarizer(text, max_length=50, min_length=20)
print(summary[0]['summary_text'])

# Translation
translator = pipeline('translation_en_to_fr', model='Helsinki-NLP/opus-mt-en-fr')
result = translator("Hello, how are you today?")
print(result[0]['translation_text'])

# Zero-shot classification (no fine-tuning needed)
zero_shot = pipeline('zero-shot-classification')
result = zero_shot(
    "This new drug reduced tumor size by 40% in clinical trials.",
    candidate_labels=['medicine', 'sports', 'technology', 'politics']
)
print("\nZero-shot classification:")
for label, score in zip(result['labels'], result['scores']):
    print(f"  {label:<12}: {score:.3f}")

# Image classification (yes, it does vision too)
img_classifier = pipeline('image-classification',
                           model='google/vit-base-patch16-224')
# Pass image path or URL
# result = img_classifier('cat.jpg')

# Speech recognition
# transcriber = pipeline('automatic-speech-recognition', model='openai/whisper-base')
# result = transcriber('audio.mp3')

print("\nAll these pipelines use the same interface.")
print("swap the model name to change the model. That's it.")

Specifying a Device

from transformers import pipeline
import torch

device = 0 if torch.cuda.is_available() else -1   # 0 = first GPU, -1 = CPU

classifier = pipeline('sentiment-analysis', device=device)
print(f"Pipeline running on: {'GPU' if device == 0 else 'CPU'}")

Tokenizers: Everything You Need to Know

The tokenizer converts raw text into token IDs the model understands.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Basic tokenization
text = "HuggingFace makes NLP surprisingly easy."
tokens = tokenizer(text)

print(f"Input IDs:      {tokens['input_ids']}")
print(f"Attention mask: {tokens['attention_mask']}")
print()

# See the actual tokens
decoded_tokens = tokenizer.convert_ids_to_tokens(tokens['input_ids'])
print(f"Tokens: {decoded_tokens}")

Output:

Input IDs:      [101, 17662, 12172, 3084, 17953, 2102, 4074, 1012, 102]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1]

Tokens: ['[CLS]', 'hugging', '##face', 'makes', 'nl', '##p', 'surprisingly', 'easy', '.', '[SEP]']

"HuggingFace" splits into "hugging" + "##face". "NLP" splits into "nl" + "##p". The ## prefix means subword continuation. This is WordPiece tokenization.

# Batch tokenization with padding and truncation
texts = [
    "Short text.",
    "This is a much longer piece of text that will need to be truncated or padded.",
    "Medium length sentence here."
]

batch = tokenizer(
    texts,
    padding=True,       # pad shorter sequences
    truncation=True,    # truncate longer sequences
    max_length=20,      # max token length
    return_tensors='pt' # return PyTorch tensors
)

print(f"Input IDs shape: {batch['input_ids'].shape}")
print(f"\nInput IDs:\n{batch['input_ids']}")
print(f"\nAttention mask:\n{batch['attention_mask']}")
print("\nZeros in attention mask = padding positions (model ignores these)")

Output:

Input IDs shape: torch.Size([3, 20])

Input IDs:
tensor([[ 101, 2460, 3793, 1012,  102,    0,    0,    0, ...],
        [ 101, 2023, 2003, 1037, 2172, ...  102],
        [ 101, 5396, 3091, 6251, 2182, 1012,  102,    0, ...]])

Attention mask:
tensor([[1, 1, 1, 1, 1, 0, 0, 0, ...],
        [1, 1, 1, 1, 1, ... 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, ...]])

Zeros in attention mask = padding positions (model ignores these)

# Decode back to text
decoded = tokenizer.decode(batch['input_ids'][0], skip_special_tokens=True)
print(f"\nDecoded text 0: '{decoded}'")

# Fast tokenizer: returns word IDs for token-level tasks
fast_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)
encoding = fast_tokenizer("John Smith works at OpenAI.", return_offsets_mapping=True)

print(f"\nWord IDs: {encoding.word_ids()}")
print(f"Character offsets: {encoding['offset_mapping']}")

AutoClasses: One API for Any Model

Instead of importing specific classes for each architecture, AutoClasses detect the architecture from the model name and load the right class.

from transformers import (
    AutoModel,
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoModelForQuestionAnswering,
    AutoModelForCausalLM,
    AutoModelForSeq2SeqLM,
)

# These all work with the same syntax regardless of architecture
# BERT, RoBERTa, DistilBERT, ALBERT, etc.

model_name = 'distilbert-base-uncased'

tokenizer   = AutoTokenizer.from_pretrained(model_name)
base_model  = AutoModel.from_pretrained(model_name)
clf_model   = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
ner_model   = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=9)

print(f"Base model params:        {sum(p.numel() for p in base_model.parameters()):,}")
print(f"Classification params:    {sum(p.numel() for p in clf_model.parameters()):,}")
print(f"NER params:               {sum(p.numel() for p in ner_model.parameters()):,}")

# Swap to RoBERTa with one line change
model_name = 'roberta-base'
tokenizer_roberta = AutoTokenizer.from_pretrained(model_name)
clf_roberta       = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
print(f"\nRoBERTa classification params: {sum(p.numel() for p in clf_roberta.parameters()):,}")

The Datasets Library

from datasets import load_dataset, Dataset
import pandas as pd

# Load a public dataset
dataset = load_dataset('imdb')
print(dataset)
print(f"\nTrain size: {len(dataset['train'])}")
print(f"Test size:  {len(dataset['test'])}")
print(f"\nExample:\n{dataset['train'][0]}")

Output:

DatasetDict({
    train: Dataset({features: ['text', 'label'], num_rows: 25000})
    test:  Dataset({features: ['text', 'label'], num_rows: 25000})
})

Train size: 25000
Test size:  25000

Example:
{'text': 'I rented I AM CURIOUS...', 'label': 0}

# Filter, map, and process
small_train = dataset['train'].select(range(1000))   # first 1000 examples
positive    = dataset['train'].filter(lambda x: x['label'] == 1)

print(f"Small train: {len(small_train)}")
print(f"Positive reviews: {len(positive)}")

# Tokenize the whole dataset efficiently
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=128
    )

tokenized = small_train.map(tokenize_function, batched=True)
tokenized = tokenized.remove_columns(['text'])
tokenized = tokenized.rename_column('label', 'labels')
tokenized.set_format('torch')

print(f"\nTokenized columns: {tokenized.column_names}")
print(f"First item keys:   {list(tokenized[0].keys())}")

# Create a dataset from your own data
my_data = {
    'text':  ['This is great!', 'Terrible experience.', 'Pretty good overall.'],
    'label': [1, 0, 1]
}
custom_dataset = Dataset.from_dict(my_data)
print(f"\nCustom dataset: {custom_dataset}")

# From pandas dataframe
df = pd.DataFrame(my_data)
from_pandas = Dataset.from_pandas(df)
print(f"From pandas: {from_pandas}")

The Trainer API: Fine-Tuning Without Writing a Loop

The Trainer class handles the training loop, evaluation, checkpointing, and logging for you.

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from datasets import load_dataset
import evaluate
import numpy as np

# Load data
raw_datasets = load_dataset('imdb')
small_train = raw_datasets['train'].select(range(2000))
small_test  = raw_datasets['test'].select(range(500))

# Tokenize
model_name = 'distilbert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(model_name)

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, max_length=128)

train_ds = small_train.map(tokenize, batched=True)
test_ds  = small_test.map(tokenize,  batched=True)

# Data collator handles dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Metrics
accuracy_metric = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions    = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

# Training configuration
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    logging_steps=50,
    report_to='none'   # disable wandb/tensorboard for simplicity
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train
trainer.train()

# Evaluate
results = trainer.evaluate()
print(f"\nFinal accuracy: {results['eval_accuracy']:.3f}")

The Trainer handles: batching, gradient accumulation, mixed precision, distributed training, checkpointing, and early stopping. All configurable through TrainingArguments.

Saving and Loading Models

# Save locally
model.save_pretrained('./my_model')
tokenizer.save_pretrained('./my_model')
print("Saved to ./my_model/")

# Load from local
from transformers import AutoModelForSequenceClassification, AutoTokenizer
loaded_model     = AutoModelForSequenceClassification.from_pretrained('./my_model')
loaded_tokenizer = AutoTokenizer.from_pretrained('./my_model')
print("Loaded from local directory")

# Push to HuggingFace Hub (requires login)
# from huggingface_hub import notebook_login
# notebook_login()
# model.push_to_hub('your-username/your-model-name')
# tokenizer.push_to_hub('your-username/your-model-name')

The Model Hub: Finding What You Need

The HuggingFace hub at huggingface.co/models has 500,000+ models. Knowing how to search it matters.

from huggingface_hub import list_models, model_info

# List models for a specific task
models = list(list_models(
    task='text-classification',
    sort='downloads',
    limit=5
))

print("Top 5 text-classification models by downloads:")
for m in models:
    print(f"  {m.id}")

print()

# Get info about a specific model
info = model_info('distilbert-base-uncased-finetuned-sst-2-english')
print(f"Model: {info.id}")
print(f"Task:  {info.pipeline_tag}")
print(f"Downloads last month: {info.downloads:,}")

# Useful search patterns on the hub:

# For a specific task:
# task:text-classification
# task:token-classification
# task:question-answering
# task:translation
# task:summarization

# For a specific language:
# language:fr   (French)
# language:zh   (Chinese)

# For size constraints:
# Search for "distil" or "tiny" or "mini" models
# Examples:
#   distilbert-base-uncased  (66M params)
#   albert-base-v2           (12M params)
#   google/mobilebert-uncased (25M params)

Model Card: Understanding What You're Loading

Every good model on the hub has a model card. It tells you:

1. What the model was trained on
2. What tasks it's designed for
3. Known limitations and biases
4. Evaluation results
5. How to use it

Always read the model card before using a model in production. A model fine-tuned on English Wikipedia should not be used to classify French medical texts.

from transformers import AutoModelForSequenceClassification

# Loading prints the model architecture
model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased-finetuned-sst-2-english'
)
print(model.config)

The config tells you architecture details, number of labels, id2label mapping, and what the model was trained for.

Quick Reference: Common Model Names

# Classification
'distilbert-base-uncased-finetuned-sst-2-english'   # sentiment (English)
'cardiffnlp/twitter-roberta-base-sentiment'          # tweet sentiment
'ProsusAI/finbert'                                   # financial sentiment

# NER
'dslim/bert-base-NER'                               # general NER
'Jean-Baptiste/roberta-large-ner-english'           # high accuracy NER

# Question Answering
'deepset/roberta-base-squad2'                       # extractive QA
'bert-large-uncased-whole-word-masking-finetuned-squad'  # high accuracy QA

# Text generation
'gpt2'                                              # small, fast
'gpt2-large'                                        # better quality
'microsoft/DialoGPT-medium'                         # conversational

# Summarization
'facebook/bart-large-cnn'                           # news summarization
't5-small'                                          # general seq2seq

# Translation
'Helsinki-NLP/opus-mt-en-fr'                        # English to French
'Helsinki-NLP/opus-mt-fr-en'                        # French to English

# Embeddings
'sentence-transformers/all-MiniLM-L6-v2'           # fast, good quality
'sentence-transformers/all-mpnet-base-v2'           # higher quality

Complete Practical Example: Build a Topic Classifier

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from datasets import Dataset
import torch

# Use zero-shot for quick prototype (no training needed)
classifier = pipeline(
    'zero-shot-classification',
    model='facebook/bart-large-mnli'
)

articles = [
    "The Federal Reserve raised interest rates by 0.25 percentage points.",
    "Manchester City won the Premier League title for the fourth consecutive year.",
    "Scientists discovered a new species of deep-sea fish near the Mariana Trench.",
    "Apple unveiled the latest iPhone with improved camera capabilities.",
    "A new study shows that regular exercise reduces the risk of heart disease.",
]

labels = ['finance', 'sports', 'science', 'technology', 'health']

print(f"{'Article':<55} {'Prediction':<12} {'Score'}")
print("-" * 80)

for article in articles:
    result = classifier(article, candidate_labels=labels)
    top_label = result['labels'][0]
    top_score = result['scores'][0]
    print(f"{article[:52]+'...':<55} {top_label:<12} {top_score:.3f}")

Output:

Article                                                 Prediction   Score
--------------------------------------------------------------------------------
The Federal Reserve raised interest rates...            finance      0.891
Manchester City won the Premier League...               sports       0.976
Scientists discovered a new species...                  science      0.934
Apple unveiled the latest iPhone...                     technology   0.887
A new study shows that regular exercise...              health       0.892

Quick Cheat Sheet

Task	Pipeline name	Common model
Sentiment	`sentiment-analysis`	`distilbert-sst-2`
NER	`ner`	`dslim/bert-base-NER`
QA	`question-answering`	`deepset/roberta-base-squad2`
Text gen	`text-generation`	`gpt2`
Summarization	`summarization`	`facebook/bart-large-cnn`
Translation	`translation_XX_to_YY`	`Helsinki-NLP/opus-mt-*`
Zero-shot	`zero-shot-classification`	`facebook/bart-large-mnli`
Embeddings	`feature-extraction`	`all-MiniLM-L6-v2`

HuggingFace task	Code
Load tokenizer	`AutoTokenizer.from_pretrained(name)`
Load model	`AutoModelForXxx.from_pretrained(name)`
Tokenize batch	`tokenizer(texts, padding=True, truncation=True, return_tensors='pt')`
Load dataset	`load_dataset('imdb')`
Fine-tune	`Trainer(model, args, train_dataset, eval_dataset)`
Save	`model.save_pretrained('./dir')`
Push to hub	`model.push_to_hub('username/model-name')`

Practice Challenges

Level 1:
Use three different pipelines on the same piece of text: sentiment, NER, and zero-shot classification with 5 labels you choose. Print all results. Notice how easy swapping tasks is.

Level 2:
Load the imdb dataset. Tokenize the first 500 training examples with distilbert-base-uncased. Create a DataLoader from it. Run one forward pass and print the output shape. This is the full preprocessing pipeline for fine-tuning.

Level 3:
Use the Trainer API to fine-tune distilbert-base-uncased on any small classification dataset of your choice (emotion detection, spam classification, topic classification). Evaluate on a test set. Push the fine-tuned model to the HuggingFace hub and share the model card.

References

Next up, Post 95: Fine-Tuning LLMs: Make a General Model Do Your Specific Job. Dataset preparation, training config, evaluation, and the right way to fine-tune without destroying what the model already knows.

DEV Community

94. HuggingFace: Your Library for Every Pretrained Model

What You'll Learn Here

Installing HuggingFace

The Four Core Libraries

Pipelines: From Zero to Working Model in One Call

Specifying a Device

Tokenizers: Everything You Need to Know

AutoClasses: One API for Any Model

The Datasets Library

The Trainer API: Fine-Tuning Without Writing a Loop

Saving and Loading Models

The Model Hub: Finding What You Need

Model Card: Understanding What You're Loading

Quick Reference: Common Model Names

Complete Practical Example: Build a Topic Classifier

Quick Cheat Sheet

Practice Challenges

References

Top comments (0)