DEV Community

Cover image for 94. HuggingFace: Your Library for Every Pretrained Model
Akhilesh
Akhilesh

Posted on

94. HuggingFace: Your Library for Every Pretrained Model

You could implement every model from scratch. We did that in the last few posts.

You won't do that in practice.

HuggingFace is the place where every state-of-the-art model lives. BERT, GPT-2, LLaMA, Stable Diffusion, Whisper, CLIP. All of them. You load any of them in three lines. You fine-tune them in twenty. You push your own trained models to share with the world.

It's the npm of machine learning. You need to know it.


What You'll Learn Here

  • The four core HuggingFace libraries and what each does
  • Pipelines: the fastest path from model to output
  • Tokenizers: everything you need to know about preprocessing
  • Loading models and configs from the hub
  • The Datasets library: clean data loading and processing
  • AutoClasses: one line for any task
  • Saving and pushing models to the hub
  • Practical patterns for real projects

Installing HuggingFace

pip install transformers datasets accelerate
pip install sentencepiece  # for some tokenizers
Enter fullscreen mode Exit fullscreen mode

The Four Core Libraries

# 1. transformers - models and tokenizers
from transformers import (
    AutoModel, AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoModelForQuestionAnswering,
    AutoModelForCausalLM,
    pipeline
)

# 2. datasets - loading and processing datasets
from datasets import load_dataset, Dataset, DatasetDict

# 3. accelerate - distributed training and GPU management
from accelerate import Accelerator

# 4. evaluate - metrics and evaluation
import evaluate

print("HuggingFace libraries loaded")
Enter fullscreen mode Exit fullscreen mode

Pipelines: From Zero to Working Model in One Call

Pipelines are the highest-level abstraction. One function call handles tokenization, model forward pass, and output decoding.

from transformers import pipeline

# Text classification / sentiment
classifier = pipeline('sentiment-analysis')
results = classifier([
    "This product changed my life.",
    "Absolute garbage, don't waste your money.",
    "It's fine I guess."
])
for r in results:
    print(f"  {r['label']:<10} {r['score']:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

  POSITIVE   0.9998
  NEGATIVE   0.9997
  NEGATIVE   0.6421
Enter fullscreen mode Exit fullscreen mode
# Named Entity Recognition
ner = pipeline('ner', grouped_entities=True)
entities = ner("Elon Musk is the CEO of SpaceX, based in Hawthorne, California.")
for e in entities:
    print(f"  {e['entity_group']:<6} '{e['word']}'  score={e['score']:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

  PER    'Elon Musk'  score=0.999
  ORG    'SpaceX'     score=0.997
  LOC    'Hawthorne'  score=0.982
  LOC    'California' score=0.994
Enter fullscreen mode Exit fullscreen mode
# Question Answering
qa = pipeline('question-answering')
answer = qa(
    question="What is the boiling point of water?",
    context="Water boils at 100 degrees Celsius at standard atmospheric pressure."
)
print(f"Answer: '{answer['answer']}'  score={answer['score']:.3f}")
Enter fullscreen mode Exit fullscreen mode
# Text generation
generator = pipeline('text-generation', model='gpt2')
result = generator("The best way to learn programming is", max_new_tokens=40, do_sample=True)
print(result[0]['generated_text'])
Enter fullscreen mode Exit fullscreen mode
# Summarization
summarizer = pipeline('summarization', model='facebook/bart-large-cnn')
text = """
Machine learning is a method of data analysis that automates analytical model building.
It is based on the idea that systems can learn from data, identify patterns and make decisions
with minimal human intervention. Machine learning is a type of artificial intelligence that
allows software applications to become more accurate at predicting outcomes without being
explicitly programmed to do so.
"""
summary = summarizer(text, max_length=50, min_length=20)
print(summary[0]['summary_text'])
Enter fullscreen mode Exit fullscreen mode
# Translation
translator = pipeline('translation_en_to_fr', model='Helsinki-NLP/opus-mt-en-fr')
result = translator("Hello, how are you today?")
print(result[0]['translation_text'])
Enter fullscreen mode Exit fullscreen mode
# Zero-shot classification (no fine-tuning needed)
zero_shot = pipeline('zero-shot-classification')
result = zero_shot(
    "This new drug reduced tumor size by 40% in clinical trials.",
    candidate_labels=['medicine', 'sports', 'technology', 'politics']
)
print("\nZero-shot classification:")
for label, score in zip(result['labels'], result['scores']):
    print(f"  {label:<12}: {score:.3f}")
Enter fullscreen mode Exit fullscreen mode
# Image classification (yes, it does vision too)
img_classifier = pipeline('image-classification',
                           model='google/vit-base-patch16-224')
# Pass image path or URL
# result = img_classifier('cat.jpg')

# Speech recognition
# transcriber = pipeline('automatic-speech-recognition', model='openai/whisper-base')
# result = transcriber('audio.mp3')

print("\nAll these pipelines use the same interface.")
print("swap the model name to change the model. That's it.")
Enter fullscreen mode Exit fullscreen mode

Specifying a Device

from transformers import pipeline
import torch

device = 0 if torch.cuda.is_available() else -1   # 0 = first GPU, -1 = CPU

classifier = pipeline('sentiment-analysis', device=device)
print(f"Pipeline running on: {'GPU' if device == 0 else 'CPU'}")
Enter fullscreen mode Exit fullscreen mode

Tokenizers: Everything You Need to Know

The tokenizer converts raw text into token IDs the model understands.

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Basic tokenization
text = "HuggingFace makes NLP surprisingly easy."
tokens = tokenizer(text)

print(f"Input IDs:      {tokens['input_ids']}")
print(f"Attention mask: {tokens['attention_mask']}")
print()

# See the actual tokens
decoded_tokens = tokenizer.convert_ids_to_tokens(tokens['input_ids'])
print(f"Tokens: {decoded_tokens}")
Enter fullscreen mode Exit fullscreen mode

Output:

Input IDs:      [101, 17662, 12172, 3084, 17953, 2102, 4074, 1012, 102]
Attention mask: [1, 1, 1, 1, 1, 1, 1, 1, 1]

Tokens: ['[CLS]', 'hugging', '##face', 'makes', 'nl', '##p', 'surprisingly', 'easy', '.', '[SEP]']
Enter fullscreen mode Exit fullscreen mode

"HuggingFace" splits into "hugging" + "##face". "NLP" splits into "nl" + "##p". The ## prefix means subword continuation. This is WordPiece tokenization.

# Batch tokenization with padding and truncation
texts = [
    "Short text.",
    "This is a much longer piece of text that will need to be truncated or padded.",
    "Medium length sentence here."
]

batch = tokenizer(
    texts,
    padding=True,       # pad shorter sequences
    truncation=True,    # truncate longer sequences
    max_length=20,      # max token length
    return_tensors='pt' # return PyTorch tensors
)

print(f"Input IDs shape: {batch['input_ids'].shape}")
print(f"\nInput IDs:\n{batch['input_ids']}")
print(f"\nAttention mask:\n{batch['attention_mask']}")
print("\nZeros in attention mask = padding positions (model ignores these)")
Enter fullscreen mode Exit fullscreen mode

Output:

Input IDs shape: torch.Size([3, 20])

Input IDs:
tensor([[ 101, 2460, 3793, 1012,  102,    0,    0,    0, ...],
        [ 101, 2023, 2003, 1037, 2172, ...  102],
        [ 101, 5396, 3091, 6251, 2182, 1012,  102,    0, ...]])

Attention mask:
tensor([[1, 1, 1, 1, 1, 0, 0, 0, ...],
        [1, 1, 1, 1, 1, ... 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 0, ...]])

Zeros in attention mask = padding positions (model ignores these)
Enter fullscreen mode Exit fullscreen mode
# Decode back to text
decoded = tokenizer.decode(batch['input_ids'][0], skip_special_tokens=True)
print(f"\nDecoded text 0: '{decoded}'")

# Fast tokenizer: returns word IDs for token-level tasks
fast_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', use_fast=True)
encoding = fast_tokenizer("John Smith works at OpenAI.", return_offsets_mapping=True)

print(f"\nWord IDs: {encoding.word_ids()}")
print(f"Character offsets: {encoding['offset_mapping']}")
Enter fullscreen mode Exit fullscreen mode

AutoClasses: One API for Any Model

Instead of importing specific classes for each architecture, AutoClasses detect the architecture from the model name and load the right class.

from transformers import (
    AutoModel,
    AutoTokenizer,
    AutoModelForSequenceClassification,
    AutoModelForTokenClassification,
    AutoModelForQuestionAnswering,
    AutoModelForCausalLM,
    AutoModelForSeq2SeqLM,
)

# These all work with the same syntax regardless of architecture
# BERT, RoBERTa, DistilBERT, ALBERT, etc.

model_name = 'distilbert-base-uncased'

tokenizer   = AutoTokenizer.from_pretrained(model_name)
base_model  = AutoModel.from_pretrained(model_name)
clf_model   = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
ner_model   = AutoModelForTokenClassification.from_pretrained(model_name, num_labels=9)

print(f"Base model params:        {sum(p.numel() for p in base_model.parameters()):,}")
print(f"Classification params:    {sum(p.numel() for p in clf_model.parameters()):,}")
print(f"NER params:               {sum(p.numel() for p in ner_model.parameters()):,}")

# Swap to RoBERTa with one line change
model_name = 'roberta-base'
tokenizer_roberta = AutoTokenizer.from_pretrained(model_name)
clf_roberta       = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=3)
print(f"\nRoBERTa classification params: {sum(p.numel() for p in clf_roberta.parameters()):,}")
Enter fullscreen mode Exit fullscreen mode

The Datasets Library

from datasets import load_dataset, Dataset
import pandas as pd

# Load a public dataset
dataset = load_dataset('imdb')
print(dataset)
print(f"\nTrain size: {len(dataset['train'])}")
print(f"Test size:  {len(dataset['test'])}")
print(f"\nExample:\n{dataset['train'][0]}")
Enter fullscreen mode Exit fullscreen mode

Output:

DatasetDict({
    train: Dataset({features: ['text', 'label'], num_rows: 25000})
    test:  Dataset({features: ['text', 'label'], num_rows: 25000})
})

Train size: 25000
Test size:  25000

Example:
{'text': 'I rented I AM CURIOUS...', 'label': 0}
Enter fullscreen mode Exit fullscreen mode
# Filter, map, and process
small_train = dataset['train'].select(range(1000))   # first 1000 examples
positive    = dataset['train'].filter(lambda x: x['label'] == 1)

print(f"Small train: {len(small_train)}")
print(f"Positive reviews: {len(positive)}")

# Tokenize the whole dataset efficiently
def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding='max_length',
        max_length=128
    )

tokenized = small_train.map(tokenize_function, batched=True)
tokenized = tokenized.remove_columns(['text'])
tokenized = tokenized.rename_column('label', 'labels')
tokenized.set_format('torch')

print(f"\nTokenized columns: {tokenized.column_names}")
print(f"First item keys:   {list(tokenized[0].keys())}")
Enter fullscreen mode Exit fullscreen mode
# Create a dataset from your own data
my_data = {
    'text':  ['This is great!', 'Terrible experience.', 'Pretty good overall.'],
    'label': [1, 0, 1]
}
custom_dataset = Dataset.from_dict(my_data)
print(f"\nCustom dataset: {custom_dataset}")

# From pandas dataframe
df = pd.DataFrame(my_data)
from_pandas = Dataset.from_pandas(df)
print(f"From pandas: {from_pandas}")
Enter fullscreen mode Exit fullscreen mode

The Trainer API: Fine-Tuning Without Writing a Loop

The Trainer class handles the training loop, evaluation, checkpointing, and logging for you.

from transformers import (
    AutoModelForSequenceClassification,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding
)
from datasets import load_dataset
import evaluate
import numpy as np

# Load data
raw_datasets = load_dataset('imdb')
small_train = raw_datasets['train'].select(range(2000))
small_test  = raw_datasets['test'].select(range(500))

# Tokenize
model_name = 'distilbert-base-uncased'
tokenizer  = AutoTokenizer.from_pretrained(model_name)

def tokenize(examples):
    return tokenizer(examples['text'], truncation=True, max_length=128)

train_ds = small_train.map(tokenize, batched=True)
test_ds  = small_test.map(tokenize,  batched=True)

# Data collator handles dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

# Model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Metrics
accuracy_metric = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions    = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=predictions, references=labels)

# Training configuration
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=2,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    learning_rate=2e-5,
    weight_decay=0.01,
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='accuracy',
    logging_steps=50,
    report_to='none'   # disable wandb/tensorboard for simplicity
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

# Train
trainer.train()

# Evaluate
results = trainer.evaluate()
print(f"\nFinal accuracy: {results['eval_accuracy']:.3f}")
Enter fullscreen mode Exit fullscreen mode

The Trainer handles: batching, gradient accumulation, mixed precision, distributed training, checkpointing, and early stopping. All configurable through TrainingArguments.


Saving and Loading Models

# Save locally
model.save_pretrained('./my_model')
tokenizer.save_pretrained('./my_model')
print("Saved to ./my_model/")

# Load from local
from transformers import AutoModelForSequenceClassification, AutoTokenizer
loaded_model     = AutoModelForSequenceClassification.from_pretrained('./my_model')
loaded_tokenizer = AutoTokenizer.from_pretrained('./my_model')
print("Loaded from local directory")

# Push to HuggingFace Hub (requires login)
# from huggingface_hub import notebook_login
# notebook_login()
# model.push_to_hub('your-username/your-model-name')
# tokenizer.push_to_hub('your-username/your-model-name')
Enter fullscreen mode Exit fullscreen mode

The Model Hub: Finding What You Need

The HuggingFace hub at huggingface.co/models has 500,000+ models. Knowing how to search it matters.

from huggingface_hub import list_models, model_info

# List models for a specific task
models = list(list_models(
    task='text-classification',
    sort='downloads',
    limit=5
))

print("Top 5 text-classification models by downloads:")
for m in models:
    print(f"  {m.id}")

print()

# Get info about a specific model
info = model_info('distilbert-base-uncased-finetuned-sst-2-english')
print(f"Model: {info.id}")
print(f"Task:  {info.pipeline_tag}")
print(f"Downloads last month: {info.downloads:,}")
Enter fullscreen mode Exit fullscreen mode
# Useful search patterns on the hub:

# For a specific task:
# task:text-classification
# task:token-classification
# task:question-answering
# task:translation
# task:summarization

# For a specific language:
# language:fr   (French)
# language:zh   (Chinese)

# For size constraints:
# Search for "distil" or "tiny" or "mini" models
# Examples:
#   distilbert-base-uncased  (66M params)
#   albert-base-v2           (12M params)
#   google/mobilebert-uncased (25M params)
Enter fullscreen mode Exit fullscreen mode

Model Card: Understanding What You're Loading

Every good model on the hub has a model card. It tells you:

1. What the model was trained on
2. What tasks it's designed for
3. Known limitations and biases
4. Evaluation results
5. How to use it
Enter fullscreen mode Exit fullscreen mode

Always read the model card before using a model in production. A model fine-tuned on English Wikipedia should not be used to classify French medical texts.

from transformers import AutoModelForSequenceClassification

# Loading prints the model architecture
model = AutoModelForSequenceClassification.from_pretrained(
    'distilbert-base-uncased-finetuned-sst-2-english'
)
print(model.config)
Enter fullscreen mode Exit fullscreen mode

The config tells you architecture details, number of labels, id2label mapping, and what the model was trained for.


Quick Reference: Common Model Names

# Classification
'distilbert-base-uncased-finetuned-sst-2-english'   # sentiment (English)
'cardiffnlp/twitter-roberta-base-sentiment'          # tweet sentiment
'ProsusAI/finbert'                                   # financial sentiment

# NER
'dslim/bert-base-NER'                               # general NER
'Jean-Baptiste/roberta-large-ner-english'           # high accuracy NER

# Question Answering
'deepset/roberta-base-squad2'                       # extractive QA
'bert-large-uncased-whole-word-masking-finetuned-squad'  # high accuracy QA

# Text generation
'gpt2'                                              # small, fast
'gpt2-large'                                        # better quality
'microsoft/DialoGPT-medium'                         # conversational

# Summarization
'facebook/bart-large-cnn'                           # news summarization
't5-small'                                          # general seq2seq

# Translation
'Helsinki-NLP/opus-mt-en-fr'                        # English to French
'Helsinki-NLP/opus-mt-fr-en'                        # French to English

# Embeddings
'sentence-transformers/all-MiniLM-L6-v2'           # fast, good quality
'sentence-transformers/all-mpnet-base-v2'           # higher quality
Enter fullscreen mode Exit fullscreen mode

Complete Practical Example: Build a Topic Classifier

from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
from datasets import Dataset
import torch

# Use zero-shot for quick prototype (no training needed)
classifier = pipeline(
    'zero-shot-classification',
    model='facebook/bart-large-mnli'
)

articles = [
    "The Federal Reserve raised interest rates by 0.25 percentage points.",
    "Manchester City won the Premier League title for the fourth consecutive year.",
    "Scientists discovered a new species of deep-sea fish near the Mariana Trench.",
    "Apple unveiled the latest iPhone with improved camera capabilities.",
    "A new study shows that regular exercise reduces the risk of heart disease.",
]

labels = ['finance', 'sports', 'science', 'technology', 'health']

print(f"{'Article':<55} {'Prediction':<12} {'Score'}")
print("-" * 80)

for article in articles:
    result = classifier(article, candidate_labels=labels)
    top_label = result['labels'][0]
    top_score = result['scores'][0]
    print(f"{article[:52]+'...':<55} {top_label:<12} {top_score:.3f}")
Enter fullscreen mode Exit fullscreen mode

Output:

Article                                                 Prediction   Score
--------------------------------------------------------------------------------
The Federal Reserve raised interest rates...            finance      0.891
Manchester City won the Premier League...               sports       0.976
Scientists discovered a new species...                  science      0.934
Apple unveiled the latest iPhone...                     technology   0.887
A new study shows that regular exercise...              health       0.892
Enter fullscreen mode Exit fullscreen mode

Quick Cheat Sheet

Task Pipeline name Common model
Sentiment sentiment-analysis distilbert-sst-2
NER ner dslim/bert-base-NER
QA question-answering deepset/roberta-base-squad2
Text gen text-generation gpt2
Summarization summarization facebook/bart-large-cnn
Translation translation_XX_to_YY Helsinki-NLP/opus-mt-*
Zero-shot zero-shot-classification facebook/bart-large-mnli
Embeddings feature-extraction all-MiniLM-L6-v2
HuggingFace task Code
Load tokenizer AutoTokenizer.from_pretrained(name)
Load model AutoModelForXxx.from_pretrained(name)
Tokenize batch tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
Load dataset load_dataset('imdb')
Fine-tune Trainer(model, args, train_dataset, eval_dataset)
Save model.save_pretrained('./dir')
Push to hub model.push_to_hub('username/model-name')

Practice Challenges

Level 1:
Use three different pipelines on the same piece of text: sentiment, NER, and zero-shot classification with 5 labels you choose. Print all results. Notice how easy swapping tasks is.

Level 2:
Load the imdb dataset. Tokenize the first 500 training examples with distilbert-base-uncased. Create a DataLoader from it. Run one forward pass and print the output shape. This is the full preprocessing pipeline for fine-tuning.

Level 3:
Use the Trainer API to fine-tune distilbert-base-uncased on any small classification dataset of your choice (emotion detection, spam classification, topic classification). Evaluate on a test set. Push the fine-tuned model to the HuggingFace hub and share the model card.


References


Next up, Post 95: Fine-Tuning LLMs: Make a General Model Do Your Specific Job. Dataset preparation, training config, evaluation, and the right way to fine-tune without destroying what the model already knows.

Top comments (0)