DEV Community

Cover image for 95. Fine-Tuning LLMs: Make a General Model Do Your Specific Job
Akhilesh
Akhilesh

Posted on

95. Fine-Tuning LLMs: Make a General Model Do Your Specific Job

A general language model knows a little about everything.

It knows some medicine. Some law. Some code. Some cooking. But it doesn't know your specific domain deeply. It doesn't know your company's tone, your product's terminology, or your task's format.

Fine-tuning fixes this. You take a pretrained model that already understands language and specialize it for your specific task with a fraction of the data and compute you'd need to train from scratch.

This post covers how to do it properly.


What You'll Learn Here

  • What fine-tuning actually does to a pretrained model
  • The three types of fine-tuning and when to use each
  • Preparing datasets for instruction fine-tuning
  • Full fine-tuning with the HuggingFace Trainer
  • Evaluating fine-tuned models properly
  • Catastrophic forgetting and how to avoid it
  • Tips that actually make a difference

What Fine-Tuning Does

A pretrained LLM has learned a general representation of language from billions of tokens. Its weights encode grammar, facts, reasoning patterns, and world knowledge.

Fine-tuning continues training on a smaller, task-specific dataset. The model adapts its weights slightly to specialize. The key word is slightly. You don't want to destroy the general knowledge. You want to build on it.

Pretrained model:
  - Knows language deeply
  - Broad but shallow domain knowledge
  - No concept of your task format

After fine-tuning:
  - Still knows language
  - Deep knowledge of your domain
  - Understands your task format
  - Responds in your required style
Enter fullscreen mode Exit fullscreen mode

The weights change. But not completely. A well-fine-tuned model retains its general capabilities while gaining task-specific expertise.


Three Types of Fine-Tuning

Type 1: Full Fine-Tuning
Update all weights. Best results. Expensive. Needs lots of data. Risk of catastrophic forgetting.

Type 2: Feature Extraction (Frozen backbone)
Freeze the pretrained model. Only train a new head (classification layer, etc.). Fast. Needs very little data. Limited adaptation.

Type 3: Parameter-Efficient Fine-Tuning (LoRA, adapters)
Add small trainable modules. Freeze most of the model. Train only a tiny fraction of parameters. Best of both worlds. Covered deeply in Post 96.

# Type 1: Full fine-tuning
for param in model.parameters():
    param.requires_grad = True   # all params update

# Type 2: Frozen backbone
for param in model.base_model.parameters():
    param.requires_grad = False  # freeze backbone
# only classifier head trains

# Type 3: LoRA (simplified)
# Covered in Post 96
Enter fullscreen mode Exit fullscreen mode

Dataset Preparation

Good data beats a good model almost every time. This is where most fine-tuning projects live or die.

For classification fine-tuning:

from datasets import Dataset, DatasetDict
import pandas as pd

# Your labeled data
data = {
    'text': [
        "The patient presented with acute chest pain radiating to the left arm.",
        "The quarterly earnings exceeded analyst expectations by 15%.",
        "The defendant claims he was not present at the scene of the crime.",
        "Treatment with metformin reduced HbA1c levels significantly.",
        "Revenue growth was driven by strong performance in cloud services.",
        "The prosecution presented DNA evidence linking the suspect to the crime.",
        "MRI results showed no signs of cerebral hemorrhage.",
        "Operating margins expanded by 200 basis points year over year.",
        "The jury found the defendant not guilty on all counts.",
        "The patient was discharged after a three-day hospitalization.",
    ],
    'label': [0, 1, 2, 0, 1, 2, 0, 1, 2, 0]  # 0=medical, 1=finance, 2=legal
}

df = pd.DataFrame(data)

# Train/val split
from sklearn.model_selection import train_test_split
train_df, val_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

train_dataset = Dataset.from_pandas(train_df.reset_index(drop=True))
val_dataset   = Dataset.from_pandas(val_df.reset_index(drop=True))

dataset = DatasetDict({'train': train_dataset, 'validation': val_dataset})
print(dataset)
Enter fullscreen mode Exit fullscreen mode

For instruction fine-tuning (making a model follow prompts):

# Instruction format used by most modern LLMs
def format_instruction(example):
    return f"""### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
{example['output']}"""

# Example instruction dataset
instruction_data = [
    {
        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
        'input': 'Patient reports persistent cough and shortness of breath for 3 weeks.',
        'output': 'symptom'
    },
    {
        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
        'input': 'Prescribed amoxicillin 500mg three times daily for 7 days.',
        'output': 'treatment'
    },
    {
        'instruction': 'Classify this medical text into one of: diagnosis, treatment, symptom.',
        'input': 'Confirmed diagnosis of type 2 diabetes mellitus based on HbA1c of 7.8%.',
        'output': 'diagnosis'
    },
]

for example in instruction_data:
    print(format_instruction(example))
    print("-" * 50)
Enter fullscreen mode Exit fullscreen mode

Data Quality Checklist

Before fine-tuning, verify your data:

import pandas as pd
import numpy as np

def audit_dataset(df, text_col='text', label_col='label'):
    print("=" * 50)
    print("DATASET AUDIT REPORT")
    print("=" * 50)

    # Size
    print(f"\nTotal examples: {len(df):,}")

    # Class distribution
    print(f"\nClass distribution:")
    dist = df[label_col].value_counts(normalize=True)
    for label, pct in dist.items():
        count = df[label_col].value_counts()[label]
        print(f"  Class {label}: {count} ({pct:.1%})")

    # Imbalance check
    max_class = dist.max()
    min_class = dist.min()
    ratio     = max_class / min_class
    if ratio > 5:
        print(f"  WARNING: Imbalance ratio {ratio:.1f}x. Consider oversampling or class weights.")

    # Text length
    lengths = df[text_col].str.len()
    print(f"\nText length:")
    print(f"  Min:    {lengths.min()}")
    print(f"  Max:    {lengths.max()}")
    print(f"  Median: {lengths.median():.0f}")
    print(f"  Mean:   {lengths.mean():.0f}")

    # Long texts warning
    if lengths.max() > 512 * 4:  # rough estimate of 512 tokens
        print(f"  WARNING: Some texts may exceed token limits. Check truncation strategy.")

    # Duplicates
    n_dupes = df[text_col].duplicated().sum()
    if n_dupes > 0:
        print(f"\n  WARNING: {n_dupes} duplicate texts found. Remove before training.")

    # Missing values
    missing = df.isnull().sum().sum()
    if missing > 0:
        print(f"\n  WARNING: {missing} missing values found.")
    else:
        print(f"\nNo missing values.")

    print("=" * 50)

audit_dataset(pd.DataFrame(data))
Enter fullscreen mode Exit fullscreen mode

Full Fine-Tuning for Sequence Classification

from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer,
    DataCollatorWithPadding,
    EarlyStoppingCallback
)
import evaluate
import numpy as np
import torch

model_name  = 'distilbert-base-uncased'
num_labels  = 3
label_names = ['medical', 'finance', 'legal']

id2label = {i: l for i, l in enumerate(label_names)}
label2id = {l: i for i, l in enumerate(label_names)}

# Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(
        examples['text'],
        truncation=True,
        padding=False,       # DataCollator will pad dynamically
        max_length=256
    )

tokenized_train = train_dataset.map(tokenize_function, batched=True)
tokenized_val   = val_dataset.map(tokenize_function, batched=True)

# Model
model = AutoModelForSequenceClassification.from_pretrained(
    model_name,
    num_labels=num_labels,
    id2label=id2label,
    label2id=label2id
)

# Metrics
accuracy = evaluate.load('accuracy')
f1_metric = evaluate.load('f1')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions    = np.argmax(logits, axis=-1)
    acc = accuracy.compute(predictions=predictions, references=labels)['accuracy']
    f1  = f1_metric.compute(
        predictions=predictions, references=labels, average='weighted'
    )['f1']
    return {'accuracy': acc, 'f1': f1}

# Training arguments
training_args = TrainingArguments(
    output_dir='./checkpoints/domain_classifier',

    # Training schedule
    num_train_epochs=5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=16,

    # Optimization
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.1,            # warmup for 10% of steps
    lr_scheduler_type='cosine',  # cosine decay after warmup

    # Evaluation
    evaluation_strategy='epoch',
    save_strategy='epoch',
    load_best_model_at_end=True,
    metric_for_best_model='f1',
    greater_is_better=True,

    # Logging
    logging_steps=10,
    logging_dir='./logs',
    report_to='none',

    # Efficiency
    fp16=torch.cuda.is_available(),  # mixed precision on GPU
    dataloader_num_workers=0,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train,
    eval_dataset=tokenized_val,
    tokenizer=tokenizer,
    data_collator=DataCollatorWithPadding(tokenizer),
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)

# Train
print("Starting fine-tuning...")
trainer.train()

# Evaluate
results = trainer.evaluate()
print(f"\nFinal Results:")
print(f"  Accuracy: {results['eval_accuracy']:.3f}")
print(f"  F1:       {results['eval_f1']:.3f}")
Enter fullscreen mode Exit fullscreen mode

Evaluating a Fine-Tuned Model Properly

Accuracy alone isn't enough. Look at per-class performance, confusion matrix, and error cases.

from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import torch

# Get predictions on validation set
model.eval()
all_preds  = []
all_labels = []

val_dataloader = trainer.get_eval_dataloader()

with torch.no_grad():
    for batch in val_dataloader:
        batch   = {k: v.to(model.device) for k, v in batch.items()}
        outputs = model(**batch)
        preds   = torch.argmax(outputs.logits, dim=-1)

        all_preds.extend(preds.cpu().numpy())
        all_labels.extend(batch['labels'].cpu().numpy())

# Classification report
print("Classification Report:")
print(classification_report(all_labels, all_preds, target_names=label_names))

# Confusion matrix
cm = confusion_matrix(all_labels, all_preds)
plt.figure(figsize=(7, 5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=label_names, yticklabels=label_names)
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.title('Confusion Matrix - Fine-tuned DistilBERT')
plt.tight_layout()
plt.savefig('fine_tune_confusion.png', dpi=100)
plt.show()
Enter fullscreen mode Exit fullscreen mode
# Error analysis: look at what the model gets wrong
errors = []
texts  = val_df['text'].tolist()

for i, (pred, true) in enumerate(zip(all_preds, all_labels)):
    if pred != true:
        errors.append({
            'text':      texts[i],
            'true':      label_names[true],
            'predicted': label_names[pred]
        })

print(f"\nErrors ({len(errors)} out of {len(all_labels)}):")
for e in errors:
    print(f"\n  True: {e['true']}, Predicted: {e['predicted']}")
    print(f"  Text: '{e['text'][:80]}...'")
Enter fullscreen mode Exit fullscreen mode

Error analysis is often the most valuable step. Understanding why the model gets specific examples wrong tells you what data to add next.


Catastrophic Forgetting: The Real Risk

When you fine-tune on a small dataset, the model can forget what it learned during pretraining. Weights move too far from their pretrained values. General capabilities degrade.

# Signs of catastrophic forgetting:
# 1. Model performs well on your task but fails on general text
# 2. Perplexity on general text spikes
# 3. Model generates incoherent text outside your domain

# Prevent it with:

# 1. Low learning rate (2e-5 is usually safe for BERT-based models)
training_args_safe = TrainingArguments(
    learning_rate=2e-5,        # not 1e-3 or 1e-4
    weight_decay=0.01,         # L2 regularization
    warmup_ratio=0.1,
    num_train_epochs=3,        # not 50
    output_dir='./safe_ft'
)

# 2. Freeze early layers (they contain general language knowledge)
def freeze_early_layers(model, n_frozen_layers=4):
    # Freeze embedding layers
    for param in model.distilbert.embeddings.parameters():
        param.requires_grad = False

    # Freeze first n transformer layers
    for layer in model.distilbert.transformer.layer[:n_frozen_layers]:
        for param in layer.parameters():
            param.requires_grad = False

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total     = sum(p.numel() for p in model.parameters())
    print(f"Trainable: {trainable:,} / {total:,} ({trainable/total:.1%})")

freeze_early_layers(model, n_frozen_layers=4)

# 3. Use a small dataset? Consider LoRA (Post 96) instead of full fine-tuning
Enter fullscreen mode Exit fullscreen mode

Instruction Fine-Tuning a Generative Model

For causal LLMs (GPT-style), you format the data as prompts and completions.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer
from datasets import Dataset
import torch

# Load a small generative model
model_name = 'gpt2'
tokenizer  = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(model_name)
model.config.use_cache = False   # required for gradient checkpointing

# Instruction dataset
instructions = [
    {
        'prompt': "### Instruction:\nSummarize this in one sentence.\n\n### Input:\nMachine learning is a subset of artificial intelligence that enables computers to learn from data without being explicitly programmed. It uses algorithms to parse data, learn from it, and make informed decisions.\n\n### Response:\n",
        'completion': "Machine learning allows computers to learn from data and make decisions without explicit programming."
    },
    {
        'prompt': "### Instruction:\nSummarize this in one sentence.\n\n### Input:\nThe Eiffel Tower, located in Paris, France, was built between 1887 and 1889 as the entrance arch for the 1889 World's Fair and stands 330 meters tall.\n\n### Response:\n",
        'completion': "The Eiffel Tower is a 330-meter structure in Paris built in 1889 as the entrance arch for the World's Fair."
    },
]

# Tokenize: concatenate prompt + completion, mask prompt in loss
def tokenize_instruction(example, max_length=256):
    full_text = example['prompt'] + example['completion'] + tokenizer.eos_token

    tokenized = tokenizer(
        full_text,
        max_length=max_length,
        truncation=True,
        padding='max_length',
        return_tensors='pt'
    )

    input_ids  = tokenized['input_ids'][0]
    labels     = input_ids.clone()

    # Mask the prompt tokens in loss (we only want to train on completions)
    prompt_ids = tokenizer(example['prompt'], return_tensors='pt')['input_ids'][0]
    prompt_len = len(prompt_ids)
    labels[:prompt_len] = -100   # -100 is ignored in CrossEntropyLoss

    return {
        'input_ids':      input_ids,
        'attention_mask': tokenized['attention_mask'][0],
        'labels':         labels
    }

tokenized_data = [tokenize_instruction(ex) for ex in instructions]

# Convert to dataset
import torch

class InstructionDataset(torch.utils.data.Dataset):
    def __init__(self, data):
        self.data = data
    def __len__(self):
        return len(self.data)
    def __getitem__(self, idx):
        return self.data[idx]

train_ds = InstructionDataset(tokenized_data)

# Fine-tune
training_args = TrainingArguments(
    output_dir='./instruct_model',
    num_train_epochs=3,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,   # effective batch size = 4
    learning_rate=2e-5,
    warmup_steps=10,
    logging_steps=5,
    save_steps=50,
    report_to='none',
    fp16=torch.cuda.is_available()
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
)

trainer.train()
print("Instruction fine-tuning complete")
Enter fullscreen mode Exit fullscreen mode

Testing Your Fine-Tuned Model

# Test the fine-tuned generative model
model.eval()

def generate_response(prompt, max_new_tokens=100, temperature=0.7):
    inputs = tokenizer(prompt, return_tensors='pt').to(model.device)
    with torch.no_grad():
        output = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    generated = output[0][inputs['input_ids'].shape[1]:]
    return tokenizer.decode(generated, skip_special_tokens=True)

# Test prompt
test_prompt = """### Instruction:
Summarize this in one sentence.

### Input:
Neural networks are computing systems inspired by biological neural networks. They consist of layers of interconnected nodes that process information using connectionist approaches to computation.

### Response:
"""

response = generate_response(test_prompt)
print(f"Generated response:\n{response}")
Enter fullscreen mode Exit fullscreen mode

Fine-Tuning Best Practices

# Summary of what actually works

best_practices = {
    'learning_rate': {
        'BERT-based (classification)': '2e-5 to 5e-5',
        'GPT-based (generation)':      '1e-5 to 3e-5',
        'Frozen backbone':             '1e-3 to 1e-4 for head only'
    },
    'batch_size': {
        'recommendation': '16 or 32 if memory allows',
        'small GPU':      'batch=4 + gradient_accumulation=4'
    },
    'epochs': {
        'BERT classification': '2 to 4',
        'GPT generation':      '1 to 3',
        'note':                'More epochs = more overfitting risk'
    },
    'data_size': {
        'frozen backbone':  'Works with 100+ examples',
        'full fine-tuning': 'Need 1000+ for reliable results',
        'instruction FT':   '1000 to 10000 good examples'
    },
    'stopping': {
        'recommendation': 'Always use early stopping',
        'metric':         'Monitor validation loss, not training loss'
    }
}

for category, details in best_practices.items():
    print(f"\n{category.upper()}:")
    for k, v in details.items():
        print(f"  {k}: {v}")
Enter fullscreen mode Exit fullscreen mode

Quick Cheat Sheet

Decision Guidance
How much data do I have? < 500: freeze backbone. 500-5k: full fine-tune. > 5k: great
Which model to start with? DistilBERT for speed, RoBERTa for accuracy
Learning rate 2e-5 for BERT, 1e-5 for GPT, never > 5e-5
Epochs 2-4, use early stopping
Catastrophic forgetting Lower LR, freeze early layers, fewer epochs
Model not learning Raise LR, check data quality, check label correctness
Model overfitting Lower LR, add dropout, add more data, use LoRA
Task Code
Load model AutoModelForSequenceClassification.from_pretrained(name, num_labels=N)
Tokenize tokenizer(texts, truncation=True, padding=False, max_length=256)
Train Trainer(model, args, train_dataset, eval_dataset)
Early stop EarlyStoppingCallback(early_stopping_patience=2)
Save trainer.save_model('./my_model')
Predict trainer.predict(test_dataset)

Practice Challenges

Level 1:
Download any small labeled text dataset from the HuggingFace hub. Fine-tune distilbert-base-uncased on it for 3 epochs. Print the classification report. Compare to a TF-IDF + LogisticRegression baseline.

Level 2:
Fine-tune with and without freezing the first 4 transformer layers. Compare final F1 scores and training time. Which approach is better for your dataset size?

Level 3:
Create your own instruction dataset of 50+ examples for a specific task (code explanation, medical text classification, legal summarization). Fine-tune GPT-2 on it. Test the model with 10 new prompts it hasn't seen. Rate the responses 1-5 and report average quality.


References


Next up, Post 96: LoRA: Fine-Tune a Billion-Parameter Model on a Laptop. Parameter-efficient fine-tuning using rank decomposition. Train 1% of parameters and get 95% of the performance of full fine-tuning.

Top comments (0)