Efficient parameter tuning for smarter, faster large language models.
Fine-tuning large models like Llama 3 no longer means retraining billions of parameters.
Thanks to PEFT (Parameter-Efficient Fine-Tuning), we can adapt models for new tasks with minimal compute — and keep the original weights frozen.
Let’s go through the setup, training, and evaluation for fine-tuning a Llama 3 model using Hugging Face’s PEFT library.
⚙️ 1. Environment Setup
First, make sure you’re using Python 3.10+ with GPU access.
pip install torch transformers datasets peft accelerate bitsandbytes
These are the key libraries:
transformers — base Llama 3 model + tokenizer
datasets — data loading utilities
peft — adapter training framework
bitsandbytes — quantization support for low-memory GPUs
🧩 2. Load Model & Tokenizer
Here’s how to load a base model (Llama 3 – 8B or smaller variant).
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
load_in_8bit=True,
device_map="auto"
)
We’re loading in 8-bit precision to save VRAM and allow fine-tuning on consumer GPUs (like RTX 4090 or A100).
🧠 3. Add a PEFT Adapter (LoRA)
The LoRA (Low-Rank Adaptation) method modifies only small matrices inside the model.
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
This trains less than 1 % of the model’s total parameters — keeping updates lightweight while maintaining accuracy.
📚 4. Load a Sample Dataset
You can use any dataset from Hugging Face Datasets, or create your own.
from datasets import load_dataset
dataset = load_dataset("Abirate/english_quotes") # simple example dataset
def format_data(example):
return {"input_ids": tokenizer(example["quote"], truncation=True, padding="max_length", max_length=128, return_tensors="pt").input_ids[0],
"labels": tokenizer(example["quote"], truncation=True, padding="max_length", max_length=128, return_tensors="pt").input_ids[0]}
tokenized_dataset = dataset.map(format_data)
🚀 5. Train the Model
We’ll use the Trainer API with LoRA adapters.
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./llama3-peft-demo",
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
num_train_epochs=2,
learning_rate=2e-4,
logging_steps=10,
save_steps=100,
fp16=True
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
)
trainer.train()
Training modifies only the LoRA parameters — typically a few hundred MB — which makes it fast and cheap.
🔎 6. Evaluate and Generate
After training, merge the adapter and test the result.
from peft import PeftModel
model = PeftModel.from_pretrained(model, "./llama3-peft-demo")
prompt = "In the next decade, AI will"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
💾 7. Save & Upload to Hugging Face Hub
model.save_pretrained("./llama3-finetuned")
tokenizer.save_pretrained("./llama3-finetuned")
Optionally, share it with the community:
huggingface-cli login
huggingface-cli upload ./llama3-finetuned
🧩 Why PEFT Matters
Efficient: updates only a fraction of parameters
Modular: you can swap adapters for different domains
Scalable: fine-tune huge models on affordable hardware
In short: PEFT turns fine-tuning into plug-and-play intelligence.
🌟 Wrap-Up
You just fine-tuned a Llama 3 model in under an hour with minimal compute.
This workflow scales easily to domain-specific tasks — chatbots, summarizers, or research assistants.
Next Up → “Fine-Tuning Failures and Fixes” — how to debug instability, manage catastrophic forgetting, and evaluate your adapters.
Top comments (0)