DEV Community

Richard Abishai
Richard Abishai

Posted on

Fine-Tuning Llama 3 with PEFT

Efficient parameter tuning for smarter, faster large language models.

Fine-tuning large models like Llama 3 no longer means retraining billions of parameters.

Thanks to PEFT (Parameter-Efficient Fine-Tuning), we can adapt models for new tasks with minimal compute — and keep the original weights frozen.

Let’s go through the setup, training, and evaluation for fine-tuning a Llama 3 model using Hugging Face’s PEFT library.


⚙️ 1. Environment Setup

First, make sure you’re using Python 3.10+ with GPU access.

pip install torch transformers datasets peft accelerate bitsandbytes
Enter fullscreen mode Exit fullscreen mode

These are the key libraries:

transformers — base Llama 3 model + tokenizer

datasets — data loading utilities

peft — adapter training framework

bitsandbytes — quantization support for low-memory GPUs


🧩 2. Load Model & Tokenizer

Here’s how to load a base model (Llama 3 – 8B or smaller variant).

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_8bit=True,
    device_map="auto"
)
Enter fullscreen mode Exit fullscreen mode

We’re loading in 8-bit precision to save VRAM and allow fine-tuning on consumer GPUs (like RTX 4090 or A100).


🧠 3. Add a PEFT Adapter (LoRA)

The LoRA (Low-Rank Adaptation) method modifies only small matrices inside the model.

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=8, 
    lora_alpha=32, 
    target_modules=["q_proj", "v_proj"], 
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
Enter fullscreen mode Exit fullscreen mode

This trains less than 1 % of the model’s total parameters — keeping updates lightweight while maintaining accuracy.


📚 4. Load a Sample Dataset

You can use any dataset from Hugging Face Datasets, or create your own.

from datasets import load_dataset

dataset = load_dataset("Abirate/english_quotes")  # simple example dataset
def format_data(example):
    return {"input_ids": tokenizer(example["quote"], truncation=True, padding="max_length", max_length=128, return_tensors="pt").input_ids[0],
            "labels": tokenizer(example["quote"], truncation=True, padding="max_length", max_length=128, return_tensors="pt").input_ids[0]}

tokenized_dataset = dataset.map(format_data)
Enter fullscreen mode Exit fullscreen mode

🚀 5. Train the Model

We’ll use the Trainer API with LoRA adapters.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./llama3-peft-demo",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,
    num_train_epochs=2,
    learning_rate=2e-4,
    logging_steps=10,
    save_steps=100,
    fp16=True
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
)

trainer.train()
Enter fullscreen mode Exit fullscreen mode

Training modifies only the LoRA parameters — typically a few hundred MB — which makes it fast and cheap.


🔎 6. Evaluate and Generate

After training, merge the adapter and test the result.

from peft import PeftModel

model = PeftModel.from_pretrained(model, "./llama3-peft-demo")
prompt = "In the next decade, AI will"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Enter fullscreen mode Exit fullscreen mode

💾 7. Save & Upload to Hugging Face Hub

model.save_pretrained("./llama3-finetuned")
tokenizer.save_pretrained("./llama3-finetuned")
Enter fullscreen mode Exit fullscreen mode

Optionally, share it with the community:

huggingface-cli login
huggingface-cli upload ./llama3-finetuned
Enter fullscreen mode Exit fullscreen mode

🧩 Why PEFT Matters

Efficient: updates only a fraction of parameters

Modular: you can swap adapters for different domains

Scalable: fine-tune huge models on affordable hardware

In short: PEFT turns fine-tuning into plug-and-play intelligence.


🌟 Wrap-Up

You just fine-tuned a Llama 3 model in under an hour with minimal compute.
This workflow scales easily to domain-specific tasks — chatbots, summarizers, or research assistants.


Next Up → “Fine-Tuning Failures and Fixes” — how to debug instability, manage catastrophic forgetting, and evaluate your adapters.

Top comments (0)