TL;DR: Meta released Llama 3.1 in July 2024 – a family of open-source models up to 405B parameters. In this tutorial, we'll fine-tune the 8B version on a custom dataset using LoRA, running on a single GPU (T4 in Colab). You'll learn the complete pipeline from setup to inference.
🚀 The News That Matters
On July 23, 2024, Meta dropped Llama 3.1 – a major update to its open-source LLM family. The 8B, 70B, and 405B models come with a permissive license, extended context length (128K tokens), and improved performance across benchmarks. For developers, this means free, state-of-the-art foundation models that we can fine-tune for our own use cases without paying API fees.
But raw models aren't always enough. You need to adapt them to your data, tone, or domain. That's exactly what we'll do today.
🧠 What You'll Learn
- How to load Llama 3.1 8B with 4-bit quantization to fit on a 16GB GPU
- How to prepare a custom instruction dataset
- How to fine-tune using LoRA (Low-Rank Adaptation) via PEFT and TRL
- How to save, reload, and run inference on your fine-tuned model
Time estimate: ~30 minutes (training in Colab may take 10–15 minutes on a small dataset).
📦 Prerequisites
- A free Google Colab account (or a local GPU with ≥16GB VRAM)
- Basic Python and PyTorch knowledge
- A Hugging Face account – sign up here and accept the Llama 3.1 license
- A small custom dataset (I'll provide a sample JSON for you to try)
🛠 Step 1: Setup the Environment
Fire up a Colab notebook and set the runtime to T4 GPU (Runtime → Change runtime type → T4). Then install the required libraries:
!pip install -qU \
transformers \
datasets \
accelerate \
bitsandbytes \
peft \
trl \
huggingface_hub
Log in to Hugging Face to access the gated Llama 3.1 repository:
from huggingface_hub import notebook_login
notebook_login()
Enter your token when prompted. If you haven't accepted the license yet, go to this page and click "Agree and access repository".
📥 Step 2: Load the Model & Tokenizer (4-bit)
We'll load the 8B model in 4‑bit to save memory. This uses bitsandbytes quantization.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
model_id = "meta-llama/Meta-Llama-3.1-8B"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token # Llama doesn't have a pad token by default
model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
device_map="auto",
use_cache=False
)
Why 4-bit? The full 8B model would need ~16GB in float16. With 4-bit quantization it drops to ~6GB, leaving room for gradients and activations.
📝 Step 3: Prepare Your Dataset
Fine-tuning works best with instruction‑style data. Here's a tiny example dataset (I'll use a few rows from the databricks/databricks-dolly-15k style). For this tutorial, we'll create a small set of Q&A pairs about Python tips.
Create a file train_data.jsonl with this content (or use your own):
{"instruction": "Explain how list comprehensions work in Python.", "output": "A list comprehension is a concise way to create lists. Syntax: [expression for item in iterable if condition]. For example, [x**2 for x in range(5)] returns [0, 1, 4, 9, 16]."}
{"instruction": "What is the difference between == and is in Python?", "output": "== compares values, while 'is' compares object identity. For example, a = [1,2]; b = [1,2]; a == b is True, but a is b is False because they are different objects in memory."}
{"instruction": "How do you open a file in Python?", "output": "Use open() with a context manager: with open('file.txt', 'r') as f: content = f.read(). The modes are 'r' (read), 'w' (write), 'a' (append), 'rb' (binary read), etc."}
Now load and format it for the trainer:
python
from datasets import load_dataset
dataset = load_dataset("json", data_files="train_data
Top comments (0)