How to Fine-Tune Llama 3.1 70B with Ollama 0.6 and PyTorch 2.5
Large Language Models (LLMs) like Meta’s Llama 3.1 70B deliver state-of-the-art performance for general tasks, but fine-tuning is required to adapt them to domain-specific use cases. This guide walks through fine-tuning Llama 3.1 70B using Ollama 0.6 for model management and PyTorch 2.5 for training, with a focus on resource-efficient Parameter-Efficient Fine-Tuning (PEFT) via LoRA.
Prerequisites
Before starting, ensure you have:
- Hardware: 4x NVIDIA A100 (80GB) GPUs or equivalent (70B models require ~140GB of VRAM for LoRA fine-tuning with 4-bit quantization)
- Software: Ubuntu 22.04, Python 3.10+, CUDA 12.1+
- Ollama 0.6 installed (follow official setup instructions)
- Hugging Face account with access to Llama 3.1 70B (request access via Meta’s Hugging Face page)
Step 1: Set Up Ollama 0.6 and Pull Base Model
Ollama 0.6 simplifies local LLM management. First, pull the base Llama 3.1 70B model:
ollama pull llama3.1:70b
Verify the model is available:
ollama list
You should see llama3.1:70b in the output. Ollama 0.6 stores model weights in ~/.ollama/models, which we will reference during fine-tuning.
Step 2: Install Fine-Tuning Dependencies
Install required Python libraries for fine-tuning with PyTorch 2.5:
pip3 install torch==2.5.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
Install fine-tuning-specific libraries:
pip install transformers==4.44.0 peft==0.12.0 trl==0.9.6 datasets==2.20.0 accelerate==0.33.0 bitsandbytes==0.43.1
These libraries enable LoRA-based fine-tuning, 4-bit quantization (to reduce VRAM usage), and integration with PyTorch 2.5’s optimized kernels.
Step 3: Prepare Fine-Tuning Dataset
We use the Alpaca instruction-response format for compatibility. Below is a sample dataset entry:
{
"instruction": "Explain quantum entanglement in simple terms",
"input": "",
"output": "Quantum entanglement is a phenomenon where two particles become linked, so that the state of one instantly influences the state of the other, no matter how far apart they are."
}
Save your dataset as train.jsonl and load it with Hugging Face Datasets:
from datasets import load_dataset
dataset = load_dataset("json", data_files="train.jsonl", split="train")
Step 4: Configure LoRA and Training Parameters
Full fine-tuning of 70B models is infeasible for most users. We use LoRA to only train 0.1-1% of model parameters. Create a configuration file lora_config.json:
{
"r": 64,
"lora_alpha": 128,
"target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
"lora_dropout": 0.05,
"bias": "none",
"task_type": "CAUSAL_LM"
}
Key PyTorch 2.5 training parameters (set in your training script):
- Learning rate: 2e-4
- Batch size: 1 per device (with gradient accumulation steps=16)
- Epochs: 3
- Optimizer: AdamW with PyTorch 2.5’s fused kernel enabled
Step 5: Initialize Model and Tokenizer
Load the Llama 3.1 70B model from Ollama’s weight directory using Hugging Face Transformers, with 4-bit quantization via BitsAndBytes to fit on consumer/prosumer hardware:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# Ollama model weight path (adjust for your system)
ollama_model_path = "~/.ollama/models/blobs/llama3.1-70b"
# 4-bit quantization config for PyTorch 2.5
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(ollama_model_path)
tokenizer.pad_token = tokenizer.eos_token
# Load base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
ollama_model_path,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16
)
# Apply LoRA config
lora_config = LoraConfig(**json.load(open("lora_config.json")))
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters() # Should show ~0.2% trainable parameters
Step 6: Run Fine-Tuning with PyTorch 2.5
Use the TRL library’s SFTTrainer to run training, leveraging PyTorch 2.5’s optimized performance:
from trl import SFTTrainer
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./llama3.1-70b-finetuned",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4,
lr_scheduler_type="cosine",
num_train_epochs=3,
save_steps=500,
logging_steps=10,
fp16=False,
bf16=True, # PyTorch 2.5 optimizes bf16 on Ampere+ GPUs
optim="adamw_torch_fused", # Use PyTorch 2.5 fused optimizer
report_to="none"
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
dataset_text_field="instruction",
max_seq_length=2048
)
trainer.train()
model.save_pretrained("./llama3.1-70b-lora")
Step 7: Deploy Fine-Tuned Model with Ollama 0.6
After training, convert the LoRA adapter to an Ollama-compatible Modelfile:
FROM llama3.1:70b
ADAPTER ./llama3.1-70b-lora
SYSTEM "You are a domain-specific assistant trained on custom data."
Build the Ollama model:
ollama create llama3.1-70b-custom -f Modelfile
Run the fine-tuned model:
ollama run llama3.1-70b-custom
Conclusion
Fine-tuning Llama 3.1 70B with Ollama 0.6 and PyTorch 2.5 is accessible even with limited hardware thanks to LoRA and 4-bit quantization. This workflow balances performance and resource efficiency, letting you adapt state-of-the-art LLMs to your specific use case.
Top comments (0)