ANKUSH CHOUDHARY JOHAL

Posted on May 3 • Originally published at johal.in

How to Fine-Tune Open-Source LLMs Like Llama 3.1 70B with Ollama 0.6 and PyTorch 2.5

#finetune #opensource #llms #like

How to Fine-Tune Llama 3.1 70B with Ollama 0.6 and PyTorch 2.5

Large Language Models (LLMs) like Meta’s Llama 3.1 70B deliver state-of-the-art performance for general tasks, but fine-tuning is required to adapt them to domain-specific use cases. This guide walks through fine-tuning Llama 3.1 70B using Ollama 0.6 for model management and PyTorch 2.5 for training, with a focus on resource-efficient Parameter-Efficient Fine-Tuning (PEFT) via LoRA.

Prerequisites

Before starting, ensure you have:

Hardware: 4x NVIDIA A100 (80GB) GPUs or equivalent (70B models require ~140GB of VRAM for LoRA fine-tuning with 4-bit quantization)
Software: Ubuntu 22.04, Python 3.10+, CUDA 12.1+
Ollama 0.6 installed (follow official setup instructions)
Hugging Face account with access to Llama 3.1 70B (request access via Meta’s Hugging Face page)

Step 1: Set Up Ollama 0.6 and Pull Base Model

Ollama 0.6 simplifies local LLM management. First, pull the base Llama 3.1 70B model:

ollama pull llama3.1:70b

Verify the model is available:

ollama list

You should see llama3.1:70b in the output. Ollama 0.6 stores model weights in ~/.ollama/models, which we will reference during fine-tuning.

Step 2: Install Fine-Tuning Dependencies

Install required Python libraries for fine-tuning with PyTorch 2.5:

pip3 install torch==2.5.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

Install fine-tuning-specific libraries:

pip install transformers==4.44.0 peft==0.12.0 trl==0.9.6 datasets==2.20.0 accelerate==0.33.0 bitsandbytes==0.43.1

These libraries enable LoRA-based fine-tuning, 4-bit quantization (to reduce VRAM usage), and integration with PyTorch 2.5’s optimized kernels.

Step 3: Prepare Fine-Tuning Dataset

We use the Alpaca instruction-response format for compatibility. Below is a sample dataset entry:

{
  "instruction": "Explain quantum entanglement in simple terms",
  "input": "",
  "output": "Quantum entanglement is a phenomenon where two particles become linked, so that the state of one instantly influences the state of the other, no matter how far apart they are."
}

Save your dataset as train.jsonl and load it with Hugging Face Datasets:

from datasets import load_dataset
dataset = load_dataset("json", data_files="train.jsonl", split="train")

Step 4: Configure LoRA and Training Parameters

Full fine-tuning of 70B models is infeasible for most users. We use LoRA to only train 0.1-1% of model parameters. Create a configuration file lora_config.json:

{
  "r": 64,
  "lora_alpha": 128,
  "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"],
  "lora_dropout": 0.05,
  "bias": "none",
  "task_type": "CAUSAL_LM"
}

Key PyTorch 2.5 training parameters (set in your training script):

Learning rate: 2e-4
Batch size: 1 per device (with gradient accumulation steps=16)
Epochs: 3
Optimizer: AdamW with PyTorch 2.5’s fused kernel enabled

Step 5: Initialize Model and Tokenizer

Load the Llama 3.1 70B model from Ollama’s weight directory using Hugging Face Transformers, with 4-bit quantization via BitsAndBytes to fit on consumer/prosumer hardware:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# Ollama model weight path (adjust for your system)
ollama_model_path = "~/.ollama/models/blobs/llama3.1-70b"

# 4-bit quantization config for PyTorch 2.5
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(ollama_model_path)
tokenizer.pad_token = tokenizer.eos_token

# Load base model with quantization
base_model = AutoModelForCausalLM.from_pretrained(
    ollama_model_path,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16
)

# Apply LoRA config
lora_config = LoraConfig(**json.load(open("lora_config.json")))
model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()  # Should show ~0.2% trainable parameters

Step 6: Run Fine-Tuning with PyTorch 2.5

Use the TRL library’s SFTTrainer to run training, leveraging PyTorch 2.5’s optimized performance:

from trl import SFTTrainer
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./llama3.1-70b-finetuned",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    num_train_epochs=3,
    save_steps=500,
    logging_steps=10,
    fp16=False,
    bf16=True,  # PyTorch 2.5 optimizes bf16 on Ampere+ GPUs
    optim="adamw_torch_fused",  # Use PyTorch 2.5 fused optimizer
    report_to="none"
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
    dataset_text_field="instruction",
    max_seq_length=2048
)

trainer.train()
model.save_pretrained("./llama3.1-70b-lora")

Step 7: Deploy Fine-Tuned Model with Ollama 0.6

After training, convert the LoRA adapter to an Ollama-compatible Modelfile:

FROM llama3.1:70b
ADAPTER ./llama3.1-70b-lora
SYSTEM "You are a domain-specific assistant trained on custom data."

Build the Ollama model:

ollama create llama3.1-70b-custom -f Modelfile

Run the fine-tuned model:

ollama run llama3.1-70b-custom

Conclusion

Fine-tuning Llama 3.1 70B with Ollama 0.6 and PyTorch 2.5 is accessible even with limited hardware thanks to LoRA and 4-bit quantization. This workflow balances performance and resource efficiency, letting you adapt state-of-the-art LLMs to your specific use case.

DEV Community