Darkstalker

Posted on Jun 30

Fine-Tuning Mistral-7B for Scientific Research: A Step-by-Step Guide

#programming #machinelearning #llm #lora

A Step-by-Step Guide

Fine-tuning large language models (LLMs) like Mistral-7B for domain-specific tasks is a powerful way to adapt their capabilities to specialized fields such as scientific research. In this comprehensive guide, we'll walk through a well-structured Jupyter notebook designed for fine-tuning Mistral-7B using LoRA (Low-Rank Adaptation) and 4-bit quantization on a GPU-enabled environment. This notebook, optimized for platforms like Kaggle or Colab, ensures reproducibility and efficiency. Whether you're a machine learning practitioner or a researcher, this tutorial will help you understand the process and adapt it for your own projects.

Why Fine-Tune Mistral-7B?
Mistral-7B, developed by Mistral AI, is a 7-billion-parameter model known for its efficiency and performance in natural language processing tasks. Fine-tuning it for scientific research allows you to tailor its responses to domain-specific jargon, hypotheses, and datasets, improving accuracy and relevance. By using techniques like LoRA and quantization, we can make this process computationally feasible on consumer-grade GPUs like the NVIDIA Tesla T4.

Overview of the Notebook
The notebook is structured for clarity and efficiency, following a clear workflow:

Imports: All dependencies are listed upfront.
Functions: Modular functions handle specific tasks like model loading, dataset preparation, and training.
Main Execution: The main() function orchestrates the workflow.
CPU/GPU Division: Data preparation runs on the CPU, while model training leverages the GPU.
Token Batching: The notebook uses a batch size and sequence length to manage memory, with notes on implementing a custom 100M/30M token strategy for large datasets.

Let’s dive into the key components and how they work together.

Step 1: Setting Up the Environment
The notebook begins by installing and importing essential libraries, ensuring compatibility with GPU acceleration. Key dependencies include:

Transformers: For model and tokenizer handling.
BitsAndBytes: For 4-bit quantization to reduce memory usage.
PEFT: For LoRA implementation.
TRL: For supervised fine-tuning (SFT) with the SFTTrainer.
Datasets: For loading and processing datasets.
PyTorch: For GPU computations.

Here’s a snippet of the import section:

import os
import torch
import json
import gc
from huggingface_hub import login
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
    DataCollatorForLanguageModeling
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from datasets import load_dataset
from trl import SFTTrainer

The notebook also checks library versions and GPU availability, ensuring the environment is correctly configured:
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"CUDA version: {torch.version.cuda}")
!nvidia-smi

To authenticate with Hugging Face for model and dataset access, the notebook uses a token stored in Kaggle Secrets:

def hf_login():
    try:
        client = UserSecretsClient()
        token = client.get_secret("HF_TOKEN")
        login(token=token)
        print("Hugging Face login complete.")
    except Exception as e:
        print(f"Failed to access HF_TOKEN: {e}")
        raise

Step 2: Configuration
A Config class centralizes all hyperparameters and paths, making it easy to modify settings without digging through the code. Key parameters include:

Model Name: mistralai/Mistral-7B-v0.1
Dataset Name: Allanatrix/Scientific_Research_Tokenized
Sequence Length: 1024 tokens
Batch Size: 1 (with gradient accumulation to simulate larger batches)
Learning Rate: 2e-5
Epochs: 2
Output Directories: For saving results and artifacts

class Config:
    MODEL_NAME = "mistralai/Mistral-7B-v0.1"
    DATASET_NAME = "Allanatrix/Scientific_Research_Tokenized"
    NEW_MODEL_NAME = "nexa-mistral-sci7b"
    MAX_SEQ_LENGTH = 1024
    BATCH_SIZE = 1
    GRADIENT_ACCUMULATION_STEPS = 64
    LEARNING_RATE = 2e-5
    NUM_TRAIN_EPOCHS = 2
    OUTPUT_DIR = "/kaggle/working/results"
    ARTIFACTS_DIR = "/kaggle/working/artifacts"

This configuration is also exportable as JSON for reproducibility:

def to_dict(self):
    return {k: v for k, v in vars(self).items() if not k.startswith('__') and not callable(getattr(self, k))}

Step 3: Loading the Model and Tokenizer
The get_model_and_tokenizer function loads Mistral-7B with 4-bit quantization to reduce memory usage, enabling it to run on a single Tesla T4 GPU. The BitsAndBytesConfig specifies:

4-bit Quantization: Using the nf4 type.
Compute Data Type: bfloat16 for faster GPU computations.
Device Map: Loads the model onto GPU 0.

The tokenizer is configured with the end-of-sequence token as the padding token and right-side padding for causal language modeling.

def get_model_and_tokenizer(model_name: str):
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=False,
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        trust_remote_code=True,
        device_map={"": 0}
    )
    model.config.use_cache = False
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"
    return model, tokenizer

Memory management is critical, so the function includes calls to torch.cuda.empty_cache() and gc.collect() to free up GPU and CPU memory.

Step 4: Preparing the Dataset
The load_and_prepare_dataset function handles dataset loading and tokenization on the CPU to avoid GPU memory bottlenecks. It loads the Allanatrix/Scientific_Research_Tokenized dataset from Hugging Face and tokenizes the input_text column with a maximum sequence length of 1024 tokens. Empty sequences are filtered out to ensure data quality.

def load_and_prepare_dataset(dataset_name: str, tokenizer: AutoTokenizer, max_seq_length: int):
    dataset = load_dataset(dataset_name)
    def tokenize_function(examples):
        return tokenizer(
            examples["input_text"],
            truncation=True,
            max_length=max_seq_length
        )
    tokenized_dataset = dataset.map(
        tokenize_function,
        batched=True,
        remove_columns=[col for col in dataset["train"].column_names if col != "input_ids"],
        desc="Tokenizing dataset"
    )
    tokenized_dataset = tokenized_dataset.filter(lambda x: len(x["input_ids"]) > 0, desc="Filtering empty sequences")
    return tokenized_dataset

The notebook mentions a "100M token pool, feed 30M until 100M" strategy, which would require a custom IterableDataset for streaming large datasets. While not fully implemented here, the MAX_SEQ_LENGTH and BATCH_SIZE settings control token batching, and group_by_length in the training arguments optimizes padding efficiency.

Step 5: Configuring LoRA
LoRA is used to fine-tune only a small subset of parameters, reducing memory and compute requirements. The get_lora_config function sets up LoRA with:

Rank (r): 64
Alpha: 16
Dropout: 0.1
Task Type: Causal language modeling

def get_lora_config():
    lora_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
    )
    return lora_config

The model is prepared for LoRA fine-tuning with gradient checkpointing and quantization-aware training:

model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

This reduces trainable parameters to approximately 0.375% of the total (27M out of 7.2B), significantly lowering memory usage.

Step 6: Training Arguments
The get_training_arguments function configures the TrainingArguments for the SFTTrainer. Key settings include:

Batch Size: 1 per device, with 64 gradient accumulation steps to simulate a larger batch size.
Learning Rate: 2e-5 with a cosine scheduler.
Optimizer: Paged AdamW in 8-bit precision.
Precision: bf16 for faster training.
Logging and Saving: Every 25 steps, with TensorBoard reporting.
Group by Length: To minimize padding and optimize GPU utilization.

def get_training_arguments(config: Config):
    training_args = TrainingArguments(
        output_dir=config.OUTPUT_DIR,
        num_train_epochs=config.NUM_TRAIN_EPOCHS,
        per_device_train_batch_size=config.BATCH_SIZE,
        gradient_accumulation_steps=config.GRADIENT_ACCUMULATION_STEPS,
        optim="paged_adamw_8bit",
        save_steps=25,
        logging_steps=25,
        learning_rate=config.LEARNING_RATE,
        weight_decay=0.001,
        bf16=True,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        group_by_length=True,
        lr_scheduler_type="cosine",
        report_to="tensorboard"
    )
    return training_args

Step 7: Fine-Tuning the Model
The fine_tune_model function uses the SFTTrainer from the TRL library to perform supervised fine-tuning. It combines the model, dataset, tokenizer, LoRA configuration, and training arguments. The DataCollatorForLanguageModeling handles batch preparation, moving data to the GPU asynchronously during training.

def fine_tune_model(model, dataset, tokenizer, lora_config, training_args, max_seq_length):
    data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset["train"],
        peft_config=lora_config,
        dataset_text_field="input_ids",
        max_seq_length=max_seq_length,
        tokenizer=tokenizer,
        args=training_args
    )
    trainer.train()
    return trainer

The training process runs for 2 epochs, with progress logged to TensorBoard.

Step 8: Saving Artifacts
After training, the save_model_artifacts function saves the fine-tuned model, tokenizer, training configuration, and arguments to the artifacts directory. These files ensure the model can be reloaded or shared later.

def save_model_artifacts(trainer: SFTTrainer, config: Config, training_args: TrainingArguments):
    final_model_path = os.path.join(config.ARTIFACTS_DIR, config.NEW_MODEL_NAME)
    trainer.save_model(final_model_path)
    trainer.tokenizer.save_pretrained(final_model_path)
    config_filename = os.path.join(config.ARTIFACTS_DIR, "training_config.json")
    with open(config_filename, 'w') as f:
        json.dump(config.to_dict(), f, indent=4)
    training_args_filename = os.path.join(config.ARTIFACTS_DIR, "training_arguments.json")
    with open(training_args_filename, 'w') as f:
        json.dump(training_args.to_dict(), f, indent=4)

Step 9: Running the Workflow
The main()function ties everything together, executing the workflow in a try-except block for robust error handling. It initializes the configuration, sets up directories, logs into Hugging Face, loads the model and dataset, configures LoRA and training arguments, fine-tunes the model, and saves the artifacts.

def main():
    config = Config()
    os.makedirs(config.ARTIFACTS_DIR, exist_ok=True)
    hf_login()
    model, tokenizer = get_model_and_tokenizer(config.MODEL_NAME)
    dataset = load_and_prepare_dataset(config.DATASET_NAME, tokenizer, config.MAX_SEQ_LENGTH)
    lora_config = get_lora_config()
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)
    model = get_peft_model(model, lora_config)
    training_args = get_training_arguments(config)
    trainer = fine_tune_model(model, dataset, tokenizer, lora_config, training_args, config.MAX_SEQ_LENGTH)
    save_model_artifacts(trainer, config, training_args)

Key Features and Optimizations

Memory Efficiency:

4-bit quantization reduces the model’s memory footprint.
Gradient checkpointing trades compute for memory.
Frequent calls to torch.cuda.empty_cache() and gc.collect() prevent memory leaks.

Scalability:

The notebook is designed for a single GPU but can be adapted for multi-GPU setups using accelerate.
The group_by_length option minimizes padding, improving training speed.

Reproducibility:

All configurations are saved as JSON files.
Library versions and GPU details are logged for debugging.
Custom Token Batching:
While not fully implemented, the notebook outlines a strategy for handling large datasets with a 100M/30M token approach, which could be extended with a custom IterableDataset.

Challenges and Future Improvements

Dataset Size: The Allanatrix/Scientific_Research_Tokenized dataset may be small, as evidenced by the quick training (2 steps in the output). For real-world applications, you’d need a larger dataset or a custom streaming loader.
Custom Batching: Implementing the 100M/30M token strategy requires a custom data loader, which could be added using PyTorch’s IterableDataset.
Warnings: The notebook includes deprecated arguments (dataset_text_field, max_seq_length) in SFTTrainer. Future versions should use SFTConfig to avoid warnings.
Evaluation: The notebook focuses on training but lacks an evaluation step. Adding a validation dataset and metrics like perplexity would improve model assessment.

How to Run the Notebook

Environment Setup:
Use a Kaggle notebook with a Tesla T4 GPU or a Colab instance with a similar GPU.
Add your Hugging Face token to Kaggle Secrets as HF_TOKEN.
Dependencies:
The notebook installs all required libraries. Ensure you restart the kernel if prompted.
Execution:
Run the cells sequentially. The main() function handles the entire workflow.
Output:
Artifacts (model weights, tokenizer, configs) are saved to /kaggle/working/artifacts.
Training logs are available in TensorBoard.

Conclusion
This notebook provides a robust and reproducible framework for fine-tuning Mistral-7B on a scientific research dataset. By leveraging LoRA, 4-bit quantization, and a modular design, it makes fine-tuning accessible on modest hardware. Whether you’re adapting LLMs for scientific research or another domain, this guide offers a solid foundation to build upon. Future enhancements could include larger datasets, custom batching, and evaluation metrics to further refine the model.
Feel free to fork the notebook, experiment with your own datasets, and share your results! If you have questions or improvements, drop them in the comments below.

Resources:
Mistral-7B on Hugging Face: https://huggingface.co/Allanatrix/Nexa-Mistral-Sci7b
Scientific Research Dataset:https://huggingface.co/datasets/Allanatrix/Scientific_Research_Tokenized
Github repo with notebook: https://github.com/DarkStarStrix/Nexa_Auto

Happy fine-tuning! 🚀