Recently, Large Language Models (LLMs) have garnered attention for their impressive capabilities in the realm of Natural Language Understanding (NLU). Trained on diverse and extensive datasets, LLMs seem to generalize well across a wide range of tasks with little to no task-specific training data. In this article, we will examine one of the less frequently explored NLU tasks when it comes to LLMs — Emotion Prediction.
Despite notable advancements, Emotion Prediction continues to be an evolving area of research. It can help assess a model's true NLU capability as it demands detection of subtle emotional cues present in the text. We will mainly focus on assessing text-based Emotion Prediction using LLMs in two real world scenarios -
- No Labelled Data Available
- Labelled Data Available But Noisy
For the purposes of our experiment, we will be using mistralai/Mistral-7B-Instruct-v0.2 as our LLM. We will also build corresponding competitive baseline approaches to benchmark the LLM's performance.
For the remainder of the article, we will be using 'LLM' as a shorthand reference to 'mistralai/Mistral-7B-Instruct-v0.2'
TL;DR
This is going to be a long post. The intent is to share code snippets for each step in all experiments covered in the article. If you want to skip directly to the results, you can find them under the section Consolidating Results.
Preliminaries
Before getting started, we need to install and load the required dependencies.
!pip install setfit
!pip install bitsandbytes
!pip install accelerate
!pip install huggingface_hub
!pip install peft
!pip install dqc-toolkit
from datasets import load_dataset, Dataset
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from typing import List, Tuple, Union
import numpy as np
import os
import pandas as pd
import torch
import transformers
import wandb
import warnings
transformers.logging.set_verbosity_error()
wandb.init(mode="disabled")
warnings.filterwarnings('ignore')
Some of the installed libraries and imported modules will make sense as we proceed further. Also, we are setting the verbosity level to 'error'. This isn't necessary. We've done this to keep the notebook outputs relatively clean.
Dataset
For the purpose of all experiments, we will be using emotion, a publicly available dataset hosted on Hugging Face. It consists of English-language tweets annotated with one of six emotions as shown below -
['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
The dataset has 16,000 training samples and 2,000 validation samples.
dataset = 'dair-ai/emotion'
dset = load_dataset(dataset, trust_remote_code=True)
train_data = pd.DataFrame(dset['train'])
val_data = pd.DataFrame(dset['validation'])
text_col, label_col = 'text', 'label'
train_data.head()
text | label | |
---|---|---|
0 | i didnt feel humiliated | 0 |
1 | i can go from feeling so hopeless to so damned... | 0 |
2 | im grabbing a minute to post i feel greedy wrong | 3 |
3 | i am ever feeling nostalgic about the fireplac... | 2 |
4 | i am feeling grouchy | 3 |
Transform Integer Labels to Text Labels
The labels in the dataset are integers. Since LLMs cannot comprehend the emotion labels in this format, we will need a mapping of the integer labels to semantic text descriptions.
label_to_text = {0 : 'sadness',
1 : 'joy',
2 : 'love',
3 : 'anger',
4 : 'fear',
5 : 'surprise'}
We will consume this dictionary downstream when we run our LLM.
Evaluation Metric
For the purpose of benchmarking our experiments, we choose Weighted F1 score as the metric. We also display the classification report and confusion matrix for detailed interpretation.
from sklearn.metrics import (classification_report, confusion_matrix, ConfusionMatrixDisplay, f1_score)
from typing import List
import matplotlib.pyplot as plt
def fetch_performance_metrics(y_true: np.ndarray, y_pred: np.ndarray, exp_name: str,display_report: bool = True, display_confusion_matrix: bool = True,label_list: List[str] = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'],num_labels: int = 6) -> dict:
"""
Util function to compute F1 score and optionally display the classification report and confusion matrix for a given experiment.
Args:
y_true (np.ndarray): Array containing true labels.
y_pred (np.ndarray): Array containing predicted labels.
exp_name (str): Name of the experiment (used to save results).
display_report (bool, optional): Boolean flag indicating whether to display classification report (True) or not (False). Defaults to True.
display_confusion_matrix (bool, optional): Boolean flag indicating whether to display confusion matrix (True) or not (False). Defaults to True.
label_list (list, optional): List of labels. Defaults to ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'].
num_labels (int, optional): Number of unique labels. Defaults to 6.
Returns:
dict: A dictionary containing F1 score.
"""
if display_report:
print('\nClassification Report:')
print(classification_report(y_true=y_true, y_pred=y_pred, labels=list(range(num_labels)),target_names=label_list[:num_labels]))
if display_confusion_matrix:
print('\nConfusion Matrix:')
cm = confusion_matrix(y_true=y_true, y_pred=y_pred)
fig, ax = plt.subplots(figsize=(8, 8))
display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=label_list)
display.plot(ax=ax)
plt.savefig(exp_name)
return {'F1-score' : f1_score(y_true, y_pred, average='weighted')}
Alright ! Now, we are ready to begin.
Scenario #1 - No Labelled Data Available
Labelled data unavailability is a common bottleneck in real-world machine learning. Constructing a sizeable set of labelled samples may not be possible for various reasons such as cost of labelling manually, data privacy / regulations, etc. To benchmark LLM in this scenario, we will be exploring In-Context Learning which consumes minimal labelled samples to generate predictions.
Baseline
When no labelled samples are available, Unsupervised Learning and Few Shot Learning are commonly employed solutions. We will be considering Few Shot Learning since it is closer to the LLM's In-Context Learning. Specifically, we will be using Setfit, a popular Few Shot Learning model by Huggingface, as our baseline.
We setup the training and inference script.
from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitModel, Trainer, TrainingArguments, SetFitTrainer
def train_and_infer_setfit(model_name: str, train_data: pd.DataFrame, val_data: pd.DataFrame, seed: int,
text_col: str = 'text', label_column: str = 'label', batch_size: int = 64, num_epochs: int = 3) -> dict:
"""Function to train Huggingface's Setfit model on input training data and return the computed performance metrics on input validation data
Args:
model_name (str): Sentence Transformer model path to be used for training
train_data (pd.DataFrame): Train data with corresponding labels
val_data (pd.DataFrame): Validation data with corresponding labels
seed (int): Random seed for reproducibility
text_col (str, optional): Column to be used to extract input features. Defaults to 'text'.
label_column (str, optional): Label column in the data. Defaults to 'label'.
batch_size (int, optional): Batch size to use during training. Defaults to 64.
num_epochs (int, optional): Number of training epochs. Defaults to 3.
Returns:
dict: A dictionary containing F1 score.
"""
model = SetFitModel.from_pretrained(model_name)
train_dataset = Dataset.from_pandas(train_data)
args = TrainingArguments(
batch_size=batch_size,
num_epochs=num_epochs,
evaluation_strategy='no',
save_strategy="no",
loss=CosineSimilarityLoss,
load_best_model_at_end=False,
sampling_strategy="unique",
seed=seed
)
trainer = Trainer(
model=model,
args=args,
train_dataset=train_dataset,
eval_dataset=None,
metric=None
)
trainer.train()
y_pred = trainer.model.predict(val_data[text_col].values)
y_true = val_data[label_column].astype(int)
return fetch_performance_metrics(y_true, y_pred, 'setfit')
We draw 48 samples at random and label them for training. The batch size is set to 64 and the model is trained for 3 epochs.
Note - Here, the number of samples (48) is an approximation based on how Setfit has been shown to perform well with 8 samples per label. Since, we have 6 labels, we'll need 8 X 6 = 48 samples. Rather than random sampling, you can employ sample selection strategies that offer better guarantees of samples being more representative of each label. You can also explore generating (8 or more) new samples for each label.
Now, let's run it.
model_name = "BAAI/bge-small-en-v1.5"
seed = 43
samples = train_data.sample(n=48, random_state=seed).reset_index(drop=True)
train_and_infer_setfit(model_name, samples, val_data, seed)
The F1 score is 0.448. We also observe that the predictions are biased to 'sadness' or 'joy'. This isn't very surprising since the number of training samples is very less. Let's see how our LLM performs in similar settings.
LLM : In-Context Learning
We start by mapping labels to text using the dictionary label_to_text
which we had previously defined.
train_data['label_text'] = train_data['label'].map(label_to_text)
val_data['label_text'] = val_data['label'].map(label_to_text)
train_data.head()
text | label | label_text | |
---|---|---|---|
0 | i didnt feel humiliated | 0 | sadness |
1 | i can go from feeling so hopeless to so damned... | 0 | sadness |
2 | im grabbing a minute to post i feel greedy wrong | 3 | anger |
3 | i am ever feeling nostalgic about the fireplac... | 2 | love |
4 | i am feeling grouchy | 3 | anger |
We will need to login to Huggingface hub to be able to access the LLM. We do this via Huggingface's notebook_login
from huggingface_hub import notebook_login
notebook_login()
Note - Additionally, you may need to navigate to the Mistral model card and "accept" the conditions to be able to access the model.
Defining the LLM helper functions
The LLM will need the input texts to be defined as prompts. We define a function build_LLM_prompt
that transforms the input text into a prompt format. We will also define two helpers function infer_LLM
and _generate_predictions
to instantiate the LLM and run inference with the constructed input prompts.
from peft import AutoPeftModelForCausalLM
from tqdm import tqdm
from transformers import (AutoTokenizer, AutoModelForCausalLM,
BitsAndBytesConfig, pipeline)
from typing import Union
import datasets
def _generate_predictions(example: datasets.formatting.formatting.LazyBatch,
generator: pipeline, text_column: str,
max_new_tokens: int = 9, split_token: str ='[/EMOTION]') -> dict:
"""
Generates predictions using the text generation model for a given example.
Args:
example (datasets.formatting.formatting.LazyBatch): Batch of samples from a dataset.
generator (pipeline): Huggingface pipeline for text generation.
text_column (str): Prompt for the text generation model.
max_new_tokens (int, optional): Maximum number of tokens to generate. Defaults to 9.
split_token (str, optional): Token to demarcate the emotion prediction. Defaults to '[/EMOTION]'.
Returns:
dict: A dictionary containing the generated predictions.
"""
num_examples = len(dataset)
predictions = []
batch_results = generator(example[text_column], max_new_tokens=max_new_tokens, num_return_sequences=1)
predictions.extend([result[0]["generated_text"] for result in batch_results])
return {'prediction' : predictions}
def infer_LLM(model_name: str, input_ds: Dataset, batch_size: int = 4, max_new_tokens: int = 9,
text_column: str = 'emotion_prompt', finetuned_model_path: str = None) -> Dataset:
"""
Util function to run LLM inference
Args:
model_name (str): The name or path of the LLM model.
input_ds (Dataset): Input dataset containing text prompts.
batch_size (int, optional): Batch size for inference. Defaults to 4.
max_new_tokens (int, optional): Maximum number of tokens to generate. Defaults to 9.
text_column (str, optional): Name of the column containing text prompts. Defaults to 'emotion_prompt'.
finetuned_model_path (str, optional): Path to the fine-tuned model. Defaults to None.
Returns:
dataset: Dataset with generated predictions.
"""
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")
if finetuned_model_path is None:
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto",
quantization_config=quantization_config)
else:
model = AutoPeftModelForCausalLM.from_pretrained(finetuned_model_path,
device_map="auto",
quantization_config=quantization_config)
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer,
batch_size=batch_size, truncation=False)
text_generator.tokenizer.pad_token_id = model.config.eos_token_id
input_ds = input_ds.map(_generate_predictions, fn_kwargs={'generator' : text_generator,
'text_column' : text_column,
'max_new_tokens' : max_new_tokens
},
batched=True, batch_size=batch_size)
return input_ds
def build_LLM_prompt(input_ds: Dataset, label_column: str = None, prompt_template: Union[str, None] = None,
with_label: bool = False) -> Dataset:
"""Util function to build the LLM prompt from input text data
Args:
input_ds (Dataset): Input dataset containing text
label_column (str, optional): Label column in the data. Applicable if constructing prompts for in-context samples / finetuning LLM. Defaults to None.
prompt_template (Union[str, None], optional): Text instruction to prepend to each transformed input text sample. Defaults to None.
with_label (bool, optional): `True` if the prompts should include labels from the `label_column`. Defaults to False.
Returns:
Dataset: Dataset with generated predictions.
"""
if type(input_ds) == pd.DataFrame:
input_ds = Dataset.from_pandas(input_ds)
if with_label:
input_ds = input_ds.map(lambda x: {'emotion_prompt': '[UTTERANCE]' + x['text'] + '[/UTTERANCE]' + \
'[EMOTION]' + x[label_column] + '[/EMOTION]'})
else:
input_ds = input_ds.map(lambda x: {'emotion_prompt': prompt_template + '[UTTERANCE]' + x['text'] + '[/UTTERANCE]' + \
'[EMOTION]'})
return input_ds
Build the LLM prompt
First, we build the prompt for in-context learning using build_LLM_prompt
. To ensure we have a reasonably fair comparison, the samples considered are the same ones used for SetFit in the previous experiment.
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
sample_data = train_data.sample(n=48, random_state=seed).reset_index(drop=True)
emotion_list = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
emotion_list_str = ', '.join(emotion_list)
transformed_sample_data = build_LLM_prompt(sample_data, with_label=True, label_column='label_text')
samples_str = '\n'.join(transformed_sample_data['emotion_prompt'])
prompt_template = "<s>[INST] You are a helpful, respectful and honest assistant. Choose one option that best describes the emotion behind the given utterance based on the following comma separated options: " + emotion_list_str + "[/INST] </s>"
Mapping LLM outputs to Emotion Predictions
LLMs generate free text that may not necessarily match the expected format for the output. We define an additional helper function _extract_label
that parses the generated text and extracts the emotion. .
We also define run_llm
that will act as our entry point to run LLM inference. It will invoke build_LLM_prompt
and infer_LLM
to perform inference and return the computed performance metrics on input validation data.
def _extract_label(sample: datasets.formatting.formatting.LazyRow, label_list: List[str]) -> dict:
"""Util function to extract the emotion from the generated LLM prediction
Args:
sample (datasets.formatting.formatting.LazyRow): Batch of samples from a dataset
label_list (List[str]): List of possible emotions
Returns:
dict: Dictionary of extracted predicted labels
"""
prompt_length = len(sample['emotion_prompt'])
generated_answer = sample['prediction'][prompt_length:].split('[/EMOTION]')[0].lower()
label_matched = False
predicted_label = None
for label in label_list:
if label in generated_answer:
predicted_label = label
label_matched = True
break
if not label_matched:
predicted_label = "no_match"
return {'predicted_label' : predicted_label}
def run_llm(val_data: pd.DataFrame, prompt_template: str, model_name: str, emotion_list: List[str], label_mapping: dict, label_column: str = 'label', batch_size: int = 4, finetuned_model_path: str = None,
num_labels: int = 6) -> dict:
"""Run end-to-end LLM inference (from pre-processing input data to post-processing the predictions) and return the computed performance metrics on input validation data
Args:
val_data (pd.DataFrame): Validation data with labels
prompt_template (str): Text instruction to prepend to each transformed input text sample.
model_name (str): The name or path of the pre-trained LLM.
emotion_list (List[str]): List of possible emotions
label_mapping (dict): Dictionary mapping to convert text labels to integers
label_column (str, optional): Label column in the data. Defaults to 'label'.
batch_size (int, optional): Batch size for inference. Defaults to 4.
finetuned_model_path (str, optional): Path to the fine-tuned model, if available.. Defaults to None.
num_labels (int, optional): Number of unique labels. Defaults to 6.
Returns:
dict: A dictionary containing F1 score.
"""
predicted_label_list = []
val_ds = build_LLM_prompt(val_data, prompt_template=prompt_template)
val_ds_with_pred = infer_LLM(model_name, val_ds, batch_size, finetuned_model_path=finetuned_model_path)
predicted_label_list = val_ds_with_pred.map(_extract_label,
fn_kwargs={"label_list": emotion_list[:num_labels]})['predicted_label']
y_pred = [label_mapping[pred] if pred in label_mapping else num_labels for pred in predicted_label_list]
y_true = val_data[label_column].astype(int).values.tolist()
if num_labels not in y_pred:
# All LLM predictions match a valid emotion from `emotion_list`
emotion_list.remove('no_match')
return fetch_performance_metrics(y_true, y_pred, 'mistral_7b', label_list=emotion_list)
For cases where the LLM generates text that have no matching emotion, _extract_label
returns the string 'no_match'. If there are occurences of 'no_match' in our final predictions, then we treat this as a new label so that our function fetch_performance_metrics
can work seamlessly.
Putting it all to work
Ok. We are ready to run our LLM now.
text_to_label = {v: k for k, v in label_to_text.items()}
# Add 'no_match' to `emotion_list` to handle LLMs predicting labels outside of `emotion_list`
LLM_emotion_list = emotion_list + ['no_match']
run_llm(val_data, prompt_template, model_name, LLM_emotion_list, text_to_label, batch_size=64)
The F1 score is 0.442. Comparing results from the two experiments, it seems that the performance of LLM In-Context Learning is similar to our baseline Setfit. Technically, many samples are ending up in the 'no_match' bucket which happens when the LLM's predicted text doesn't match any label. We could explore better prompt construction to ensure that the LLM always predicts a valid label. Enhancements such as chain-of-thought and self consistency based prompting can give better results but will be relatively more expensive to run at scale. We can cover them in a future post if people are interested.
Similarly, if we can provide a few more labelled samples for emotions where the setfit model is underperforming, then the results can potentially improve further. This could be much more feasible than the enhancements mentioned for LLMs.
Scenario #2 - Labelled Data Available But Noisy
Depending on the problem statement, a substantial number of labelled samples can occasionally be available for training Machine Learning models. As mentioned before, obtaining quality labelled data can be expensive. The labelled data that is available is potentially weakly labelled via automated systems or some form of user interaction. Besides, even if you invest in manually labelling the data, the labels obtained can still be susceptible to noise and errors due to multiple factors such as annotation ambiguity, human subjectivity, fatigue, etc. Such situations can demand fine-tuning cost effective models using the available resources.
Simulating Noisy Labels
To simulate our noisy scenario, we define a function add_asymmetric_noise
to construct a noisy version of the emotion detection data labels.
from pandas._typing import RandomState
def add_asymmetric_noise(
labels: pd.Series,
noise_prob: float,
random_state: Union[RandomState, None] = 42,
) -> Tuple[pd.Series, float]:
"""
Util function to add asymmetric noise to labels
for simulation of noisy label scenarios.
Args:
labels (pd.Series): Input pandas series with integer values
ranging from 0 to n - 1.
noise_prob (float): Probability of adding noise to each value.
random_state (Union[RandomState, None]): Random seed for reproducibility
Returns:
pd.Series: Series with asymmetric noise added to it.
float: Normalized quantification of pairwise disagreement between `labels` and `noisy_labels` for parity check
"""
# Set seed
np.random.seed(random_state)
# Avoid modifying the original data
noisy_labels = labels.copy()
# Build a replacement dictionary
unique_labels = list(set(noisy_labels))
replacement_dict = {
label: [candidate for candidate in unique_labels if candidate != label]
for label in unique_labels
}
# Determine the number of samples to modify based on the noise probability
num_samples = min(len(noisy_labels), int(len(noisy_labels) * noise_prob + 1))
# Sample random indices from the labels to introduce noise
target_indices = np.random.choice(len(noisy_labels), num_samples, replace=False)
for idx in target_indices:
# Introduce noise
noisy_labels[idx] = np.random.choice(replacement_dict[noisy_labels[idx]])
# Parity check
num_mismatches = sum(
[
label != noisy_label
for label, noisy_label in zip(labels.values, noisy_labels.values)
]
)
observed_noise_ratio = num_mismatches / len(noisy_labels)
return noisy_labels, observed_noise_ratio
We are going to assume 30% noise for the purpose of our experiments. In other words, 30% of the data will have incorrect labels. You can try running it with different noise levels by modifiying noise_level
in the following script as needed.
noise_level = 0.3
train_data['noisy_label'], observed_noise_ratio = add_asymmetric_noise(train_data['label'],
noise_prob=noise_level,
random_state=seed)
observed_noise_ratio
0.3000625
observed_noise_ratio
is a parity check to ensure that the percentage of noise in the labelled samples is what we intended. You could also just check number of samples where 'label' and 'noisy_label' mismatch.
len(train_data.loc[train_data['label'] != train_data['noisy_label']]) / len(train_data)
0.3000625
Looks good. Let's move on to defining our baseline.
Baseline
With text classification, given the constraints that we have many labelled samples where some labels are potentially incorrect, one of the commonly explored approaches is using state-of-the-art pre-trained embeddings to extract features from the text (because such embeddings are trained to be robust). This embedding is then combined with a downstream classification model. For the purpose of the experiments, we leverage 'BGE-small-en-v1.5' embeddings for feature extraction. For classification, we keep it simple by using Sklearn's Logistic Regression. First, we setup a training and inference script for the baseline.
def train_and_infer_LR_with_PTE(embeddings: str, train_data: pd.DataFrame, val_data: pd.DataFrame, seed: int, label_column: str = 'label') -> dict:
"""Function to train Logistic Regression model with pre-trained embeddings as features and return performance metrics on input validation data
Args:
embeddings (str): Path to embeddings to be used for training
train_data (pd.DataFrame): Train data with corresponding labels
val_data (pd.DataFrame): Validation data with corresponding labels
seed (int): Random seed for reproducibility
label_column (str): Label column in the data. Defaults to 'label'
Returns:
dict: A dictionary containing F1 score.
"""
embedding_model = SentenceTransformer(embeddings)
train_embeddings = embedding_model.encode(train_data['text'].values,
show_progress_bar=False)
y_train = train_data[label_column].astype(int)
clf = LogisticRegression(random_state=seed).fit(train_embeddings, y_train)
y_true = val_data['label'].astype(int)
val_embeddings = embedding_model.encode(val_data['text'].values, show_progress_bar=False)
y_pred = clf.predict(val_embeddings)
return fetch_performance_metrics(y_true, y_pred, 'LR_with_PTE')
We use the SentenceTransformer
library to load the embeddings and encode input data. Let's run our baseline and observe the results.
embedding_model = "BAAI/bge-small-en"
label_column = 'noisy_label'
train_and_infer_LR_with_PTE(embedding_model, train_data, val_data, seed, label_column)
The F1-score obtained is 0.641 which is better than what we observed for Few Shot Learning. Performance when it comes to emotions 'love' and 'surprise' seem to be poor. Will our LLM be able to beat this result ? Let's find out.
LLM : Fine-tuning
With a relatively larger number of training samples available, we can explore fine-tuning of our LLM. Traditional full fine-tuning of an LLM becomes infeasible to train on consumer hardware because of the large number of parameters. We will be relying on PEFT approaches which freeze most of the parameters of the pre-trained LLMs and fine-tune a small number of additional model parameters. Particularly, we will be using LoRA to finetune our LLM.
from peft import get_peft_model, LoraConfig, PeftConfig, PeftModel, prepare_model_for_kbit_training
from tqdm import tqdm
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, DataCollatorForLanguageModeling,
pipeline, Trainer, TrainingArguments
)
import bitsandbytes as bnb
import torch.nn as nn
As before, we will need to map the noisy labels to the corresponding text using the dictionary label_to_text
.
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
train_data['noisy_label_text'] = train_data['noisy_label'].map(label_to_text)
Transform Text into LLM Prompts
We transform the data into prompts that LLMs understand.
emotion_list = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
emotion_list_str = ', '.join(emotion_list)
prompt_template = "<s>[INST] You are a helpful, respectful and honest assistant. Choose one option that best describes the emotion behind the given utterance based on the following comma separated options: " + emotion_list_str + "[/INST] </s>"
noisy_label_column = 'noisy_label_text'
train_ds = build_LLM_prompt(train_data, with_label=True, label_column=noisy_label_column)
train_ds = train_ds.map(lambda example, prompt_template=prompt_template : {'emotion_prompt' : prompt_template + example['emotion_prompt']})
Fine-tune LLM
Next, we define our LLM Fine-tuning function.
def tokenize(example: datasets.formatting.formatting.LazyRow, tokenizer: AutoTokenizer ) -> dict:
"""Util function to tokenize text data
Args:
example (datasets.formatting.formatting.LazyRow): Batch of samples containing text to tokenize.
tokenizer (AutoTokenizer): Tokenizer object used for tokenization.
Returns:
dict: Dictionary containing tokenized text.
"""
tokenized = tokenizer(
example['emotion_prompt'],
truncation=False
)
return {**tokenized}
def finetune_LLM(base_model_name: str, train_ds: Dataset, save_path: str, seed: int, batch_size: int = 64, num_epochs: int = 1):
"""Function to finetune an LLM on the given input training data
Args:
base_model_name (str): The name or path of the LLM model to be finetuned
train_ds (Dataset): Input dataset containing text prompts.
save_path (str): Path to save the trained model
seed (int): Random seed for reproducibility
batch_size (int, optional): Batch size to use during training. Defaults to 64.
num_epochs (int, optional): Number of training epochs. Defaults to 1.
"""
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=False,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
)
model = AutoModelForCausalLM.from_pretrained(base_model_name,
quantization_config=bnb_config,
device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(base_model_name, padding_side="left")
tokenizer.pad_token = tokenizer.eos_token
train_ds = train_ds.map(
tokenize,
batched=False,
fn_kwargs={"tokenizer": tokenizer},
)
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
lora_alpha=16,
lora_dropout=0.1,
r=64,
bias="none",
task_type="CAUSAL_LM",
)
args = TrainingArguments(
disable_tqdm=False,
output_dir=save_path,
warmup_steps=1,
per_device_train_batch_size=batch_size,
num_train_epochs=num_epochs,
learning_rate=2e-4,
fp16=True,
optim="paged_adamw_8bit",
logging_dir="./logs",
save_strategy="no",
evaluation_strategy="no",
report_to=None
)
model = get_peft_model(model, peft_config)
model.config.use_cache = False
trainer = Trainer(
model=model,
train_dataset=train_ds.select_columns(['input_ids', 'attention_mask']),
eval_dataset=None,
args=args,
data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
trainer.train()
trainer.model.save_pretrained(save_path)
return
To finetune our LLM, we use the following snippet
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
finetuned_model_path = "finetuned-mistral-all_data"
finetune_LLM(model_name, train_ds, save_path=finetuned_model_path, seed=seed)
The newly finetuned model is stored in your working directory under the folder "finetuned-mistral-all_data".
Inference with Fine-tuned LLM
We run the inference with this model using the function run_llm
as we did for the in-context learning experiment previously
text_to_label = {v: k for k, v in label_to_text.items()}
LLM_emotion_list = emotion_list + ['no_match']
run_llm(val_data, prompt_template, model_name, LLM_emotion_list, text_to_label,
finetuned_model_path=finetuned_model_path, batch_size=64)
Our finetuned LLM performs better than our baseline of BGE small + Logistic Regression. Perhaps, the LLM is able to capture better representations than the pre-trained embeddings used in the baseline. More importantly, none of the predictions are 'no_match' anymore. In other words, our finetuned LLM is able to predict a valid emotion as the label for all validation samples unlike what happened during In-Context Learning. Also, open source pre-trained embeddings more powerful than BGE small are available and can be explored to improve the baseline results.
Better practical strategies to handle noisy data ?
We finetuned models directly on all of the available noisy labelled samples. In practice, you'd additionally explore noise correction strategies applicable either at the model building stage (ranging from designing robust loss functions to exploring weaker forms of supervision) or at the data processing stage (label error detection / correction). Given that our goal is to benchmark LLM performance, let's consider a data level noise correction strategy and re-run the finetuning experiments to observe performance changes. We'll be using DQC-Toolkit, an open source library I'm currently building, to curate our noisy labelled samples.
from dqc import CrossValCurate
cvc = CrossValCurate(random_state=seed,
calibration_method='calibrate_using_baseline' )
train_data_modified = cvc.fit_transform(train_data, y_col_name='noisy_label')
DQC Toolkit offers CrossValCurate
that quantifies label correctness in the input labelled data using cross validation techniques. The result (stored in train_data_modified
) is a pandas dataframe similar to train_data
with the following additional columns -
-
'label_correctness_score'
represents a normalized score quantifying the correctness of'noisy_label'
. -
'is_label_correct'
is a boolean flag indicating whether the'noisy_label'
is to be considered correct (True
) or incorrect (False
). -
'predicted_label'
and'prediction_probability'
represent DQC Toolkit's predicted label for a given sample and the corresponding probability score.
For more details regarding different hyperparameters available in CrossValCurate
, please refer to the API documentation.
Baseline
Instead of 'noisy_label', we are going to leverage DQC Toolkit's 'predicted_label' as our target variable. Let's start by rerunning the BGE small + Logistic Regression baseline.
embeddings = "BAAI/bge-small-en"
label_column = 'predicted_label'
train_and_infer_LR_with_PTE(embeddings, train_data_modified, val_data, seed, label_column)
The F1 score without any noise correction was 0.641. With label correction, we observe a score of 0.664. There is an improvement in the F1 score ! Let's check if we observe similar performance improvements with LLM finetuning.
LLM : Finetuning
We map the integer labels to text labels for LLM interpretability.
train_data_modified['predicted_label_text'] = train_data_modified['predicted_label'].map(label_to_text)
We transform our text to LLM prompts
emotion_list = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
emotion_list_str = ', '.join(emotion_list)
prompt_template = "<s>[INST] You are a helpful, respectful and honest assistant. Choose one option that best describes the emotion behind the given utterance based on the following comma separated options: " + emotion_list_str + "[/INST] </s>"
label_column = 'predicted_label_text'
train_data_modified_ds = build_LLM_prompt(train_data_modified, with_label=True, label_column=label_column)
train_data_modified_ds = train_data_modified_ds.map(lambda example, prompt_template=prompt_template : {'emotion_prompt' : prompt_template + example['emotion_prompt']})
Now, we finetune our LLM on the data.
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
finetuned_model_path = "finetuned-mistral-filtered_noisy_data"
finetune_LLM(model_name, train_data_modified_ds, save_path=finetuned_model_path, seed=seed)
Let's run the inference module and observe the performance changes.
text_to_label = {v: k for k, v in label_to_text.items()}
LLM_emotion_list = emotion_list + ['no_match']
run_llm(val_data, prompt_template, model_name, LLM_emotion_list, text_to_label,
finetuned_model_path=finetuned_model_path, batch_size=32)
From an F1 score of 0.666 to an F1 score 0.726 ! This is a larger improvement compared to the performance improvement observed with our baseline. This could be attributed to the fact that we are finetuning the LLM as opposed to leveraging pre-trained embeddings in the baseline.
[BONUS] The Ideal Scenario - (Clean) Labelled Data Available
We've run a bunch of experiments and were able to compare performances of different models for Emotion Prediction. Our best performing model is the finetuned LLM with noisy labelled data combined with DQC Toolkit for noise correction. How well would this finetuned LLM perform if it was trained with clean labelled data without changing any additional settings ? Let's find out.
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
emotion_list = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
emotion_list_str = ', '.join(emotion_list)
prompt_template = "<s>[INST] You are a helpful, respectful and honest assistant. Choose one option that best describes the emotion behind the given utterance based on the following comma separated options: " + emotion_list_str + "[/INST] </s>"
label_column = 'label_text'
train_ds = build_LLM_prompt(train_data, with_label=True, label_column=label_column)
train_ds = train_ds.map(lambda example, prompt_template=prompt_template : {'emotion_prompt' : prompt_template + example['emotion_prompt']})
We invoke finetune_LLM
to finetune the model
model_name = "mistralai/Mistral-7B-Instruct-v0.2"
finetuned_model_path = "finetuned-mistral-ideal_data"
finetune_LLM(model_name, train_ds, save_path=finetuned_model_path, seed=seed)
And finally, performance inference and compute metrics using run_llm
text_to_label = {v: k for k, v in label_to_text.items()}
LLM_emotion_list = emotion_list + ['no_match']
run_llm(val_data, prompt_template, model_name, LLM_emotion_list, text_to_label,
finetuned_model_path=finetuned_model_path, batch_size=32)
F1-score is 0.761. This means there is scope for improvement in our previous strategies. There are more enhancements that we can leverage to achieve better performance but we leave those discussions to future posts, if people are interested.
Consolidating results
Phew ! We've reached the end of the article. The following plots summarize the performances of the approaches for each scenario discussed -
In Scenario 1 (No Labelled Data Available), our baseline(SetFit) performed on par with the LLM(In-Context Learning). In Scenario 2 (Labelled Data Available but Noisy), our baseline (BGE small pre-trained embeddings combined with Logistic Regression) slightly underperformed in comparison to our fine-tuned LLM. When we applied label noise correction via DQC Toolkit, we observed a performance boost in both the baseline and the fine-tuned LLM.
Currently, DQC Toolkit supports text classification (binary/multi class) problems with various parameter customization options. Check out the documentation for details. Following is the link to the repo. The plan is to enhance it further by adding more capabilities. Any form of feedback and support will be much appreciated !
sumanthprabhu / DQC-Toolkit
Data quality checks to curate noisy labels in the data
DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers
CrossValCurate
for curation of text classification datasets (binary / multi-class) using cross validation based selection.Installation
Installation of DQC-toolkit can be done as shown below
pip install dqc-toolkitQuick Start
Assuming your text classification data is stored as a pandas dataframe
data
, with each sample represented by thetext
column and its corresponding noisy label represented by thelabel
column, here is how you useCrossValCurate
-from dqc import CrossValCurate cvc = CrossValCurate() data_curated = cvc.fit_transform(data[['text', 'label']])The result stored in
data_curated
which is a pandas dataframe similar todata
with the following columns ->>> data_curated.columns ['text', 'label',…
Thank you for reading
Passionate about Machine Learning? Please feel free to add me on Linkedin.
Top comments (0)