DEV Community: Sumanth Prabhu

Self-Training LLMs for Text Classification using DQC Toolkit

Sumanth Prabhu — Thu, 06 Jun 2024 06:37:00 +0000

Large language models (LLMs) have demonstrated exceptional language capabilities. In the context of Text Classification, if Labelled Data is unavailable, LLMs are commonly employed using In-Context Learning (ICL). With ICL, the LLM implicitly learns how to classify text by relying on a task instruction and (optionally) a few labelled examples relevant to the task. While this approach may appear to be flexible and powerful, it can often be sensitive to the choice of prompts, choice of ICL examples, etc. resulting in poor performance. In such scenarios, can we improve the performance of the LLM without manually labelling more data ?

In this article, we will be talking about Self-Training LLMs for Text Classification. Self-Training is a semi-supervised learning approach which leverages a model’s own predictions on unlabelled data to build a labelled dataset for training of the model. Concretely, we will use the LLM to predict labels for unlabelled data to construct a training dataset and then fine-tune the LLM on the training data.

Intuitively, the main downside of Self-Training is its inability to correct its own mistakes. Typically, the most confident predictions of the model are the only samples considered to be included in the labelled dataset. However, “confidence” does not always imply “correctness”. Incorrectly labelled samples can end up amplifying the LLM errors.

To address this, we include a “Label Correction” step. We use DQC-Toolkit, a Python library that facilitates improvement of Machine Learning models by identifying and mitigating label errors in training dataset.

Pre-liminaries

Most of the code is based on our previous post. For the purposes of our experiment, we will be using Mistral-7B as our LLM. We will also extend the observations to Llama3–8B at the end of the article.

We begin by installing and loading the required dependencies. We will require the Python version to be ≥ 3.9



!pip install transformers
!pip install bitsandbytes
!pip install accelerate
!pip install huggingface_hub
!pip install peft
!pip install dqc-toolkit



from datasets import load_dataset, Dataset
from typing import List, Union

import numpy as np 
import pandas as pd
import torch
import transformers
import wandb
import warnings

transformers.logging.set_verbosity_error()
wandb.init(mode="disabled")
warnings.filterwarnings('ignore')

Dataset

We will be using emotion, a publicly available dataset hosted on Hugging Face. It consists of English-language tweets annotated with one of six emotions as shown below —

[‘sadness’, ‘joy’, ‘love’, ‘anger’, ‘fear’, ‘surprise’]

The dataset has 16,000 training samples and 2,000 validation samples.

We also extend the observations to the MTOP domain dataset towards the end of the article.



from datasets import load_dataset
import pandas as pd 

dataset = 'dair-ai/emotion'

dset = load_dataset(dataset, trust_remote_code=True)
train_data = pd.DataFrame(dset['train'])
val_data = pd.DataFrame(dset['validation'])
train_data.head()

Since LLMs cannot comprehend the emotion labels in integer format, we define a mapping of the integer labels to semantic text descriptions and create text labels for downstream consumption.



label_to_text = {0 : 'sadness', 
                 1 : 'joy', 
                 2 : 'love', 
                 3 : 'anger', 
                 4 : 'fear',
                 5 : 'surprise'}

train_data['label_text'] = train_data['label'].map(label_to_text)
val_data['label_text'] = val_data['label'].map(label_to_text)

Evaluation Metric

For the purpose of benchmarking our experiments, we choose Weighted F1 score as the metric. We also display the classification report and confusion matrix for detailed interpretation.



from sklearn.metrics import (classification_report, confusion_matrix,
                              ConfusionMatrixDisplay, f1_score)

import matplotlib.pyplot as plt

def fetch_performance_metrics(y_true: np.ndarray, y_pred: np.ndarray, exp_name: str,
                              display_report: bool = True, display_confusion_matrix: bool = True,
                             label_list: List[str] = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'],
                              num_labels: int = 6) -> dict:
    """
    Util function to compute F1 score and optionally display the classification report and confusion matrix for a given experiment.

    Args:
        y_true (np.ndarray): Array containing true labels.
        y_pred (np.ndarray): Array containing predicted labels.
        exp_name (str): Name of the experiment (used to save results).
        display_report (bool, optional): Boolean flag indicating whether to display classification report (True) or not (False). Defaults to True.
        display_confusion_matrix (bool, optional): Boolean flag indicating whether to display confusion matrix  (True) or not (False). Defaults to True.
        label_list (list, optional): List of labels. Defaults to ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'].
        num_labels (int, optional): Number of unique labels. Defaults to 6.

    Returns:
        dict: A dictionary containing F1 score.
    """ 
    if display_report:
        print('\nClassification Report:')

        print(classification_report(y_true=y_true, y_pred=y_pred, labels=list(range(num_labels)),
                                   target_names=label_list[:num_labels]))

    if display_confusion_matrix:
        cm = confusion_matrix(y_true=y_true, y_pred=y_pred)
        fig, ax = plt.subplots(figsize=(8, 8))
        display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=label_list)
        display.plot(ax=ax)
        plt.savefig(exp_name)

    return {'F1-score' : f1_score(y_true, y_pred, average='weighted')}

Alright ! let’s begin.

Baseline : LLM with ICL

We will need to login to Hugging Face hub to be able to access the LLM. We do this via Hugging Face’s notebook_login



from huggingface_hub import notebook_login

notebook_login()

Defining the LLM Pre-liminaries

We define a few LLM Utility functions as we did in the previous post.



from peft import AutoPeftModelForCausalLM
from tqdm import tqdm
from transformers import (AutoTokenizer, AutoModelForCausalLM,
                            BitsAndBytesConfig, pipeline)

import datasets

def _generate_predictions(example: datasets.formatting.formatting.LazyBatch, 
                          generator: pipeline, text_column: str, 
                          max_new_tokens: int = 9, split_token: str ='[/EMOTION]') -> dict:
    """
    Generates predictions using the text generation model for a given example.

    Args:
        example (datasets.formatting.formatting.LazyBatch): Batch of samples from a dataset.
        generator (pipeline): Huggingface pipeline for text generation.
        text_column (str): Prompt for the text generation model.
        max_new_tokens (int, optional): Maximum number of tokens to generate. Defaults to 9.
        split_token (str, optional): Token to demarcate the emotion prediction. Defaults to '[/EMOTION]'.

    Returns:
        dict: A dictionary containing the generated predictions.
    """
    num_examples = len(dataset)
    predictions = []
    batch_results = generator(example[text_column], max_new_tokens=max_new_tokens, num_return_sequences=1)
    predictions.extend([result[0]["generated_text"] for result in batch_results])
    return {'prediction' : predictions}


def infer_LLM(model_name: str, input_ds: Dataset, batch_size: int = 4, max_new_tokens: int = 9,
             text_column: str = 'emotion_prompt', finetuned_model_path: str = None) -> Dataset:
    """
    Util function to run LLM inference

    Args:
        model_name (str): The name or path of the LLM model.
        input_ds (Dataset): Input dataset containing text prompts.
        batch_size (int, optional): Batch size for inference. Defaults to 4.
        max_new_tokens (int, optional): Maximum number of tokens to generate. Defaults to 9.
        text_column (str, optional): Name of the column containing text prompts. Defaults to 'emotion_prompt'.
        finetuned_model_path (str, optional): Path to the fine-tuned model. Defaults to None.

    Returns:
        dataset: Dataset with generated predictions.
    """
    quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

    if finetuned_model_path is None:
        model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto",
                                                quantization_config=quantization_config)
    else:
        model = AutoPeftModelForCausalLM.from_pretrained(finetuned_model_path,
                                                        device_map="auto",
                                                quantization_config=quantization_config)

    text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer,
                             batch_size=batch_size, truncation=False)
    text_generator.tokenizer.pad_token_id = model.config.eos_token_id

    input_ds = input_ds.map(_generate_predictions, fn_kwargs={'generator' : text_generator,
                                                              'text_column' : text_column,
                                                              'max_new_tokens' : max_new_tokens
                                                             },
                           batched=True, batch_size=batch_size)
    return input_ds

def build_LLM_prompt(input_ds: Dataset, label_column: str = None, prompt_template: Union[str, None] = None, 
                     with_label: bool = False) -> Dataset:
    """Util function to build the LLM prompt from input text data

    Args:
        input_ds (Dataset): Input dataset containing text
        label_column (str, optional): Label column in the data. Applicable if constructing prompts for in-context samples / finetuning LLM. Defaults to None.
        prompt_template (Union[str, None], optional): Text instruction to prepend to each transformed input text sample. Defaults to None.
        with_label (bool, optional): `True` if the prompts should include labels from the `label_column`. Defaults to False.

    Returns:
        Dataset: Dataset with generated predictions.
    """
    if type(input_ds) == pd.DataFrame:
        input_ds = Dataset.from_pandas(input_ds)

    if with_label:

        input_ds = input_ds.map(lambda x: {'emotion_prompt': '[UTTERANCE]' + x['text'] + '[/UTTERANCE]' + \
                                          '[EMOTION]' + x[label_column] + '[/EMOTION]'})
    else:
        input_ds = input_ds.map(lambda x: {'emotion_prompt': prompt_template + '[UTTERANCE]' + x['text'] + '[/UTTERANCE]' + \
                                          '[EMOTION]'})

    return input_ds

def _extract_label(sample: datasets.formatting.formatting.LazyRow, label_list: List[str]) -> dict:
    """Util function to extract the emotion from the generated LLM prediction

    Args:
        sample (datasets.formatting.formatting.LazyRow): Batch of samples from a dataset
        label_list (List[str]): List of possible emotions

    Returns:
        dict: Dictionary of extracted predicted labels
    """
    prompt_length = len(sample['emotion_prompt'])
    generated_answer = sample['prediction'][prompt_length:].split('[/EMOTION]')[0].lower()

    label_matched = False
    predicted_label = None

    for label in label_list:
        if label in generated_answer:
            predicted_label = label
            label_matched = True
            break

    if not label_matched:
        predicted_label = "no_match"

    return {'predicted_label' : predicted_label}

def run_llm(val_data: pd.DataFrame, prompt_template: str, model_name: str, emotion_list: List[str], label_mapping: dict, 
            label_column: str = 'label', batch_size: int = 4, finetuned_model_path: str = None,
           num_labels: int = 6, compute_metrics: bool = True) -> dict:
    """Run end-to-end LLM inference (from pre-processing input data to post-processing the predictions) and return the computed performance metrics on input validation data

    Args:
        val_data (pd.DataFrame): Validation data with labels
        prompt_template (str): Text instruction to prepend to each transformed input text sample.
        model_name (str): The name or path of the pre-trained LLM.
        emotion_list (List[str]): List of possible emotions 
        label_mapping (dict): Dictionary mapping to convert text labels to integers 
        label_column (str, optional): Label column in the data. Defaults to 'label'.
        batch_size (int, optional): Batch size for inference. Defaults to 4.
        finetuned_model_path (str, optional):  Path to the fine-tuned model, if available.. Defaults to None.
        num_labels (int, optional): Number of unique labels. Defaults to 6.
        compute_metrics (bool, optional): Boolean flag indicating whether to compute the performance metrics (True) or not (False)

    Returns:
        dict: A dictionary containing F1 score.
    """
    predicted_label_list = []
    val_ds = build_LLM_prompt(val_data, prompt_template=prompt_template)
    val_ds_with_pred = infer_LLM(model_name, val_ds, batch_size, finetuned_model_path=finetuned_model_path)

    predicted_label_list = val_ds_with_pred.map(_extract_label, 
                                  fn_kwargs={"label_list": emotion_list[:num_labels]})['predicted_label'] 

    y_pred = [label_mapping[pred] if pred in label_mapping else num_labels for pred in predicted_label_list]
    y_true = val_data[label_column].astype(int).values.tolist()

    if num_labels not in y_pred:
        # All LLM predictions match a valid emotion from `emotion_list`
        emotion_list.remove('no_match')

    if compute_metrics:
        return y_pred, fetch_performance_metrics(y_true, y_pred, 'mistral_7b', label_list=emotion_list)

    return y_pred

In summary -

build_LLM_prompt transforms the input text into a LLM prompt
infer_LLM and _generate_predictions instantiate the LLM using 4 bit quantization and run inference with the constructed input prompts.
_extract_label maps the LLM free text outputs to valid emotion predictions. If the generated text has no matching emotion, the predicted label is set to “no_match”.
run_LLM invokes functions build_LLM_prompt and infer_LLM to perform inference and return the computed performance metrics on input validation data.

Build the LLM prompt

We select one sample at random for each label and build the prompt prefix to run ICL.



model_name = "mistralai/Mistral-7B-Instruct-v0.2"
seed = 43

sample_data = train_data.groupby('label_text').sample(n=1, random_state=seed).reset_index(drop=True)
emotion_list = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
emotion_list_str = ', '.join(emotion_list)

transformed_sample_data = build_LLM_prompt(sample_data, with_label=True, label_column='label_text')
samples_str = '\n'.join(transformed_sample_data['emotion_prompt'])

prompt_template =  "<s>[INST] You are a helpful, respectful and honest assistant. Choose one option that best describes the emotion behind the given utterance based on the following comma separated options: " + emotion_list_str + "[/INST] </s>"

Putting it all to work

We are ready to run our LLM now.



text_to_label = {v: k for k, v in label_to_text.items()}
llm_emotion_list = emotion_list + ['no_match']

_, score = run_llm(val_data, prompt_template, model_name, llm_emotion_list, text_to_label,
        batch_size=64)
print(score)

The F1-score is 0.442 with a large proportion of the samples ending up in the “no_match” bucket. Can we do better than this ? Let’s find out.

Our Approach : Self-Training using DQC Toolkit

Self-Training LLMs for Text Classification using DQC Toolkit

As shown in the figure, our proposed self-training approach is comprised of the following three steps -

Generate LLM Predictions for Unlabelled Data
Apply Label Correction using DQC Toolkit
Fine-tune LLM using Reliably Labelled Data

Step 1 — Generate LLM Predictions for Unlabelled Data

We leverage LLM with ICL to generate initial labels for training our model.



predictions = run_llm(train_data, prompt_template, model_name, llm_emotion_list, text_to_label,
                      batch_size=64, compute_metrics=False)

As mentioned before, many predictions can end up being mapped to “no_match” (when we are unable to extract the emotion prediction from the LLM’s generated answer). We remove such samples from the data.



train_data['llm_predicted_label'] = pd.Series(predictions)
## Only valid label predictions
train_data_with_llm_pred = train_data.loc[train_data['llm_predicted_label'] < len(emotion_list), ].reset_index(drop=True)

Step 2 — Apply Label Correction using DQC Toolkit

Currently, DQC toolkit offers CrossValCurate for curation of text classification datasets (binary / multi-class) using cross validation based label prediction. We will leverage this module to acquire better quality labels for our data.



cvc = CrossValCurate(random_state=seed, 
                     calibration_method='calibrate_using_baseline' )

train_data_curated = cvc.fit_transform(train_data_with_llm_pred, y_col_name='llm_predicted_label')

CrossValCurate accepts two parameters random_state (random seed for reproducibility) and calibration_method(whether/how to calibrate the prediction probabilities of the model being trained for label correction). You can check out all the hyper-parameters available in the documentation here.

The returned object train_data_curated is a Pandas dataframe similar to the input dataframe train_data_with_llm_pred with the following additional columns -

‘label_correctness_score’ represents a normalized score quantifying the correctness of llm_predicted_label.
‘is_label_correct’ is a boolean flag indicating whether the llm_predicted_label is to be considered correct (True) or incorrect (False).
‘predicted_label’ and ‘prediction_probability’ represent DQC Toolkit’s predicted label for a given sample and the corresponding probability score.

We leverage is_label_correct to identify reliably labelled samples



train_data_curated = train_data_curated.loc[train_data_curated['is_label_correct']].reset_index(drop=True)

Step 3 — Fine-tune LLM using Reliably Labelled Data

We fine-tune the LLM Using train_data_curated with llm_predicted_label as the target variable. First, we map the integer labels to text labels for LLM interpretability.



train_data_curated['llm_predicted_label_text'] = train_data_curated['llm_predicted_label'].map(label_to_text)

Next, we transform the data into instruction prompts for better performance



prompt_template =  "<s>[INST] You are a helpful, respectful and honest assistant. Choose one option that best describes the emotion behind the given utterance based on the following comma separated options: " + emotion_list_str + "[/INST] </s>"
label_column = 'llm_predicted_label_text'

train_data_curated_ds = build_LLM_prompt(train_data_curated, with_label=True, label_column=label_column)
train_data_curated_ds = train_data_curated_ds.map(lambda example, prompt_template=prompt_template : {'emotion_prompt' : prompt_template + example['emotion_prompt']})

Then, we define the LLM fine-tuning function



from peft import get_peft_model, LoraConfig, PeftConfig, PeftModel, prepare_model_for_kbit_training
from tqdm import tqdm
from transformers import (AutoModelForCausalLM, AutoTokenizer, 
                          BitsAndBytesConfig, DataCollatorForLanguageModeling,
                          pipeline, Trainer, TrainingArguments 
                          )

import bitsandbytes as bnb
import torch.nn as nn

def tokenize(example: datasets.formatting.formatting.LazyRow, tokenizer: AutoTokenizer ) -> dict:
    """Util function to tokenize text data

    Args:
        example (datasets.formatting.formatting.LazyRow): Batch of samples containing text to tokenize.
        tokenizer (AutoTokenizer): Tokenizer object used for tokenization.

    Returns:
        dict: Dictionary containing tokenized text.
    """
    tokenized = tokenizer(
        example['emotion_prompt'],
        truncation=False
    )

    return {**tokenized}

def finetune_LLM(base_model_name: str, train_ds: Dataset,
              save_path: str, seed: int, batch_size: int = 64, num_epochs: int = 1):
    """Function to fine-tune an LLM on the given input training data

    Args:
        base_model_name (str): The name or path of the LLM model to be fine-tuned
        train_ds (Dataset): Input dataset containing text prompts.
        save_path (str): Path to save the trained model
        seed (int): Random seed for reproducibility
        batch_size (int, optional): Batch size to use during training. Defaults to 64.
        num_epochs (int, optional): Number of training epochs. Defaults to 1.
    """
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16
    )

    model = AutoModelForCausalLM.from_pretrained(base_model_name, 
                                                 quantization_config=bnb_config, 
                                                 device_map="auto")

    tokenizer = AutoTokenizer.from_pretrained(base_model_name, padding_side="left")
    tokenizer.pad_token = tokenizer.eos_token

    train_ds = train_ds.map(
        tokenize,
        batched=False,
        fn_kwargs={"tokenizer": tokenizer},
    )

    model = prepare_model_for_kbit_training(model)

    peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
    )

    args = TrainingArguments(
            disable_tqdm=False,
            output_dir=save_path,
            warmup_steps=1,
            per_device_train_batch_size=batch_size,
            num_train_epochs=num_epochs,
            learning_rate=2e-4,
            fp16=True,
            optim="paged_adamw_8bit",             
            logging_dir="./logs",        
            save_strategy="no",              
            evaluation_strategy="no",                             
            report_to=None          
        )
    model = get_peft_model(model, peft_config)
    model.config.use_cache = False

    trainer = Trainer(
        model=model,
        train_dataset=train_ds.select_columns(['input_ids', 'attention_mask']),
        eval_dataset=None,
        args=args,
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )

    trainer.train()
    trainer.model.save_pretrained(save_path)

    return

Finally, we are ready to fine-tune the model. The number of training epochs is set to 1 and batch size is set to 64.



model_name = "mistralai/Mistral-7B-Instruct-v0.2"

finetuned_model_path = "selftrained-mistral-emotion"
finetune_LLM(model_name, train_data_curated_ds, save_path=finetuned_model_path, seed=seed)

The fine-tuned model is stored in your working directory under the folder ‘selftrained-mistral-emotion’

Test the Self-Trained Model’s Performance

We run the inference with the fine-tuned model using the same function run_llm as we did for the ICL baseline.



text_to_label = {v: k for k, v in label_to_text.items()}
LLM_emotion_list = emotion_list + ['no_match']
_, score = run_llm(val_data, prompt_template, model_name, LLM_emotion_list, text_to_label,
       finetuned_model_path=finetuned_model_path, batch_size=64)
print(score)

There’s a 29.41% improvement in the F1-score (from 0.442 to 0.572). The number of “no_match” predictions have also drastically reduced. And we didn’t have to label any data manually !

The following plot summarizes our results visually —

Performance of Mistral 7B in Text Classification using Emotion dataset with Minimal Labelled Data

Further Experimental Validation

Additional LLM — To verify the reproducibility of our observations with Mistral-7B, we run experiments with Llama3–8B as well.

Additional Dataset — We also include the MTOP domain dataset where LLM ICL is known to perform well in general. This helps us understand if our approach is capable of achieving improvements when LLMs are already doing a reasonable job.

We re-run our experiments with the new LLM and dataset. The code for these experiments can be found here. Following are the results —

The first plot from the left shows LLama3–8B’s performance in Text Classification with the Emotion dataset using ICL. The observations are similar to Mistral-7B experiment. The results with ICL are poor (F1-score of 0.365) and there is a 49.86% improvement in the F1-score after Self-Training using DQC Toolkit (F1 score of 0.547).

With MTOP Domain, both the LLMs perform well with ICL. As shown in the second and third plot, ICL with Mistral-7B and Llama3–8B achieve F1-scores of 0.9 and 0.88 respectively. Post Self-Training using DQC Toolkit, Mistral-7B scores 0.916 while Llama3–8B scores 0.938. Essentially, we observe a 1.78% improvement with Mistral-7B and a 6.59% improvement with Llama3–8B.

In a Nutshell

We observe that Self-Training using DQC Toolkit improves the ICL performance of both Mistral-7B and Llama3–8B for both Emotion and MTOP Domain datasets in Text Classification.

Similarity to “Teacher-Student” Learning

Self Training can be considered a special case of “Teacher-Student” framework where the Teacher model is an LLM and the Student model is the same LLM. In practice, you would want to explore a Student model that is more cost effective when it comes to deployment. Similar to what we’ve seen in this article, we can bootstrap smaller models using LLM ICL predictions to achieve improved performance. We leave this discussion for future posts.

Currently, DQC Toolkit supports text classification (binary/multi class) problems with various parameter customization options. The plan is to enhance it further by adding more capabilities. Any form of feedback / support will be much appreciated ! Following is the link to the repo.

sumanthprabhu / DQC-Toolkit

Quality Checks for Training Data in Machine Learning

DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers CrossValCurate and LLMCurate. CrossValCurate can be used for label error detection / correction in text classification (binary / multi-class) based on cross validation. LLMCurate extends PEDAL: Enhancing Greedy Decoding with Large Language Models using Diverse Exemplars to compute LLM-based confidence scores for free-text labels.

Installation

Installation of DQC-toolkit can be done as shown below

pip install dqc-toolkit

Quick Start

CrossValCurate

Assuming your text classification data is stored as a pandas dataframe data, with each sample represented by the text column and its corresponding noisy label represented by the label column, here is how you use CrossValCurate -

from dqc import CrossValCurate
cvc = CrossValCurate()
data_curated = cvc.fit_transform(data[['text'

…

View on GitHub

PS - If you found this helpful, it would be great if you could give the repo a shout out.

Thank you for reading

Passionate about Machine Learning? Please feel free to add me on Linkedin

Can LLMs Truly Understand Text-based Emotion ?

Sumanth Prabhu — Sun, 26 May 2024 15:00:54 +0000

Recently, Large Language Models (LLMs) have garnered attention for their impressive capabilities in the realm of Natural Language Understanding (NLU). Trained on diverse and extensive datasets, LLMs seem to generalize well across a wide range of tasks with little to no task-specific training data. In this article, we will examine one of the less frequently explored NLU tasks when it comes to LLMs — Emotion Prediction.

Despite notable advancements, Emotion Prediction continues to be an evolving area of research. It can help assess a model's true NLU capability as it demands detection of subtle emotional cues present in the text. We will mainly focus on assessing text-based Emotion Prediction using LLMs in two real world scenarios -

No Labelled Data Available
Labelled Data Available But Noisy

For the purposes of our experiment, we will be using mistralai/Mistral-7B-Instruct-v0.2 as our LLM. We will also build corresponding competitive baseline approaches to benchmark the LLM's performance.

For the remainder of the article, we will be using 'LLM' as a shorthand reference to 'mistralai/Mistral-7B-Instruct-v0.2'

TL;DR

This is going to be a long post. The intent is to share code snippets for each step in all experiments covered in the article. If you want to skip directly to the results, you can find them under the section Consolidating Results.

Preliminaries

Before getting started, we need to install and load the required dependencies.

!pip install setfit
!pip install bitsandbytes
!pip install accelerate
!pip install huggingface_hub
!pip install peft
!pip install dqc-toolkit


from datasets import load_dataset, Dataset
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LogisticRegression
from typing import List, Tuple, Union

import numpy as np 
import os
import pandas as pd
import torch
import transformers
import wandb
import warnings

transformers.logging.set_verbosity_error()
wandb.init(mode="disabled")
warnings.filterwarnings('ignore')

Some of the installed libraries and imported modules will make sense as we proceed further. Also, we are setting the verbosity level to 'error'. This isn't necessary. We've done this to keep the notebook outputs relatively clean.

Dataset

For the purpose of all experiments, we will be using emotion, a publicly available dataset hosted on Hugging Face. It consists of English-language tweets annotated with one of six emotions as shown below -
['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']

The dataset has 16,000 training samples and 2,000 validation samples.

dataset = 'dair-ai/emotion'

dset = load_dataset(dataset, trust_remote_code=True)
train_data = pd.DataFrame(dset['train'])
val_data = pd.DataFrame(dset['validation'])
text_col, label_col = 'text', 'label'
train_data.head()

	text	label
0	i didnt feel humiliated	0
1	i can go from feeling so hopeless to so damned...	0
2	im grabbing a minute to post i feel greedy wrong	3
3	i am ever feeling nostalgic about the fireplac...	2
4	i am feeling grouchy	3

Transform Integer Labels to Text Labels

The labels in the dataset are integers. Since LLMs cannot comprehend the emotion labels in this format, we will need a mapping of the integer labels to semantic text descriptions.

label_to_text = {0 : 'sadness', 
                 1 : 'joy', 
                 2 : 'love', 
                 3 : 'anger', 
                 4 : 'fear',
                 5 : 'surprise'}

We will consume this dictionary downstream when we run our LLM.

Evaluation Metric

For the purpose of benchmarking our experiments, we choose Weighted F1 score as the metric. We also display the classification report and confusion matrix for detailed interpretation.

from sklearn.metrics import (classification_report, confusion_matrix, ConfusionMatrixDisplay, f1_score)
from typing import List

import matplotlib.pyplot as plt

def fetch_performance_metrics(y_true: np.ndarray, y_pred: np.ndarray, exp_name: str,display_report: bool = True, display_confusion_matrix: bool = True,label_list: List[str] = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'],num_labels: int = 6) -> dict:
    """
    Util function to compute F1 score and optionally display the classification report and confusion matrix for a given experiment.

    Args:
        y_true (np.ndarray): Array containing true labels.
        y_pred (np.ndarray): Array containing predicted labels.
        exp_name (str): Name of the experiment (used to save results).
        display_report (bool, optional): Boolean flag indicating whether to display classification report (True) or not (False). Defaults to True.
        display_confusion_matrix (bool, optional): Boolean flag indicating whether to display confusion matrix  (True) or not (False). Defaults to True.
        label_list (list, optional): List of labels. Defaults to ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise'].
        num_labels (int, optional): Number of unique labels. Defaults to 6.

    Returns:
        dict: A dictionary containing F1 score.
    """ 
    if display_report:
        print('\nClassification Report:')

        print(classification_report(y_true=y_true, y_pred=y_pred, labels=list(range(num_labels)),target_names=label_list[:num_labels]))

    if display_confusion_matrix:
        print('\nConfusion Matrix:')
        cm = confusion_matrix(y_true=y_true, y_pred=y_pred)
        fig, ax = plt.subplots(figsize=(8, 8))
        display = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=label_list)
        display.plot(ax=ax)
        plt.savefig(exp_name)

    return {'F1-score' : f1_score(y_true, y_pred, average='weighted')}

Alright ! Now, we are ready to begin.

Scenario #1 - No Labelled Data Available

Labelled data unavailability is a common bottleneck in real-world machine learning. Constructing a sizeable set of labelled samples may not be possible for various reasons such as cost of labelling manually, data privacy / regulations, etc. To benchmark LLM in this scenario, we will be exploring In-Context Learning which consumes minimal labelled samples to generate predictions.

Baseline

When no labelled samples are available, Unsupervised Learning and Few Shot Learning are commonly employed solutions. We will be considering Few Shot Learning since it is closer to the LLM's In-Context Learning. Specifically, we will be using Setfit, a popular Few Shot Learning model by Huggingface, as our baseline.

We setup the training and inference script.

from sentence_transformers.losses import CosineSimilarityLoss
from setfit import SetFitModel, Trainer, TrainingArguments, SetFitTrainer

def train_and_infer_setfit(model_name: str, train_data: pd.DataFrame, val_data: pd.DataFrame, seed: int,
                text_col: str = 'text', label_column: str = 'label', batch_size: int = 64, num_epochs: int = 3) -> dict:
    """Function to train Huggingface's Setfit model on input training data and return the computed performance metrics on input validation data 

    Args:
        model_name (str): Sentence Transformer model path to be used for training
        train_data (pd.DataFrame): Train data with corresponding labels 
        val_data (pd.DataFrame): Validation data with corresponding labels
        seed (int): Random seed for reproducibility
        text_col (str, optional): Column to be used to extract input features. Defaults to 'text'.
        label_column (str, optional): Label column in the data. Defaults to 'label'.
        batch_size (int, optional): Batch size to use during training. Defaults to 64.
        num_epochs (int, optional): Number of training epochs. Defaults to 3.

    Returns:
        dict: A dictionary containing F1 score.
    """
    model = SetFitModel.from_pretrained(model_name)
    train_dataset = Dataset.from_pandas(train_data)

    args = TrainingArguments(
    batch_size=batch_size,
    num_epochs=num_epochs,
    evaluation_strategy='no',
    save_strategy="no",
    loss=CosineSimilarityLoss,
    load_best_model_at_end=False,
    sampling_strategy="unique",
    seed=seed
    )

    trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=None,
    metric=None
    )

    trainer.train()

    y_pred = trainer.model.predict(val_data[text_col].values)
    y_true = val_data[label_column].astype(int)


    return fetch_performance_metrics(y_true, y_pred, 'setfit')

We draw 48 samples at random and label them for training. The batch size is set to 64 and the model is trained for 3 epochs.

Note - Here, the number of samples (48) is an approximation based on how Setfit has been shown to perform well with 8 samples per label. Since, we have 6 labels, we'll need 8 X 6 = 48 samples. Rather than random sampling, you can employ sample selection strategies that offer better guarantees of samples being more representative of each label. You can also explore generating (8 or more) new samples for each label.

Now, let's run it.

model_name = "BAAI/bge-small-en-v1.5"
seed = 43
samples = train_data.sample(n=48, random_state=seed).reset_index(drop=True)
train_and_infer_setfit(model_name, samples, val_data, seed)

The F1 score is 0.448. We also observe that the predictions are biased to 'sadness' or 'joy'. This isn't very surprising since the number of training samples is very less. Let's see how our LLM performs in similar settings.

LLM : In-Context Learning

We start by mapping labels to text using the dictionary label_to_text which we had previously defined.

train_data['label_text'] = train_data['label'].map(label_to_text)
val_data['label_text'] = val_data['label'].map(label_to_text)
train_data.head()

	text	label	label_text
0	i didnt feel humiliated	0	sadness
1	i can go from feeling so hopeless to so damned...	0	sadness
2	im grabbing a minute to post i feel greedy wrong	3	anger
3	i am ever feeling nostalgic about the fireplac...	2	love
4	i am feeling grouchy	3	anger

We will need to login to Huggingface hub to be able to access the LLM. We do this via Huggingface's notebook_login

from huggingface_hub import notebook_login

notebook_login()

Note - Additionally, you may need to navigate to the Mistral model card and "accept" the conditions to be able to access the model.

Defining the LLM helper functions

The LLM will need the input texts to be defined as prompts. We define a function build_LLM_prompt that transforms the input text into a prompt format. We will also define two helpers function infer_LLM and _generate_predictions to instantiate the LLM and run inference with the constructed input prompts.

from peft import AutoPeftModelForCausalLM
from tqdm import tqdm
from transformers import (AutoTokenizer, AutoModelForCausalLM,
                            BitsAndBytesConfig, pipeline)
from typing import Union

import datasets

def _generate_predictions(example: datasets.formatting.formatting.LazyBatch, 
                          generator: pipeline, text_column: str, 
                          max_new_tokens: int = 9, split_token: str ='[/EMOTION]') -> dict:
    """
    Generates predictions using the text generation model for a given example.

    Args:
        example (datasets.formatting.formatting.LazyBatch): Batch of samples from a dataset.
        generator (pipeline): Huggingface pipeline for text generation.
        text_column (str): Prompt for the text generation model.
        max_new_tokens (int, optional): Maximum number of tokens to generate. Defaults to 9.
        split_token (str, optional): Token to demarcate the emotion prediction. Defaults to '[/EMOTION]'.

    Returns:
        dict: A dictionary containing the generated predictions.
    """
    num_examples = len(dataset)
    predictions = []
    batch_results = generator(example[text_column], max_new_tokens=max_new_tokens, num_return_sequences=1)
    predictions.extend([result[0]["generated_text"] for result in batch_results])
    return {'prediction' : predictions}


def infer_LLM(model_name: str, input_ds: Dataset, batch_size: int = 4, max_new_tokens: int = 9,
             text_column: str = 'emotion_prompt', finetuned_model_path: str = None) -> Dataset:
    """
    Util function to run LLM inference

    Args:
        model_name (str): The name or path of the LLM model.
        input_ds (Dataset): Input dataset containing text prompts.
        batch_size (int, optional): Batch size for inference. Defaults to 4.
        max_new_tokens (int, optional): Maximum number of tokens to generate. Defaults to 9.
        text_column (str, optional): Name of the column containing text prompts. Defaults to 'emotion_prompt'.
        finetuned_model_path (str, optional): Path to the fine-tuned model. Defaults to None.

    Returns:
        dataset: Dataset with generated predictions.
    """
    quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, padding_side="left")

    if finetuned_model_path is None:
        model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto",
                                                quantization_config=quantization_config)
    else:
        model = AutoPeftModelForCausalLM.from_pretrained(finetuned_model_path,
                                                        device_map="auto",
                                                quantization_config=quantization_config)

    text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer,
                             batch_size=batch_size, truncation=False)
    text_generator.tokenizer.pad_token_id = model.config.eos_token_id

    input_ds = input_ds.map(_generate_predictions, fn_kwargs={'generator' : text_generator,
                                                              'text_column' : text_column,
                                                              'max_new_tokens' : max_new_tokens
                                                             },
                           batched=True, batch_size=batch_size)
    return input_ds

def build_LLM_prompt(input_ds: Dataset, label_column: str = None, prompt_template: Union[str, None] = None, 
                     with_label: bool = False) -> Dataset:
    """Util function to build the LLM prompt from input text data

    Args:
        input_ds (Dataset): Input dataset containing text
        label_column (str, optional): Label column in the data. Applicable if constructing prompts for in-context samples / finetuning LLM. Defaults to None.
        prompt_template (Union[str, None], optional): Text instruction to prepend to each transformed input text sample. Defaults to None.
        with_label (bool, optional): `True` if the prompts should include labels from the `label_column`. Defaults to False.

    Returns:
        Dataset: Dataset with generated predictions.
    """
    if type(input_ds) == pd.DataFrame:
        input_ds = Dataset.from_pandas(input_ds)

    if with_label:

        input_ds = input_ds.map(lambda x: {'emotion_prompt': '[UTTERANCE]' + x['text'] + '[/UTTERANCE]' + \
                                          '[EMOTION]' + x[label_column] + '[/EMOTION]'})
    else:
        input_ds = input_ds.map(lambda x: {'emotion_prompt': prompt_template + '[UTTERANCE]' + x['text'] + '[/UTTERANCE]' + \
                                          '[EMOTION]'})

    return input_ds

Build the LLM prompt

First, we build the prompt for in-context learning using build_LLM_prompt. To ensure we have a reasonably fair comparison, the samples considered are the same ones used for SetFit in the previous experiment.

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

sample_data = train_data.sample(n=48, random_state=seed).reset_index(drop=True)
emotion_list = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
emotion_list_str = ', '.join(emotion_list)

transformed_sample_data = build_LLM_prompt(sample_data, with_label=True, label_column='label_text')
samples_str = '\n'.join(transformed_sample_data['emotion_prompt'])

prompt_template =  "<s>[INST] You are a helpful, respectful and honest assistant. Choose one option that best describes the emotion behind the given utterance based on the following comma separated options: " + emotion_list_str + "[/INST] </s>"

Mapping LLM outputs to Emotion Predictions

LLMs generate free text that may not necessarily match the expected format for the output. We define an additional helper function _extract_label that parses the generated text and extracts the emotion. .

We also define run_llm that will act as our entry point to run LLM inference. It will invoke build_LLM_prompt and infer_LLM to perform inference and return the computed performance metrics on input validation data.

def _extract_label(sample: datasets.formatting.formatting.LazyRow, label_list: List[str]) -> dict:
    """Util function to extract the emotion from the generated LLM prediction

    Args:
        sample (datasets.formatting.formatting.LazyRow): Batch of samples from a dataset
        label_list (List[str]): List of possible emotions

    Returns:
        dict: Dictionary of extracted predicted labels
    """
    prompt_length = len(sample['emotion_prompt'])
    generated_answer = sample['prediction'][prompt_length:].split('[/EMOTION]')[0].lower()

    label_matched = False
    predicted_label = None

    for label in label_list:
        if label in generated_answer:
            predicted_label = label
            label_matched = True
            break

    if not label_matched:
        predicted_label = "no_match"

    return {'predicted_label' : predicted_label}

def run_llm(val_data: pd.DataFrame, prompt_template: str, model_name: str, emotion_list: List[str], label_mapping: dict, label_column: str = 'label', batch_size: int = 4, finetuned_model_path: str = None,
           num_labels: int = 6) -> dict:
    """Run end-to-end LLM inference (from pre-processing input data to post-processing the predictions) and return the computed performance metrics on input validation data

    Args:
        val_data (pd.DataFrame): Validation data with labels
        prompt_template (str): Text instruction to prepend to each transformed input text sample.
        model_name (str): The name or path of the pre-trained LLM.
        emotion_list (List[str]): List of possible emotions 
        label_mapping (dict): Dictionary mapping to convert text labels to integers 
        label_column (str, optional): Label column in the data. Defaults to 'label'.
        batch_size (int, optional): Batch size for inference. Defaults to 4.
        finetuned_model_path (str, optional):  Path to the fine-tuned model, if available.. Defaults to None.
        num_labels (int, optional): Number of unique labels. Defaults to 6.

    Returns:
        dict: A dictionary containing F1 score.
    """
    predicted_label_list = []
    val_ds = build_LLM_prompt(val_data, prompt_template=prompt_template)
    val_ds_with_pred = infer_LLM(model_name, val_ds, batch_size, finetuned_model_path=finetuned_model_path)

    predicted_label_list = val_ds_with_pred.map(_extract_label, 
                                  fn_kwargs={"label_list": emotion_list[:num_labels]})['predicted_label'] 

    y_pred = [label_mapping[pred] if pred in label_mapping else num_labels for pred in predicted_label_list]
    y_true = val_data[label_column].astype(int).values.tolist()

    if num_labels not in y_pred:
        # All LLM predictions match a valid emotion from `emotion_list`
        emotion_list.remove('no_match')

    return fetch_performance_metrics(y_true, y_pred, 'mistral_7b', label_list=emotion_list)

For cases where the LLM generates text that have no matching emotion, _extract_label returns the string 'no_match'. If there are occurences of 'no_match' in our final predictions, then we treat this as a new label so that our function fetch_performance_metrics can work seamlessly.

Putting it all to work

Ok. We are ready to run our LLM now.

text_to_label = {v: k for k, v in label_to_text.items()}

# Add 'no_match' to `emotion_list` to handle LLMs predicting labels outside of `emotion_list`   
LLM_emotion_list = emotion_list + ['no_match']
run_llm(val_data, prompt_template, model_name, LLM_emotion_list, text_to_label, batch_size=64)

The F1 score is 0.442. Comparing results from the two experiments, it seems that the performance of LLM In-Context Learning is similar to our baseline Setfit. Technically, many samples are ending up in the 'no_match' bucket which happens when the LLM's predicted text doesn't match any label. We could explore better prompt construction to ensure that the LLM always predicts a valid label. Enhancements such as chain-of-thought and self consistency based prompting can give better results but will be relatively more expensive to run at scale. We can cover them in a future post if people are interested.

Similarly, if we can provide a few more labelled samples for emotions where the setfit model is underperforming, then the results can potentially improve further. This could be much more feasible than the enhancements mentioned for LLMs.

Scenario #2 - Labelled Data Available But Noisy

Depending on the problem statement, a substantial number of labelled samples can occasionally be available for training Machine Learning models. As mentioned before, obtaining quality labelled data can be expensive. The labelled data that is available is potentially weakly labelled via automated systems or some form of user interaction. Besides, even if you invest in manually labelling the data, the labels obtained can still be susceptible to noise and errors due to multiple factors such as annotation ambiguity, human subjectivity, fatigue, etc. Such situations can demand fine-tuning cost effective models using the available resources.

Simulating Noisy Labels

To simulate our noisy scenario, we define a function add_asymmetric_noise to construct a noisy version of the emotion detection data labels.

from pandas._typing import RandomState

def add_asymmetric_noise(
    labels: pd.Series,
    noise_prob: float,
    random_state: Union[RandomState, None] = 42,
) -> Tuple[pd.Series, float]:
    """
    Util function to add asymmetric noise to labels
    for simulation of noisy label scenarios.

    Args:
        labels (pd.Series): Input pandas series with integer values
                        ranging from 0 to n - 1.
        noise_prob (float): Probability of adding noise to each value.
        random_state (Union[RandomState, None]): Random seed for reproducibility
    Returns:
        pd.Series: Series with asymmetric noise added to it.
        float: Normalized quantification of pairwise disagreement between `labels` and `noisy_labels` for parity check
    """
    # Set seed
    np.random.seed(random_state)

    # Avoid modifying the original data
    noisy_labels = labels.copy()

    # Build a replacement dictionary
    unique_labels = list(set(noisy_labels))
    replacement_dict = {
        label: [candidate for candidate in unique_labels if candidate != label]
        for label in unique_labels
    }

    # Determine the number of samples to modify based on the noise probability
    num_samples = min(len(noisy_labels), int(len(noisy_labels) * noise_prob + 1))

    # Sample random indices from the labels to introduce noise
    target_indices = np.random.choice(len(noisy_labels), num_samples, replace=False)

    for idx in target_indices:
        # Introduce noise
        noisy_labels[idx] = np.random.choice(replacement_dict[noisy_labels[idx]])

    # Parity check
    num_mismatches = sum(
        [
            label != noisy_label
            for label, noisy_label in zip(labels.values, noisy_labels.values)
        ]
    )
    observed_noise_ratio = num_mismatches / len(noisy_labels)

    return noisy_labels, observed_noise_ratio

We are going to assume 30% noise for the purpose of our experiments. In other words, 30% of the data will have incorrect labels. You can try running it with different noise levels by modifiying noise_level in the following script as needed.

noise_level = 0.3

train_data['noisy_label'], observed_noise_ratio = add_asymmetric_noise(train_data['label'], 
                                                  noise_prob=noise_level,
                                                  random_state=seed)
observed_noise_ratio

0.3000625

observed_noise_ratio is a parity check to ensure that the percentage of noise in the labelled samples is what we intended. You could also just check number of samples where 'label' and 'noisy_label' mismatch.

len(train_data.loc[train_data['label'] != train_data['noisy_label']]) / len(train_data)

0.3000625

Looks good. Let's move on to defining our baseline.

Baseline

With text classification, given the constraints that we have many labelled samples where some labels are potentially incorrect, one of the commonly explored approaches is using state-of-the-art pre-trained embeddings to extract features from the text (because such embeddings are trained to be robust). This embedding is then combined with a downstream classification model. For the purpose of the experiments, we leverage 'BGE-small-en-v1.5' embeddings for feature extraction. For classification, we keep it simple by using Sklearn's Logistic Regression. First, we setup a training and inference script for the baseline.

def train_and_infer_LR_with_PTE(embeddings: str, train_data: pd.DataFrame, val_data: pd.DataFrame, seed: int, label_column: str = 'label') -> dict:
    """Function to train Logistic Regression model with pre-trained embeddings as features and return performance metrics on input validation data 

    Args:
        embeddings (str): Path to embeddings to be used for training
        train_data (pd.DataFrame): Train data with corresponding labels
        val_data (pd.DataFrame): Validation data with corresponding labels
        seed (int): Random seed for reproducibility
        label_column (str): Label column in the data. Defaults to 'label'

    Returns:
        dict: A dictionary containing F1 score.
    """
    embedding_model = SentenceTransformer(embeddings)

    train_embeddings = embedding_model.encode(train_data['text'].values, 
                                                  show_progress_bar=False)

    y_train = train_data[label_column].astype(int)
    clf = LogisticRegression(random_state=seed).fit(train_embeddings, y_train)

    y_true = val_data['label'].astype(int)
    val_embeddings = embedding_model.encode(val_data['text'].values, show_progress_bar=False)
    y_pred = clf.predict(val_embeddings)

    return fetch_performance_metrics(y_true, y_pred, 'LR_with_PTE')

We use the SentenceTransformer library to load the embeddings and encode input data. Let's run our baseline and observe the results.

embedding_model = "BAAI/bge-small-en"
label_column = 'noisy_label'
train_and_infer_LR_with_PTE(embedding_model, train_data, val_data, seed, label_column)

The F1-score obtained is 0.641 which is better than what we observed for Few Shot Learning. Performance when it comes to emotions 'love' and 'surprise' seem to be poor. Will our LLM be able to beat this result ? Let's find out.

LLM : Fine-tuning

With a relatively larger number of training samples available, we can explore fine-tuning of our LLM. Traditional full fine-tuning of an LLM becomes infeasible to train on consumer hardware because of the large number of parameters. We will be relying on PEFT approaches which freeze most of the parameters of the pre-trained LLMs and fine-tune a small number of additional model parameters. Particularly, we will be using LoRA to finetune our LLM.

from peft import get_peft_model, LoraConfig, PeftConfig, PeftModel, prepare_model_for_kbit_training
from tqdm import tqdm
from transformers import (AutoModelForCausalLM, AutoTokenizer, 
                          BitsAndBytesConfig, DataCollatorForLanguageModeling,
                          pipeline, Trainer, TrainingArguments 
                          )

import bitsandbytes as bnb
import torch.nn as nn

As before, we will need to map the noisy labels to the corresponding text using the dictionary label_to_text.

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

train_data['noisy_label_text'] = train_data['noisy_label'].map(label_to_text)

Transform Text into LLM Prompts

We transform the data into prompts that LLMs understand.

emotion_list = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
emotion_list_str = ', '.join(emotion_list)

prompt_template =  "<s>[INST] You are a helpful, respectful and honest assistant. Choose one option that best describes the emotion behind the given utterance based on the following comma separated options: " + emotion_list_str + "[/INST] </s>"
noisy_label_column = 'noisy_label_text'

train_ds = build_LLM_prompt(train_data, with_label=True, label_column=noisy_label_column)
train_ds = train_ds.map(lambda example, prompt_template=prompt_template : {'emotion_prompt' : prompt_template + example['emotion_prompt']})

Fine-tune LLM

Next, we define our LLM Fine-tuning function.

def tokenize(example: datasets.formatting.formatting.LazyRow, tokenizer: AutoTokenizer ) -> dict:
    """Util function to tokenize text data

    Args:
        example (datasets.formatting.formatting.LazyRow): Batch of samples containing text to tokenize.
        tokenizer (AutoTokenizer): Tokenizer object used for tokenization.

    Returns:
        dict: Dictionary containing tokenized text.
    """
    tokenized = tokenizer(
        example['emotion_prompt'],
        truncation=False
    )

    return {**tokenized}

def finetune_LLM(base_model_name: str, train_ds: Dataset, save_path: str, seed: int, batch_size: int = 64, num_epochs: int = 1):
    """Function to finetune an LLM on the given input training data

    Args:
        base_model_name (str): The name or path of the LLM model to be finetuned
        train_ds (Dataset): Input dataset containing text prompts.
        save_path (str): Path to save the trained model
        seed (int): Random seed for reproducibility
        batch_size (int, optional): Batch size to use during training. Defaults to 64.
        num_epochs (int, optional): Number of training epochs. Defaults to 1.
    """
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=False,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16
    )

    model = AutoModelForCausalLM.from_pretrained(base_model_name, 
                                                 quantization_config=bnb_config, 
                                                 device_map="auto")

    tokenizer = AutoTokenizer.from_pretrained(base_model_name, padding_side="left")
    tokenizer.pad_token = tokenizer.eos_token

    train_ds = train_ds.map(
        tokenize,
        batched=False,
        fn_kwargs={"tokenizer": tokenizer},
    )

    model = prepare_model_for_kbit_training(model)

    peft_config = LoraConfig(
        lora_alpha=16,
        lora_dropout=0.1,
        r=64,
        bias="none",
        task_type="CAUSAL_LM",
    )

    args = TrainingArguments(
            disable_tqdm=False,
            output_dir=save_path,
            warmup_steps=1,
            per_device_train_batch_size=batch_size,
            num_train_epochs=num_epochs,
            learning_rate=2e-4,
            fp16=True,
            optim="paged_adamw_8bit",             
            logging_dir="./logs",        
            save_strategy="no",              
            evaluation_strategy="no",                             
            report_to=None          
        )
    model = get_peft_model(model, peft_config)
    model.config.use_cache = False

    trainer = Trainer(
        model=model,
        train_dataset=train_ds.select_columns(['input_ids', 'attention_mask']),
        eval_dataset=None,
        args=args,
        data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
    )

    trainer.train()
    trainer.model.save_pretrained(save_path)

    return

To finetune our LLM, we use the following snippet

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

finetuned_model_path = "finetuned-mistral-all_data"
finetune_LLM(model_name, train_ds, save_path=finetuned_model_path, seed=seed)

The newly finetuned model is stored in your working directory under the folder "finetuned-mistral-all_data".

Inference with Fine-tuned LLM

We run the inference with this model using the function run_llm as we did for the in-context learning experiment previously

text_to_label = {v: k for k, v in label_to_text.items()}
LLM_emotion_list = emotion_list + ['no_match']
run_llm(val_data, prompt_template, model_name, LLM_emotion_list, text_to_label,
       finetuned_model_path=finetuned_model_path, batch_size=64)

Our finetuned LLM performs better than our baseline of BGE small + Logistic Regression. Perhaps, the LLM is able to capture better representations than the pre-trained embeddings used in the baseline. More importantly, none of the predictions are 'no_match' anymore. In other words, our finetuned LLM is able to predict a valid emotion as the label for all validation samples unlike what happened during In-Context Learning. Also, open source pre-trained embeddings more powerful than BGE small are available and can be explored to improve the baseline results.

Better practical strategies to handle noisy data ?

We finetuned models directly on all of the available noisy labelled samples. In practice, you'd additionally explore noise correction strategies applicable either at the model building stage (ranging from designing robust loss functions to exploring weaker forms of supervision) or at the data processing stage (label error detection / correction). Given that our goal is to benchmark LLM performance, let's consider a data level noise correction strategy and re-run the finetuning experiments to observe performance changes. We'll be using DQC-Toolkit, an open source library I'm currently building, to curate our noisy labelled samples.

from dqc import CrossValCurate
cvc = CrossValCurate(random_state=seed, 
                     calibration_method='calibrate_using_baseline' )

train_data_modified = cvc.fit_transform(train_data, y_col_name='noisy_label')

DQC Toolkit offers CrossValCurate that quantifies label correctness in the input labelled data using cross validation techniques. The result (stored in train_data_modified) is a pandas dataframe similar to train_data with the following additional columns -

'label_correctness_score' represents a normalized score quantifying the correctness of 'noisy_label'.
'is_label_correct' is a boolean flag indicating whether the 'noisy_label' is to be considered correct (True) or incorrect (False).
'predicted_label' and 'prediction_probability' represent DQC Toolkit's predicted label for a given sample and the corresponding probability score.

For more details regarding different hyperparameters available in CrossValCurate, please refer to the API documentation.

Baseline

Instead of 'noisy_label', we are going to leverage DQC Toolkit's 'predicted_label' as our target variable. Let's start by rerunning the BGE small + Logistic Regression baseline.

embeddings = "BAAI/bge-small-en"
label_column = 'predicted_label'
train_and_infer_LR_with_PTE(embeddings, train_data_modified, val_data, seed, label_column)

The F1 score without any noise correction was 0.641. With label correction, we observe a score of 0.664. There is an improvement in the F1 score ! Let's check if we observe similar performance improvements with LLM finetuning.

LLM : Finetuning

We map the integer labels to text labels for LLM interpretability.

train_data_modified['predicted_label_text'] = train_data_modified['predicted_label'].map(label_to_text)

We transform our text to LLM prompts

emotion_list = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
emotion_list_str = ', '.join(emotion_list)

prompt_template =  "<s>[INST] You are a helpful, respectful and honest assistant. Choose one option that best describes the emotion behind the given utterance based on the following comma separated options: " + emotion_list_str + "[/INST] </s>"
label_column = 'predicted_label_text'

train_data_modified_ds = build_LLM_prompt(train_data_modified, with_label=True, label_column=label_column)
train_data_modified_ds = train_data_modified_ds.map(lambda example, prompt_template=prompt_template : {'emotion_prompt' : prompt_template + example['emotion_prompt']})

Now, we finetune our LLM on the data.

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

finetuned_model_path = "finetuned-mistral-filtered_noisy_data"
finetune_LLM(model_name, train_data_modified_ds, save_path=finetuned_model_path, seed=seed)

Let's run the inference module and observe the performance changes.

text_to_label = {v: k for k, v in label_to_text.items()}
LLM_emotion_list = emotion_list + ['no_match']
run_llm(val_data, prompt_template, model_name, LLM_emotion_list, text_to_label,
       finetuned_model_path=finetuned_model_path, batch_size=32)

From an F1 score of 0.666 to an F1 score 0.726 ! This is a larger improvement compared to the performance improvement observed with our baseline. This could be attributed to the fact that we are finetuning the LLM as opposed to leveraging pre-trained embeddings in the baseline.

[BONUS] The Ideal Scenario - (Clean) Labelled Data Available

We've run a bunch of experiments and were able to compare performances of different models for Emotion Prediction. Our best performing model is the finetuned LLM with noisy labelled data combined with DQC Toolkit for noise correction. How well would this finetuned LLM perform if it was trained with clean labelled data without changing any additional settings ? Let's find out.

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

emotion_list = ['sadness', 'joy', 'love', 'anger', 'fear', 'surprise']
emotion_list_str = ', '.join(emotion_list)

prompt_template =  "<s>[INST] You are a helpful, respectful and honest assistant. Choose one option that best describes the emotion behind the given utterance based on the following comma separated options: " + emotion_list_str + "[/INST] </s>"
label_column = 'label_text'

train_ds = build_LLM_prompt(train_data, with_label=True, label_column=label_column)
train_ds = train_ds.map(lambda example, prompt_template=prompt_template : {'emotion_prompt' : prompt_template + example['emotion_prompt']})

We invoke finetune_LLM to finetune the model

model_name = "mistralai/Mistral-7B-Instruct-v0.2"

finetuned_model_path = "finetuned-mistral-ideal_data"
finetune_LLM(model_name, train_ds, save_path=finetuned_model_path, seed=seed)

And finally, performance inference and compute metrics using run_llm

text_to_label = {v: k for k, v in label_to_text.items()}
LLM_emotion_list = emotion_list + ['no_match']
run_llm(val_data, prompt_template, model_name, LLM_emotion_list, text_to_label,
       finetuned_model_path=finetuned_model_path, batch_size=32)

F1-score is 0.761. This means there is scope for improvement in our previous strategies. There are more enhancements that we can leverage to achieve better performance but we leave those discussions to future posts, if people are interested.

Consolidating results

Phew ! We've reached the end of the article. The following plots summarize the performances of the approaches for each scenario discussed -

In Scenario 1 (No Labelled Data Available), our baseline(SetFit) performed on par with the LLM(In-Context Learning). In Scenario 2 (Labelled Data Available but Noisy), our baseline (BGE small pre-trained embeddings combined with Logistic Regression) slightly underperformed in comparison to our fine-tuned LLM. When we applied label noise correction via DQC Toolkit, we observed a performance boost in both the baseline and the fine-tuned LLM.

Currently, DQC Toolkit supports text classification (binary/multi class) problems with various parameter customization options. Check out the documentation for details. Following is the link to the repo. The plan is to enhance it further by adding more capabilities. Any form of feedback and support will be much appreciated !
sumanthprabhu / DQC-Toolkit

Data quality checks to curate noisy labels in the data
DQC Toolkit is a Python library and framework designed with the goal to facilitate improvement of Machine Learning models by identifying and mitigating label errors in training dataset. Currently, DQC toolkit offers CrossValCurate for curation of text classification datasets (binary / multi-class) using cross validation based selection.

Installation

Installation of DQC-toolkit can be done as shown below
pip install dqc-toolkit
Quick Start

Assuming your text classification data is stored as a pandas dataframe data, with each sample represented by the text column and its corresponding noisy label represented by the label column, here is how you use CrossValCurate -
from dqc import CrossValCurate

cvc = CrossValCurate()
data_curated = cvc.fit_transform(data[['text', 'label']])
The result stored in data_curated which is a pandas dataframe similar to data with the following columns -
>>> data_curated.columns
['text', 'label', 
…
View on GitHub

Thank you for reading

Passionate about Machine Learning? Please feel free to add me on Linkedin.