Is your fine-tuned LLM any good?

#llm #ai #finetuning #evaluate

Over the past few weeks, I've been diving deep into fine-tuning Large Language Models (LLMs) for various applications, with a particular focus on creating detailed summaries from multiple sources. Throughout this process, we've leveraged tools like weights and biases to track our learning rates and optimize our results. However, a crucial question remains: How can we objectively determine if our fine-tuned model is producing high-quality outputs?
While human evaluation is always an option, it's time-consuming and often impractical for large-scale assessments. This is where ROUGE (Recall-Oriented Understudy for Gisting Evaluation) comes into play, offering a quantitative approach to evaluating our fine-tuned models.

ROUGE calculates the overlap between the generated text and the reference text(s) using various methods. The most common ROUGE metrics include:

ROUGE-N: Measures the overlap of n-grams (contiguous sequences of n words) between the generated and reference texts.
ROUGE-L: Computes the longest common subsequence (LCS) between the generated and reference texts.
ROUGE-W: A weighted version of ROUGE-L that gives more importance to consecutive matches.

What do the different metrics mean?

ROUGE-1: Overlap of unigrams (single words)
ROUGE-2: Overlap of bigrams (two-word sequences)
ROUGE-L: Longest common subsequence
ROUGE-L sum: Variant of ROUGE-L computed over the entire summary

Each metric is typically reported as a score between 0 and 1, where higher values indicate better performance

Here is some code you can use for evaluting the models. Note in this example I am loading both models into a single GPU and comparing the results at the same time

import multiprocessing
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
from evaluate import load

original_model_name =""
finetuned_model = ""

def load_model(model_name: str, device: str):
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        return_dict=True, 
        load_in_8bit=True, 
        device_map={"":device},
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    return model, tokenizer

def inference(model, tokenizer, prompt: str):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=200, temperature=1.0)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def process_task(task_queue, result_queue):
    original_model, original_tokenizer = load_model(original_model_name, device="cuda:0")
    fine_tuned_model, fine_tuned_tokenizer = load_model(finetuned_model, device="cuda:0")

    rouge = load('rouge')

    while True:
        task = task_queue.get()
        if task is None:
            break
        prompt, reference = task

        start = time.time()
        original_summary = inference(original_model, original_tokenizer, prompt)
        fine_tuned_summary = inference(fine_tuned_model, fine_tuned_tokenizer, prompt)
        print(f"Completed inference in {time.time() - start}")

        original_scores = rouge.compute(predictions=[original_summary], references=[reference])
        fine_tuned_scores = rouge.compute(predictions=[fine_tuned_summary], references=[reference])

        result_queue.put((original_scores, fine_tuned_scores))

def main():
    task_queue = multiprocessing.Queue()
    result_queue = multiprocessing.Queue()

    prompt = "Your prompt here"
    reference = "Your reference summary here"

    process = multiprocessing.Process(target=process_task, args=(task_queue, result_queue))
    process.start()

    start = time.time()

    # Run 3 times
    for _ in range(3):
        task_queue.put((prompt, reference))

    results = []
    for _ in range(3):
        result = result_queue.get()
        results.append(result)

    # Signal the process to terminate
    task_queue.put(None)
    process.join()

    end = time.time()

    print(f"Total time: {end - start}")

    # Print ROUGE scores
    for i, (original_scores, fine_tuned_scores) in enumerate(results):
        print(f"Run {i+1}:")
        print("Original model scores:")
        print(original_scores)
        print("Fine-tuned model scores:")
        print(fine_tuned_scores)
        print()

if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")
    main()

To illustrate the effectiveness of ROUGE in evaluating fine-tuned models, here are the results from a recent run:
Fine-tuned model:

rouge1: 0.1775
rouge2: 0.0271
rougeL: 0.1148
rougeLsum: 0.1148

Original model:

rouge1: 0.0780
rouge2: 0.0228
rougeL: 0.0543
rougeLsum: 0.0598

As we can see, the fine-tuned model shows significant improvements across all ROUGE metrics compared to the original model. This quantitative assessment provides concrete evidence that our fine-tuning process has indeed enhanced the model's summarization capabilities.
While ROUGE scores shouldn't be the sole criterion for evaluating LLMs, they offer a valuable, objective measure of improvement. By incorporating ROUGE into our evaluation pipeline, we can more efficiently iterate on our fine-tuning process and confidently assess the quality of our models.

DEV Community

Is your fine-tuned LLM any good?

Top comments (0)