DEV Community

Cover image for Is your fine-tuned LLM any good?
Shannon Lal
Shannon Lal

Posted on

1

Is your fine-tuned LLM any good?

Over the past few weeks, I've been diving deep into fine-tuning Large Language Models (LLMs) for various applications, with a particular focus on creating detailed summaries from multiple sources. Throughout this process, we've leveraged tools like weights and biases to track our learning rates and optimize our results. However, a crucial question remains: How can we objectively determine if our fine-tuned model is producing high-quality outputs?
While human evaluation is always an option, it's time-consuming and often impractical for large-scale assessments. This is where ROUGE (Recall-Oriented Understudy for Gisting Evaluation) comes into play, offering a quantitative approach to evaluating our fine-tuned models.

ROUGE calculates the overlap between the generated text and the reference text(s) using various methods. The most common ROUGE metrics include:

ROUGE-N: Measures the overlap of n-grams (contiguous sequences of n words) between the generated and reference texts.
ROUGE-L: Computes the longest common subsequence (LCS) between the generated and reference texts.
ROUGE-W: A weighted version of ROUGE-L that gives more importance to consecutive matches.

What do the different metrics mean?

ROUGE-1: Overlap of unigrams (single words)
ROUGE-2: Overlap of bigrams (two-word sequences)
ROUGE-L: Longest common subsequence
ROUGE-L sum: Variant of ROUGE-L computed over the entire summary

Each metric is typically reported as a score between 0 and 1, where higher values indicate better performance

Here is some code you can use for evaluting the models. Note in this example I am loading both models into a single GPU and comparing the results at the same time

import multiprocessing
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
from evaluate import load

original_model_name =""
finetuned_model = ""

def load_model(model_name: str, device: str):
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        return_dict=True, 
        load_in_8bit=True, 
        device_map={"":device},
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    return model, tokenizer

def inference(model, tokenizer, prompt: str):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=200, temperature=1.0)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def process_task(task_queue, result_queue):
    original_model, original_tokenizer = load_model(original_model_name, device="cuda:0")
    fine_tuned_model, fine_tuned_tokenizer = load_model(finetuned_model, device="cuda:0")

    rouge = load('rouge')

    while True:
        task = task_queue.get()
        if task is None:
            break
        prompt, reference = task

        start = time.time()
        original_summary = inference(original_model, original_tokenizer, prompt)
        fine_tuned_summary = inference(fine_tuned_model, fine_tuned_tokenizer, prompt)
        print(f"Completed inference in {time.time() - start}")

        original_scores = rouge.compute(predictions=[original_summary], references=[reference])
        fine_tuned_scores = rouge.compute(predictions=[fine_tuned_summary], references=[reference])

        result_queue.put((original_scores, fine_tuned_scores))

def main():
    task_queue = multiprocessing.Queue()
    result_queue = multiprocessing.Queue()

    prompt = "Your prompt here"
    reference = "Your reference summary here"

    process = multiprocessing.Process(target=process_task, args=(task_queue, result_queue))
    process.start()

    start = time.time()

    # Run 3 times
    for _ in range(3):
        task_queue.put((prompt, reference))

    results = []
    for _ in range(3):
        result = result_queue.get()
        results.append(result)

    # Signal the process to terminate
    task_queue.put(None)
    process.join()

    end = time.time()

    print(f"Total time: {end - start}")

    # Print ROUGE scores
    for i, (original_scores, fine_tuned_scores) in enumerate(results):
        print(f"Run {i+1}:")
        print("Original model scores:")
        print(original_scores)
        print("Fine-tuned model scores:")
        print(fine_tuned_scores)
        print()

if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")
    main()
Enter fullscreen mode Exit fullscreen mode

To illustrate the effectiveness of ROUGE in evaluating fine-tuned models, here are the results from a recent run:
Fine-tuned model:

rouge1: 0.1775
rouge2: 0.0271
rougeL: 0.1148
rougeLsum: 0.1148

Original model:

rouge1: 0.0780
rouge2: 0.0228
rougeL: 0.0543
rougeLsum: 0.0598

As we can see, the fine-tuned model shows significant improvements across all ROUGE metrics compared to the original model. This quantitative assessment provides concrete evidence that our fine-tuning process has indeed enhanced the model's summarization capabilities.
While ROUGE scores shouldn't be the sole criterion for evaluating LLMs, they offer a valuable, objective measure of improvement. By incorporating ROUGE into our evaluation pipeline, we can more efficiently iterate on our fine-tuning process and confidently assess the quality of our models.

Heroku

Build apps, not infrastructure.

Dealing with servers, hardware, and infrastructure can take up your valuable time. Discover the benefits of Heroku, the PaaS of choice for developers since 2007.

Visit Site

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs