DEV Community

Cover image for Is your fine-tuned LLM any good?
Shannon Lal
Shannon Lal

Posted on

1

Is your fine-tuned LLM any good?

Over the past few weeks, I've been diving deep into fine-tuning Large Language Models (LLMs) for various applications, with a particular focus on creating detailed summaries from multiple sources. Throughout this process, we've leveraged tools like weights and biases to track our learning rates and optimize our results. However, a crucial question remains: How can we objectively determine if our fine-tuned model is producing high-quality outputs?
While human evaluation is always an option, it's time-consuming and often impractical for large-scale assessments. This is where ROUGE (Recall-Oriented Understudy for Gisting Evaluation) comes into play, offering a quantitative approach to evaluating our fine-tuned models.

ROUGE calculates the overlap between the generated text and the reference text(s) using various methods. The most common ROUGE metrics include:

ROUGE-N: Measures the overlap of n-grams (contiguous sequences of n words) between the generated and reference texts.
ROUGE-L: Computes the longest common subsequence (LCS) between the generated and reference texts.
ROUGE-W: A weighted version of ROUGE-L that gives more importance to consecutive matches.

What do the different metrics mean?

ROUGE-1: Overlap of unigrams (single words)
ROUGE-2: Overlap of bigrams (two-word sequences)
ROUGE-L: Longest common subsequence
ROUGE-L sum: Variant of ROUGE-L computed over the entire summary

Each metric is typically reported as a score between 0 and 1, where higher values indicate better performance

Here is some code you can use for evaluting the models. Note in this example I am loading both models into a single GPU and comparing the results at the same time

import multiprocessing
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
from evaluate import load

original_model_name =""
finetuned_model = ""

def load_model(model_name: str, device: str):
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        return_dict=True, 
        load_in_8bit=True, 
        device_map={"":device},
        trust_remote_code=True,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    return model, tokenizer

def inference(model, tokenizer, prompt: str):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=200, temperature=1.0)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def process_task(task_queue, result_queue):
    original_model, original_tokenizer = load_model(original_model_name, device="cuda:0")
    fine_tuned_model, fine_tuned_tokenizer = load_model(finetuned_model, device="cuda:0")

    rouge = load('rouge')

    while True:
        task = task_queue.get()
        if task is None:
            break
        prompt, reference = task

        start = time.time()
        original_summary = inference(original_model, original_tokenizer, prompt)
        fine_tuned_summary = inference(fine_tuned_model, fine_tuned_tokenizer, prompt)
        print(f"Completed inference in {time.time() - start}")

        original_scores = rouge.compute(predictions=[original_summary], references=[reference])
        fine_tuned_scores = rouge.compute(predictions=[fine_tuned_summary], references=[reference])

        result_queue.put((original_scores, fine_tuned_scores))

def main():
    task_queue = multiprocessing.Queue()
    result_queue = multiprocessing.Queue()

    prompt = "Your prompt here"
    reference = "Your reference summary here"

    process = multiprocessing.Process(target=process_task, args=(task_queue, result_queue))
    process.start()

    start = time.time()

    # Run 3 times
    for _ in range(3):
        task_queue.put((prompt, reference))

    results = []
    for _ in range(3):
        result = result_queue.get()
        results.append(result)

    # Signal the process to terminate
    task_queue.put(None)
    process.join()

    end = time.time()

    print(f"Total time: {end - start}")

    # Print ROUGE scores
    for i, (original_scores, fine_tuned_scores) in enumerate(results):
        print(f"Run {i+1}:")
        print("Original model scores:")
        print(original_scores)
        print("Fine-tuned model scores:")
        print(fine_tuned_scores)
        print()

if __name__ == "__main__":
    multiprocessing.set_start_method("spawn")
    main()
Enter fullscreen mode Exit fullscreen mode

To illustrate the effectiveness of ROUGE in evaluating fine-tuned models, here are the results from a recent run:
Fine-tuned model:

rouge1: 0.1775
rouge2: 0.0271
rougeL: 0.1148
rougeLsum: 0.1148

Original model:

rouge1: 0.0780
rouge2: 0.0228
rougeL: 0.0543
rougeLsum: 0.0598

As we can see, the fine-tuned model shows significant improvements across all ROUGE metrics compared to the original model. This quantitative assessment provides concrete evidence that our fine-tuning process has indeed enhanced the model's summarization capabilities.
While ROUGE scores shouldn't be the sole criterion for evaluating LLMs, they offer a valuable, objective measure of improvement. By incorporating ROUGE into our evaluation pipeline, we can more efficiently iterate on our fine-tuning process and confidently assess the quality of our models.

Image of Timescale

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.

Read full post →

Top comments (0)

Image of Docusign

🛠️ Bring your solution into Docusign. Reach over 1.6M customers.

Docusign is now extensible. Overcome challenges with disconnected products and inaccessible data by bringing your solutions into Docusign and publishing to 1.6M customers in the App Center.

Learn more