Over the past few weeks, I've been diving deep into fine-tuning Large Language Models (LLMs) for various applications, with a particular focus on creating detailed summaries from multiple sources. Throughout this process, we've leveraged tools like weights and biases to track our learning rates and optimize our results. However, a crucial question remains: How can we objectively determine if our fine-tuned model is producing high-quality outputs?
While human evaluation is always an option, it's time-consuming and often impractical for large-scale assessments. This is where ROUGE (Recall-Oriented Understudy for Gisting Evaluation) comes into play, offering a quantitative approach to evaluating our fine-tuned models.
ROUGE calculates the overlap between the generated text and the reference text(s) using various methods. The most common ROUGE metrics include:
ROUGE-N: Measures the overlap of n-grams (contiguous sequences of n words) between the generated and reference texts.
ROUGE-L: Computes the longest common subsequence (LCS) between the generated and reference texts.
ROUGE-W: A weighted version of ROUGE-L that gives more importance to consecutive matches.
What do the different metrics mean?
ROUGE-1: Overlap of unigrams (single words)
ROUGE-2: Overlap of bigrams (two-word sequences)
ROUGE-L: Longest common subsequence
ROUGE-L sum: Variant of ROUGE-L computed over the entire summary
Each metric is typically reported as a score between 0 and 1, where higher values indicate better performance
Here is some code you can use for evaluting the models. Note in this example I am loading both models into a single GPU and comparing the results at the same time
import multiprocessing
import time
from transformers import AutoModelForCausalLM, AutoTokenizer
from evaluate import load
original_model_name =""
finetuned_model = ""
def load_model(model_name: str, device: str):
model = AutoModelForCausalLM.from_pretrained(
model_name,
return_dict=True,
load_in_8bit=True,
device_map={"":device},
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
return model, tokenizer
def inference(model, tokenizer, prompt: str):
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=1.0)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
def process_task(task_queue, result_queue):
original_model, original_tokenizer = load_model(original_model_name, device="cuda:0")
fine_tuned_model, fine_tuned_tokenizer = load_model(finetuned_model, device="cuda:0")
rouge = load('rouge')
while True:
task = task_queue.get()
if task is None:
break
prompt, reference = task
start = time.time()
original_summary = inference(original_model, original_tokenizer, prompt)
fine_tuned_summary = inference(fine_tuned_model, fine_tuned_tokenizer, prompt)
print(f"Completed inference in {time.time() - start}")
original_scores = rouge.compute(predictions=[original_summary], references=[reference])
fine_tuned_scores = rouge.compute(predictions=[fine_tuned_summary], references=[reference])
result_queue.put((original_scores, fine_tuned_scores))
def main():
task_queue = multiprocessing.Queue()
result_queue = multiprocessing.Queue()
prompt = "Your prompt here"
reference = "Your reference summary here"
process = multiprocessing.Process(target=process_task, args=(task_queue, result_queue))
process.start()
start = time.time()
# Run 3 times
for _ in range(3):
task_queue.put((prompt, reference))
results = []
for _ in range(3):
result = result_queue.get()
results.append(result)
# Signal the process to terminate
task_queue.put(None)
process.join()
end = time.time()
print(f"Total time: {end - start}")
# Print ROUGE scores
for i, (original_scores, fine_tuned_scores) in enumerate(results):
print(f"Run {i+1}:")
print("Original model scores:")
print(original_scores)
print("Fine-tuned model scores:")
print(fine_tuned_scores)
print()
if __name__ == "__main__":
multiprocessing.set_start_method("spawn")
main()
To illustrate the effectiveness of ROUGE in evaluating fine-tuned models, here are the results from a recent run:
Fine-tuned model:
rouge1: 0.1775
rouge2: 0.0271
rougeL: 0.1148
rougeLsum: 0.1148
Original model:
rouge1: 0.0780
rouge2: 0.0228
rougeL: 0.0543
rougeLsum: 0.0598
As we can see, the fine-tuned model shows significant improvements across all ROUGE metrics compared to the original model. This quantitative assessment provides concrete evidence that our fine-tuning process has indeed enhanced the model's summarization capabilities.
While ROUGE scores shouldn't be the sole criterion for evaluating LLMs, they offer a valuable, objective measure of improvement. By incorporating ROUGE into our evaluation pipeline, we can more efficiently iterate on our fine-tuning process and confidently assess the quality of our models.
Top comments (0)