Update: Here are some of other interesting blogs:
- DeepSeek R1: https://dev.to/gokulsg/deepseek-r1-33n0
- Spoken Language Models: https://dev.to/gokulsg/spoken-language-models-3afe
Language models have become central components in modern AI systems, yet their evaluation remains a complex challenge spanning multiple dimensions of performance. In this blog, we will explore the comprehensive landscape of evaluating both text-based and spoken language models across various tasks, providing implementation details and code examples.
Introduction to Language Model Evaluation
Evaluating language models presents unique challenges compared to other machine learning systems. Unlike tasks with clear right or wrong answers, language processing exists in a complex space of semantics, fluency, context-awareness, and domain-specific knowledge. This complexity increases significantly when dealing with spoken language models, which must handle both acoustic properties and linguistic content. Evaluation for speech generation is difficult due to the continuous, variable, and multi-level nature of the speech waveform, and the necessity both to capture fine-grained acoustic details to generate intelligible audio and to abstract away from them to learn higher-level language concepts. This dual nature creates fundamental challenges in measuring model performance.
Language model evaluation typically falls into two main paradigms:
- Intrinsic Evaluation: Measures inherent qualities like fluency and coherence without reference to downstream applications
- Extrinsic Evaluation: Assesses performance on specific tasks like question answering or summarization
Fundamentals of Evaluation Metrics
Perplexity and Text-Based Metrics
Perplexity remains the standard intrinsic evaluation metric for text-based language models, measuring how well a model predicts a sample of text:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
def calculate_perplexity(model, tokenizer, text, max_length=1024):
"""Calculate perplexity of a text using a causal language model."""
encodings = tokenizer(text, return_tensors='pt', max_length=max_length, truncation=True)
with torch.no_grad():
outputs = model(**encodings)
neg_log_likelihood = outputs.loss * encodings.input_ids.size(1)
perplexity = torch.exp(neg_log_likelihood).item()
return perplexity
# Example usage
model_name = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
test_text = "Natural language processing has evolved significantly."
perplexity = calculate_perplexity(model, tokenizer, test_text)
print(f"Perplexity: {perplexity:.2f}")
However, perplexity has limitations when comparing models with different vocabularies or architectures. This has led to alternative metrics like Cloze-based predictability, which may better correlate with human judgments.
from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch
def calculate_cloze_probability(model, tokenizer, sentence, target_word, position):
"""Calculate probability of a target word in context."""
tokens = sentence.split()
tokens[position] = tokenizer.mask_token
masked_sentence = ' '.join(tokens)
inputs = tokenizer(masked_sentence, return_tensors='pt')
mask_idx = torch.where(inputs["input_ids"][0] == tokenizer.mask_token_id)[0]
with torch.no_grad():
outputs = model(**inputs)
predictions = outputs.logits
target_token_id = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(target_word))[0]
probs = torch.softmax(predictions[0, mask_idx], dim=-1)
target_prob = probs[0, target_token_id].item()
return target_prob
Standard Benchmarks for Textual Language Models
Several benchmarks have emerged as standards for evaluating text language models:
- GLUE/SuperGLUE: General Language Understanding Evaluation with tasks like sentiment analysis and natural language inference
- MMLU: Massive Multitask Language Understanding benchmark testing knowledge across 57 subjects
- GSM8K: Grade School Math problems for testing mathematical reasoning
- MATH: Advanced mathematics problems for testing higher-level reasoning
- HumanEval: Code generation benchmark testing programming abilities
These benchmarks provide standardized ways to assess different aspects of language model capabilities across multiple dimensions.
Evaluating Spoken Language Models
Spoken language models present unique evaluation challenges due to their dual nature of handling both acoustic properties and linguistic content.
Acoustic and Language Level Evaluation
Evaluation of spoken language models can occur at two distinct levels:
- Acoustic Level: Focuses on speech intelligibility and quality
- Language Level: Assesses the linguistic content and meaningfulness
Additionally, these evaluations can be performed in two operational modes:
- Encoding Mode: How well the model represents speech
- Generation Mode: How well the model produces new speech
Human Evaluation Metrics
Human evaluation remains the gold standard for spoken language models. Key metrics include:
- Mean Opinion Scores (MOS): Subjective ratings of intelligibility on a 1-5 scale
- Character Error Rate (CER): Objective measure based on transcriptions
- Meaningfulness-MOS (MMOS): Ratings of naturalness considering grammar and meaning
import numpy as np
from scipy import stats
def analyze_mos_scores(scores, confidence_level=0.95):
"""Analyze Mean Opinion Scores from human evaluators."""
mean_score = np.mean(scores)
# Calculate confidence interval
n = len(scores)
std_err = stats.sem(scores)
confidence_interval = std_err * stats.t.ppf((1 + confidence_level) / 2, n - 1)
return {
"mean": mean_score,
"confidence_interval": confidence_interval,
"range": (mean_score - confidence_interval, mean_score + confidence_interval)
}
ASR-Based Evaluation Metrics
A significant innovation in spoken language model evaluation is the use of automated speech recognition (ASR) systems to assess both intelligibility and meaningfulness.
import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa
import jiwer
def asr_based_evaluation(audio_path, reference_text):
"""Evaluate speech using ASR and calculate error metrics."""
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
# Load audio
audio, rate = librosa.load(audio_path, sr=16000)
# Process audio
input_values = processor(audio, sampling_rate=16000, return_tensors="pt").input_values
# Get ASR prediction
with torch.no_grad():
logits = model(input_values).logits
# Decode the prediction
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
# Calculate metrics
wer = jiwer.wer(reference_text, transcription)
cer = jiwer.cer(reference_text, transcription)
return {
"transcription": transcription,
"wer": wer,
"cer": cer
}
Temperature Selection for Spoken Language Models
The temperature parameter is critical for balancing quality and diversity in generated speech. Researchers recommend normalizing the temperature by comparing to a reference text.
def normalize_temperature(model, tokenizer, prompt, target_text, temp_range=(0.3, 3.0), steps=10):
"""Find optimal temperature for spoken language model sampling."""
temperatures = np.linspace(temp_range[0], temp_range[1], steps)
best_temp = None
best_score = -float('inf')
# Tokenize prompt
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
for temp in temperatures:
# Generate with current temperature
outputs = model.generate(
input_ids,
max_length=100,
do_sample=True,
temperature=temp,
num_return_sequences=10
)
# Calculate scores against target
generations = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
scores = [calculate_similarity(gen, target_text) for gen in generations]
avg_score = sum(scores) / len(scores)
# Update best temperature
if avg_score > best_score:
best_score = avg_score
best_temp = temp
return best_temp, best_score
Evaluating Code Generation Models
Code generation has become a significant application of language models, requiring specialized evaluation approaches.
Test-Based Evaluation Methods
The most straightforward approach is to evaluate whether generated code passes predefined test cases.
import subprocess
import tempfile
import os
def evaluate_code_execution(code, test_cases, language="python"):
"""Evaluate generated code by executing it against test cases."""
results = {}
with tempfile.TemporaryDirectory() as tmpdir:
# Write code to temporary file
if language == "python":
file_path = os.path.join(tmpdir, "solution.py")
with open(file_path, "w") as f:
f.write(code)
# Execute each test case
for i, test_case in enumerate(test_cases):
test_file = os.path.join(tmpdir, f"test_{i}.py")
with open(test_file, "w") as f:
f.write(f"from solution import *\n{test_case}")
try:
result = subprocess.run(
["python", test_file],
capture_output=True,
text=True,
timeout=5 # 5 second timeout
)
passed = result.returncode == 0
results[f"test_{i}"] = {
"passed": passed,
"stdout": result.stdout,
"stderr": result.stderr
}
except subprocess.TimeoutExpired:
results[f"test_{i}"] = {
"passed": False,
"error": "Timeout"
}
# Calculate overall pass rate
pass_count = sum(1 for result in results.values() if result.get("passed", False))
results["overall"] = {
"pass_rate": pass_count / len(test_cases) if test_cases else 0,
"passed": pass_count,
"total": len(test_cases)
}
return results
Token-Based and Embedding-Based Methods
For evaluating code similarity to reference solutions, token-based metrics like BLEU, ROUGE-L, and CodeBLEU are commonly used.
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize
def calculate_code_bleu(generated_code, reference_code):
"""Calculate BLEU score for code generation."""
# Tokenize code
reference_tokens = word_tokenize(reference_code)
generated_tokens = word_tokenize(generated_code)
# Calculate BLEU score
bleu = sentence_bleu([reference_tokens], generated_tokens)
return bleu
LLM-Based Code Evaluation
Recent research like CODEJUDGE demonstrates how LLMs themselves can be used to evaluate code quality and correctness.
import openai
def codejudge_evaluation(problem_description, generated_code, reference_code=None):
"""Use LLM to evaluate code correctness following CODEJUDGE approach."""
if reference_code:
prompt = f"""
Problem Description:
{problem_description}
Generated Code:
{generated_code}
text
Reference Solution:
{reference_code}
text
Analyze the generated code and reference solution:
1. Trace through the execution for both implementations with example inputs.
2. Analyze if the generated code correctly implements the requirements.
3. Check edge cases and potential bugs.
4. Compare the generated code with the reference solution.
After thorough analysis, determine if the generated code is correct (Yes/No):
"""
else:
prompt = f"""
Problem Description:
{problem_description}
Generated Code:
{generated_code}
text
Analyze the generated code:
1. Trace through the execution with example inputs.
2. Analyze if the code correctly implements the requirements.
3. Check edge cases and potential bugs.
After thorough analysis, determine if the generated code is correct (Yes/No):
"""
# Call LLM API
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=[
{"role": "system", "content": "You are an expert code evaluator."},
{"role": "user", "content": prompt}
],
temperature=0
)
evaluation_text = response.choices[0].message.content
correct = "yes" in evaluation_text.lower().split("\n")[-1]
return {
"correct": correct,
"explanation": evaluation_text
}
Task-Specific: Question Answering Evaluation
Evaluating question answering systems typically involves measuring exact match and F1 score against reference answers.
def normalize_answer(s):
"""Normalize answer for exact match evaluation."""
import re
def remove_articles(text):
return re.sub(r'\b(a|an|the)\b', ' ', text)
def white_space_fix(text):
return ' '.join(text.split())
def remove_punc(text):
exclude = set('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
return ''.join(ch for ch in text if ch not in exclude)
def lower(text):
return text.lower()
return white_space_fix(remove_articles(remove_punc(lower(s))))
def exact_match_score(prediction, ground_truth):
"""Calculate exact match score."""
return normalize_answer(prediction) == normalize_answer(ground_truth)
def f1_score(prediction, ground_truth):
"""Calculate F1 score for QA tasks."""
from collections import Counter
prediction_tokens = normalize_answer(prediction).split()
ground_truth_tokens = normalize_answer(ground_truth).split()
common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
num_same = sum(common.values())
if num_same == 0:
return 0
precision = num_same / len(prediction_tokens)
recall = num_same / len(ground_truth_tokens)
f1 = (2 * precision * recall) / (precision + recall)
return f1
Task-Specific: Summarization Evaluation
ROUGE metrics remain the standard for evaluating text summarization.
from rouge_score import rouge_scorer
def evaluate_summarization(model_function, dataset_name="cnn_dailymail", split="test", num_samples=100):
"""Evaluate a summarization model using ROUGE metrics."""
# Load dataset
from datasets import load_dataset
if dataset_name == "cnn_dailymail":
dataset = load_dataset(dataset_name, "3.0.0", split=split)
else:
dataset = load_dataset(dataset_name, split=split)
# Sample examples
if num_samples and num_samples < len(dataset):
indices = np.random.choice(len(dataset), num_samples, replace=False)
dataset = dataset.select(indices)
# Initialize ROUGE scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
rouge_scores = {
'rouge1': [],
'rouge2': [],
'rougeL': []
}
for example in dataset:
# Get document and reference summary
document = example["article"] if "article" in example else example["document"]
reference = example["highlights"] if "highlights" in example else example["summary"]
# Get model prediction
prediction = model_function(document)
# Calculate ROUGE scores
scores = scorer.score(reference, prediction)
# Store scores
for key in rouge_scores:
rouge_scores[key].append(scores[key].fmeasure)
# Calculate average scores
results = {
key: np.mean(values) for key, values in rouge_scores.items()
}
return results
Psycholinguistic Modeling and Evaluation
Research has shown that language model performance correlates with human reading times and psycholinguistic measures. Generalized additive mixed-effect models (GAMMs) can be used to assess this relationship.
def psycholinguistic_evaluation(model_surprisals, reading_times):
"""
Evaluate language models using psycholinguistic measures.
Args:
model_surprisals: Dictionary mapping model names to their surprisal values
reading_times: Human reading time measurements
Returns:
Correlation between model surprisals and reading times
"""
import pandas as pd
import numpy as np
from scipy.stats import pearsonr
results = {}
for model_name, surprisals in model_surprisals.items():
# Calculate correlation
correlation, p_value = pearsonr(surprisals, reading_times)
results[model_name] = {
"correlation": correlation,
"p_value": p_value,
"significant": p_value < 0.05
}
return results
The psycholinguistic modeling perspective provides a unique window into how well language models capture human-like language processing. Studies suggest that factors like model architecture and training corpus size significantly impact psycholinguistic modeling performance, while the number of model parameters has less influence.
Factuality and Hallucination Evaluation
As language models advance, evaluating factuality becomes increasingly important.
def evaluate_factuality(model_function, factual_statements, non_factual_statements):
"""
Evaluate model's ability to distinguish factual from non-factual statements.
Args:
model_function: Function that takes a statement and returns True/False
factual_statements: List of known factual statements
non_factual_statements: List of known non-factual statements
Returns:
Accuracy, precision, recall, and F1 score
"""
true_positives = sum(1 for s in factual_statements if model_function(s))
false_positives = sum(1 for s in non_factual_statements if model_function(s))
true_negatives = sum(1 for s in non_factual_statements if not model_function(s))
false_negatives = sum(1 for s in factual_statements if not model_function(s))
accuracy = (true_positives + true_negatives) / (len(factual_statements) + len(non_factual_statements))
precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0
return {
"accuracy": accuracy,
"precision": precision,
"recall": recall,
"f1": f1
}
Conclusion
Evaluating language models, whether text-based or spoken, remains a complex and evolving field. Key takeaways include:
- Multi-faceted evaluation is essential: No single metric can capture the complex capabilities of modern language models.
- Human evaluation remains important: While automatic metrics are efficient, human assessment is crucial for aspects like coherence, factuality, and helpfulness.
- Task-specific evaluation provides actionable insights: Different applications require different evaluation approaches.
- Spoken language models require dual evaluation: Both acoustic properties and linguistic content must be assessed.
- LLM-based evaluation methods show promise: Using language models themselves to evaluate outputs is an exciting frontier in evaluation research.
Reference:
- https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00430/108611/On-Generative-Spoken-Language-Modeling-from-Raw
- https://aclanthology.org/2024.emnlp-main.1118.pdf
- https://aclanthology.org/2020.cmcl-1.10.pdf
- https://www.jasonwei.net/blog/evals
- https://www.restack.io/p/llm-evaluation-answer-python-examples-cat-ai
- https://neurips.cc/virtual/2024/tutorial/99524
- https://shahabks.github.io/Speechat/
- https://github.com/zkmkarlsruhe/language-identification
- https://aclanthology.org/2024.findings-acl.709/
- https://quickcreator.io/quthor_blog/evaluation-methods-language-models-trained-code/
- https://www.isca-archive.org/interspeech_2024/saeki24_interspeech.pdf
- https://composio.dev/blog/llm-evaluation-guide/
- https://www.confident-ai.com/blog/a-gentle-introduction-to-llm-evaluation
- https://www.easyreportpro.com/PDF/SLP_CELF_CASL.pdf
- https://www.kdnuggets.com/how-to-evaluate-llms
- https://mlflow.org/docs/latest/llms/llm-evaluate/index.html
- https://www.restack.io/p/speech-to-text-answer-python-tts-code-cat-ai
- https://aclanthology.org/2022.lrec-1.316/
- https://github.com/EleutherAI/lm-evaluation-harness
- https://picovoice.ai/blog/evaluating-large-language-models-llms/
- https://www.lakera.ai/blog/large-language-model-evaluation
- https://arxiv.org/abs/2406.10083
- https://www.confident-ai.com/blog/llm-evaluation-metrics-everything-you-need-for-llm-evaluation
- https://www.comet.com/site/blog/perplexity-for-llm-evaluation/
- https://www.cognite.com/en/blog/evaluating-large-language-models-usefulness-correctness
- https://arxiv.org/abs/2308.14536
- https://github.com/openai/human-eval
- https://www.labellerr.com/blog/evaluating-large-language-models/
- https://www.thoughtworks.com/en-de/insights/blog/generative-ai/Large-language-model-evaluation-the-key-to-GenAI-success
- https://mingwei-liu.github.io/assets/pdf/ICSE2024ClassEval-V2.pdf
- https://github.com/confident-ai/deepeval
- https://www.home-speech-home.com/speech-therapy-report-templates-comprehensive-assessment-of-spoken-language-2.html
- https://github.com/openai/evals
- https://machinelearningmastery.com/llm-evaluation-metrics-made-easy/
- https://www.speechace.com/speaking-test/
- https://www.datacamp.com/courses/spoken-language-processing-in-python
- https://nlathia.github.io/2022/10/Code-generation.html
- https://arxiv.org/pdf/2107.03374.pdf
- https://paperswithcode.com/dataset/slue
Top comments (0)