Gokul S

Posted on Jan 21 • Edited on Mar 28

Evaluating Textual and Spoken Language Models

#ai

Update: Here are some of other interesting blogs:

DeepSeek R1: https://dev.to/gokulsg/deepseek-r1-33n0
Spoken Language Models: https://dev.to/gokulsg/spoken-language-models-3afe

Language models have become central components in modern AI systems, yet their evaluation remains a complex challenge spanning multiple dimensions of performance. In this blog, we will explore the comprehensive landscape of evaluating both text-based and spoken language models across various tasks, providing implementation details and code examples.

Introduction to Language Model Evaluation

Evaluating language models presents unique challenges compared to other machine learning systems. Unlike tasks with clear right or wrong answers, language processing exists in a complex space of semantics, fluency, context-awareness, and domain-specific knowledge. This complexity increases significantly when dealing with spoken language models, which must handle both acoustic properties and linguistic content. Evaluation for speech generation is difficult due to the continuous, variable, and multi-level nature of the speech waveform, and the necessity both to capture fine-grained acoustic details to generate intelligible audio and to abstract away from them to learn higher-level language concepts. This dual nature creates fundamental challenges in measuring model performance.

Language model evaluation typically falls into two main paradigms:

Intrinsic Evaluation: Measures inherent qualities like fluency and coherence without reference to downstream applications
Extrinsic Evaluation: Assesses performance on specific tasks like question answering or summarization

Fundamentals of Evaluation Metrics

Perplexity and Text-Based Metrics

Perplexity remains the standard intrinsic evaluation metric for text-based language models, measuring how well a model predicts a sample of text:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def calculate_perplexity(model, tokenizer, text, max_length=1024):
    """Calculate perplexity of a text using a causal language model."""
    encodings = tokenizer(text, return_tensors='pt', max_length=max_length, truncation=True)

    with torch.no_grad():
        outputs = model(**encodings)
        neg_log_likelihood = outputs.loss * encodings.input_ids.size(1)

    perplexity = torch.exp(neg_log_likelihood).item()
    return perplexity

# Example usage
model_name = "gpt2-medium"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

test_text = "Natural language processing has evolved significantly."
perplexity = calculate_perplexity(model, tokenizer, test_text)
print(f"Perplexity: {perplexity:.2f}")

However, perplexity has limitations when comparing models with different vocabularies or architectures. This has led to alternative metrics like Cloze-based predictability, which may better correlate with human judgments.

from transformers import AutoModelForMaskedLM, AutoTokenizer
import torch

def calculate_cloze_probability(model, tokenizer, sentence, target_word, position):
    """Calculate probability of a target word in context."""
    tokens = sentence.split()
    tokens[position] = tokenizer.mask_token
    masked_sentence = ' '.join(tokens)

    inputs = tokenizer(masked_sentence, return_tensors='pt')
    mask_idx = torch.where(inputs["input_ids"][0] == tokenizer.mask_token_id)[0]

    with torch.no_grad():
        outputs = model(**inputs)
        predictions = outputs.logits

    target_token_id = tokenizer.convert_tokens_to_ids(tokenizer.tokenize(target_word))[0]
    probs = torch.softmax(predictions[0, mask_idx], dim=-1)
    target_prob = probs[0, target_token_id].item()

    return target_prob

Standard Benchmarks for Textual Language Models

Several benchmarks have emerged as standards for evaluating text language models:

GLUE/SuperGLUE: General Language Understanding Evaluation with tasks like sentiment analysis and natural language inference
MMLU: Massive Multitask Language Understanding benchmark testing knowledge across 57 subjects
GSM8K: Grade School Math problems for testing mathematical reasoning
MATH: Advanced mathematics problems for testing higher-level reasoning
HumanEval: Code generation benchmark testing programming abilities

These benchmarks provide standardized ways to assess different aspects of language model capabilities across multiple dimensions.

Evaluating Spoken Language Models

Spoken language models present unique evaluation challenges due to their dual nature of handling both acoustic properties and linguistic content.

Acoustic and Language Level Evaluation

Evaluation of spoken language models can occur at two distinct levels:

Acoustic Level: Focuses on speech intelligibility and quality
Language Level: Assesses the linguistic content and meaningfulness

Additionally, these evaluations can be performed in two operational modes:

Encoding Mode: How well the model represents speech
Generation Mode: How well the model produces new speech

Human Evaluation Metrics

Human evaluation remains the gold standard for spoken language models. Key metrics include:

Mean Opinion Scores (MOS): Subjective ratings of intelligibility on a 1-5 scale
Character Error Rate (CER): Objective measure based on transcriptions
Meaningfulness-MOS (MMOS): Ratings of naturalness considering grammar and meaning

import numpy as np
from scipy import stats

def analyze_mos_scores(scores, confidence_level=0.95):
    """Analyze Mean Opinion Scores from human evaluators."""
    mean_score = np.mean(scores)

    # Calculate confidence interval
    n = len(scores)
    std_err = stats.sem(scores)
    confidence_interval = std_err * stats.t.ppf((1 + confidence_level) / 2, n - 1)

    return {
        "mean": mean_score,
        "confidence_interval": confidence_interval,
        "range": (mean_score - confidence_interval, mean_score + confidence_interval)
    }

ASR-Based Evaluation Metrics

A significant innovation in spoken language model evaluation is the use of automated speech recognition (ASR) systems to assess both intelligibility and meaningfulness.

import torch
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import librosa
import jiwer

def asr_based_evaluation(audio_path, reference_text):
    """Evaluate speech using ASR and calculate error metrics."""
    processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-base-960h")
    model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")

    # Load audio
    audio, rate = librosa.load(audio_path, sr=16000)

    # Process audio
    input_values = processor(audio, sampling_rate=16000, return_tensors="pt").input_values

    # Get ASR prediction
    with torch.no_grad():
        logits = model(input_values).logits

    # Decode the prediction
    predicted_ids = torch.argmax(logits, dim=-1)
    transcription = processor.batch_decode(predicted_ids)[0]

    # Calculate metrics
    wer = jiwer.wer(reference_text, transcription)
    cer = jiwer.cer(reference_text, transcription)

    return {
        "transcription": transcription,
        "wer": wer,
        "cer": cer
    }

Temperature Selection for Spoken Language Models

The temperature parameter is critical for balancing quality and diversity in generated speech. Researchers recommend normalizing the temperature by comparing to a reference text.

def normalize_temperature(model, tokenizer, prompt, target_text, temp_range=(0.3, 3.0), steps=10):
    """Find optimal temperature for spoken language model sampling."""
    temperatures = np.linspace(temp_range[0], temp_range[1], steps)
    best_temp = None
    best_score = -float('inf')

    # Tokenize prompt
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    for temp in temperatures:
        # Generate with current temperature
        outputs = model.generate(
            input_ids,
            max_length=100,
            do_sample=True,
            temperature=temp,
            num_return_sequences=10
        )

        # Calculate scores against target
        generations = [tokenizer.decode(output, skip_special_tokens=True) for output in outputs]
        scores = [calculate_similarity(gen, target_text) for gen in generations]
        avg_score = sum(scores) / len(scores)

        # Update best temperature
        if avg_score > best_score:
            best_score = avg_score
            best_temp = temp

    return best_temp, best_score

Evaluating Code Generation Models

Code generation has become a significant application of language models, requiring specialized evaluation approaches.

Test-Based Evaluation Methods

The most straightforward approach is to evaluate whether generated code passes predefined test cases.

import subprocess
import tempfile
import os

def evaluate_code_execution(code, test_cases, language="python"):
    """Evaluate generated code by executing it against test cases."""
    results = {}

    with tempfile.TemporaryDirectory() as tmpdir:
        # Write code to temporary file
        if language == "python":
            file_path = os.path.join(tmpdir, "solution.py")
            with open(file_path, "w") as f:
                f.write(code)

            # Execute each test case
            for i, test_case in enumerate(test_cases):
                test_file = os.path.join(tmpdir, f"test_{i}.py")
                with open(test_file, "w") as f:
                    f.write(f"from solution import *\n{test_case}")

                try:
                    result = subprocess.run(
                        ["python", test_file],
                        capture_output=True,
                        text=True,
                        timeout=5  # 5 second timeout
                    )
                    passed = result.returncode == 0
                    results[f"test_{i}"] = {
                        "passed": passed,
                        "stdout": result.stdout,
                        "stderr": result.stderr
                    }
                except subprocess.TimeoutExpired:
                    results[f"test_{i}"] = {
                        "passed": False,
                        "error": "Timeout"
                    }

    # Calculate overall pass rate
    pass_count = sum(1 for result in results.values() if result.get("passed", False))
    results["overall"] = {
        "pass_rate": pass_count / len(test_cases) if test_cases else 0,
        "passed": pass_count,
        "total": len(test_cases)
    }

    return results

Token-Based and Embedding-Based Methods

For evaluating code similarity to reference solutions, token-based metrics like BLEU, ROUGE-L, and CodeBLEU are commonly used.

from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize

def calculate_code_bleu(generated_code, reference_code):
    """Calculate BLEU score for code generation."""
    # Tokenize code
    reference_tokens = word_tokenize(reference_code)
    generated_tokens = word_tokenize(generated_code)

    # Calculate BLEU score
    bleu = sentence_bleu([reference_tokens], generated_tokens)

    return bleu

LLM-Based Code Evaluation

Recent research like CODEJUDGE demonstrates how LLMs themselves can be used to evaluate code quality and correctness.

import openai

def codejudge_evaluation(problem_description, generated_code, reference_code=None):
    """Use LLM to evaluate code correctness following CODEJUDGE approach."""
    if reference_code:
        prompt = f"""
Problem Description:
{problem_description}

Generated Code:
{generated_code}

text

Reference Solution:
{reference_code}

text

Analyze the generated code and reference solution:
1. Trace through the execution for both implementations with example inputs.
2. Analyze if the generated code correctly implements the requirements.
3. Check edge cases and potential bugs.
4. Compare the generated code with the reference solution.

After thorough analysis, determine if the generated code is correct (Yes/No):
"""
    else:
        prompt = f"""
Problem Description:
{problem_description}

Generated Code:
{generated_code}

text

Analyze the generated code:
1. Trace through the execution with example inputs.
2. Analyze if the code correctly implements the requirements.
3. Check edge cases and potential bugs.

After thorough analysis, determine if the generated code is correct (Yes/No):
"""

    # Call LLM API
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        messages=[
            {"role": "system", "content": "You are an expert code evaluator."},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )

    evaluation_text = response.choices[0].message.content
    correct = "yes" in evaluation_text.lower().split("\n")[-1]

    return {
        "correct": correct,
        "explanation": evaluation_text
    }

Task-Specific: Question Answering Evaluation

Evaluating question answering systems typically involves measuring exact match and F1 score against reference answers.

def normalize_answer(s):
    """Normalize answer for exact match evaluation."""
    import re

    def remove_articles(text):
        return re.sub(r'\b(a|an|the)\b', ' ', text)

    def white_space_fix(text):
        return ' '.join(text.split())

    def remove_punc(text):
        exclude = set('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~')
        return ''.join(ch for ch in text if ch not in exclude)

    def lower(text):
        return text.lower()

    return white_space_fix(remove_articles(remove_punc(lower(s))))

def exact_match_score(prediction, ground_truth):
    """Calculate exact match score."""
    return normalize_answer(prediction) == normalize_answer(ground_truth)

def f1_score(prediction, ground_truth):
    """Calculate F1 score for QA tasks."""
    from collections import Counter

    prediction_tokens = normalize_answer(prediction).split()
    ground_truth_tokens = normalize_answer(ground_truth).split()

    common = Counter(prediction_tokens) & Counter(ground_truth_tokens)
    num_same = sum(common.values())

    if num_same == 0:
        return 0

    precision = num_same / len(prediction_tokens)
    recall = num_same / len(ground_truth_tokens)
    f1 = (2 * precision * recall) / (precision + recall)

    return f1

Task-Specific: Summarization Evaluation

ROUGE metrics remain the standard for evaluating text summarization.

from rouge_score import rouge_scorer

def evaluate_summarization(model_function, dataset_name="cnn_dailymail", split="test", num_samples=100):
    """Evaluate a summarization model using ROUGE metrics."""
    # Load dataset
    from datasets import load_dataset
    if dataset_name == "cnn_dailymail":
        dataset = load_dataset(dataset_name, "3.0.0", split=split)
    else:
        dataset = load_dataset(dataset_name, split=split)

    # Sample examples
    if num_samples and num_samples < len(dataset):
        indices = np.random.choice(len(dataset), num_samples, replace=False)
        dataset = dataset.select(indices)

    # Initialize ROUGE scorer
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)

    rouge_scores = {
        'rouge1': [],
        'rouge2': [],
        'rougeL': []
    }

    for example in dataset:
        # Get document and reference summary
        document = example["article"] if "article" in example else example["document"]
        reference = example["highlights"] if "highlights" in example else example["summary"]

        # Get model prediction
        prediction = model_function(document)

        # Calculate ROUGE scores
        scores = scorer.score(reference, prediction)

        # Store scores
        for key in rouge_scores:
            rouge_scores[key].append(scores[key].fmeasure)

    # Calculate average scores
    results = {
        key: np.mean(values) for key, values in rouge_scores.items()
    }

    return results

Psycholinguistic Modeling and Evaluation

Research has shown that language model performance correlates with human reading times and psycholinguistic measures. Generalized additive mixed-effect models (GAMMs) can be used to assess this relationship.

def psycholinguistic_evaluation(model_surprisals, reading_times):
    """
    Evaluate language models using psycholinguistic measures.

    Args:
        model_surprisals: Dictionary mapping model names to their surprisal values
        reading_times: Human reading time measurements

    Returns:
        Correlation between model surprisals and reading times
    """
    import pandas as pd
    import numpy as np
    from scipy.stats import pearsonr

    results = {}
    for model_name, surprisals in model_surprisals.items():
        # Calculate correlation
        correlation, p_value = pearsonr(surprisals, reading_times)
        results[model_name] = {
            "correlation": correlation,
            "p_value": p_value,
            "significant": p_value < 0.05
        }

    return results

The psycholinguistic modeling perspective provides a unique window into how well language models capture human-like language processing. Studies suggest that factors like model architecture and training corpus size significantly impact psycholinguistic modeling performance, while the number of model parameters has less influence.

Factuality and Hallucination Evaluation

As language models advance, evaluating factuality becomes increasingly important.

def evaluate_factuality(model_function, factual_statements, non_factual_statements):
    """
    Evaluate model's ability to distinguish factual from non-factual statements.

    Args:
        model_function: Function that takes a statement and returns True/False
        factual_statements: List of known factual statements
        non_factual_statements: List of known non-factual statements

    Returns:
        Accuracy, precision, recall, and F1 score
    """
    true_positives = sum(1 for s in factual_statements if model_function(s))
    false_positives = sum(1 for s in non_factual_statements if model_function(s))
    true_negatives = sum(1 for s in non_factual_statements if not model_function(s))
    false_negatives = sum(1 for s in factual_statements if not model_function(s))

    accuracy = (true_positives + true_negatives) / (len(factual_statements) + len(non_factual_statements))
    precision = true_positives / (true_positives + false_positives) if (true_positives + false_positives) > 0 else 0
    recall = true_positives / (true_positives + false_negatives) if (true_positives + false_negatives) > 0 else 0
    f1 = 2 * precision * recall / (precision + recall) if (precision + recall) > 0 else 0

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

Conclusion

Evaluating language models, whether text-based or spoken, remains a complex and evolving field. Key takeaways include:

Multi-faceted evaluation is essential: No single metric can capture the complex capabilities of modern language models.
Human evaluation remains important: While automatic metrics are efficient, human assessment is crucial for aspects like coherence, factuality, and helpfulness.
Task-specific evaluation provides actionable insights: Different applications require different evaluation approaches.
Spoken language models require dual evaluation: Both acoustic properties and linguistic content must be assessed.
LLM-based evaluation methods show promise: Using language models themselves to evaluate outputs is an exciting frontier in evaluation research.

Reference:

DEV Community

Evaluating Textual and Spoken Language Models

Fundamentals of Evaluation Metrics

Top comments (0)