Nilofer 🚀

Posted on May 15

ASR Evaluation Framework: Benchmarking Speech Recognition Models Across Accuracy, Speed, and Robustness

#whisper #llm #opensource #machinelearning

Picking an ASR model for production is not straightforward. Whisper might be the most accurate for general English but too slow for real-time use. Wav2Vec2 might be fast enough for edge devices but struggle with accented speech. Distil-Whisper might hit the sweet spot for your use case, or it might not. Without a systematic benchmark across your actual conditions, you are guessing.

ASR Evaluation Framework is an enterprise-grade benchmarking tool that answers the questions that matter before you commit to a model:

Which ASR model is most accurate for my use case?
How fast can each model process audio in real-time?
How robust is each model against background noise, accents, and degraded audio?
What are the tradeoffs between speed and accuracy?

Features

5 ASR Models : IBM Granite, OpenAI Whisper, NVIDIA Canary, Distil-Whisper, Wav2Vec2
Comprehensive Metrics : WER, CER, Accuracy, RTF, and Inference Time
15+ Test Scenarios : Clean speech, background noise, accents, fast/slow speech, technical terms, and more
Flexible Evaluation Modes : Speed, accuracy, or complete evaluation
JSON Output Schema : Standardized metrics schema for result storage

Architecture

┌─────────────────────────────────────────────────────┐
│         run_evaluation.py (CLI Entry)               │
├────────────┬──────────────┬──────────────┬──────────┤
│ --accuracy │ --speed      │ --all        │ Config   │
│ Evaluate   │ Evaluate RTF │ Complete     │ Loading  │
│ WER/CER    │ & Inference  │ Evaluation   │          │
└────────────┴──────────────┴──────────────┴──────────┘
              │
      ┌───────▼────────┐
      │   Evaluator    │
      │  - Load models │
      │  - Test audio  │
      │  - Calc metrics│
      └───────┬────────┘
              │
     ┌────────┼────────┐
     │        │        │
┌────▼──┐┌────▼──┐┌────▼──┐
│Granite ││Whisper││ Wav2V │  ... 5 models
│ Model  ││ Model ││ Model │
└────┬──┘└────┬──┘└────┬──┘
     └────────┼────────┘
              │
      ┌───────▼───────────┐
      │  Metrics Engine   │
      │ - WER/CER calc    │
      │ - RTF calc        │
      │ - Accuracy calc   │
      │ - Aggregation     │
      └───────┬───────────┘
              │
      ┌───────▼──────────┐
      │ JSON Results     │
      │ with schema      │
      └──────────────────┘

Model Comparison Overview

Evaluation Dimensions

Accuracy Metrics

WER : Word Error Rate. Percentage of words transcribed incorrectly compared to the reference.
CER : Character Error Rate. Character-level error rate for more detailed analysis.
Accuracy : 100% minus WER, normalized to a percentage.

Speed Metrics

RTF : Real-Time Factor. Inference time divided by audio duration. Below 1.0 means the model is real-time capable. Above 1.0 means it requires more compute than the audio duration.
Inference Time : Absolute seconds to transcribe the audio.

Robustness Testing
15 test scenarios covering:

Clean speech - baseline accuracy testing
Background noise - office and street environments
Accented English
Fast and slow speech rates
Technical vocabulary
Whispered speech
Phone quality audio
Numbers and acronyms
And more scenarios

Installation

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

Requires Python 3.10+. Core dependencies:

librosa - Audio processing
numpy, scipy - Numerical computing
transformers - HuggingFace model loading
jiwer - WER and CER calculation
soundfile - Audio file I/O
pytest - Testing framework

Usage

Run Complete Evaluation

Runs accuracy and speed evaluation across all five models against all 15 test scenarios:

python run_evaluation.py --all

Run Accuracy Evaluation Only

python run_evaluation.py --accuracy

Run Speed Evaluation Only

python run_evaluation.py --speed

Specify Custom Paths

python run_evaluation.py --all \
  --data-path ./my_data \
  --output-path ./my_results

Results and Output

Console Output

Here is what a complete evaluation run looks like in the terminal:

============================================================
ASR EVALUATION FRAMEWORK v1.0.0
============================================================

=== RUNNING COMPLETE EVALUATION (ACCURACY + SPEED) ===

Evaluating Whisper...
Evaluating Wav2Vec2...
Evaluating Distil-Whisper...
Evaluating Canary...
Evaluating Granite...

✓ Results saved to: results/asr_eval_results_all_20260513_123045.json

============================================================
EVALUATION SUMMARY
============================================================

Model: Whisper
  Status: ✓ OK
  Mean Accuracy: 95.23%
  Mean WER: 0.0477

Model: Wav2Vec2
  Status: ✓ OK
  Mean Accuracy: 91.45%
  Mean WER: 0.0855

Model: Distil-Whisper
  Status: ✓ OK
  Mean Accuracy: 93.78%
  Mean WER: 0.0622

JSON Output Format

Results are saved as structured JSON to results/asr_eval_results_{type}_{timestamp}.json. The schema includes evaluation metadata, per-model aggregate metrics, and per-scenario test results:

{
  "evaluation_metadata": {
    "timestamp": "2026-05-13T12:30:45.123Z",
    "evaluator_version": "1.0.0",
    "models_tested": ["Whisper", "Wav2Vec2", "Distil-Whisper"],
    "test_scenarios": 15,
    "evaluation_type": "all"
  },
  "model_results": {
    "Whisper": {
      "model_name": "Whisper",
      "model_id": "openai/whisper-base",
      "initialized": true,
      "aggregate_metrics": {
        "mean_accuracy": 95.23,
        "mean_wer": 0.0477,
        "mean_cer": 0.0234,
        "mean_rtf": 1.15,
        "std_wer": 0.0145
      },
      "test_results": [
        {
          "test_id": 1,
          "test_name": "clean_english",
          "wer": 0.032,
          "cer": 0.015,
          "accuracy": 96.8,
          "inference_time": 2.34,
          "rtf": 1.17
        }
      ]
    }
  },
  "summary": {
    "total_tests": 15,
    "evaluation_type": "all",
    "status": "completed"
  }
}

The per-scenario test_results array shows exactly how each model performed on each specific condition, not just aggregated averages which is what makes this useful for production decisions.

Configuration

Environment variables, documented in .env.example:
HUGGINGFACE_TOKEN : HuggingFace API token for model loading
OPENAI_API_KEY : OpenAI API key
ASR_EVAL_DATA_PATH : data directory path
ASR_EVAL_RESULTS_PATH : results output path
VERBOSE : enable verbose logging

Test Matrix

15 test scenarios covering four categories:

Clean Speech - Baseline accuracy testing
Robustness - Background noise, accents, variable speech rates
Challenging Conditions - Whispered speech, music, phone quality audio
Domain-Specific - Technical vocabulary, numbers, acronyms

Metrics

Accuracy Metrics

WER (Word Error Rate) - Percentage of words that differ from reference
CER (Character Error Rate) - Percentage of characters that differ
Accuracy - 100% minus WER, normalized to a percentage
Speed Metrics
RTF (Real-Time Factor) - Inference time divided by audio duration. Below 1.0 is real-time capable.
Inference Time - Total time to transcribe audio in seconds

Model Details

When to Use This Framework

Benchmarking ASR models before production deployment : run a full evaluation before committing to a model, not after.

Comparing model tradeoffs : speed versus accuracy decisions are data-driven rather than based on published benchmarks that may not reflect your audio conditions.

Testing robustness against real-world audio : the 15 test scenarios cover conditions that synthetic benchmarks miss: phone quality audio, background noise, accents, and technical vocabulary.

Evaluating cost-performance of different models : RTF and inference time metrics let you calculate the compute cost of each model at your actual workload.

Quality assurance in voice-enabled applications : run evaluations to catch model regressions before they reach production.

Research and academic speech recognition studies : the standardized JSON output schema makes results comparable and reproducible across experiments.

Real-World Scenarios

Scenario 1 - Call Center AI

Evaluate which model handles phone quality audio best
Test robustness against background noise
Measure inference speed for cost calculation
Result: Select fastest model that maintains accuracy

Scenario 2 - Voice Assistant

Test against various accents and speech rates
Evaluate technical command recognition
Measure real-time performance on edge devices
Result: Pick model that runs on-device with good accuracy

Scenario 3 - Transcription Service

Benchmark accuracy across multiple languages
Evaluate cost versus accuracy tradeoffs
Test on domain-specific vocabulary
Result: Choose optimal model for service tier

Project Structure

.
├── src/                          # Core modules
│   ├── config.py                # Configuration
│   ├── metrics.py               # Metric calculations
│   ├── data_loader.py           # Data loading utilities
│   ├── base_model.py            # ASR model base class
│   └── evaluator.py             # Main evaluator class
├── models/                       # ASR model implementations
│   ├── wav2vec2.py
│   ├── whisper.py
│   ├── distil_whisper.py
│   ├── canary.py
│   └── granite.py
├── tests/                        # Test suite (36 tests)
├── data/                         # Audio files for evaluation
├── results/                      # Output evaluation results
├── notebooks/                    # Jupyter notebooks
├── run_evaluation.py             # CLI entry point
├── asr_eval_test_matrix.csv      # Test scenarios matrix
├── asr_eval_metrics_schema.json  # Output schema
└── requirements.txt              # Python dependencies

Testing

pytest tests/ -v

36 tests covering all core modules.

How I Built This Using NEO

This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.

The requirement was a systematic benchmarking framework for ASR models, one that could evaluate accuracy, speed, and robustness across real-world audio conditions, support multiple models through a common interface, and produce structured output for production decisions. The framework needed to cover five distinct model architectures, a 15-scenario test matrix, and three evaluation modes selectable from the CLI.

NEO built the full implementation: the base model class in base_model.py that all five model implementations extend, the five model wrappers for Whisper, Wav2Vec2, Distil-Whisper, Canary, and Granite, the metrics engine in metrics.py computing WER, CER, accuracy, RTF, and inference time, the main evaluator class in evaluator.py, the CLI entry point in run_evaluation.py with all three evaluation modes, the data loader in data_loader.py, the JSON output schema in asr_eval_metrics_schema.json, the test scenario matrix in asr_eval_test_matrix.csv, and the 36-test test suite.

How You Can Use and Extend This With NEO

Use it before committing to an ASR model in production.
Run the full evaluation against your own audio samples using --data-path. The per-scenario breakdown shows exactly how each model performs on the conditions your application will actually encounter, not on generic benchmarks that may not reflect your use case.

Use the JSON output to build model selection pipelines.
The structured output at results/asr_eval_results_{type}_{timestamp}.json contains all the metrics needed to make a data-driven model selection decision programmatically. A script that reads the output and selects the model with the best WER for a given RTF threshold builds directly on top of the existing schema.

Use it to evaluate cost-performance before scaling.
RTF and inference time metrics per model let you calculate the compute cost of each option at your actual call volume. The per-scenario breakdown shows where each model spends the most compute, useful for optimising before scaling a voice-enabled product.

Extend it with additional ASR models.
All five models extend base_model.py following the same interface. Adding a new ASR model available through HuggingFace Transformers means adding a new file in models/ that implements the same base class, it is then available in all three evaluation modes without touching the evaluator, metrics engine, or CLI.

Final Notes

Choosing an ASR model without systematic evaluation is a production risk. ASR Evaluation Framework removes that risk by giving you per-model, per-scenario metrics across accuracy, speed, and robustness before you deploy with structured JSON output that makes the decision data-driven rather than intuitive.

The code is at https://github.com/dakshjain-1616/Asr-Evaluation
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code

DEV Community