Picking an ASR model for production is not straightforward. Whisper might be the most accurate for general English but too slow for real-time use. Wav2Vec2 might be fast enough for edge devices but struggle with accented speech. Distil-Whisper might hit the sweet spot for your use case, or it might not. Without a systematic benchmark across your actual conditions, you are guessing.
ASR Evaluation Framework is an enterprise-grade benchmarking tool that answers the questions that matter before you commit to a model:
- Which ASR model is most accurate for my use case?
- How fast can each model process audio in real-time?
- How robust is each model against background noise, accents, and degraded audio?
- What are the tradeoffs between speed and accuracy?
Features
- 5 ASR Models : IBM Granite, OpenAI Whisper, NVIDIA Canary, Distil-Whisper, Wav2Vec2
- Comprehensive Metrics : WER, CER, Accuracy, RTF, and Inference Time
- 15+ Test Scenarios : Clean speech, background noise, accents, fast/slow speech, technical terms, and more
- Flexible Evaluation Modes : Speed, accuracy, or complete evaluation
- JSON Output Schema : Standardized metrics schema for result storage
Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β run_evaluation.py (CLI Entry) β
ββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ¬βββββββββββ€
β --accuracy β --speed β --all β Config β
β Evaluate β Evaluate RTF β Complete β Loading β
β WER/CER β & Inference β Evaluation β β
ββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββ
β
βββββββββΌβββββββββ
β Evaluator β
β - Load models β
β - Test audio β
β - Calc metricsβ
βββββββββ¬βββββββββ
β
ββββββββββΌβββββββββ
β β β
ββββββΌβββββββββΌβββββββββΌβββ
βGranite ββWhisperββ Wav2V β ... 5 models
β Model ββ Model ββ Model β
ββββββ¬βββββββββ¬βββββββββ¬βββ
ββββββββββΌβββββββββ
β
βββββββββΌββββββββββββ
β Metrics Engine β
β - WER/CER calc β
β - RTF calc β
β - Accuracy calc β
β - Aggregation β
βββββββββ¬ββββββββββββ
β
βββββββββΌβββββββββββ
β JSON Results β
β with schema β
ββββββββββββββββββββ
Model Comparison Overview
Evaluation Dimensions
Accuracy Metrics
WER : Word Error Rate. Percentage of words transcribed incorrectly compared to the reference.
CER : Character Error Rate. Character-level error rate for more detailed analysis.
Accuracy : 100% minus WER, normalized to a percentage.
Speed Metrics
RTF : Real-Time Factor. Inference time divided by audio duration. Below 1.0 means the model is real-time capable. Above 1.0 means it requires more compute than the audio duration.
Inference Time : Absolute seconds to transcribe the audio.
Robustness Testing
15 test scenarios covering:
- Clean speech - baseline accuracy testing
- Background noise - office and street environments
- Accented English
- Fast and slow speech rates
- Technical vocabulary
- Whispered speech
- Phone quality audio
- Numbers and acronyms
- And more scenarios
Installation
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
Requires Python 3.10+. Core dependencies:
librosa - Audio processing
numpy, scipy - Numerical computing
transformers - HuggingFace model loading
jiwer - WER and CER calculation
soundfile - Audio file I/O
pytest - Testing framework
Usage
Run Complete Evaluation
Runs accuracy and speed evaluation across all five models against all 15 test scenarios:
python run_evaluation.py --all
Run Accuracy Evaluation Only
python run_evaluation.py --accuracy
Run Speed Evaluation Only
python run_evaluation.py --speed
Specify Custom Paths
python run_evaluation.py --all \
--data-path ./my_data \
--output-path ./my_results
Results and Output
Console Output
Here is what a complete evaluation run looks like in the terminal:
============================================================
ASR EVALUATION FRAMEWORK v1.0.0
============================================================
=== RUNNING COMPLETE EVALUATION (ACCURACY + SPEED) ===
Evaluating Whisper...
Evaluating Wav2Vec2...
Evaluating Distil-Whisper...
Evaluating Canary...
Evaluating Granite...
β Results saved to: results/asr_eval_results_all_20260513_123045.json
============================================================
EVALUATION SUMMARY
============================================================
Model: Whisper
Status: β OK
Mean Accuracy: 95.23%
Mean WER: 0.0477
Model: Wav2Vec2
Status: β OK
Mean Accuracy: 91.45%
Mean WER: 0.0855
Model: Distil-Whisper
Status: β OK
Mean Accuracy: 93.78%
Mean WER: 0.0622
JSON Output Format
Results are saved as structured JSON to results/asr_eval_results_{type}_{timestamp}.json. The schema includes evaluation metadata, per-model aggregate metrics, and per-scenario test results:
{
"evaluation_metadata": {
"timestamp": "2026-05-13T12:30:45.123Z",
"evaluator_version": "1.0.0",
"models_tested": ["Whisper", "Wav2Vec2", "Distil-Whisper"],
"test_scenarios": 15,
"evaluation_type": "all"
},
"model_results": {
"Whisper": {
"model_name": "Whisper",
"model_id": "openai/whisper-base",
"initialized": true,
"aggregate_metrics": {
"mean_accuracy": 95.23,
"mean_wer": 0.0477,
"mean_cer": 0.0234,
"mean_rtf": 1.15,
"std_wer": 0.0145
},
"test_results": [
{
"test_id": 1,
"test_name": "clean_english",
"wer": 0.032,
"cer": 0.015,
"accuracy": 96.8,
"inference_time": 2.34,
"rtf": 1.17
}
]
}
},
"summary": {
"total_tests": 15,
"evaluation_type": "all",
"status": "completed"
}
}
The per-scenario test_results array shows exactly how each model performed on each specific condition, not just aggregated averages which is what makes this useful for production decisions.
Configuration
Environment variables, documented in .env.example:
HUGGINGFACE_TOKEN : HuggingFace API token for model loading
OPENAI_API_KEY : OpenAI API key
ASR_EVAL_DATA_PATH : data directory path
ASR_EVAL_RESULTS_PATH : results output path
VERBOSE : enable verbose logging
Test Matrix
15 test scenarios covering four categories:
Clean Speech - Baseline accuracy testing
Robustness - Background noise, accents, variable speech rates
Challenging Conditions - Whispered speech, music, phone quality audio
Domain-Specific - Technical vocabulary, numbers, acronyms
Metrics
Accuracy Metrics
WER (Word Error Rate) - Percentage of words that differ from reference
CER (Character Error Rate) - Percentage of characters that differ
Accuracy - 100% minus WER, normalized to a percentage
Speed Metrics
RTF (Real-Time Factor) - Inference time divided by audio duration. Below 1.0 is real-time capable.
Inference Time - Total time to transcribe audio in seconds
Model Details
When to Use This Framework
Benchmarking ASR models before production deployment : run a full evaluation before committing to a model, not after.
Comparing model tradeoffs : speed versus accuracy decisions are data-driven rather than based on published benchmarks that may not reflect your audio conditions.
Testing robustness against real-world audio : the 15 test scenarios cover conditions that synthetic benchmarks miss: phone quality audio, background noise, accents, and technical vocabulary.
Evaluating cost-performance of different models : RTF and inference time metrics let you calculate the compute cost of each model at your actual workload.
Quality assurance in voice-enabled applications : run evaluations to catch model regressions before they reach production.
Research and academic speech recognition studies : the standardized JSON output schema makes results comparable and reproducible across experiments.
Real-World Scenarios
Scenario 1 - Call Center AI
- Evaluate which model handles phone quality audio best
- Test robustness against background noise
- Measure inference speed for cost calculation
- Result: Select fastest model that maintains accuracy
Scenario 2 - Voice Assistant
- Test against various accents and speech rates
- Evaluate technical command recognition
- Measure real-time performance on edge devices
- Result: Pick model that runs on-device with good accuracy
Scenario 3 - Transcription Service
- Benchmark accuracy across multiple languages
- Evaluate cost versus accuracy tradeoffs
- Test on domain-specific vocabulary
- Result: Choose optimal model for service tier
Project Structure
.
βββ src/ # Core modules
β βββ config.py # Configuration
β βββ metrics.py # Metric calculations
β βββ data_loader.py # Data loading utilities
β βββ base_model.py # ASR model base class
β βββ evaluator.py # Main evaluator class
βββ models/ # ASR model implementations
β βββ wav2vec2.py
β βββ whisper.py
β βββ distil_whisper.py
β βββ canary.py
β βββ granite.py
βββ tests/ # Test suite (36 tests)
βββ data/ # Audio files for evaluation
βββ results/ # Output evaluation results
βββ notebooks/ # Jupyter notebooks
βββ run_evaluation.py # CLI entry point
βββ asr_eval_test_matrix.csv # Test scenarios matrix
βββ asr_eval_metrics_schema.json # Output schema
βββ requirements.txt # Python dependencies
Testing
pytest tests/ -v
36 tests covering all core modules.
How I Built This Using NEO
This project was built using NEO. NEO is a fully autonomous AI engineering agent that can write code and build solutions for AI/ML tasks including AI model evals, prompt optimization and end to end AI pipeline development.
The requirement was a systematic benchmarking framework for ASR models, one that could evaluate accuracy, speed, and robustness across real-world audio conditions, support multiple models through a common interface, and produce structured output for production decisions. The framework needed to cover five distinct model architectures, a 15-scenario test matrix, and three evaluation modes selectable from the CLI.
NEO built the full implementation: the base model class in base_model.py that all five model implementations extend, the five model wrappers for Whisper, Wav2Vec2, Distil-Whisper, Canary, and Granite, the metrics engine in metrics.py computing WER, CER, accuracy, RTF, and inference time, the main evaluator class in evaluator.py, the CLI entry point in run_evaluation.py with all three evaluation modes, the data loader in data_loader.py, the JSON output schema in asr_eval_metrics_schema.json, the test scenario matrix in asr_eval_test_matrix.csv, and the 36-test test suite.
How You Can Use and Extend This With NEO
Use it before committing to an ASR model in production.
Run the full evaluation against your own audio samples using --data-path. The per-scenario breakdown shows exactly how each model performs on the conditions your application will actually encounter, not on generic benchmarks that may not reflect your use case.
Use the JSON output to build model selection pipelines.
The structured output at results/asr_eval_results_{type}_{timestamp}.json contains all the metrics needed to make a data-driven model selection decision programmatically. A script that reads the output and selects the model with the best WER for a given RTF threshold builds directly on top of the existing schema.
Use it to evaluate cost-performance before scaling.
RTF and inference time metrics per model let you calculate the compute cost of each option at your actual call volume. The per-scenario breakdown shows where each model spends the most compute, useful for optimising before scaling a voice-enabled product.
Extend it with additional ASR models.
All five models extend base_model.py following the same interface. Adding a new ASR model available through HuggingFace Transformers means adding a new file in models/ that implements the same base class, it is then available in all three evaluation modes without touching the evaluator, metrics engine, or CLI.
Final Notes
Choosing an ASR model without systematic evaluation is a production risk. ASR Evaluation Framework removes that risk by giving you per-model, per-scenario metrics across accuracy, speed, and robustness before you deploy with structured JSON output that makes the decision data-driven rather than intuitive.
The code is at https://github.com/dakshjain-1616/Asr-Evaluation
You can also build with NEO in your IDE using the VS Code extension or Cursor.
You can use NEO MCP with Claude Code: https://heyneo.com/claude-code



Top comments (0)