Updating ASR examples in Hugging Face Transformers Hub datasets, clearer args, smoother Windows setup

#python #opensource #huggingface #nlp

I merged a docs/examples update to 🤗 Transformers that:

Pins CTC training commands to Hub datasets (instead of deprecated local scripts)
Clarifies dataset_name vs dataset_config_name help (matches 🤗 Datasets docs)
Adds small Windows setup notes Merged PR: https://github.com/huggingface/transformers/pull/41027

The problem (why this change was needed)

Some users following the Automatic Speech Recognition (ASR) examples hit setup errors because older instructions referenced local dataset scripts. Modern datasets expects Hub datasets (e.g., Common Voice by version and language), so commands like --dataset_name="common_voice" were fragile or ambiguous—especially on Windows.

What changed (at a glance)

Use explicit Hub dataset IDs in CTC commands
Example: mozilla-foundation/common_voice_17_0(versioned, reproducible)
Clarify arguments in the example script help
dataset_name → the dataset ID on the Hub
dataset_config_name → the subset/language (e.g., en, tr, clean)
Tiny Windows notes
How to activate venv in PowerShell
How to run formatters without make

Before vs After (one line that mattered)

Before

--dataset_name="common_voice" \
--dataset_config_name="tr" \

After

--dataset_name="mozilla-foundation/common_voice_17_0" \
--dataset_config_name="tr" \

This small change prevents the “dataset scripts are no longer supported” error and makes runs reproducible.

Quickstart: CTC finetuning (Common Voice, Turkish)

python run_speech_recognition_ctc.py \
  --dataset_name="mozilla-foundation/common_voice_17_0" \
  --dataset_config_name="tr" \
  --model_name_or_path="facebook/wav2vec2-large-xlsr-53" \
  --output_dir="./wav2vec2-common_voice-tr-demo" \
  --overwrite_output_dir \
  --num_train_epochs="15" \
  --per_device_train_batch_size="16" \
  --gradient_accumulation_steps="2" \
  --learning_rate="3e-4" \
  --warmup_steps="500" \
  --eval_strategy="steps" \
  --text_column_name="sentence" \
  --length_column_name="input_length" \
  --save_steps="400" \
  --eval_steps="100" \
  --layerdrop="0.0" \
  --save_total_limit="3" \
  --freeze_feature_encoder \
  --gradient_checkpointing \
  --fp16 \
  --group_by_length \
  --push_to_hub \
  --do_train --do_eval

Change dataset_config_nameto your language (e.g., en, hi, …).

Windows notes (PowerShell)

# activate venv
.\.venv\Scripts\Activate.ps1

# if 'make' isn't available, run formatters directly
python -m black <changed_paths>
python -m ruff check <changed_paths> --fix

Impact