DEV Community

Daniel Guerrero
Daniel Guerrero

Posted on

Nvidia NeMo Speech Recognition Starting Guide

After reading an article that nvidia had several models in top of HF ASR Leaderboard I wanted to test in my local computer.
Even the code seems pretty simple from HF it turns didn't work for nvidia/canary-qwen-2.5b so I started to dig a bit more and test several features.

Base Setup

To test you will need a base setup, I'm using docker so the options are:

  1. Using Nvidia PyTorch container
  2. Using python with cuda enabled libraries

Nvidia PyTorch container

This is the simplest, but of course could contain a lot of libraries not needed, the size of the container is 12.78GB

docker run \
  --gpus all \
  -it \
  --rm \
  --shm-size=16g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  nvcr.io/nvidia/pytorch:25.06-py3
Enter fullscreen mode Exit fullscreen mode

Python image with cuda libraries

docker run \
  --gpus all \
  -it \
  --rm \
  python:3.12-bookworm \
  /bin/bash
Enter fullscreen mode Exit fullscreen mode

Setup cuda libraries:

apt update && \
  apt install -y wget && \
  wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb && \
  dpkg -i cuda-keyring_1.1-1_all.deb && \
  apt-get update && \
  apt-get -y install --no-install-recommends cuda-toolkit-12-9
Enter fullscreen mode Exit fullscreen mode

Note: the image is around 11.5Gb so probably is not really much different from Nvidia container image.

Setup NeMo libraries

This is pretty simple:

pip install nemo-toolkit[asr]
Enter fullscreen mode Exit fullscreen mode

Setup ffmpeg

For the examples should not be needed but libraries will report missing and is good to have for creating proper input files

cd /tmp
wget https://github.com/BtbN/FFmpeg-Builds/releases/download/latest/ffmpeg-master-latest-linux64-gpl.tar.xz
tar xvf ffmpeg-master-latest-linux64-gpl.tar.xz 
cp ffmpeg-master-latest-linux64-gpl/bin/* /usr/bin/
rm -rf ffmpeg-master-latest-linux64-gpl ffmpeg-master-latest-linux64-gpl.tar.xz
Enter fullscreen mode Exit fullscreen mode

Get audio samples

You need audio samples that are:

  • 1 single channel
  • 16Khz Sample Rate
  • less than 20 seconds

An example used on several examples is this:
https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav

so you can download:

wget https://dldata-public.s3.us-east-2.amazonaws.com/2086-149220-0033.wav
Enter fullscreen mode Exit fullscreen mode

This sample have the previous requirements, so is not needed to process; but if you want to have a proper sample from any file you can do with ffmpeg:

ffmpeg -i INPUT_FILE -ac 1 -ar 16000 example.wav
Enter fullscreen mode Exit fullscreen mode

A more complex can be to extract just a part of the audio, the following will get 10 seconds, starting at second 29 so will extract audio from 29s - 39s:

ffmpeg -i INPUT_FILE -s 29 -t 10 -ac 1 -ar 16000 example.wav
Enter fullscreen mode Exit fullscreen mode

Code example

The following file (asr_example.py) will help you to test the different models:

import nemo.collections.asr as nemo_asr
import argparse

parser = argparse.ArgumentParser(prog="ASR NeMo Example")
parser.add_argument(
    "--enable-timestamps", 
    help="Enable timestamps", 
    action=argparse.BooleanOptionalAction,
)

parser.add_argument(
    "model_name", 
     help="Name of the model like 'nvidia/canary-1b-flash'",
)

parser.add_argument(
    "input_file", 
    help="Path of the wav file, must be 16000Hz and 1 channel",
)
args = parser.parse_args()

asr_model = nemo_asr.models.ASRModel.from_pretrained(args.model_name)

transcriptions = asr_model.transcribe(
    args.input_file, 
    timestamps=args.enable_timestamps,
)
for idx, transcript in enumerate(transcriptions):
    print(f"[{idx}] {transcript.text}")
    if args.enable_timestamps:
        for stamp in transcript.timestamp["word"]:
            word = stamp['word']
            output_line = f"{stamp['start']:0>5.2f}"
            output_line += f"-{stamp['end']:0>5.2f}"
            output_line += f": {word}"

            print(output_line)
Enter fullscreen mode Exit fullscreen mode

Here is a list of models: https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/asr/models.html
Note that list is not fully updated as the most recent is nvidia/canary-qwen-2.5b but this won't work with current code.

Testing the code

You need to provide the model name and the input file, so you can call like this:

python3 asr_example.py \
  nvidia/canary-1b-flash \
  2086-149220-0033.wav
Enter fullscreen mode Exit fullscreen mode

This will output:

[0] Well, I don't wish to see it any more, observed Phoebe, turning away her eyes. It is certainly very like the old portrait.
Enter fullscreen mode Exit fullscreen mode

You can enable the timestamps per word (not all the models support timestamps, but it will trigger an error if doesn't support it)

python3 asr_example.py \
  --enable-timestamps \
  nvidia/canary-1b-flash \
  2086-149220-0033.wav
Enter fullscreen mode Exit fullscreen mode

This will output:

[0] Well I don't wish to see it any more observed Phoebe turning away her eyes it is certainly very like the old portrait
00.32-00.40: Well
00.56-00.72: I
00.72-01.04: don't
01.04-01.28: wish
01.28-01.36: to
01.44-01.52: see
01.60-01.68: it
01.76-01.84: any
01.92-02.00: more
02.24-02.64: observed
02.64-03.12: Phoebe
03.36-03.68: turning
03.76-03.84: away
04.08-04.16: her
04.24-04.48: eyes
04.96-05.04: it
05.12-05.20: is
05.36-05.76: certainly
05.84-05.92: very
06.08-06.16: like
06.24-06.32: the
06.40-06.48: old
06.64-07.12: portrait
Enter fullscreen mode Exit fullscreen mode

Multilanguage

One of the cool features of the Canary family is the support of multiple input languages (English, German, French, Spanish) and can even translate the output.
I will use one file from this dataset: https://www.kaggle.com/datasets/carlfm01/120h-spanish-speech

In the case of the language instead of passing the wav file, it needs to be created an input manifest json.
The format is like this:

{
    "audio_filepath": "FILE.wav",
    "duration": 10, 
    "source_lang": "es",
    "target_lang": "en"
}
Enter fullscreen mode Exit fullscreen mode

But the trick is the input file is actually a text file where each line is a json entry, so the input-spanish.json must be:

{"audio_filepath": "0000df16-47ea-428f-8367-df2ce365d5c4.wav","duration": 9, "source_lang": "es","target_lang": "es"}
Enter fullscreen mode Exit fullscreen mode

And to run with:

python3 asr_example.py \
  --enable-timestamps \
  nvidia/canary-1b-flash \
  input-spanish.json
Enter fullscreen mode Exit fullscreen mode

The output will be:

[0] con efeto, su lenguaje y singulares maneras me divertían extraordinariamente, porque nuestro hombre era un verdadero andaluz,
00.00-00.08: con
00.48-01.04: efeto,
01.12-01.20: su
01.36-01.84: lenguaje
01.92-02.00: y
02.08-02.64: singulares
02.72-03.04: maneras
03.20-03.28: me
03.36-03.92: divertían
04.08-05.92: extraordinariamente,
05.92-06.00: porque
06.40-06.48: nuestro
06.88-06.96: hombre
07.28-07.36: era
07.52-07.60: un
07.68-08.16: verdadero
08.24-08.96: andaluz,
Enter fullscreen mode Exit fullscreen mode

And if you want to translate into english, the input-spanish.json must be:

{"audio_filepath": "0000df16-47ea-428f-8367-df2ce365d5c4.wav","duration": 9, "source_lang": "es","target_lang": "en"}
Enter fullscreen mode Exit fullscreen mode

In this case the output of the same command will be:

[0] with effect his language and singular manners amused me extraordinarily because our man was a true Andalusian
00.00-00.08: with
00.48-00.56: effect
01.12-01.20: his
01.36-01.76: language
01.84-01.92: and
02.00-02.56: singular
02.64-03.04: manners
03.20-03.84: amused
04.08-04.16: me
04.24-05.28: extraordinarily
05.92-06.00: because
06.40-06.48: our
06.80-06.88: man
07.20-07.28: was
07.44-07.52: a
07.60-07.68: true
08.16-08.80: Andalusian
Enter fullscreen mode Exit fullscreen mode

Top comments (0)