DEV Community


Data Augmentation for Speech Recognition

makcedward profile image Edward Ma Originally published at on ・6 min read

Automatic Speech Recognition (ASR)

Photo by Edward Ma on Unsplash

The objective of Speech Recognition is converting audio to text. This technology is applied in our life widely. Google Assistant and Amazon Alexa are some of the examples which taking our voice as input and converting to text to understand our intention.

Same as other NLP problem, one of critical challenge is lack of adequate volume of training data. It leads overfit or hard to tackle unseen data. Google Brain team with AI Resident come to tackle this problem by introducing several data augmentation method for speech recognition. This story will discuss about SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition (Park et al., 2019) and the following are will be covered:

  • Data
  • Architecture
  • Experiment


To process data, waveform audio converts to spectrogram and feeding to neural network to generate output. Traditional way to perform data augmentation is normally applied to waveform. Park et al. go for another approach which is manipulate spectrogram.

Waveform audio to spectrogram (Google Brain)

Given a spectrogram, you can view it as an image where x axis is time while y axis is frequency.

Spectrogram representation (librosa)

Intuitively, it improves training speed because no data transformation between waveform data to spectrogram data but augmenting spectrogram data.

Park et al. introduced SpecAugment for data augmentation in speech recognition. There are 3 basic ways to augment data which are time warping, frequency masking and time masking. In their experiment, they combine these ways to together and introducing 4 different combinations which are LibriSpeech basic (LB), LibriSpeech double (LD), Switchboard mild (SM) and Switchboard strong (SS).

Time Warping

A random point will be selected and warping to either left or right with a distance w which chosen from a uniform distribution from 0 to the time warp parameter W along that line.

Frequency Masking

A frequency channels [f0, f0 + f) are masked. f is chosen from a uniform distribution from 0 to the frequency mask parameter F, and f0 is chosen from (0, ν − f) where ν is the number of frequency channels.

Time Masking

t consecutive time steps [t0, t0 + t) are masked. t is chosen from a uniform distribution from 0 to the time mask parameter T, and t0 is chosen from [0, τ − t).

From top to bottom, the figures depict the log mel spectrogram of the base input with no augmentation, time warp, frequency masking and time masking applied. (Park et al., 2019)

Combination of basic augmentation policy

By combing the augmentation policy of Frequency Masking and Time Masking, 4 new augmentation policies are introduced. While the symbols denote:

  • W: Time Warping Parameter
  • F: Frequency Masking Parameter
  • mF: Number of frequency masking applied
  • T: Time Masking Parameter
  • mT: Number of time masking applied

Configuration for LB, LD, SM and SS (Park et al., 2019)

From top to bottom, the figures depict the log mel spectrogram of the base input with policies None, LB and LD applied. (Park et al., 2019)


Listen, Attend and Spell (LAS) Network Architecture

Park et al. uses LAS network architecture to verify the performance with and without data augmentation. It includes 2 layers of Convolutional Neural Network (CNN), attention and stacked bi-directional LSTMs. As the objective of this paper is data augmentation and the model is leveraged to see the impact of models, you can deep dive into LAS from here.

Learning Rate Schedules

Learning rate schedule turn out to be come a critical factor to determine model performance. Similar to Slanted triangular learning rates (STLR), a non-static learning rate is applied. Learning rate will be decay exponentially until it reaches 1/100 of its maximum value and keeping it as constant beyond this point. Some parameters are denoted:

  • sr: Step of the ramp-up (from zero learning rate) is complete
  • si: Step of the exponential decay starts
  • sf: Step of the exponential decay stops.

Another learning rate schedule is uniform label smoothing. The correct class label is assigned the confidence 0.9, while the confidence of the other labels are increased accordingly. Parameter is denoted:

  • snoise: Variational weight noise

In later experiment, three standard learning rate schedules are defined:

  1. B(asic): (sr, snoise, si, sf ) = (0.5k, 10k, 20k, 80k)
  2. D(ouble): (sr, snoise, si, sf ) = (1k, 20k, 40k, 160k)
  3. L(ong): (sr, snoise, si, sf ) = (1k, 20k, 140k, 320k)

Langauge Models (LM)

LM is applied to further boost up the model performance. In general, LM is designed to predict next token given consequence of previous tokens. Once a new token is predicted, it will be treat as “previous token” when predicting next tokens. This approach is applied in lots of modern NLP model such as BERT and GPT-2.


Model performance is measured by Word Error Rate (WER).

From the below figure, “Sch” denotes as learning rate schedule while “Pol” denotes as augmentation policy. We can see that LAS with 6 LSTM layer and 1280 embedding vector perform the best result.

Evaluation of LibriSpeech (Park et al., 2019)

By using LAS-6–1280 with SpecAugment perform the best result when comparing to other model and LAS without data augmentation.

Comparing SpecAugment method in LibriSpeech 960h (Park et al., 2019)

In Switchboard 300h, LAS-4–1024 is applied to be benchmark. We can see that SpecAugment did help on further boost up model performance.

Comparing SpecAugment method in Switchboard 300h (Park et al., 2019)

Take Away

  • Time warping did not improve model performance a lot. If resource is limited, this approach will be discarded.
  • Label smoothing leads instability to training.
  • Data augmentation converts over-fit problem to under-fit problems. From below figures, you can notice that the model without augmentation (None) perform nearly perfect in training set while no similar result is performed in other dataset.

  • To facilitate data augmentation for speech recognition, nlpaug supports SpecAugment methods now.

About Me

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github.

Extension Reading


Discussion (0)

Forem Open with the Forem app