DEV Community: Edward Ma

Unsupervised Data Augmentation

Edward Ma — Mon, 05 Aug 2019 00:26:58 +0000

A Look at Data Augmentation | Towards AI

The more data we have, the better the performance we can achieve. However, it is very too luxury to annotate a large amount of training data. Therefore, proper data augmentation is useful to boost up your model performance. Authors of Unsupervised Data Augmentation (Xie et al., 2019) proposed Unsupervised Data Augmentation (UDA) assistants us to build a better model by leveraging several data augmentation methods.

In natural language processing (NLP) field, it is hard to augmenting text due to high complexity of language. Not every word we can replace it by others such as a, an, the. Also, not every word has synonym. Even changing a word, the context will be totally difference. On the other hand, generating augmented image in computer vision area is relative easier. Even introducing noise or cropping out portion of image, model can still classify the image.

Xie et al. conducted several data augmentation experiments on image classification (AutoAugment) and text classification (Back translation and TF-IDF based word replacing). After generating large enough data set of model training, the authors noticed that the model can easily over-fit. Therefore, they introduce Training Signal Annealing (TSA) to overcome it.

Augmentation Strategies

This section will introduce three data augmentation in computer vision (CV) and the natural language processing (NLP) field.

AutoAugment for Image Classification

AutoAugment is found by google in 2018. It is a way to augment images automatically. Unlike the traditional image augmentation library, AutoAugment is designed to find the best policy to manipulate data automatically.

You may visit here for model and implementation.

Generated result by AutoAugment (Cubuk et al., 2018)

Back translation for Text Classification

Back translation is a method to leverage the translation system to generate data. Given that we have a model for translating English to Cantonese and vice versa. Augmented data can be retrieved by translating the original data from English to Cantonese and then translating back to English.

Sennrich et al. (2015) used back-translation method to generate more training data to improve translation model performance.

Examples of back translation (Xie et al., 2019)

TF-IDF based word replacing for Text Classification

Although back translation helps to generate a lot of data, there is no guarantee that keywords will be kept after translation. Some keywords carry more information than others and it may be missed after translation.

Therefore, Xie et al. use TF-IDF to tackle this limitation. The concept of TF-IDF is that high frequency may not able to provide much information gain. In other word, rare words contribute more weights to the model. Word importance will be increased if the number of occurrence within the same document (i.e. training record). On the other hand, it will be decreased if it occurs in the corpus (i.e. other training records).

IDF score is calculated by the DBPedia corpus. TF-IDF score will be computed for each token and replace it according to the TF-IDF score. Low TF-IDF score will have a high probability to be replaced.

If you are interested to use TF-IDF based word replacing for data augmentation, you may visit nlpaug for python implementation.

Training Signal Annealing (TSA)

After generated a large amount of data by using the aforementioned skill, Xie et al. noticed that the model will be over-fitting easily. Therefore, they introduce the TSA. During model training, examples with high confidence will be removed from loss function to prevent over-training.

The following figure shows the value range of ηt while K is the number of categories. If the probability is higher then ηt, it will be removed from loss function.

The threshold of removing high probability examples (Xie et al., 2019)

TSA’s objective function (Xie et al., 2019)

3 calculations of ηt are considered for different scenarios.

Linear-schedule: Growing constantly
Log-schedule: Growing faster in the early stage of training.
Exp-schedule: Growing faster at the end of the training.

Training process among three schedules (Xie et al., 2019)

Recommendation

The above approach is designed to solve problems that authors are facing in their problem. If you understand your data, you should tailor made augmentation approach it. Remember that golden rule in data science is garbage in garbage out.

Like to learn?

I am a Data Scientist in the Bay Area. Focusing on the state-of-the-art in Data Science, Artificial Intelligence, especially in NLP and platform related. Feel free to connect with me on LinkedIn or Github.

Extension Reading

Reference

R. Sennrich, B. Haddow and A Birch. Improving Neural Machine Translation Models with Monolingual Data. 2015
E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan and Q. V. Le. AutoAugment: Learning Augmentation Strategies from Data. 2018
Q. Xie, Z. Dai, E Hovy, M. T. Luong and Q. V. Le. Unsupervised Data Augmentation. 2019

How does your assistant device work based-on Text-to-Speech technology?

Edward Ma — Mon, 29 Jul 2019 13:54:16 +0000

Speech synthesis

Photo by Howard Lawrence B on Unsplash

Speech synthesis is the artificial production of human speech. Text-to-Speech (TTS) is way to converts language to human voice (or speech). The goal of TTS is to render naturally sounding speech signals for downstream such as assistant device (Google’s Assistant, Amazon’s Echo, Apple’s Siri). This story will talk about how we can generate a human-like voice. Concatenative TTS and Parametric TTS are the traditional ways to generate audio but there are some limitations. Google released a generative model, WaveNet, which is a break through on TTS. It can generate a very good audio and overcoming traditional ways’ limitation.

This story will discuss about WaveNet: A Generative Model for Raw Audio (van den Oord et al., 2016) and the following are will be covered:

Text-to-Speech
Technique of Classical Speech Synthesis
WaveNet
Experiment

Text-to-Speech (TTS)

Technically, we can treat TTS as a sequence-to-sequence problem. It includes 2 major stages which are text analysis and speech synthesis. Text analysis is quite similar to generic natural language processing (NLP) steps (Although we may not need heave preprocessing when using deep neural network). For example, sentence segmentation, word segmentation, part-of-speech(POS). The output of first stage is grapheme-to-phoneme (G2P) which is the input of second stage. In speech synthesis, it takes the output from first stage and generating waveform.

Technique of Classical Speech Synthesis

Concatenative TTS and Parametric TTS are the traditional ways to generate audio by feeding text. As named mentioned, Concatenative TTS concatenate a short clip to form a speech. As short clips are recorded by human, quality is good and voice is clear. However, the limitations are huge human effort for recordings and re-recording if transcript is changed. Parametric TTS can generate voice easily as it stores all base information such as fundamental frequency, magnitude spectrum. As voice is generated, voice is more unnatural than Concatenative TTS.

WaveNet

WaveNet is introduced by van den Oord et al. It can generate audio from text and achieving very good result which you may not able to distinguish generated audio and human voice. On the other hand, dilated causal convolutions architecture is leveraged to deal with long-range temporal dependencies. Also, a single model can generate multiple voices

It is based on PixelCNN ( van den Oord et al., 2016) architecture. By leveraging dilated causal convolutions, it contributes to increasing receptive field without greatly increasing computational cost. A dilated convolution is similar to normal convolution but the filter is applied over an area larger than its length and causing some of input values are skipped. It is similar to larger filter but less computational cost.

From the following figure, you notice that second layer (Hidden Layer , Dilation=2) get current input and the one of previous one input. In next layer (Hidden Layer, Dilation =4), current input and 4 previous one input. During the experiment, van den Oord et al. doubled for every layer up to a limit and then repeated. So dilation sequence is

1, 2, 4, 8, 512, 1, 2, 4 ….

Dilated Causal Convolutions (from van den Oord et al., 2016)

The following animation show the operation of dilated causal convolutions. Previous output becomes input and it combines previous input to generate new outputs.

Dilated Causal Convolutions (from DeepMind)

Experiment

van den Oord et al. conducts four experiments to validate this model. First experiment is multi-speaker speech generation. By leveraging CSTR Voice Cloning Toolkit dataset, it can generate up to 109 speaker voices. More speaker training data lead to a better result as WaveNet’s internal representation are shares among speaker voices.

The second experiment is TTS. van den Oord et al. use Google’s North American English and Mandarin Chinese TTS systems as a training data to compare different models. To make the comparison fairly, researchers use hidden Markov model (HMM) and LSTM-RNN-based statistical parametric model as baseline models. Mean Opinion Score (MOS) is used to measure the performance. It is a five-point scale score (1: Bad, 2: Poor, 3: Fair, 4: Good, 5: Excellent). From the following, although WaveNet’s score is still lower than human natural voice but it is better than those baseline models a lot.

Model performance comparison (van den Oord et al., 2016)

The third and forth experiments are music generation and speech recognition. Resear

The following figures hows the latest Google DeepMind’s WaveNet performance.

Comparison result among different model in US English and Mandarin Chinese (from DeepMind)

Take Away

Google applied WaveNet on Google Assistant such that it can response to our voice command without storing all of the audio but generating it in realtime.

Like to learn?

I am Data Scientist in Bay Area. Focusing on state-of-the-art in Data Science, Artificial Intelligence , especially in NLP and platform related. Feel free to connect with me on LinkedIn or following me on Medium or Github. I am offering short advise on machine learning problem or data science platform for small fee.

Extension Reading

DeepMind’s WaveNet

Reference

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio. 2016
Aaron van den Oord, Nal Kalchbrenner, Oriol Vinyals, Lasse Espeholt, Alex Graves, Koray Kavukcuoglu. Conditional Image Generation with PixelCNN Decoders. 2016

Don’t forget to give us your 👏 !

https://medium.com/media/c43026df6fee7cdb1aab8aaf916125ea/href

Multi-Task Learning for Sentence Embeddings

Edward Ma — Mon, 10 Jun 2019 03:54:40 +0000

Universal Sentence Encoder

“ Mount Fuji“ by Edward Ma on Unsplash

Cera et al. demonstrated that the transfer learning result of sentence embeddings is outperform word embeddings. The traditional way of building sentence embeddings is either average, sum or contacting a set of word vectors to product sentence embeddings. This method loss lots of information but just easier of calculation. Cera et al. evaluated two famous network architectures which are transformer based model and deep averaging network (DAN) based model.

Sentence similarity score (Cera et al., 2018)

This story will discuss about Universal Sentence Encoder (Cera et al., 2018) and the following are will be covered:

Data
Architecture
Implementation

Data

As it is designed to support multiple downstream tasks, multi task learning is adopted. Therefore, Cera et al. use multiple data sources to train model including movie review, customer review, sentiment classification, question classification, semantic textual similarity and Word Embedding Association Test (WEAT) data.

Architecture

Text will be tokenized by Penn Treebank(PTB) method and passing to either transformer architecture or deep averaging network. As both models are designed to be a general purpose, multi-task learning approach is adopted. The training objective includes:

Same as Skip-though, predicting previous sentence and next sentence by giving current sentence.
Conversational response suggestion for the inclusion of parsed conversational data.
Classification task on supervised data

Predicting previous sentence and next sentence (Kiros et al., 2015)

Transformer architecture is developed by Google in 2017. It leverages self attention with multi blocks to learn the context aware word representation.

Transformer architecture (Vaswani et al,, 2017)

Deep averaging network (DAN) is using average of embeddings (word and bi-gram) and feeding to feedforward neural network.

DAN architecture (Ivver et al., 2015)

The reasons of introducing two models because different concern. Transformer architecture achieve a better performance but it needs more resource to train. Although DAN does not perform as good as transformer architecture. The advantage of DAN is simple model and requiring less training resource.

Implementation

To explore the Universal Sentence Encoder, if you simply follow the instruction from Tensorflow Hub.

Take Away

Multi-task learning is important for learning text representations. It can be found that lots of modern NLP model architecture use multi-task learning rather than standalone data set
Rather than aggregate multi word vectors to represent sentence embeddings, learning it from multi word vectors achieve better result.

About Me

Extension Reading

Universal Sentence Encoder implementation
Word Embeddings
Skip-though (Sentence Embeddings)

Reference

D. Cera , Y. Yang , S. Y. Kong , N, Hua , N. Limtiaco, R. S. Johna , N. Constanta , M. Guajardo-Cespedes, S. Yuan, C. Tar , Y. H. Sung , B. Strope and Ray Kurzweil. Universal Sentence Encoder. 2018

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser and I. Polosukhin. Attention Is All You Need. 2017

M. Iyyer, V. Manjunatha, J. Boyd-Graber and H. Daume III. Deep Unordered Composition Rivals Syntactic Methods for Text Classification. 2015

Data Augmentation for Audio

Edward Ma — Sat, 01 Jun 2019 12:58:39 +0000

Data Augmentation

Photo by Edward Ma on Unsplash

Although tuning model architecture and hyperparameter are successful factor of building a wonderful model, data scientist should also focus on data. No matter how amazing model you build, garbage in, garbage out (GIGO).

Intuitively, lack of data is one of the common issue in actual data science problem. Data augmentation helps to generate synthetic data from existing data set such that generalisation capability of model can be improved.

In the previous story, we explained how we play with spectrogram. In this story, we will talk about a basic augmentation methods for audio. This story and implementation are inspired by Kaggle’s Audio Data Augmentation Notebook.

Data Augmentation for Audio

To generate syntactic data for audio, we can apply noise injection, shifting time, changing pitch and speed. numpy provides an easy way to handle noise injection and shifting time while librosa (library for Recognition and Organization of Speech and Audio) help to manipulate pitch and speed with just 1 line of code.

Noise Injection

It simply add some random value into data by using numpy.

import numpy as np

def manipulate(data, noise\_factor):
 noise = np.random.randn(len(data))
 augmented\_data = data + noise\_factor \* noise
 # Cast back to same data type
 augmented\_data = augmented\_data.astype(type(data[0]))
 return augmented\_data

Comparison between original and noise voice

Shifting Time

The idea of shifting time is very simple. It just shift audio to left/right with a random second. If shifting audio to left (fast forward) with x seconds, first x seconds will mark as 0 (i.e. silence). If shifting audio to right (back forward) with x seconds, last x seconds will mark as 0 (i.e. silence).

import numpy as np

def manipulate(data, sampling\_rate, shift\_max, shift\_direction):
 shift = np.random.randint(sampling\_rate \* shift\_max)
 if shift\_direction == 'right':
 shift = -shift
 elif self.shift\_direction == 'both':
 direction = np.random.randint(0, 2)
 if direction == 1:
 shift = -shift

augmented\_data = np.roll(data, shift)
 # Set to silence for heading/ tailing
 if shift \> 0:
 augmented\_data[:shift] = 0
 else:
 augmented\_data[shift:] = 0
 return augmented\_data

Comparison between original and shifted voice

Changing Pitch

This augmentation is a wrapper of librosa function. It change pitch randomly

import librosa

def manipulate(data, sampling\_rate, pitch\_factor):
 return librosa.effects.pitch\_shift(data, sampling\_rate, pitch\_factor)

Comparison between original and changed pitch voice

Changing Speed

Same as changing pitch, this augmentation is performed by librosafunction. It stretches times series by a fixed rate.

import librosa

def manipulate(data, speed\_factor):
 return librosa.effects.time\_stretch(data, speed\_factor)

Comparison between original and changed speed voice

Take Away

Above 4 methods are implemented in nlpaug package (≥ 0.0.3). You can generate augmented data within a few line of code.
Data augmentation cannot replace real training data. It just help to generate synthetic data to make the model better.
Do not blindly generate synthetic data. You have to understand your data pattern and selecting a appropriate way to increase training data volume.

About Me

Extension Reading

3 subword algorithms help to improve your NLP model performance

Edward Ma — Sun, 19 May 2019 01:37:13 +0000

Introduction to subword

Photo by Edward Ma on Unsplash

Classic word representation cannot handle unseen word or rare word well. Character embeddings is one of the solution to overcome out-of-vocabulary (OOV). However, it may too fine-grained any missing some important information. Subword is in between word and character. It is not too fine-grained while able to handle unseen word and rare word.

For example, we can split “subword” to “sub” and “word”. In other word we use two vector (i.e. “sub” and “word”) to represent “subword”. You may argue that it uses more resource to compute it but the reality is that we can use less footprint by comparing to word representation.

This story will discuss about SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing (Kudo et al., 2018) and further discussing about different subword algorithms. The following are will be covered:

Byte Pair Encoding (BPE)
WordPiece
Unigram Language Model
SentencePiece

Byte Pair Encoding (BPE)

Sennrich et al. (2016) proposed to use Byte Pair Encoding (BPE) to build subword dictionary. Radfor et al adopt BPE to construct subword vector to build GPT-2 in 2019.

Algorithm

Prepare a large enough training data (i.e. corpus)
Define a desired subword vocabulary size
Split word to sequence of characters and appending suffix “</w>” to end of word with word frequency. So the basic unit is character in this stage. For example, the frequency of “low” is 5, then we rephrase it to “l o w </w>”: 5
Generating a new subword according to the high frequency occurrence.
Repeating step 4 until reaching subword vocabulary size which is defined in step 2 or the next highest frequency pair is 1.

Algorithm of BPE (Sennrich et al., 2015)

Example

Taking “low: 5”, “lower: 2”, “newest: 6” and “widest: 3” as an example, the highest frequency subword pair is e and s. It is because we get 6 count from newest and 3 count from widest. Then new subword (es) is formed and it will become a candidate in next iteration.

In the second iteration, the next high frequency subword pair is es (generated from previous iteration )and t. It is because we get 6count from newest and 3 count from widest.

Keep iterate until built a desire size of vocabulary size or the next highest frequency pair is 1.

WordPiece

WordPiece is another word segmentation algorithm and it is similar with BPE. Schuster and Nakajima introduced WordPiece by solving Japanese and Korea voice problem in 2012. Basically, WordPiece is similar with BPE and the difference part is forming a new subword by likelihood but not the next highest frequency pair.

Algorithm

Prepare a large enough training data (i.e. corpus)
Define a desired subword vocabulary size
Split word to sequence of characters
Build a languages model based on step 3 data
Choose the new word unit out of all the possible ones that increases the likelihood on the training data the most when added to the model.
Repeating step 5until reaching subword vocabulary size which is defined in step 2 or the likelihood increase falls below a certain threshold.

Unigram Language Model

Kudo. introduced unigram language model as another algorithm for subword segmentation. One of the assumption is all subword occurrence are independently and subword sequence is produced by the product of subword occurrence probabilities. Both WordPiece and Unigram Language Model leverages languages model to build subword vocabulary.

Algorithm

Prepare a large enough training data (i.e. corpus)
Define a desired subword vocabulary size
Optimize the probability of word occurrence by giving a word sequence.
Compute the loss of each subword
Sort the symbol by loss and keep top X % of word (e.g. X can be 80). To avoid out-of-vocabulary, character level is recommend to be included as subset of subword.
Repeating step 3–5until reaching subword vocabulary size which is defined in step 2 or no change in step 5.

SentencePiece

So, any existing library which we can leverage it for our text processing? Kudo and Richardson implemented SentencePiece library. You have to train your tokenizer based on your data such that you can encode and decoding your data for downstream tasks.

First of all, preparing a plain text including your data and then triggering the following API to train the model

import sentencepiece as spm
spm.SentencePieceTrainer.Train('--input=test/botchan.txt --model\_prefix=m --vocab\_size=1000')

It is super fast and you can load the model by

sp = spm.SentencePieceProcessor()
sp.Load("m.model")

To encode your text, you just need to

sp.EncodeAsIds("This is a test")

For more examples and usages, you can access this repo.

Take Away

Subword balances vocabulary size and footprint. Extreme case is we can only use 26 token (i.e. character) to present all English word. 16k or 32k subwords are recommended vocabulary size to have a good result.
Many Asian language word cannot be separated by space. Therefore, the initial vocabulary is larger than English a lot. You may need to prepare over 10k initial word to kick start the word segmentation. From Schuster and Nakajima research, they propose to use 22k word and 11k word for Japanese and Korean respectively.

Like to learn?

Extension Reading

Reference

T. Kudo and J. Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. 2018
R. Sennrich, B. Haddow and A. Birch. Neural Machine Translation of Rare Words with Subword Units. 2015
M. Schuster and K. Nakajima. Japanese and Korea Voice Search. 2012
Taku Kudo. Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates. 2018

How do they apply BERT in the clinical domain?

Edward Ma — Mon, 06 May 2019 13:31:03 +0000

BERT in clinical domain

Photo by Edward Ma on Unsplash

Contextual word embeddings is proven that have dramatically improved NLP model performance via ELMo (Peters et al., 2018), BERT (Devlin et al., 2018) and GPT-2 (Radford et al., 2019). Lots of researches intend to fine tune BERT model on domain specific data. BioBERT and SciBERT are introduced in last time. Would like to continue on this topic as there are another 2 research fine tune BERT model and applying in the clinical domain.

This story will discuss about Publicly Available Clinical BERT Embeddings (Alsentzer et al., 2019) and ClinicalBert: Modeling Clinical Notes and Predicting Hospital Readmission (Huang et al., 2019) while it will go through BERT detail but focusing how researchers applying it in clinical domain. In case, you want to understand more about BERT, you may visit this story.The following are will be covered:

Building clinical specific BERT resource
Application for ClinicalBERT

Building clinical specific BERT resource

Alsentzer et al. apply 2 millon notes in the MIMIC-III v1.4 database (Johnson et al., 2016). There are among 15 note types in total and Alsentzer et al. aggregate to either non-Discharge Summary type and Discharge Summary type. Discharge summary data is designed for downstream tasks training/ fine-tuning.

Distribution of note type MIMIC-III v1.4 (Alsentzer et al., 2019)

Giving that those data, ScispaCy is leveraged to tokenize article to sentence. Those sentences will be passed to BERT-Base (Original BERT base model) and BioBERT respectively for additional pre-training.

Clinical BERT is build based on BERT-base while Clinical BioBERT is based on BioBERT. Once the contextual word embeddings is trained, a signal linear layer classification model is trained for tacking named-entity recognition (NER), de-identification (de-ID) task or sentiment classification.

These models achieves a better result in MedNLI by comparing to original BERT model. Meanwhile, you may notice that there are no improvement fro i2b2 2006 and i2b2 2014 which are de-ID tasks.

Performance comparison among different models in MedNLI and i2b2 data set (Alsentzer et al., 2019))

Application for ClinicalBERT

In the same time, Huang et al. also focus on clinical notes. However, the major objective of Huang et al. research is building a prediction model by leveraging a good clinical text representation. Huang et al. researched that lower readmission rate is good for patients such as saving money.

Same as Alsentzer et al., MIMIC-III dataset (Johnson et al., 2016) are used for evaluation. Following same BERT practice, contextual word embeddings is trained by predicting a masked token and next sentence prediction. In short, predicting a masked token is mask a token randomly and using surrounding words to predict masked token. Next sentence prediction is a binary classifier, output of this model is classifying whether second sentence is a next sentence of first sentence or not.

Training tasks of ClincialBERT (Huang et al., 2019)

After having a pre-trained contextual word embeddings, fine-tuned process is applied on readmission prediction. It is a binary classification model to predict whether patient need to be readmission within the next 30 days.

One of the BERT model limitation is maximum length of token is 512. A long clinical note will be split to multiple parts and predicting it separately. Once all sub-part is predicted, a final probability will be aggregated. Due to the concern on using maximum or mean purely, Huang et al. combine both of them to have a accurate result.

Scalable radmission prediction formula (Huang et al., 2019)

Finally, the experiment result demonstrated a fine-tuned ClinicalBERT is better than classical model.

Performance comparison among models (Huang et al., 2019)

Take Away

Alsentzer et al. uses a signal layer of classification model to evaluate result. It may be a good start for that and expected BERT model able to learn the content. Evaluating other advanced model architecture may provide a better comprehensive experiment result.
For long clinical note, Huang et al. uses some mathematics trick to resolve it. It may not able to capture content a very long clinical notes. May need to further think about a better way to tackle a long input.

Like to learn?

Extension Reading

Reference

E. Alsentzer, J. R. Murphy, W. Boag, W. H. Weng, D. Jin, T. Naumann and M. B. A. McDermott. Publicly Available Clinical BERT Embeddings. 2019
K. Huang, J. Altosaar and R. Ranganath. ClinicalBert: Modeling Clinical Notes and Predicting Hospital Readmission. 2019

Data Augmentation for Speech Recognition

Edward Ma — Wed, 01 May 2019 13:33:53 +0000

Automatic Speech Recognition (ASR)

Photo by Edward Ma on Unsplash

The objective of Speech Recognition is converting audio to text. This technology is applied in our life widely. Google Assistant and Amazon Alexa are some of the examples which taking our voice as input and converting to text to understand our intention.

Same as other NLP problem, one of critical challenge is lack of adequate volume of training data. It leads overfit or hard to tackle unseen data. Google Brain team with AI Resident come to tackle this problem by introducing several data augmentation method for speech recognition. This story will discuss about SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition (Park et al., 2019) and the following are will be covered:

Data
Architecture
Experiment

Data

To process data, waveform audio converts to spectrogram and feeding to neural network to generate output. Traditional way to perform data augmentation is normally applied to waveform. Park et al. go for another approach which is manipulate spectrogram.

Waveform audio to spectrogram (Google Brain)

Given a spectrogram, you can view it as an image where x axis is time while y axis is frequency.

Spectrogram representation (librosa)

Intuitively, it improves training speed because no data transformation between waveform data to spectrogram data but augmenting spectrogram data.

Park et al. introduced SpecAugment for data augmentation in speech recognition. There are 3 basic ways to augment data which are time warping, frequency masking and time masking. In their experiment, they combine these ways to together and introducing 4 different combinations which are LibriSpeech basic (LB), LibriSpeech double (LD), Switchboard mild (SM) and Switchboard strong (SS).

Time Warping

A random point will be selected and warping to either left or right with a distance w which chosen from a uniform distribution from 0 to the time warp parameter W along that line.

Frequency Masking

A frequency channels [f0, f0 + f) are masked. f is chosen from a uniform distribution from 0 to the frequency mask parameter F, and f0 is chosen from (0, ν − f) where ν is the number of frequency channels.

Time Masking

t consecutive time steps [t0, t0 + t) are masked. t is chosen from a uniform distribution from 0 to the time mask parameter T, and t0 is chosen from [0, τ − t).

From top to bottom, the figures depict the log mel spectrogram of the base input with no augmentation, time warp, frequency masking and time masking applied. (Park et al., 2019)

Combination of basic augmentation policy

By combing the augmentation policy of Frequency Masking and Time Masking, 4 new augmentation policies are introduced. While the symbols denote:

W: Time Warping Parameter
F: Frequency Masking Parameter
mF: Number of frequency masking applied
T: Time Masking Parameter
mT: Number of time masking applied

Configuration for LB, LD, SM and SS (Park et al., 2019)

From top to bottom, the figures depict the log mel spectrogram of the base input with policies None, LB and LD applied. (Park et al., 2019)

Architecture

Listen, Attend and Spell (LAS) Network Architecture

Park et al. uses LAS network architecture to verify the performance with and without data augmentation. It includes 2 layers of Convolutional Neural Network (CNN), attention and stacked bi-directional LSTMs. As the objective of this paper is data augmentation and the model is leveraged to see the impact of models, you can deep dive into LAS from here.

Learning Rate Schedules

Learning rate schedule turn out to be come a critical factor to determine model performance. Similar to Slanted triangular learning rates (STLR), a non-static learning rate is applied. Learning rate will be decay exponentially until it reaches 1/100 of its maximum value and keeping it as constant beyond this point. Some parameters are denoted:

sr: Step of the ramp-up (from zero learning rate) is complete
si: Step of the exponential decay starts
sf: Step of the exponential decay stops.

Another learning rate schedule is uniform label smoothing. The correct class label is assigned the confidence 0.9, while the confidence of the other labels are increased accordingly. Parameter is denoted:

snoise: Variational weight noise

In later experiment, three standard learning rate schedules are defined:

B(asic): (sr, snoise, si, sf ) = (0.5k, 10k, 20k, 80k)
D(ouble): (sr, snoise, si, sf ) = (1k, 20k, 40k, 160k)
L(ong): (sr, snoise, si, sf ) = (1k, 20k, 140k, 320k)

Langauge Models (LM)

LM is applied to further boost up the model performance. In general, LM is designed to predict next token given consequence of previous tokens. Once a new token is predicted, it will be treat as “previous token” when predicting next tokens. This approach is applied in lots of modern NLP model such as BERT and GPT-2.

Experiment

Model performance is measured by Word Error Rate (WER).

From the below figure, “Sch” denotes as learning rate schedule while “Pol” denotes as augmentation policy. We can see that LAS with 6 LSTM layer and 1280 embedding vector perform the best result.

Evaluation of LibriSpeech (Park et al., 2019)

By using LAS-6–1280 with SpecAugment perform the best result when comparing to other model and LAS without data augmentation.

Comparing SpecAugment method in LibriSpeech 960h (Park et al., 2019)

In Switchboard 300h, LAS-4–1024 is applied to be benchmark. We can see that SpecAugment did help on further boost up model performance.

Comparing SpecAugment method in Switchboard 300h (Park et al., 2019)

Take Away

Time warping did not improve model performance a lot. If resource is limited, this approach will be discarded.
Label smoothing leads instability to training.
Data augmentation converts over-fit problem to under-fit problems. From below figures, you can notice that the model without augmentation (None) perform nearly perfect in training set while no similar result is performed in other dataset.

To facilitate data augmentation for speech recognition, nlpaug supports SpecAugment methods now.

About Me

Extension Reading

Reference

D. S. Park, W. Chan, Y. Zhang, C. C. Chiu, B. Zoph, E. D. Cubuk and Q. V. Le. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. 2019
W. Chan, N. Jaitly, Q. V. Le and O. Vinyals. Listen, Attend and Spell. 2015

DEV Community: Edward Ma

Unsupervised Data Augmentation

A Look at Data Augmentation | Towards AI

Augmentation Strategies

AutoAugment for Image Classification

Back translation for Text Classification

TF-IDF based word replacing for Text Classification

Training Signal Annealing (TSA)

Recommendation

Like to learn?

Extension Reading

Reference

How does your assistant device work based-on Text-to-Speech technology?

Speech synthesis

Trending AI Articles:

Text-to-Speech (TTS)

Technique of Classical Speech Synthesis

WaveNet

Experiment

Take Away

Like to learn?

Extension Reading

Reference

Don’t forget to give us your 👏 !

Multi-Task Learning for Sentence Embeddings

Data

Architecture

Implementation

Take Away

About Me

Extension Reading

Reference

Data Augmentation for Audio

Data Augmentation

Data Augmentation for Audio

Noise Injection

Shifting Time

Changing Pitch

Changing Speed

Take Away

About Me

Extension Reading

3 subword algorithms help to improve your NLP model performance

Introduction to subword

Byte Pair Encoding (BPE)

Algorithm

Example

WordPiece

Algorithm

Unigram Language Model

Algorithm

SentencePiece

Take Away

Like to learn?

Extension Reading

Reference

How do they apply BERT in the clinical domain?

BERT in clinical domain

Building clinical specific BERT resource

Application for ClinicalBERT

Take Away

Like to learn?

Extension Reading

Reference

Data Augmentation for Speech Recognition

Automatic Speech Recognition (ASR)

Data

Time Warping

Frequency Masking

Time Masking

Combination of basic augmentation policy

Architecture

Listen, Attend and Spell (LAS) Network Architecture

Learning Rate Schedules

Langauge Models (LM)

Experiment

Take Away

About Me

Extension Reading

Reference