When working on Natural Language Processing applications such as Text Classification, collecting enough labeled examples for each category manually can be difficult. In this article, I will go over an interesting technique to augment your existing text data automatically called back translation.
The key idea of back translation is very simple. We create augmented version of a sentence using the following steps:
- You take the original text written in English
- You convert it into another language (say French) using Google Translate
- You convert the translated text back into English using Google Translate
- Keep the augmented text if the original text and the back-translated text are different.
We need a machine translation service to perform the translation to a different language and back to English. Google Translate is the most popular service for this purpose, but you need to get an API key to use it and it is a paid service.
Luckily, Google provides a handy feature in their Google Sheets web app, which we can leverage for our purpose.
Let's assume we are building a sentiment analysis model and our dataset has sentences and their associated labels. We can load it into Google Sheets by importing the Excel/CSV file directly.
Add a new column and use the
GOOGLETRANSLATE() function to translate from English to French and back to English.
The command to place in the column is
=GOOGLETRANSLATE(GOOGLETRANSLATE(A2, "en", "fr"), "fr", "en")
Once the command is placed, press Enter and you will see the translation.
Now, select the first cell of "Backtranslated" column and drag the small square at the bottom right side below to apply this formula over the whole column
This should apply to all your training texts and you will get back the augmented version.
For texts where the original text and what get back from
back translation are the same, we can filter them out programmatically by comparing the original text column and the augmented column. Then, only keep responses that have
True value in the
You can download your data as a CSV file and augment your existing training data.
Here is a Google Sheet demonstrating all the four steps above. You can refer to that and make a copy of it to test things out.
Back translation offers an interesting approach when you've small training data but want to improve the performance of your model.
If you enjoyed this blog post, feel free to connect with me on Twitter where I share new blog posts every week.