DEV Community

Cover image for How to augment your dataset of texts

Posted on • Updated on


How to augment your dataset of texts

I needed to augment textual data and tutorials on this topic are scarce. So I'm writing this post to share how I augmented my data using NLTK and python.

The project

Our data is a set of emails mostly written in french and english. I'm building a model that predict if an email corresponds to a website the user is subscribed to.
Hence we have 2 classes represented by a boolean named isAccount.
However our dataset is very unbalanced:

Alt Text

Generating new data is time-consuming because our data is tagged by hand. Hence Data Augmentation seems to be a good solution.
Since our model is basically looking for specific keywords, Synonym replacement is a good way to create new useful data.

What is synonym replacement.

Synonym replacement is a method of data augmentation which consists of remplacing words of a sentence with synonyms.

NLTK's wordnet

Let's have a look at how to find synonyms using NLTK's wordnet'wordnet')'punkt')
from nltk.corpus import wordnet
Enter fullscreen mode Exit fullscreen mode

gives us a list of synsets:

Enter fullscreen mode Exit fullscreen mode

Afterwards we can get the words in each synsets with lemma_names()

Hence I made this basic function to get all synonyms for any english word:

from collections import OrderedDict
from nltk.tokenize import word_tokenize
def find_synonyms(word):
  synonyms = []
  for synset in wordnet.synsets(word):
    for syn in synset.lemma_names():

  # using this to drop duplicates while maintaining word order (closest synonyms comes first)
  synonyms_without_duplicates = list(OrderedDict.fromkeys(synonyms))
  return synonyms_without_duplicates

Enter fullscreen mode Exit fullscreen mode

the results for the word "subscribe" is:

['subscribe', 'sign', 'support', 'pledge', 'subscribe_to', 'take']
Enter fullscreen mode Exit fullscreen mode

generating new sentences

Some words have a lot of synonyms (50 for "support"!), hence I only take the 6 first synonyms given by wordnet.
I also noticed how short words tends to have inadequate synonyms (in context), like "iodine" for "I". Hence I ignore words shorted than 3 characters.
Some synonymes are composed of several words separated by an underscore ('_'), that's why I replace this character by a whitespace character.
Here is my function generating new sentences by doing one-word replacements:

def create_set_of_new_sentences(sentence, max_syn_per_word = 6):
  new_sentences = []
  for word in word_tokenize(sentence):
    if len(word)<=3 : continue 
    for synonym in find_synonyms(word)[0:max_syn_per_word]:
      synonym = synonym.replace('_', ' ') #restore space character
      new_sentence = sentence.replace(word,synonym)
  return new_sentences
Enter fullscreen mode Exit fullscreen mode

Augmenting the dataset

For those interested in how to merge the original data with the generated data, here is the function I wrote for that:
the argument 'column' specify which field of you dataframe you want to augment.

def data_augment_synonym_replacement(data, column='subject'):
  generated_data = pd.DataFrame([], columns=data.columns)
  for index in data.index:
    text_to_augment = data[column][index]
    for generated_sentence in create_set_of_new_sentences(text_to_augment):
      new_entry =  data.loc[[index]]
      new_entry[column] = generated_sentence

  generated_data_df = generated_data.drop_duplicates()
  augmented_data= pd.concat([data.loc[:],generated_data_df], ignore_index=True)
  return augmented_data
Enter fullscreen mode Exit fullscreen mode


My original dataset lacked data points where isAccount is False (only 30 lines!). By applying this data augmentation method I now have 298 emails of this class, hence multiplying by 10 the number of data points.
I noticed that this scale down the impact of mail incorrectly marked as written in english, because wordnet don't give synonyms to non-english words. Hence these data points are not augmented.

possible weaknesses of my method

My method doesn't ensure that the structure of the sentence is preserved. For example: a verb can be replacement by a noun.

I haven't implemented a maximum number of sentences generated for each datapoint, hence my method will generate more data for longer sentences. This may cause overfitting.


While looking for tools to perform data augmentation, I found TextAttack, defined by its authors as a Python framework for adversarial attacks and data augmentation in NLP.
I had compatibility errors when trying to use it on my Google Colab but this is promising and worth looking into.
Taken from their documentation, here is the basic code to have it running:

!pip install textattack -q
from textattack.augmentation import WordNetAugmenter
augmenter = WordNetAugmenter()
s = 'What I cannot create, I do not understand.'
Enter fullscreen mode Exit fullscreen mode

the results seems similar to what I have done with wordnet, far from perfect but usable.
augmenter.augment(s) return a big list. Among this list the best result is 'What I cannot create, I do not comprehend.' but we see that some meaning is lost, for example: 'What I cannot creating, I do not understand.'

Here's their Github repo


I hope this post will help someone to better understand data augmentation for text data.
If you have any feedback to give, I'd be grateful if you take a few minutes to comment!
I'm especially interested in finding ways to find synonyms in other languages than English.

Top comments (1)

waylonwalker profile image
Waylon Walker • Edited

Feedback as requested.

Good job

  • breaking into sections with headings
  • Using syntax hilighting for code examples
  • Documenting a real problem you solved


  • The intro was a bit unclear and did not pull me in right away
  • No need to talk about not finding a tutorial, tell us what your doing
  • Pull code comments blocks into the article
  • Tighten up code samples as they can be hard to read through from a phone.
  • I really enjoyed reading the solutions at the end but it was cumbersome to read and I almost skipped right past as it was one line.
  • It seems linked to your last post, make it a series and use the liquid tags to link back to it inline

🌚 Browsing with dark mode makes you a better developer.

It's a scientific fact.