DEV Community: Gabrielle

How to augment your dataset of texts

Gabrielle — Mon, 06 Jul 2020 14:10:34 +0000

I needed to augment textual data and tutorials on this topic are scarce. So I'm writing this post to share how I augmented my data using NLTK and python.

this article is part of a serie about machine learning for Kormos
ML and text processing on emails
text data augmentation: synonym replacement (you are here)

The project

Our data is a set of emails mostly written in french and english. I'm building a model that predict if an email corresponds to a website the user is subscribed to.
Hence we have 2 classes represented by a boolean named isAccount.
However our dataset is very unbalanced:

Generating new data is time-consuming because our data is tagged by hand. Hence Data Augmentation seems to be a good solution.
Since our model is basically looking for specific keywords, Synonym replacement is a good way to create new useful data.

What is synonym replacement.

Synonym replacement is a method of data augmentation which consists of remplacing words of a sentence with synonyms.

NLTK's wordnet

Let's have a look at how to find synonyms using NLTK's wordnet

nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import wordnet
wordnet.synsets("subscribe")

gives us a list of synsets:

[Synset('subscribe.v.01'),
 Synset('sign.v.01'),
 Synset('subscribe.v.03'),
 Synset('pledge.v.02'),
 Synset('subscribe.v.05')]

Afterwards we can get the words in each synsets with lemma_names()

Hence I made this basic function to get all synonyms for any english word:

from collections import OrderedDict
from nltk.tokenize import word_tokenize
def find_synonyms(word):
  synonyms = []
  for synset in wordnet.synsets(word):
    for syn in synset.lemma_names():
      synonyms.append(syn)

  # using this to drop duplicates while maintaining word order (closest synonyms comes first)
  synonyms_without_duplicates = list(OrderedDict.fromkeys(synonyms))
  return synonyms_without_duplicates


find_synonyms("subscribe")

the results for the word "subscribe" is:

['subscribe', 'sign', 'support', 'pledge', 'subscribe_to', 'take']

generating new sentences

Some words have a lot of synonyms (50 for "support"!), hence I only take the 6 first synonyms given by wordnet.
I also noticed how short words tends to have inadequate synonyms (in context), like "iodine" for "I". Hence I ignore words shorted than 3 characters.
Some synonymes are composed of several words separated by an underscore ('_'), that's why I replace this character by a whitespace character.
Here is my function generating new sentences by doing one-word replacements:

def create_set_of_new_sentences(sentence, max_syn_per_word = 6):
  new_sentences = []
  for word in word_tokenize(sentence):
    if len(word)<=3 : continue 
    for synonym in find_synonyms(word)[0:max_syn_per_word]:
      synonym = synonym.replace('_', ' ') #restore space character
      new_sentence = sentence.replace(word,synonym)
      new_sentences.append(new_sentence)
  return new_sentences

Augmenting the dataset

For those interested in how to merge the original data with the generated data, here is the function I wrote for that:
the argument 'column' specify which field of you dataframe you want to augment.

def data_augment_synonym_replacement(data, column='subject'):
  generated_data = pd.DataFrame([], columns=data.columns)
  for index in data.index:
    text_to_augment = data[column][index]
    for generated_sentence in create_set_of_new_sentences(text_to_augment):
      new_entry =  data.loc[[index]]
      new_entry[column] = generated_sentence
      generated_data=generated_data.append(new_entry)

  generated_data_df = generated_data.drop_duplicates()
  augmented_data= pd.concat([data.loc[:],generated_data_df], ignore_index=True)
  return augmented_data

Results

My original dataset lacked data points where isAccount is False (only 30 lines!). By applying this data augmentation method I now have 298 emails of this class, hence multiplying by 10 the number of data points.
I noticed that this scale down the impact of mail incorrectly marked as written in english, because wordnet don't give synonyms to non-english words. Hence these data points are not augmented.

possible weaknesses of my method

My method doesn't ensure that the structure of the sentence is preserved. For example: a verb can be replacement by a noun.

I haven't implemented a maximum number of sentences generated for each datapoint, hence my method will generate more data for longer sentences. This may cause overfitting.

TextAttack

While looking for tools to perform data augmentation, I found TextAttack, defined by its authors as a Python framework for adversarial attacks and data augmentation in NLP.
I had compatibility errors when trying to use it on my Google Colab but this is promising and worth looking into.
Taken from their documentation, here is the basic code to have it running:

!pip install textattack -q
from textattack.augmentation import WordNetAugmenter
augmenter = WordNetAugmenter()
s = 'What I cannot create, I do not understand.'
augmenter.augment(s)

the results seems similar to what I have done with wordnet, far from perfect but usable.
augmenter.augment(s) return a big list. Among this list the best result is 'What I cannot create, I do not comprehend.' but we see that some meaning is lost, for example: 'What I cannot creating, I do not understand.'

Here's their Github repo

afterword

I hope this post will help someone to better understand data augmentation for text data.
If you have any feedback to give, I'd be grateful if you take a few minutes to comment!
I'm especially interested in finding ways to find synonyms in other languages than English.

ML and text processing on emails

Gabrielle — Mon, 22 Jun 2020 07:04:44 +0000

I'm a software engineering student and this is my first blog post! I'm writing this to seek feedback, improve my technical writing skills and, hopefully, provide insights on text processing with Machine Learning.

I'm currently tasked to do machine learning for Kormos, the startup I'm working with.

Our project

We are trying to find all the websites an user is subscribed to by looking at their emails. For that we have a database of mails and four thousands of them are human-tagged. This tag is referred to as 'isAccount' and is true when the email was sent from a website the user is subscribed to.

The tagged emails were selected based on keywords on their body field. such keywords are related to "account creation" or "email verification"

This results in a imbalanced data set.

For this project we're focusing on these data:

Subject
Body
sender Domain : the domain of the sender (e.g "kormos.com")
langCode : the predicted language

We mostly have french emails. we're only considering french emails from now on.

Technical decisions

I'm using Python on Google Colab.

I'm doing Machine Learning using scikit-learn.

I experimented with Spacy and consider using it to extract features from the body of emails. I'm thinking about extracting usernames or name of organizations.

Processing text

I started training my model only on the subject field.

vectoring our text data

I'm using scikit's TfidfTransformer, an equivalent to CountVectorizer followed by TfidfTransformer.

from nltk.corpus import stopwords

tfidfV=TfidfVectorizer(
    stop_words = set(stopwords.words('french')),
    max_features = 15
)
corpus_bow=tfidfV.fit_transform(data["subject"])

What it does is building a dictionary of the most common words, ignoring the stop words (frequent words of little value, like "I").
The size of the dictionary is, at most, equals to max_features.
Based on this dictionary, each text input is transformed into a vector of dimension max_features.
Basically, if "confirm" is the n-th word of the dictionary then n-th dimension of the output vector is the occurrences of the word "confirm".

Hence we have a count matrix. Numerical values instead of text.

tfidf transformer

This step transforms the count matrix to a normalized term-frequency representation.
It scale down the impact of words that appears very frequently.

Splitting our dataset

I use scikit to divide my data into two groups: one to train my model and the other to test it.

y =  data_fr["isAccount"]
train_X, val_X, train_y, val_y = train_test_split(corpus_bow, y, random_state=1)

I specify the random_state to an arbitrary number to fix the seed of the randomness generator, hence making my results stable across the different executions of my code .

Training our model

The model I'm using is Scikit's RandomForestClassifier because I understand it. It's training a number of decision tree classifier and using the aggregation of their predictions.
There are just so many models you can choose from.

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(train_X, train_y)
y_pred = model.predict(val_X)

Results

confusion matrix :
[[ 18  29]
 [ 10 716]]
accuracy = 94.955%
precision = 96.107%

We get good result, however this is partly due to the imbalance in the distribution of the classes.

With these predictions, we can easily create a list of unique sender domains the user is predicted to be subscribed to.

I filter the list of domains by removing the ones not present in Alexa's top 1 million domains. Hopefully filtering any scam.

Conclusion - How to make it better?

Assuming that the tagged data is correct and representative of future users, I believe that the model is good enough to be used.

However I wonder if removing data where isAccount is True is an effective way to improve the model. The cost of that strategy would be to train the model on a much smaller data set.

I have also been informed that data augmentation could be useful in this situation.

Please feel free to give feedback!
I can give additional information about any step of the process.

Thanks to Scikit and Panda for their documentation.
Thanks to Tancrède Suard, and Kormos, for its work on the dataset.