<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Gabrielle</title>
    <description>The latest articles on DEV Community by Gabrielle (@veganaise).</description>
    <link>https://dev.to/veganaise</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F81987%2Fadcde22d-d229-435a-8b8b-ff4d891e9807.jpg</url>
      <title>DEV Community: Gabrielle</title>
      <link>https://dev.to/veganaise</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/veganaise"/>
    <language>en</language>
    <item>
      <title>How to augment your dataset of texts</title>
      <dc:creator>Gabrielle</dc:creator>
      <pubDate>Mon, 06 Jul 2020 14:10:34 +0000</pubDate>
      <link>https://dev.to/veganaise/text-data-augmentation-synonym-replacement-4h8l</link>
      <guid>https://dev.to/veganaise/text-data-augmentation-synonym-replacement-4h8l</guid>
      <description>&lt;p&gt;I needed to augment textual data and tutorials on this topic are scarce. So I'm writing this post to share how I augmented my data using &lt;a href="https://www.nltk.org/" rel="noopener noreferrer"&gt;NLTK&lt;/a&gt; and python.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;this article is part of a serie about machine learning for &lt;a href="https://kormos.fr/" rel="noopener noreferrer"&gt;Kormos&lt;/a&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/ton_ami/ml-and-text-processing-on-emails-4f6p"&gt;ML and text processing on emails&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://dev.to/ton_ami/text-data-augmentation-synonym-replacement-4h8l"&gt;text data augmentation: synonym replacement (you are here)&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  The project
&lt;/h2&gt;

&lt;p&gt;Our data is a set of emails mostly written in french and english. I'm building a model that predict if an email corresponds to a website the user is subscribed to. &lt;br&gt;
Hence we have 2 classes represented by a boolean named isAccount.&lt;br&gt;
However our dataset is very unbalanced:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fdb6q9m3oxw6jr2z0li4k.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fdb6q9m3oxw6jr2z0li4k.PNG" alt="Alt Text" width="421" height="273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Generating new data is time-consuming because our data is tagged by hand. Hence Data Augmentation seems to be a good solution.&lt;br&gt;
Since our model is basically looking for specific keywords, Synonym replacement is a good way to create new useful data.&lt;/p&gt;
&lt;h1&gt;
  
  
  What is synonym replacement.
&lt;/h1&gt;

&lt;p&gt;Synonym replacement is a method of data augmentation which consists of remplacing words of a sentence with synonyms. &lt;/p&gt;
&lt;h2&gt;
  
  
  NLTK's wordnet
&lt;/h2&gt;

&lt;p&gt;Let's have a look at how to find synonyms using NLTK's wordnet&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;wordnet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;punkt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.corpus&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;wordnet&lt;/span&gt;
&lt;span class="n"&gt;wordnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synsets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;gives us a list of synsets:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Synset('subscribe.v.01'),
 Synset('sign.v.01'),
 Synset('subscribe.v.03'),
 Synset('pledge.v.02'),
 Synset('subscribe.v.05')]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Afterwards we can get the words in each synsets with lemma_names()&lt;/p&gt;

&lt;p&gt;Hence I made this basic function to get all synonyms for any english word:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OrderedDict&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.tokenize&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;word_tokenize&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;find_synonyms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;synonyms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;synset&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;wordnet&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;synsets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;syn&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;synset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lemma_names&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
      &lt;span class="n"&gt;synonyms&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;syn&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="c1"&gt;# using this to drop duplicates while maintaining word order (closest synonyms comes first)
&lt;/span&gt;  &lt;span class="n"&gt;synonyms_without_duplicates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;OrderedDict&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fromkeys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;synonyms&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;synonyms_without_duplicates&lt;/span&gt;


&lt;span class="nf"&gt;find_synonyms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subscribe&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the results for the word "subscribe" is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;['subscribe', 'sign', 'support', 'pledge', 'subscribe_to', 'take']
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  generating new sentences
&lt;/h2&gt;

&lt;p&gt;Some words have a lot of synonyms (50 for "support"!), hence I only take the 6 first synonyms given by wordnet.&lt;br&gt;
I also noticed how short words tends to have inadequate synonyms (in context), like "iodine" for "I". Hence I ignore words shorted than 3 characters.&lt;br&gt;
Some synonymes are composed of several words separated by an underscore ('_'), that's why I replace this character by a whitespace character.&lt;br&gt;
Here is my function generating new sentences by doing one-word replacements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_set_of_new_sentences&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_syn_per_word&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;6&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;new_sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;word&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;word_tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;continue&lt;/span&gt; 
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;synonym&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;find_synonyms&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="n"&gt;max_syn_per_word&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
      &lt;span class="n"&gt;synonym&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;synonym&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;_&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="c1"&gt;#restore space character
&lt;/span&gt;      &lt;span class="n"&gt;new_sentence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;word&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="n"&gt;synonym&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
      &lt;span class="n"&gt;new_sentences&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_sentence&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;new_sentences&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Augmenting the dataset
&lt;/h2&gt;

&lt;p&gt;For those interested in how to merge the original data with the generated data, here is the function I wrote for that:&lt;br&gt;
the argument 'column' specify which field of you dataframe you want to augment.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;data_augment_synonym_replacement&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="n"&gt;generated_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;([],&lt;/span&gt; &lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;columns&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;text_to_augment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;generated_sentence&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;create_set_of_new_sentences&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text_to_augment&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
      &lt;span class="n"&gt;new_entry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;
      &lt;span class="n"&gt;new_entry&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;column&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generated_sentence&lt;/span&gt;
      &lt;span class="n"&gt;generated_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;generated_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_entry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;generated_data_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;generated_data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;drop_duplicates&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="n"&gt;augmented_data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;[:],&lt;/span&gt;&lt;span class="n"&gt;generated_data_df&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;ignore_index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;augmented_data&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Results
&lt;/h1&gt;

&lt;p&gt;My original dataset lacked data points where isAccount is False (only 30 lines!). By applying this data augmentation method I now have 298 emails of this class, hence multiplying by 10 the number of data points.&lt;br&gt;
I noticed that this scale down the impact of mail incorrectly marked as written in english, because wordnet don't give synonyms to non-english words. Hence these data points are not augmented.&lt;/p&gt;
&lt;h1&gt;
  
  
  possible weaknesses of my method
&lt;/h1&gt;

&lt;p&gt;My method doesn't ensure that the structure of the sentence is preserved. For example: a verb can be replacement by a noun.&lt;/p&gt;

&lt;p&gt;I haven't implemented a maximum number of sentences generated for each datapoint, hence my method will generate more data for longer sentences. This may cause overfitting.&lt;/p&gt;
&lt;h1&gt;
  
  
  TextAttack
&lt;/h1&gt;

&lt;p&gt;While looking for tools to perform data augmentation, I found TextAttack, defined by its authors as a Python framework for adversarial attacks and data augmentation in NLP.&lt;br&gt;
I had compatibility errors when trying to use it on my Google Colab but this is promising and worth looking into.&lt;br&gt;
Taken from their documentation, here is the basic code to have it running:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="err"&gt;!&lt;/span&gt;&lt;span class="n"&gt;pip&lt;/span&gt; &lt;span class="n"&gt;install&lt;/span&gt; &lt;span class="n"&gt;textattack&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;q&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;textattack.augmentation&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;WordNetAugmenter&lt;/span&gt;
&lt;span class="n"&gt;augmenter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;WordNetAugmenter&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;What I cannot create, I do not understand.&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;
&lt;span class="n"&gt;augmenter&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;augment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;the results seems similar to what I have done with wordnet, far from perfect but usable. &lt;br&gt;
augmenter.augment(s) return a big list. Among this list the best result is &lt;em&gt;'What I cannot create, I do not comprehend.'&lt;/em&gt; but we see that some meaning is lost, for example: &lt;em&gt;'What I cannot creating, I do not understand.'&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/QData/TextAttack" rel="noopener noreferrer"&gt;Here's their Github repo&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  afterword
&lt;/h1&gt;

&lt;p&gt;I hope this post will help someone to better understand data augmentation for text data. &lt;br&gt;
If you have any feedback to give, I'd be grateful if you take a few minutes to comment!&lt;br&gt;
I'm especially interested in finding ways to find synonyms in other languages than English. &lt;/p&gt;

</description>
      <category>beginners</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>dataaugmentation</category>
    </item>
    <item>
      <title>ML and text processing on emails</title>
      <dc:creator>Gabrielle</dc:creator>
      <pubDate>Mon, 22 Jun 2020 07:04:44 +0000</pubDate>
      <link>https://dev.to/veganaise/ml-and-text-processing-on-emails-4f6p</link>
      <guid>https://dev.to/veganaise/ml-and-text-processing-on-emails-4f6p</guid>
      <description>&lt;p&gt;I'm a software engineering student and this is my first blog post! I'm writing this to seek feedback, improve my technical writing skills and, hopefully, provide insights on text processing with Machine Learning.&lt;/p&gt;

&lt;p&gt;I'm currently tasked to do machine learning for Kormos, the startup I'm working with.&lt;/p&gt;

&lt;h1&gt;
  
  
  Our project
&lt;/h1&gt;

&lt;p&gt;We are trying to find all the websites an user is subscribed to by looking at their emails. For that we have a database of mails and four thousands of them are human-tagged. This tag is referred to as 'isAccount' and is true when the email was sent from a website the user is subscribed to.&lt;/p&gt;

&lt;p&gt;The tagged emails were selected based on keywords on their body field. such keywords are related to "account creation" or "email verification"&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fqy3o4q6oo1vkmhof2rt1.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fqy3o4q6oo1vkmhof2rt1.PNG" alt="Alt Text" width="461" height="273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This results in a imbalanced data set.&lt;/p&gt;

&lt;p&gt;For this project we're focusing on these data:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Subject&lt;/li&gt;
&lt;li&gt;Body&lt;/li&gt;
&lt;li&gt;sender Domain : the domain of the sender (e.g "kormos.com")&lt;/li&gt;
&lt;li&gt;langCode : the predicted language &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fhvz6i4r2bljf0gyowl2t.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fhvz6i4r2bljf0gyowl2t.PNG" alt="Alt Text" width="435" height="272"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We mostly have french emails. we're only considering french emails from now on.&lt;/p&gt;

&lt;h2&gt;
  
  
  Technical decisions
&lt;/h2&gt;

&lt;p&gt;I'm using Python on Google Colab. &lt;/p&gt;

&lt;p&gt;I'm doing Machine Learning using scikit-learn.&lt;/p&gt;

&lt;p&gt;I experimented with Spacy and consider using it to extract features from the body of emails. I'm thinking about extracting usernames or name of organizations. &lt;/p&gt;

&lt;h1&gt;
  
  
  Processing text
&lt;/h1&gt;

&lt;p&gt;I started training my model only on the subject field.&lt;/p&gt;

&lt;h3&gt;
  
  
  vectoring our text data
&lt;/h3&gt;

&lt;p&gt;I'm using scikit's TfidfTransformer, an equivalent to CountVectorizer followed by TfidfTransformer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.corpus&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;stopwords&lt;/span&gt;

&lt;span class="n"&gt;tfidfV&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;TfidfVectorizer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;stop_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stopwords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;french&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt;
    &lt;span class="n"&gt;max_features&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;15&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;corpus_bow&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tfidfV&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit_transform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;subject&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What it does is building a dictionary of the most common words, ignoring the stop words (frequent words of little value, like "I"). &lt;br&gt;
The size of the dictionary is, at most, equals to max_features.&lt;br&gt;
Based on this dictionary, each text input is transformed into a vector of dimension max_features.&lt;br&gt;
Basically, if "confirm" is the n-th word of the dictionary then n-th dimension of the output vector is the occurrences of the word "confirm". &lt;/p&gt;

&lt;p&gt;Hence we have a count matrix. Numerical values instead of text.&lt;/p&gt;
&lt;h3&gt;
  
  
  tfidf transformer
&lt;/h3&gt;

&lt;p&gt;This step transforms the count matrix to a normalized term-frequency representation. &lt;br&gt;
It scale down the impact of words that appears very frequently.&lt;/p&gt;
&lt;h3&gt;
  
  
  Splitting our dataset
&lt;/h3&gt;

&lt;p&gt;I use scikit to divide my data into two groups: one to train my model and the other to test it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;  &lt;span class="n"&gt;data_fr&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;isAccount&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;train_X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;val_y&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;train_test_split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;corpus_bow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I specify the random_state to an arbitrary number to fix the seed of the randomness generator, hence making my results stable across the different executions of my code .&lt;/p&gt;

&lt;h2&gt;
  
  
  Training our model
&lt;/h2&gt;

&lt;p&gt;The model I'm using is Scikit's RandomForestClassifier because I understand it. It's training a number of decision tree classifier and using the aggregation of their predictions.&lt;br&gt;
There are just so many models you can choose from.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RandomForestClassifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n_estimators&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_depth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;random_state&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_X&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_y&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;y_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;val_X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Results
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;confusion matrix :
[[ 18  29]
 [ 10 716]]
accuracy = 94.955%
precision = 96.107%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We get good result, however this is partly due to the imbalance in the distribution of the classes. &lt;/p&gt;

&lt;p&gt;With these predictions, we can easily create a list of unique sender domains the user is predicted to be subscribed to.&lt;/p&gt;

&lt;p&gt;I filter the list of domains by removing the ones not present in Alexa's top 1 million domains. Hopefully filtering any scam.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion - How to make it better?
&lt;/h1&gt;

&lt;p&gt;Assuming that the tagged data is correct and representative of future users, I believe that the model is good enough to be used.&lt;/p&gt;

&lt;p&gt;However I wonder if removing data where isAccount is True is an effective way to improve the model. The cost of that strategy would be to train the model on a much smaller data set.&lt;/p&gt;

&lt;p&gt;I have also been informed that data augmentation could be useful in this situation.&lt;/p&gt;

&lt;p&gt;Please feel free to give feedback!&lt;br&gt;
I can give additional information about any step of the process.&lt;/p&gt;

&lt;p&gt;Thanks to Scikit and Panda for their documentation.&lt;br&gt;
Thanks to Tancrède Suard, and Kormos, for its work on the dataset. &lt;/p&gt;

</description>
      <category>python</category>
      <category>scikit</category>
      <category>discuss</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
