Using pretrained word vectors are all good and fine and are of huge benefit to your models, for example the Google word2vec contains 3 million unique words (vocabulary) that were trained on billions of texts. Those word vectors are amazingly finetuned and represent the semantic meaning of a words to an extremely high accuracy. I mean they are state-of-the-art, why should we think of ever building our own word vectors?
Visualize this scenario
You are a hardcore fantasy fan and their are these particular novel series [Malazan book of the Fallen], you and your fellow hardcores are so crazy about, you all can’t wait to express your emotions to the author [Steven Erikson] about how you all feel about the books. In a classic bid to impress your other hardcore fans and show that you are the book number one fan, you decided to build a sentimental model, that will be able to analyze their reviews and accurately recognize their state of emotion, and then translate it into five succinct word (Very Good, Good, Ok, Bad, Very Bad).
In your enthusiasm you decided to use google word2vec (I mean it has 3 million words), then you discovered a very obvious error. You and your fellow hardcore fans use words a lot in your reviews that are not found in the google word2vec 3 million vocabulary (what!!! but the book was writen in English), words like Bugg, Gothos, Seguleh etc, (I guess the over 1 billion texts that google word2vec was trained on didn’t contain those words). Even with those cryptic words containing rich sentimental values (e.g. Hood’s breaths may means wow, amazing, shock etc. depending on the context), you still think you would be able to build a fairly accurate model with the remaining words that can be found in the google word2vec vocabulary.
Bolster by that insight, you went ahead with building your model. With further analysis, you then discovered an error of the most insidious kind, the kind that are extremely hard to detect because they do not trip any alarm (silent or loud). They pass so silently most times you do not notice them until it is very late, and the damage they deal is so extreme, that they have huge impact on the performance of your model.
Ok enough with trying to scare you, but which kind of error are these? They are in this NLP context what I will call the _mis-concepted errors_, they occur as a result of when the meaning of a word drastically change when placed in a different world, you would understand much better with an example. For example let’s take the word “burn”, this word in reality means when something was at some point ignited with fire or about to, but in the Malazan world this word take a totally different meaning and not in any way related, it’s the name of a god. Below are some example of words with their different meaning.
|Words||Meaning in the Real World||Meaning in the Malazan World|
|Hood||The metallic covering of a car engine, or a cloth||The name of the god of death.|
|High Fist||Taken together they mean nothing, but taken apart they mean different things||They are mostly taken as a single word, which means a rank in a military structure.|
|Kindly||In a kind manner||The name of a soldier (who isn’t even kind).|
|Divers||Someone who go deep under waters||A shape shifter, it changes from a single entity to multiple entities|
|Curdle||A milk going spoilt||The name of a long dead Soletaken (A dragon)|
You can now see how misconceptions like these can introduce terrible errors in your model, not only that, there are also words that taken on it own means nothing or completely different thing in the Malazan world, but must be combined with another word before the meaning of the word could be revealed e.g T'lan Imass, High mage, High fist, Ampelas Rooted etc.
Ok coming out of that visualization exercise you can see why having your own custom word vectors for specific case is imperative, but that does not mean the google word2vec should suddenly become useless to you, and you wouldn’t want to be building a word2vec model for every situation you encounter.
You get optimal results when you combine both model together, with your custom word2vec vectors acting as supplementary. You use the google word2vec vectors for more common words that are found in every day English, and then use your custom word2vec vectors for those rare words or phrases that are only found in the niche you are building your model for.
We are now done with the blah blah part of building a word2vec, we are now going to dive into the coding part of building it. let us import all the necessary modules needed to build the model.
This project requires that you have gone through the building a phrase model detector, as it is a required prequisite.
import glob import numpy as np from gensim.models import callbacks from gensim.models.word2vec import Word2Vec from gensim.models.callbacks import CallbackAny2Vec from gensim.models.phrases import Phrases %run utils.py
There is nothing new in the above code except the little instruction on line 9
%run utils.py this instruction executes the codes found in the*utils.py* file in this current namespace. This file contains a very useful generic tools that feeds data to our model when training without having to load all the text file to memory, it was defined when we built a phrase detector model. Defining it here again would take up unnecessary space and distract us from our goal, so head here to see how it was defined.
def load_phrase_detector_model(fname, reduce_size=False): """Load a phrase detector model from disk. Parameters: fname: str path to the pretrained phrase detector model reduce_size: bool should be False if the full sized model was saved during training of the phrase detector model, then it would be reduced in size when loaded, else it should False will raise an AttributeError if set to True and the phrase detector model is not a full sized model """ phrases = Phrases.load(fname) print(f"Loading complete") return phrases.freeze() if reduce_size else phrases # load a phrase detector model phrase_model_path = "malaz_phrase_detector" phrases = load_phrase_detector_model(phrase_model_path, reduce_size=True) sentences_iterator = CustomPathLineSentences('Books', include_phrase=True, phrase_model=phrases)
If a phrase detector model is passed to the custom iterator, as it iterates through sentences it combines any combination of words that constitutes phrase detected in the sentence into a single word, and returns the modified sentence (for example “High Fist” to “High_Fist”).
It doesn’t do this when we were building a phrase detector model, it simply return the split words, that is why I said the function is generic and a very useful tool to have in your toolbelt.
The phrase detector model solves the problem of where a combination of words constitute a word, this will help the word2vec model find the word vector that best represent the semantic meaning of the combined word.
The text files must have the same structure as explained in the phrase detector project.
model_path = "malaz_word2vec.bin" class CustomCallback(CallbackAny2Vec): """Create a custom callback that save the model at the end of each epoch and at the end of training, while also reporting the current epoch value.""" def __init__(self): self.__epoch_trained = 0 def on_epoch_end(self, model): model.save(model_path) self.__epoch_trained += 1 print(self.__epoch_trained, end=' | ') def on_train_end(self, model): model.save(model_path)
epochs = 1000 vector_size = 300 min_count = 3 num_workers = 4 window_SIZE = 5 subsampling = 1e-3 model = Word2Vec(workers=num_workers, vector_size=vector_size, min_count=min_count, window=window_size, sample=subsampling) # build word vector vocabulary model.build_vocab(sentences_iterator) # training word2vec model.train(sentences_iterator, total_examples=model.corpus_count, epochs=epochs, compute_loss=True, callbacks=[CustomCallback()]) # just for precaution sake model.save(model_path)
It might take minutes or hours depending on your computer, I slept off after 4 hours and it was still at 300 epochs. That is why I save it after every epoch and at when it finally completes training [during when I was probably asleep] and again (to be doubly sure it was save). With this simple word vector alone, you can build a fairly complex sentiment analysis model center around this author, there will be a drastic drop of accuracy if another author is added to the mix.
God loves you!