Word2Vec and Word Similarity

When humans read text we assign so much meaning to individual words, and over many years we develop an understanding of the relationship between certain words. It is difficult to imagine a computer being able to understand or infer the relationship between two words, but using Natural Language Processing we can begin to achieve just that. The gensim package contains a model called Word2Vec. Word2Vec is built using a corpus of ‘documents’, and one of its uses is to calculate the relationship between words. Let’s start by building an example model. A common demonstration is to use a list of jeopardy questions, so we’ll build our model with that. First, we’ll need to get our list of questions into the proper format. Word2Vec requires a corpus of ‘sentences’, where each ‘sentence’ is a list of words.

corpus = []

for clue in clues:
    sentence = clue['question'].translate(str.maketrans('','',string.punctuation)).split(' ')
    new_sentence = []
    for word in sentence:
        new_sentence.append(word.lower())

    corpus.append(new_sentence)

If we check an entry in our new list it should look something like this:

['in',
 'the',
 'title',
 'of',
 'an',
 'aesop',
 'fable',
 'this',
 'insect',
 'shared',
 'billing',
 'with',
 'a',
 'grasshopper']

So you can see what I mean by a ‘sentence’ being a list of words. Now that we have our corpus properly formatted, we can build a model:

model = gensim.models.Word2Vec(corpus,sg=1,seed=10)
model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)

Now that we have successfully created our model, let’s discuss some of the interesting things we can do with it. In order to make use of the model, we will be calling functions using model.wv. If you need to check the vectors for a specific word you can use model.wv[word] but for the purposes of this tutorial it won’t be necessary. To start with, let’s talk about similarity and distance. In fact, model.wv has functions called .distance() and .similarity(), and they are very similar, but with an important difference. When using .similarity(), two words that are similar will return a higher number, whereas .distance() will return a lower number. The reason for this is that .similarity() calculates how similar the two words are, with two of that same word being 1, or 100% similarity. .distance() calculates the distance between the words, so a lower number means the words are closer in meaning, and two of the same word will return 0, or a distance of 0 between the words. In fact, for any pair of words the distance and the similarity always add up to 1.
You can also return a list of the words most similar to a given word, using .most_similar(). Let’s see how that looks:

model.wv.most_similar(['movie'])

Which returns:

[('film', 0.8861722350120544),
 ('movies', 0.7633383274078369),
 ('flick', 0.7305044531822205),
 ('starring', 0.7231286764144897),
 ('films', 0.7168289422988892),
 ('miniseries', 0.704983651638031),
 ('remake', 0.6890048980712891),
 ('tearjerker', 0.6820477247238159),
 ('liveaction', 0.6722157597541809),
 ('oscarwinning', 0.6645717620849609)]

You also play around with this feature, by combining and subtracting words to calculate new words. To understand what this is doing, let first set it up as an equation. A classic example would be something like: King - Man + Woman = Queen. A King,, typically a masculine monarch, minus the masculine aspect or the position, plus a female aspect, would be a Queen. So let’s see what the computer comes up with:

model.wv.most_similar(positive=['king', 'woman'],
                     negative=['man'], topn=10)

Produces:

[('queen', 0.7005906105041504),
 ('aquitaine', 0.6221221685409546),
 ('monarch', 0.6098789572715759),
 ('throne', 0.6048736572265625),
 ('wilhelmina', 0.5939158201217651),
 ('consort', 0.5926421284675598),
 ('granddaughter', 0.5812513828277588),
 ('margrethe', 0.5809222459793091),
 ('vi', 0.5733479261398315),
 ('princess', 0.5711938142776489)]

Astonishing, it was able to determine that the appropriate answer to our equation would be ‘Queen’! There are many other fun and interesting aspects of Word2Vec models that can be very useful, and playing around with things like this is a good way to familiarize yourself with Word2Vec.