Ravi

Posted on Sep 20

Word-embedding-with-Python: doc2vec

#genai #vectordatabase #python

doc2vec implementation with Python (& Gensim)

Note: This code is written in Python 3.6.1 (+Gensim 2.3.0)
Python implementation and application of doc2vec with Gensim

import re
import numpy as np

from gensim.models import doc2Vec
from gensim.models.doc2vec import TaggedDocument
from nltk.corpus import gutenberg
from multiprocessing import Pool
from scipy import spatial

Import training dataset
Import Shakespeare's Hamlet corpus from nltk library

sentences = list(gutenberg.sents('shakespeare-hamlet.txt'))   # import the corpus and convert into a list

print('Type of corpus: ', type(sentences))
print('Length of corpus: ', len(sentences))

Type of corpus: class 'list'
Length of corpus: 3106

print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

['[', 'The', 'Tragedie', 'of', 'Hamlet', 'by', 'William', 'Shakespeare', '1599', ']']
['Actus', 'Primus', '.']
['Fran', '.']

Preprocess data

Use re module to preprocess data
Convert all letters into lowercase
Remove punctuations, numbers, etc.
For the doc2vec model, input data should be in format of iterable TaggedDocuments"
- Each TaggedDocument instance comprises words and tags
- Hence, each document (i.e., a sentence or paragraph) should have a unique tag which is identifiable

for i in range(len(sentences)):
    sentences[i] = [word.lower() for word in sentences[i] if re.match('^[a-zA-Z]+', word)]  
print(sentences[0])    # title, author, and year
print(sentences[1])
print(sentences[10])

['the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare']
['actus', 'primus']
['fran']

for i in range(len(sentences)):
    sentences[i] = TaggedDocument(words = sentences[i], tags = ['sent{}'.format(i)])    # converting each sentence into a TaggedDocument
sentences[0]

TaggedDocument(words=['the', 'tragedie', 'of', 'hamlet', 'by', 'william', 'shakespeare'], tags=['sent0'])

Create and train model

Create a doc2vec model and train it with Hamlet corpus
Key parameter description (https://radimrehurek.com/gensim/models/doc2vec.html)
- sentences: training data (has to be a list with tokenized sentences)
- size: dimension of embedding space
- sg: CBOW if 0, skip-gram if 1
- window: number of words accounted for each context (if the window
- size is 3, 3 word in the left neighorhood and 3 word in the right neighborhood are considered)
- min_count: minimum count of words to be included in the vocabulary
- iter: number of training iterations
- workers: number of worker threads to train

model = Doc2Vec(documents = sentences,dm = 1, size = 100, min_count = 1, iter = 10, workers = Pool()._processes)

model.init_sims(replace = True)

Save and load model

doc2vec model can be saved and loaded locally
Doing so can reduce time to train model again

model.save('doc2vec_model')
model = doc2Vec.load('doc2vec_model')

Similarity calculation

Similarity between embedded words (i.e., vectors) can be computed using metrics such as cosine similarity

model.most_similar('hamlet')

[('horatio', 0.9978846311569214),
('queene', 0.9971947073936462),
('laertes', 0.9971820116043091),
('king', 0.9968599081039429),
('mother', 0.9966716170310974),
('where', 0.9966292381286621),
('deere', 0.9965540170669556),
('ophelia', 0.9964221715927124),
('very', 0.9963752627372742),
('oh', 0.9963476657867432)]

v1 = model['king']
v2 = model['queen']

# define a function that computes cosine similarity between two words
def cosine_similarity(v1, v2):
    return 1 - spatial.distance.cosine(v1, v2)

cosine_similarity(v1, v2)

0.99437165260314941

DEV Community

Word-embedding-with-Python: doc2vec

doc2vec implementation with Python (& Gensim)

Create and train model

Save and load model

Similarity calculation

Top comments (0)

Read next

Understanding JSONify(), to_dict(), make_response(), and SerializerMixin in Flask

How I Saved Myself Hours Using Python, Google Gemini, & Meta Llama to Create a Time Saving Script

Flatten in PyTorch

How to Use PySpark for Machine Learning