DEV Community

Discussion on: Compare documents similarity using Python | NLP

Collapse
 
jemq profile image
JEMQ • Edited

Really great tutorial, thanks!

Two points and a question-

1.

should the second line of this -

tf_idf = gensim.models.TfidfModel(corpus)
for doc in tfidf[corpus]:
print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])

read -

for doc in tf_idf[corpus]:

2.
To avoid having percentages over 100 and also to calculate the correct average, I think you have to divide the second total by the number of documents in the query corpus.

i.e -

total_avg = ((np.sum(avg_sims, dtype=np.float)) / len(file2_docs))

3.

Any thoughts on how you would compare a corpus to itself? I.e to see how unique each document is within the corpus?
I've tried a variety of corpora using your code and they all end up with the same similarity score...8% (Using the update % calc above)

Cheers,

Jamie