tf_idf = gensim.models.TfidfModel(corpus)
for doc in tfidf[corpus]:
print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])
read -
for doc in tf_idf[corpus]:
2.
To avoid having percentages over 100 and also to calculate the correct average, I think you have to divide the second total by the number of documents in the query corpus.
Any thoughts on how you would compare a corpus to itself? I.e to see how unique each document is within the corpus?
I've tried a variety of corpora using your code and they all end up with the same similarity score...8% (Using the update % calc above)
Cheers,
Jamie
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
Really great tutorial, thanks!
Two points and a question-
1.
should the second line of this -
tf_idf = gensim.models.TfidfModel(corpus)
for doc in tfidf[corpus]:
print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])
read -
for doc in tf_idf[corpus]:
2.
To avoid having percentages over 100 and also to calculate the correct average, I think you have to divide the second total by the number of documents in the query corpus.
i.e -
total_avg = ((np.sum(avg_sims, dtype=np.float)) / len(file2_docs))
3.
Any thoughts on how you would compare a corpus to itself? I.e to see how unique each document is within the corpus?
I've tried a variety of corpora using your code and they all end up with the same similarity score...8% (Using the update % calc above)
Cheers,
Jamie