DEV Community

Compare documents similarity using Python | NLP

Rashid on September 16, 2019

This post cross-published with OnePublish In this post we are going to build a web application which will compare the similarity between two docum...

Read full post

R13TechNewbie • Dec 15 '19

Thanks for making the tutorial, coderasha. I currently following your tutorial, and I think I found some typo in this part :

tf_idf = gensim.models.TfidfModel(corpus)
for doc in tfidf[corpus]:
print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

probably the "mydict" variable is typo, so I changed to "dictionary" based on previous line declaration and the code works.

Please verify this, Thanks

Rashid • Dec 15 '19

Oh, I see. Yes, there is a typo in that part. Thank you for your attention :)

JEMQ • Apr 12 '20 • Edited

Really great tutorial, thanks!

Two points and a question-

should the second line of this -

tf_idf = gensim.models.TfidfModel(corpus)
for doc in tfidf[corpus]:
print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])

read -

for doc in tf_idf[corpus]:

2.
To avoid having percentages over 100 and also to calculate the correct average, I think you have to divide the second total by the number of documents in the query corpus.

i.e -

total_avg = ((np.sum(avg_sims, dtype=np.float)) / len(file2_docs))

Any thoughts on how you would compare a corpus to itself? I.e to see how unique each document is within the corpus?
I've tried a variety of corpora using your code and they all end up with the same similarity score...8% (Using the update % calc above)

Cheers,

Jamie

bingyuyiyang • Jul 3 '20

Thank you for making the tutorial. I have some questions for the code as follows.
1). do you why if I switch the query document (demofile2.txt) and demofile.txt, I can not get the same similarity of two documents?
2). If the document demofile.txt just contains one sentence: " Mars is the fourth planet in our solar system." , the print (doc) will empty. Do you know why? In other words, the TFIDF does not work, when corpus is single sentence for your code.

file_docs = []
with open ('~/demofile.txt') as f:
tokens = sent_tokenize(f.read())
for line in tokens:
file_docs.append(line)
print("Number of documents:",len(file_docs))
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in file_docs]
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary.token2id)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
for doc in tf_idf[corpus]:
print(doc)
print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])

Please let me know if you have any comments about it. Thank you.

clymbert8 • Oct 19 '20

1) This "similarity" is asymmetric. Look at the definition of TFIDF, it calculates for whatever you consider the corpus, not the query. So when you switch query and corpus, you are changing the weights (IDF) or the "normalization" (I prefer to think of square-root(DF) as the denominator for both texts - the corpus as well as the query). Geometrically: You have two psuedo-vectors V1 and V2. Naively we think of similarity as some equivalent to cosine of the angle between them. But the "angle" is being calculated after a projection of both V1 and V2. Now if this projection is determined symmetrically you'll be fine. But actually the projection of both vectors is based on a component of the first vector. So it is not symmetric under exhange. Concretely, consider two vectors V1 = (3,4,5) and V2 = (3,1,2). Our rule is to calculate the angle after projecting perpendicular to the largest component of the first vector (to down weight or eliminate the most common tokens in the corpus). If V1 is the corpus, you are calculating the angle between V1' = (3,4,0) and V2' = (3,1,0). If V2 is the corpus you are calculating the angle between V1" = (0,4,5) and V2" = (0,1,2).

2) Again think geometrically in terms of projections. If your corpus has only one document, every token in that document has the maximum DF. So when you project perpendicular to this, you get zero! Algebraically, I suspect that what people call IDF is actually Log(IDF). So a token that appears in every document in the corpus has a DF of 1, its inverse is 1 and the log of that is ... 0. So if you only have one document, every token satisfies this and you are left with LIDF = 0.
Why log? probably something based on Zipf's law. But remember, this (LIDF) is not mathematically derived, it is just a heuristic that has become common usage. If you prefer, to do geometry with distributions, you should use something like the symmetrized Kullbach - Lieber probability divergence, or even better, the Euclidean metric in logit space.

Vee W • Jun 22 '22

Hi, this is very helpful! Wonder whether there is any doc on how the number of documentation is determined. I also need to read more on solutions if document is = 1.

plysytsya • Nov 23 '19

Thanks for making this tutorial. Exactly what I was looking for!

diyaralzuhairi • Sep 23 '20

A very very very helpful blog. Thanks a lot my friend.

I am wondering if possible to apply my idea by this approach. which am thinking to create two folders ( Student_answers ) and ( Teacher_reference_answers ). Each folder content number of txts. for example , student answers have 30 txt. and Teacher_reference answers have 5 txt. so the idea is to compare students answers documents with the 5 teachers answers to compute the score automatically ( and chose the biggest score for each student ) ?

I would love to hear from you <3

Isa Levine • Sep 16 '19

Oh wow, this is EXACTLY the kind of tutorial I've been looking for to dip my toes into NLP! Thank you so much for sharing this!!

Rashid • Sep 16 '19

Great! :D Glad you liked it!

Nihar Sainee • Jul 14 '20

It is giving an error while calling this piece of code

perform a similarity query against the corpus

query_doc_tf_idf = tf_idf[query_doc_bow]

print(document_number, document_similarity)

print('Comparing Result:', sims[query_doc_tf_idf])

error is :
FileNotFoundError: [Errno 2] No such file or directory: 'workdir/.0'
Can anyone help

Zhuangdie(Alan) Zhou • Jul 16 '20

building the index

sims = gensim.similarities.Similarity('workdir/',tf_idf[corpus],
num_features=len(dictionary))
Change the 'workdir/' to your directory where your python script is located

Casi Imin • Sep 16 '19 • Edited

Nice! Thank you for sharing, very useful. The NLTK’s power!

Rashid • Sep 16 '19

I am glad you liked it! :)

Vee W • Jun 22 '22

Absolutely fabulous article!