DEV Community

Cover image for Compare documents similarity using Python | NLP

Compare documents similarity using Python | NLP

Rashid on September 16, 2019

This post cross-published with OnePublish In this post we are going to build a web application which will compare the similarity between two docum...
Collapse
 
r13technewbie profile image
R13TechNewbie

Thanks for making the tutorial, coderasha. I currently following your tutorial, and I think I found some typo in this part :

tf_idf = gensim.models.TfidfModel(corpus)
for doc in tfidf[corpus]:
print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])

probably the "mydict" variable is typo, so I changed to "dictionary" based on previous line declaration and the code works.

Please verify this, Thanks

Collapse
 
thedevtimeline profile image
Rashid

Oh, I see. Yes, there is a typo in that part. Thank you for your attention :)

Collapse
 
jemq profile image
JEMQ • Edited

Really great tutorial, thanks!

Two points and a question-

1.

should the second line of this -

tf_idf = gensim.models.TfidfModel(corpus)
for doc in tfidf[corpus]:
print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])

read -

for doc in tf_idf[corpus]:

2.
To avoid having percentages over 100 and also to calculate the correct average, I think you have to divide the second total by the number of documents in the query corpus.

i.e -

total_avg = ((np.sum(avg_sims, dtype=np.float)) / len(file2_docs))

3.

Any thoughts on how you would compare a corpus to itself? I.e to see how unique each document is within the corpus?
I've tried a variety of corpora using your code and they all end up with the same similarity score...8% (Using the update % calc above)

Cheers,

Jamie

Collapse
 
bingyuyiyang profile image
bingyuyiyang

Thank you for making the tutorial. I have some questions for the code as follows.
1). do you why if I switch the query document (demofile2.txt) and demofile.txt, I can not get the same similarity of two documents?
2). If the document demofile.txt just contains one sentence: " Mars is the fourth planet in our solar system." , the print (doc) will empty. Do you know why? In other words, the TFIDF does not work, when corpus is single sentence for your code.

file_docs = []
with open ('~/demofile.txt') as f:
tokens = sent_tokenize(f.read())
for line in tokens:
file_docs.append(line)
print("Number of documents:",len(file_docs))
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in file_docs]
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary.token2id)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
for doc in tf_idf[corpus]:
print(doc)
print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])

Please let me know if you have any comments about it. Thank you.

Collapse
 
clymbert8 profile image
clymbert8

1) This "similarity" is asymmetric. Look at the definition of TFIDF, it calculates for whatever you consider the corpus, not the query. So when you switch query and corpus, you are changing the weights (IDF) or the "normalization" (I prefer to think of square-root(DF) as the denominator for both texts - the corpus as well as the query). Geometrically: You have two psuedo-vectors V1 and V2. Naively we think of similarity as some equivalent to cosine of the angle between them. But the "angle" is being calculated after a projection of both V1 and V2. Now if this projection is determined symmetrically you'll be fine. But actually the projection of both vectors is based on a component of the first vector. So it is not symmetric under exhange. Concretely, consider two vectors V1 = (3,4,5) and V2 = (3,1,2). Our rule is to calculate the angle after projecting perpendicular to the largest component of the first vector (to down weight or eliminate the most common tokens in the corpus). If V1 is the corpus, you are calculating the angle between V1' = (3,4,0) and V2' = (3,1,0). If V2 is the corpus you are calculating the angle between V1" = (0,4,5) and V2" = (0,1,2).

2) Again think geometrically in terms of projections. If your corpus has only one document, every token in that document has the maximum DF. So when you project perpendicular to this, you get zero! Algebraically, I suspect that what people call IDF is actually Log(IDF). So a token that appears in every document in the corpus has a DF of 1, its inverse is 1 and the log of that is ... 0. So if you only have one document, every token satisfies this and you are left with LIDF = 0.
Why log? probably something based on Zipf's law. But remember, this (LIDF) is not mathematically derived, it is just a heuristic that has become common usage. If you prefer, to do geometry with distributions, you should use something like the symmetrized Kullbach - Lieber probability divergence, or even better, the Euclidean metric in logit space.

Collapse
 
venuswo07180517 profile image
Vee W

Hi, this is very helpful! Wonder whether there is any doc on how the number of documentation is determined. I also need to read more on solutions if document is = 1.

Collapse
 
plysytsya profile image
plysytsya

Thanks for making this tutorial. Exactly what I was looking for!

Collapse
 
diyaralzuhairi profile image
diyaralzuhairi

A very very very helpful blog. Thanks a lot my friend.

I am wondering if possible to apply my idea by this approach. which am thinking to create two folders ( Student_answers ) and ( Teacher_reference_answers ). Each folder content number of txts. for example , student answers have 30 txt. and Teacher_reference answers have 5 txt. so the idea is to compare students answers documents with the 5 teachers answers to compute the score automatically ( and chose the biggest score for each student ) ?

I would love to hear from you <3

Collapse
 
isalevine profile image
Isa Levine

Oh wow, this is EXACTLY the kind of tutorial I've been looking for to dip my toes into NLP! Thank you so much for sharing this!!

Collapse
 
thedevtimeline profile image
Rashid

Great! :D Glad you liked it!

Collapse
 
saineenihar profile image
Nihar Sainee

It is giving an error while calling this piece of code

perform a similarity query against the corpus

query_doc_tf_idf = tf_idf[query_doc_bow]

print(document_number, document_similarity)

print('Comparing Result:', sims[query_doc_tf_idf])

error is :
FileNotFoundError: [Errno 2] No such file or directory: 'workdir/.0'
Can anyone help

Collapse
 
zz161 profile image
Zhuangdie(Alan) Zhou

building the index

sims = gensim.similarities.Similarity('workdir/',tf_idf[corpus],
num_features=len(dictionary))
Change the 'workdir/' to your directory where your python script is located

Collapse
 
casiimin profile image
Casi Imin • Edited

Nice! Thank you for sharing, very useful. The NLTK’s power!

Collapse
 
thedevtimeline profile image
Rashid

I am glad you liked it! :)

Collapse
 
venuswo07180517 profile image
Vee W

Absolutely fabulous article!