loading...

Discussion on: Compare documents similarity using Python | NLP

Collapse
bingyuyiyang profile image
bingyuyiyang

Thank you for making the tutorial. I have some questions for the code as follows.
1). do you why if I switch the query document (demofile2.txt) and demofile.txt, I can not get the same similarity of two documents?
2). If the document demofile.txt just contains one sentence: " Mars is the fourth planet in our solar system." , the print (doc) will empty. Do you know why? In other words, the TFIDF does not work, when corpus is single sentence for your code.

file_docs = []
with open ('~/demofile.txt') as f:
tokens = sent_tokenize(f.read())
for line in tokens:
file_docs.append(line)
print("Number of documents:",len(file_docs))
gen_docs = [[w.lower() for w in word_tokenize(text)]
for text in file_docs]
dictionary = gensim.corpora.Dictionary(gen_docs)
print(dictionary.token2id)
corpus = [dictionary.doc2bow(gen_doc) for gen_doc in gen_docs]
tf_idf = gensim.models.TfidfModel(corpus)
for doc in tf_idf[corpus]:
print(doc)
print([[dictionary[id], np.around(freq, decimals=2)] for id, freq in doc])

Please let me know if you have any comments about it. Thank you.

Collapse
clymbert8 profile image
clymbert8

1) This "similarity" is asymmetric. Look at the definition of TFIDF, it calculates for whatever you consider the corpus, not the query. So when you switch query and corpus, you are changing the weights (IDF) or the "normalization" (I prefer to think of square-root(DF) as the denominator for both texts - the corpus as well as the query). Geometrically: You have two psuedo-vectors V1 and V2. Naively we think of similarity as some equivalent to cosine of the angle between them. But the "angle" is being calculated after a projection of both V1 and V2. Now if this projection is determined symmetrically you'll be fine. But actually the projection of both vectors is based on a component of the first vector. So it is not symmetric under exhange. Concretely, consider two vectors V1 = (3,4,5) and V2 = (3,1,2). Our rule is to calculate the angle after projecting perpendicular to the largest component of the first vector (to down weight or eliminate the most common tokens in the corpus). If V1 is the corpus, you are calculating the angle between V1' = (3,4,0) and V2' = (3,1,0). If V2 is the corpus you are calculating the angle between V1" = (0,4,5) and V2" = (0,1,2).

2) Again think geometrically in terms of projections. If your corpus has only one document, every token in that document has the maximum DF. So when you project perpendicular to this, you get zero! Algebraically, I suspect that what people call IDF is actually Log(IDF). So a token that appears in every document in the corpus has a DF of 1, its inverse is 1 and the log of that is ... 0. So if you only have one document, every token satisfies this and you are left with LIDF = 0.
Why log? probably something based on Zipf's law. But remember, this (LIDF) is not mathematically derived, it is just a heuristic that has become common usage. If you prefer, to do geometry with distributions, you should use something like the symmetrized Kullbach - Lieber probability divergence, or even better, the Euclidean metric in logit space.