loading...

Discussion on: Compare documents similarity using Python | NLP

Collapse
clymbert8 profile image
clymbert8

1) This "similarity" is asymmetric. Look at the definition of TFIDF, it calculates for whatever you consider the corpus, not the query. So when you switch query and corpus, you are changing the weights (IDF) or the "normalization" (I prefer to think of square-root(DF) as the denominator for both texts - the corpus as well as the query). Geometrically: You have two psuedo-vectors V1 and V2. Naively we think of similarity as some equivalent to cosine of the angle between them. But the "angle" is being calculated after a projection of both V1 and V2. Now if this projection is determined symmetrically you'll be fine. But actually the projection of both vectors is based on a component of the first vector. So it is not symmetric under exhange. Concretely, consider two vectors V1 = (3,4,5) and V2 = (3,1,2). Our rule is to calculate the angle after projecting perpendicular to the largest component of the first vector (to down weight or eliminate the most common tokens in the corpus). If V1 is the corpus, you are calculating the angle between V1' = (3,4,0) and V2' = (3,1,0). If V2 is the corpus you are calculating the angle between V1" = (0,4,5) and V2" = (0,1,2).

2) Again think geometrically in terms of projections. If your corpus has only one document, every token in that document has the maximum DF. So when you project perpendicular to this, you get zero! Algebraically, I suspect that what people call IDF is actually Log(IDF). So a token that appears in every document in the corpus has a DF of 1, its inverse is 1 and the log of that is ... 0. So if you only have one document, every token satisfies this and you are left with LIDF = 0.
Why log? probably something based on Zipf's law. But remember, this (LIDF) is not mathematically derived, it is just a heuristic that has become common usage. If you prefer, to do geometry with distributions, you should use something like the symmetrized Kullbach - Lieber probability divergence, or even better, the Euclidean metric in logit space.