Discussion on: Compare documents similarity using Python | NLP

View post

This was completely awesome, clear and useful.
I don't want to sound lazy, but I was wondering if you can point me in the right direction.
I have a large group of texts that are very similar among themselves (questionnaires). So, I wanted to create a large corpus with the "typical" answers, and have a function that tells me if one questionnaire is different from the "average" (I think I can tag where they differ by listing the lowest ranking sentences). My initial attempt consisted in combining several files together, putting them in the corpse and then compare a few other files with that corpse. It seems weird, but as my corpse grows, I tend to have LOWER similarity. Is this expected? If so, how can I improve the code so it points me to "unusual" answers? I was thinking about throwing several corpi into a machine learning algorithm until it learns which corpus are equal and which ones are different. But I was wondering if you have any insights on that. Thanks.