Cool website you got yourself there!
I got a question I forgot to ask. Why do you turn the 'stopwords' list into a set()? First I thought it was because you properly intented to remove duplicate items from the list, but then it stroke me.. Why would there be duplicate items in a corpus list containing stop words? When I compared the length of the list before and after turning it into a set. There was no difference:
len(stopwords.words("english") == len(set(stopwords.words("english")))
Tracing the variable throughout the script, I most admit, I can not figure out why you turned it into a set. I assume it is a mistake?
Or do you have any specific reason for it?
Hmm, I believe the first time I used the list of stop words from NLTK there were some duplicates, if not I am curious too, lol. It may be time to change it to a list.
Thanks for the note!
If you ever try your implementation using TFIDF, let me know how it goes.
We’re a place where coders share, stay up-to-date and grow their careers.
We strive for transparency and don't collect excess data.