Does anyone have recommended text classification tools (APIs, services, etc) for fighting spam here on DEV?
We have spam measurements in place but we’d like to mature our toolbelt with text classification. We’re not interested in working with services that rely on providing the poster’s IP address (such as Akismet), and ideally we’d want to use something open source.
Look forward to hearing everyone’s thoughts!
Top comments (5)
To understand the scope: you mean to prevent spam from being posted on the site? (Or removed quickly after)
Is that a problem right now or do you see first signs of spam and want to stop it before it spreads?
"Understanding" text can be tricky. Maybe the SpamAssassin project can be adapted to help here. Otherwise, would it be an option for your vision of the site that when trying to post articles or comments with links, new users would be required to have their first few posts approved by a more senior member?
This would be the opposite of automation, of course, but StackOverflow uses this concept to allow experienced users more than newbies quite well and it would allow to give nice feedback from a human instead of an automated decline.
We have a variety of other measures in place already including elevating possible spam and sending stuff around for review. We're looking for a detection lib which has the potential to flag stuff as a component of a broader strategy.
As it stands, spam isn't a big problem on the site, we just want to keep refining our solution so it scales better.
Spam isn't a big problem?
This is because we don't collect this info. We proactively anonymize IP addresses. Since we don't allow any unauthenticated passer-by to comment, we don't necessarily need it either, since authentication info from social login provides more clues.
We already do a bit of text classification for different purposes, but we still feel like we could do more proactively in the area of spam. Any input is appreciated.
Spam filtering usually done with Naive Bayes algorithm. It's quite simple and there are a lot of implementations in different languages. As with any such classifiers this algorithm requires training. Unlike many others training can be done in production.