DEV Community

Michael Tharrington
Michael Tharrington Subscriber

Posted on

Text classification tools for fighting spam

Does anyone have recommended text classification tools (APIs, services, etc) for fighting spam here on DEV?

We have spam measurements in place but we’d like to mature our toolbelt with text classification. We’re not interested in working with services that rely on providing the poster’s IP address (such as Akismet), and ideally we’d want to use something open source.

Look forward to hearing everyone’s thoughts!

Top comments (5)

Collapse
 
pchinery profile image
Philip

To understand the scope: you mean to prevent spam from being posted on the site? (Or removed quickly after)

Is that a problem right now or do you see first signs of spam and want to stop it before it spreads?

"Understanding" text can be tricky. Maybe the SpamAssassin project can be adapted to help here. Otherwise, would it be an option for your vision of the site that when trying to post articles or comments with links, new users would be required to have their first few posts approved by a more senior member?

This would be the opposite of automation, of course, but StackOverflow uses this concept to allow experienced users more than newbies quite well and it would allow to give nice feedback from a human instead of an automated decline.

Collapse
 
ben profile image
Ben Halpern

We have a variety of other measures in place already including elevating possible spam and sending stuff around for review. We're looking for a detection lib which has the potential to flag stuff as a component of a broader strategy.

As it stands, spam isn't a big problem on the site, we just want to keep refining our solution so it scales better.

Collapse
 
marcellothearcane profile image
marcellothearcane • Edited

Spam isn't a big problem?

spam spam spam

Collapse
 
ben profile image
Ben Halpern

We’re not interested in working with services that rely on providing the poster’s IP address

This is because we don't collect this info. We proactively anonymize IP addresses. Since we don't allow any unauthenticated passer-by to comment, we don't necessarily need it either, since authentication info from social login provides more clues.

We already do a bit of text classification for different purposes, but we still feel like we could do more proactively in the area of spam. Any input is appreciated.

Collapse
 
siy profile image
Sergiy Yevtushenko

Spam filtering usually done with Naive Bayes algorithm. It's quite simple and there are a lot of implementations in different languages. As with any such classifiers this algorithm requires training. Unlike many others training can be done in production.