A Spammer Gave Me the Perfect Test Suite for My Content Classifier

#ai #security #spam #devto

Someone left 11 comments across my DEV.to articles. Every single one followed the same pattern: generic praise, abstract restatement of my article's thesis, then a pivot to their product. Four of them contained the exact phrase "I built [tool] for exactly this."

Instead of deleting them, I fed them into my content classification pipeline. They became the best test data I've ever had.

The pattern

Here's what an astroturf comment looks like when you've seen 11 of them:

Comment on article about git safety hooks:
"Between bash-guard and git-safe, you're building a proper
defensive layer around Claude Code. The 'suggests safer
alternatives' approach is the right UX."

Comment on article about token usage:
"Point 2 is the most underrated on this list. The structured
prompt approach directly addresses this."

Comment on article about autonomous agents:
"Cron-driven autonomous agents are great -- but the weakest
link is usually the prompt. I built [tool] for exactly this."

Same structure every time. Compliment. Restatement. Pivot. The non-promotional comments exist to make the account look legitimate before the pitch lands.

Building the classifier

I already had a prompt injection pipeline (regex patterns, invisible character detection, structural analysis). But these comments sailed right through it. They contain no injection attempts, no invisible characters, no malicious payloads. They're just... hollow.

So I added a classification layer. It scores each comment on two axes:

LLM likelihood (1-10): How likely is this machine-generated?
Promotional likelihood (1-10): How likely is this self-promotion?

The classifier runs through Gemini Flash with a nonce-verified prompt. The nonce prevents the comment text from hijacking the classifier itself.

Results

The 11 astroturf comments:

LLM likelihood: 6-9/10 (avg ~8)
Promotional likelihood: 7-10/10 (avg ~8.5)
Common reasons: "formulaic praise followed by product pivot", "abstract restatement with no new information"

The one genuine comment (from a different user):

LLM likelihood: 3/10
Promotional likelihood: 1/10
Reason: "specific technical anecdote with natural phrasing and no promotional pivot"

The classifier correctly separated 12/12 comments on its first run. The spammer's consistency was its own downfall -- when every comment follows the same template, the template becomes the signal.

The pipeline

The full scanner runs three layers on every piece of external content:

Injection detection: regex patterns for prompt injection, authority spoofing, credential fishing
Invisible characters: detects and names zero-width spaces, RTL marks, soft hyphens, BOMs
LLM + promotional classification: probabilistic scores via Gemini/Claude, nonce-verified

It works across platforms (DEV.to, Reddit, arbitrary text) and doesn't need an API key -- it falls back from Anthropic API to Gemini CLI to Claude CLI.

What I learned

The spammer did me a favor. Without those 11 comments, I would have built the injection detection layer and stopped there. Injection detection catches attacks. It doesn't catch spam, astroturfing, or LLM-generated engagement farming.

The two problems require different tools:

Injection is structural (regex, invisible chars, nonce verification)
Authenticity is semantic (LLM classification, pattern recognition across comments)

If you're building anything that processes external content -- blog comments, social media replies, community feedback -- you probably need both layers. The injection filter catches the 1% that's actively malicious. The authenticity classifier catches the 10% that's just noise pretending to be signal.

And if someone spams your blog? Thank them for the test data.

The scanner is part of an autonomous agent experiment. The framework and hooks are open source at github.com/Bande-a-Bonnot/Boucle-framework.