The Experiment
We gave the same prompt to six LLMs: "Write a technical blog article about [topic] in approximately 800 characters."
- Commercial: Claude Sonnet 4, GPT-4o
- Open-source: Qwen 3.5-4B, Qwen 3.5-9B, Swallow-20B (Japanese-specialized), Llama 3.2-1B
10 topics × 3 trials each = 180 samples. Then we measured 16 linguistic pattern indicators — AI-frequent vocabulary, boilerplate conclusions, hedging, structural formatting, sentence rhythm — and combined them into a composite "AI Text Slop Score."
The Rankings (No Surprises Here)
| Model | Slop Score | Type |
|---|---|---|
| Claude Sonnet 4 | 22.6 ± 4.2 | Commercial |
| GPT-4o | 20.1 ± 6.0 | Commercial |
| Qwen 3.5-4B | 16.6 ± 5.3 | OSS |
| Qwen 3.5-9B | 15.6 ± 3.9 | OSS |
| Swallow-20B | 15.2 ± 6.2 | OSS (JP) |
| Llama 3.2-1B | 11.3 ± 8.6 | OSS |
Commercial models produce more "AI-like" text than open-source (Cohen's d = 1.01, p < 10⁻⁹). This is consistent with RLHF training optimizing for "professional, helpful" responses — which means converging on the same patterns.
In other words, the more you train a model to sound "helpful," the more it sounds like every other helpful model. RLHF is basically a factory that mass-produces the same polite intern.
Not surprising. But then we added human data.
The Plot Twist
We scored 10 human-written Qiita articles using the same 16 indicators.
Human score: 28.5 ± 8.1
Wait — humans scored higher than all AI models? The most "AI-like" writer... is human?
What's Actually Going On
It turns out our score was measuring two different things at once:
Structural indicators (headings, bullet lists, boilerplate conclusions) reflect platform culture. Qiita authors use 22.4 headings and 31.8 list markers per article — not because they're AI, but because that's what a "good Qiita article" looks like.
Vocabulary indicators (AI-frequent phrases, hedging, sycophantic language) actually do discriminate. Claude (3.43) and GPT-4o (3.33) use more AI-characteristic vocabulary than human writers (2.70).
The lesson: structure is cultural, vocabulary is computational. Any AI text detection system that mixes them will produce false positives on platforms with strong formatting conventions.
The Swallow Paradox
The most interesting model was Swallow-20B — a Japanese-specialized LLM from Tokyo Institute of Technology.
- Lowest AI-frequent vocabulary (0.80) — it learned natural Japanese from its curated corpus
- Highest boilerplate conclusions (1.17) — it also learned the structural clichés of Japanese tech blogging
Vocabulary and structure don't move together. They're independent dimensions. This means:
- You can't detect Japanese AI text by vocabulary alone (Swallow defeats it)
- You can't detect it by structure alone (humans defeat it)
- You need to analyze both dimensions separately
Each Model Has a Fingerprint
| Model | Signature Pattern |
|---|---|
| Claude Sonnet 4 | Over-structures everything (most headings, most lists) |
| GPT-4o | Hedges constantly ("it is considered that...") |
| Swallow-20B | Natural vocabulary + formulaic endings |
| Llama 3.2-1B | Can't follow instructions (asked for 800 chars, wrote 3,900) |
Llama 3.2-1B is the intern who was asked to write a one-page memo and returned a novella. Ironically, this incompetence makes it the least detectable by style metrics. Sometimes the best way to not sound like AI is to be bad at being AI.
Is This Robust?
We ran sensitivity analysis with four alternative weighting schemes:
- Equal weights: Same ranking ✓
- Vocabulary-only: GPT-4o edges past Claude (more hedging) ✓
- Structure-only: Claude dominates, Swallow jumps to #2 ✓
- Leave-one-feature-out: Top-2 unchanged across all 10 conditions ✓
The commercial > OSS gap holds under every scheme (Cohen's d = 0.67–1.15).
Practical Takeaways
If you write technical articles:
- Vary your sentence rhythm (AI text has suspiciously uniform sentence lengths)
- Drop the "いかがでしたでしょうか" (How was it?) — it's the #1 AI signal in Japanese. Every AI model ends articles like a waiter asking if you enjoyed your meal. You're not a waiter. Stop it.
- Use specific vocabulary instead of generic filler ("さまざまな", "効率的な")
If you build AI detection tools:
- Separate vocabulary sub-scores from structural sub-scores
- Calibrate per platform — Qiita norms ≠ Zenn norms ≠ note norms
- Multi-dimensional analysis beats single-feature classifiers
The Paper
Full paper (14 pages), all 190 data samples, and analysis scripts:
- 📄 Paper: DOI: 10.5281/zenodo.19173035
- 💻 Code & Data: github.com/kenimo49/ai-text-slop
- 🔵 Related: AI Blue — Color Recognition Bias in VLMs
This is the second paper in a series on AI Slop — the systematic convergence patterns in AI-generated content. The first (AI Blue) covered visual patterns; this one covers text.

Top comments (0)