Ken Imoto

Posted on Mar 23 • Originally published at doi.org

We Measured 180 AI-Generated Japanese Articles. The Results Were Not What We Expected.

#ai #llm #nlp #research

The Experiment

We gave the same prompt to six LLMs: "Write a technical blog article about [topic] in approximately 800 characters."

Commercial: Claude Sonnet 4, GPT-4o
Open-source: Qwen 3.5-4B, Qwen 3.5-9B, Swallow-20B (Japanese-specialized), Llama 3.2-1B

10 topics × 3 trials each = 180 samples. Then we measured 16 linguistic pattern indicators — AI-frequent vocabulary, boilerplate conclusions, hedging, structural formatting, sentence rhythm — and combined them into a composite "AI Text Slop Score."

The Rankings (No Surprises Here)

Model	Slop Score	Type
Claude Sonnet 4	22.6 ± 4.2	Commercial
GPT-4o	20.1 ± 6.0	Commercial
Qwen 3.5-4B	16.6 ± 5.3	OSS
Qwen 3.5-9B	15.6 ± 3.9	OSS
Swallow-20B	15.2 ± 6.2	OSS (JP)
Llama 3.2-1B	11.3 ± 8.6	OSS

Commercial models produce more "AI-like" text than open-source (Cohen's d = 1.01, p < 10⁻⁹). This is consistent with RLHF training optimizing for "professional, helpful" responses — which means converging on the same patterns.

In other words, the more you train a model to sound "helpful," the more it sounds like every other helpful model. RLHF is basically a factory that mass-produces the same polite intern.

Not surprising. But then we added human data.

The Plot Twist

We scored 10 human-written Qiita articles using the same 16 indicators.

Human score: 28.5 ± 8.1

Wait — humans scored higher than all AI models? The most "AI-like" writer... is human?

What's Actually Going On

It turns out our score was measuring two different things at once:

Structural indicators (headings, bullet lists, boilerplate conclusions) reflect platform culture. Qiita authors use 22.4 headings and 31.8 list markers per article — not because they're AI, but because that's what a "good Qiita article" looks like.

Vocabulary indicators (AI-frequent phrases, hedging, sycophantic language) actually do discriminate. Claude (3.43) and GPT-4o (3.33) use more AI-characteristic vocabulary than human writers (2.70).

The lesson: structure is cultural, vocabulary is computational. Any AI text detection system that mixes them will produce false positives on platforms with strong formatting conventions.

The Swallow Paradox

The most interesting model was Swallow-20B — a Japanese-specialized LLM from Tokyo Institute of Technology.

Lowest AI-frequent vocabulary (0.80) — it learned natural Japanese from its curated corpus
Highest boilerplate conclusions (1.17) — it also learned the structural clichés of Japanese tech blogging

Vocabulary and structure don't move together. They're independent dimensions. This means:

You can't detect Japanese AI text by vocabulary alone (Swallow defeats it)
You can't detect it by structure alone (humans defeat it)
You need to analyze both dimensions separately

Each Model Has a Fingerprint

Model	Signature Pattern
Claude Sonnet 4	Over-structures everything (most headings, most lists)
GPT-4o	Hedges constantly ("it is considered that...")
Swallow-20B	Natural vocabulary + formulaic endings
Llama 3.2-1B	Can't follow instructions (asked for 800 chars, wrote 3,900)

Llama 3.2-1B is the intern who was asked to write a one-page memo and returned a novella. Ironically, this incompetence makes it the least detectable by style metrics. Sometimes the best way to not sound like AI is to be bad at being AI.

Is This Robust?

We ran sensitivity analysis with four alternative weighting schemes:

Equal weights: Same ranking ✓
Vocabulary-only: GPT-4o edges past Claude (more hedging) ✓
Structure-only: Claude dominates, Swallow jumps to #2 ✓
Leave-one-feature-out: Top-2 unchanged across all 10 conditions ✓

The commercial > OSS gap holds under every scheme (Cohen's d = 0.67–1.15).

Practical Takeaways

If you write technical articles:

Vary your sentence rhythm (AI text has suspiciously uniform sentence lengths)
Drop the "いかがでしたでしょうか" (How was it?) — it's the #1 AI signal in Japanese. Every AI model ends articles like a waiter asking if you enjoyed your meal. You're not a waiter. Stop it.
Use specific vocabulary instead of generic filler ("さまざまな", "効率的な")

If you build AI detection tools:

Separate vocabulary sub-scores from structural sub-scores
Calibrate per platform — Qiita norms ≠ Zenn norms ≠ note norms
Multi-dimensional analysis beats single-feature classifiers

The Paper

Full paper (14 pages), all 190 data samples, and analysis scripts:

📄 Paper: DOI: 10.5281/zenodo.19173035
💻 Code & Data: github.com/kenimo49/ai-text-slop
🔵 Related: AI Blue — Color Recognition Bias in VLMs

This is the second paper in a series on AI Slop — the systematic convergence patterns in AI-generated content. The first (AI Blue) covered visual patterns; this one covers text.

DEV Community