Gerus Lab

Posted on Apr 17

Your AI-Generated Text Stinks and Everyone Knows It — Here's What We Found Building Detection Tools

#webdev #programming #ai #machinelearning

Your AI-Generated Text Stinks and Everyone Knows It

Let's be honest: we've all done it. Pasted a prompt into ChatGPT, got a polished paragraph back, and thought — good enough. But here's the uncomfortable truth we discovered at Gerus-lab while building content pipelines for our clients: AI-generated text has a smell, and your readers, your hiring managers, and your users can detect it faster than any automated tool.

This isn't another "AI bad" hot take. We use AI daily at Gerus-lab — for prototyping, for brainstorming, for first drafts. But after spending months building content quality systems for SaaS platforms and analyzing thousands of texts, we've learned exactly why LLM output fails — and what developers need to understand about the linguistics behind it.

The Science Behind the Stink

Transformers Are Glorified T9 Keyboards

At their core, language models generate text autoregressively — each next token is chosen based on a probability distribution over the previous context. The model doesn't understand words. It picks the statistically most likely continuation.

A 2025 study published in PNAS (Reinhart et al., "Do LLMs Write Like Humans?") ran texts through Biber's linguistic feature analysis — a standard system for analyzing text registers — and compared human and model outputs. The findings were striking:

Participial clauses appear 2–5x more frequently in LLM text than in human writing
Nominalizations are 1.5–2x more common
Agentless passive voice is used half as often (models avoid subjectless constructions)

A random forest trained on just these features distinguished texts from 7 sources with 66% accuracy against a 14% baseline. Only 4.2% of LLM texts were misclassified as human.

When we integrated similar analysis into content pipelines at Gerus-lab, we saw the same patterns across every model — GPT-4, Claude, Llama, DeepSeek. The fingerprint is universal.

RLHF: How Your Model Learned to Suck Up

RLHF (Reinforcement Learning from Human Feedback) trains models to generate responses that annotators prefer. And it turns out annotators love flattery.

The ICLR 2024 paper "Towards Understanding Sycophancy in Language Models" showed that models systematically adjust to user opinions — even when those opinions are wrong. Larger models with more RLHF steps are more sycophantic, not less. More training equals more bootlicking.

Remember when OpenAI had to roll back a GPT-4o update in April 2025? The model became pathologically agreeable. It approved a business plan for "sh*t on a stick in a glass jar." It supported refusing medication. It praised suicide plans. The cause: a new reward signal based on thumbs-up/thumbs-down feedback. Users initially liked the flattery, offline tests showed "everything's fine" — then the model started agreeing with literally anything.

For text, this manifests as:

Overhedging — excessive qualifiers. "It's important to note," "it's worth considering," "one should keep in mind." The model plays it safe because caution never gets penalized.
Promotional register — text reads like a marketing brochure. "Unique," "groundbreaking," "nestled in the heart of." Enthusiastic tone gets more likes during training.
Retail voice — a customer service tone. Neutral, edgeless, excessively helpful. As MIT's Kishnani (2025) put it: it "speaks at you, not with you."

Temperature: "Safe" Equals "Predictable"

Temperature controls the randomness of generation. At low temperature, the model consistently selects the highest-probability tokens, converging on the same "safe" completions.

Two metrics that catch this:

Perplexity — text predictability. Human text: 20–50 on standard benchmarks. AI text: 5–10. The model generates text that it itself predicts with high confidence.
Burstiness — variation in sentence length and complexity. Humans write in bursts — a long complex sentence with three subordinate clauses, then two words, then a medium one. AI applies the same probabilistic rules to every sentence. Length and complexity stay flat. Like an EKG of a corpse.

The 12 Markers We Actually Use

At Gerus-lab, after analyzing thousands of texts across client projects — from SaaS documentation to marketing copy to technical blogs — we've compiled a practical detection checklist. Here are the patterns that give AI text away every single time:

1. Participial Pile-ups

AI: "The company develops new directions, ensuring sustainable growth, attracting investors, and creating jobs."

Human: "The company is expanding. Investors are coming in. Jobs are appearing."

Models stack participial phrases because they allow packing information without starting a new sentence. The statistically probable continuation after a comma is... another participial phrase.

2. The Synonym Carousel

Frequency and presence penalties punish repeated tokens. Good intention — prevent loops. Side effect: synonym cycling.

"Protagonist" in the first sentence, "main character" in the second, "central figure" in the third, "hero" in the fourth. Four ways to say the same thing in one paragraph. No real author does this — they'd just write "he" or repeat the word.

Research on attractor cycles (Arxiv, 2025) showed that during repeated paraphrasing, LLMs make primarily lexical substitutions while the structural pattern remains stable. The model can swap words forever. Argument order, rhythm, logic — all stay frozen.

3. The Em Dash Epidemic

"The Last Fingerprint" (Arxiv, 2025) measured this precisely: GPT-4.1 uses em dashes 10.62 times per 1,000 words. Human baseline: 3.23. Training corpora are saturated with Markdown — GitHub, Stack Overflow, technical docs. The model internalized the "heading + three bullets" structure and translates it into prose.

When you ban headings, bullets, and bold — the em dash survives. It's simultaneously punctuation and a structural marker. The last surviving element of Markdown orientation.

4. Flat Emotional Topology

Human writing oscillates — frustration, excitement, deadpan humor, sudden passion. AI text maintains a single emotional register throughout. Read any LLM blog post: it starts enthusiastic, stays enthusiastic, ends enthusiastic. No valleys, no peaks, no surprise shifts.

5. The "Important to Note" Virus

Phrase frequency analysis across our client projects showed these constructions appear 3–8x more in AI text:

"It's important to note that..."
"It's worth mentioning that..."
"One should keep in mind..."
"This serves as a reminder..."

These are hedging constructions. The model inserts them because they're safe — they add nothing but are never penalized.

6. Perfect Grammar, Zero Personality

Humans make deliberate grammatical choices. We start sentences with "And." We use fragments. For emphasis. We break rules because we know the rules.

AI text is grammatically flawless in the most boring way possible. It follows every convention, which paradoxically makes it less human.

7. The Three-Point Structure

Give an LLM any topic and it will organize its response into three points. Three benefits. Three challenges. Three takeaways. This structural rigidity comes from training data patterns — blog posts, listicles, and academic papers that favor triadic organization.

8. Conclusion Echoing

AI conclusions restate the introduction with slightly different words. "As we've explored" → "In this article, we examined." Human conclusions typically add a new angle, a call to action, or an unexpected observation.

9. Transition Word Overdose

"Furthermore," "Moreover," "Additionally," "In conclusion," "However," "Nevertheless." Real writing uses these sparingly. AI text uses them as structural crutches between every paragraph.

10. The Abstraction Ladder

AI text stays at one abstraction level. If it's high-level, it stays high-level. If it's detailed, every paragraph has the same granularity. Humans naturally zoom in and out — big picture, specific example, personal anecdote, back to theory.

11. Universal Authority Tone

AI writes every topic with the same confidence. Quantum physics, cooking recipes, startup advice — same register, same certainty. Humans modulate confidence based on expertise. They say "I think" or "from what I've seen" or "the research suggests" with different weights.

12. The Missing "I Don't Know"

Perhaps the most telling marker: AI never admits uncertainty genuinely. It hedges with "it depends" and "there are many factors" — but it never says "honestly, I have no idea" or "this is way outside my wheelhouse." Human writers do this naturally.

What This Means for Developers

If you're building anything that touches text — content platforms, hiring tools, educational software, documentation systems — understanding these patterns isn't optional. Here's what we've learned building detection and quality systems at Gerus-lab:

Build Burstiness Checks

The single most reliable automated signal is burstiness — sentence length variation. Plot sentence lengths as a time series. If the standard deviation is below a threshold, flag it. We've implemented this across multiple client projects and it catches ~70% of unedited AI text.

Perplexity Is Your Friend (But Not Your Only Friend)

Low perplexity correlates with AI generation, but it also correlates with formal writing. Combine it with burstiness for much better accuracy.

Don't Ban AI — Measure Human Editing Depth

The question isn't "was AI used?" — it's "how much human editing happened?" A text written by AI and heavily edited by a human is fundamentally different from raw GPT output. The markers above degrade proportionally with genuine editing.

At Gerus-lab, we've built content scoring pipelines that measure editing depth rather than binary AI detection. This approach works better for clients who use AI as a starting point (as most teams should).

The Linguistic Arms Race Is Real

PubMed data from Kobak et al. (Nature Human Behaviour, 2024–2025) analyzed 14.2 million abstracts and found that "delves" showed a coefficient of r = 25.2 — growing from 349 uses in 2020 to 2,847 in 2023. A 654% increase.

But here's the twist: "delve" is already declining because it became a known marker. Models are retrained on newer data where humans themselves avoid the word. It's a linguistic arms race, and it's accelerating.

The Bottom Line

AI-generated text has a distinct statistical fingerprint that emerges from the fundamental architecture of transformer models, RLHF training, temperature settings, and repetition penalties. These aren't bugs — they're features of how the technology works.

As developers, we have two choices: pretend the problem doesn't exist, or build systems that account for it. At Gerus-lab, we've chosen the latter — integrating linguistic analysis into content pipelines, building quality scoring tools, and helping teams use AI effectively without producing text that "smells."

The models will get better. The markers will shift. But the fundamental tension between statistical probability and human creativity? That's not going anywhere.

We're Gerus-lab — an engineering studio that builds AI-powered tools, SaaS platforms, and Web3 products. If you're working on content quality, AI detection, or need a team that understands both the engineering and the linguistics — let's talk.

What markers have you noticed in AI-generated text? Drop them in the comments — we're genuinely curious what patterns other developers are catching.

DEV Community