AI text detectors are mostly guessing: how they actually work

#ai #machinelearning #writing

If you write documentation, blog posts, or even commit messages, there's a decent chance an "AI detector" has flagged something you wrote by hand. These tools are now bolted onto plagiarism checkers, CMSs, and hiring pipelines — and a lot of people treat their output as a verdict. It isn't. Here's what they actually measure, and why the score is closer to a weather forecast than a fact.

What they actually measure

No detector "reads" your text for meaning. They estimate the probability that a passage was machine-generated from surface statistics. Three approaches dominate:

Perplexity — how surprised a language model is by each next word. Human writing tends to take odd turns; LLM output is, by construction, the high-probability next token most of the time. So low, flat perplexity reads as "probably AI."

Burstiness — how much sentence length and complexity vary. Humans mix a 4-word sentence with a 40-word one. Models trend toward an even rhythm. Low variance → "probably AI."

Trained classifiers — a model shown lots of human and machine samples that learns to output a probability. This is only as good as its training distribution; feed it a domain it hasn't seen and it guesses.

A fourth idea, watermarking, biases the model's word choices in a pattern a matching detector can later spot. It's the most principled approach in theory, but it only works if the provider actually watermarks output and the watermark survives copying, paraphrasing, or light editing. Usually it doesn't.

Why the score is unreliable

Because all of the above measure predictability, not authorship, the failure modes are systematic:

Clean human writing scores as AI. The exact style we teach people to write — short, clear, well-structured — is low-perplexity. The better your prose, the more "robotic" it looks to a detector.
Documented bias against non-native English. A widely cited 2023 Stanford study (Liang et al., published in Patterns) found detectors disproportionately flag text by non-native English speakers, whose simpler phrasing reads as low perplexity. That's a fairness problem, not a rounding error.
Trivially defeated. A few paraphrases, a synonym pass, or moderate editing collapses the signal. So the tool punishes honest plain writers while waving through anyone who lightly edits machine output.
Even OpenAI gave up. OpenAI quietly discontinued its own AI Text Classifier in July 2023, citing low accuracy. When the lab that ships the generator can't reliably detect it, a third-party box promising "99% accurate" should set off alarms.

What to do instead

If you ship anything that acts on a detector score — moderation, grading, hiring — treat the number as a faint hint, never proof:

Never auto-act on a single score. A false positive that accuses a student or rejects a candidate causes real harm.
Look at process and context — drafts, edit history, the ability to explain the work — not a probability.
If you build with detectors, log and surface the uncertainty. Show the confidence band, not a binary "AI / human."

The honest summary: AI detectors are probabilistic pattern-matchers with high false-positive rates and a built-in bias against plain and non-native writing. Useful as a weak prior; dangerous as a verdict.

I wrote a longer, fully-sourced breakdown — including how watermarking holds up and what the research actually says — here: How do AI detectors work?