I built an AI text detector, so I've spent more time than I'd like staring at the mechanics of how these tools work and where they break down. The marketing promises confidence scores and definitive answers. The reality is messier.
Perplexity: The Core Signal
Perplexity measures how "surprised" a language model is by a piece of text. Given a sequence of words, how predictable is the next word at each step?
When a language model generates text, it picks tokens that have high probability given the preceding context. That's literally what it's optimized to do. The result is text with low perplexity: each word follows naturally and predictably from the ones before it.
Human writing tends to have higher perplexity. We make unexpected word choices. We use idioms that don't follow statistical patterns. We start sentences in ways that a probability distribution wouldn't favor. We throw in a technical term, then follow it with slang in the same paragraph.
A perplexity score is computed by running the text through a reference language model and calculating the geometric mean of the inverse probabilities at each token position. Lower scores suggest machine generation. Higher scores suggest human authorship.
The problem is that the threshold between "human" and "AI" perplexity isn't a clean line. Academic writing by humans tends to have lower perplexity because it follows conventions strictly. A well-edited essay can look "AI-generated" to a perplexity-based detector simply because the author writes clearly and follows standard structure.
Burstiness: The Rhythm Signal
Burstiness captures variation in sentence structure. Humans are inconsistent writers. We'll write a 6-word sentence, then follow it with a 40-word run-on, then drop in a fragment. Our paragraph lengths vary. We repeat ourselves sometimes and other times jump topics without transition.
Language models produce more uniform output. Sentence lengths cluster around a narrower range. Paragraph structure is more regular. The rhythm is steady in a way that human writing rarely is.
Measuring burstiness is straightforward: calculate the standard deviation of sentence lengths across the document. Higher variance suggests human writing. Low variance suggests generation.
This is a surprisingly effective signal on unedited AI output. GPT-4's default output has noticeably lower sentence length variance than most human writing. But it's easy to defeat. Any post-processing that varies sentence structure (combining sentences, splitting long ones, adding fragments) disrupts the signal.
Token Prediction Probability
This is a more granular version of perplexity. Instead of a single score for the whole text, you look at the probability distribution at each token position. AI-generated text tends to have more tokens in the high-probability region. The model picks the "safe" option more often than humans do.
You can visualize this by plotting per-token probability across a document. Human text shows spikes of low-probability choices scattered throughout. AI text shows a flatter, higher-probability profile.
Why Detection Breaks With Editing
All of these signals measure statistical properties of the raw output. The moment a human edits AI-generated text, those properties shift.
Changing a few words per paragraph increases perplexity. Varying sentence lengths increases burstiness. Replacing high-probability words with synonyms disrupts the token probability profile.
In my testing, a single editing pass where someone rewrites roughly 20% of the sentences is enough to drop most detectors' confidence scores below their flagging threshold. Two passes make detection functionally unreliable.
This isn't a bug in the detectors. It's a fundamental limitation of the approach. If a human has meaningfully edited the text, the statistical signature of generation is diluted. At some point the text genuinely is a human-AI collaboration, and drawing a binary line becomes philosophically questionable, not just technically difficult.
False Positive Rates
This is the part that concerns me most. Published research on AI detectors shows false positive rates between 1% and 15% depending on the tool, the text domain, and the model used for generation.
Non-native English speakers get flagged at higher rates. Their writing tends to follow learned patterns more strictly, use simpler vocabulary, and produce more predictable sentence structures. This is essentially the same statistical profile that detectors associate with AI generation.
Technical writing, academic papers, and legal documents also trigger false positives more frequently than casual writing. Formal register naturally has lower perplexity.
A 5% false positive rate sounds small until you apply it to millions of submissions. At scale, you're wrongly accusing thousands of people.
The Watermarking Approach
Watermarking takes a different angle. Instead of detecting generation after the fact, you embed a signal during generation. The model subtly biases its token selection to follow a pattern that's invisible to humans but detectable by someone who knows the key.
The approach works by partitioning the vocabulary into "green" and "red" tokens at each position using a pseudorandom function seeded by the preceding tokens. The model is nudged to prefer green tokens. A detector checks whether the text has a statistically significant bias toward green tokens.
Watermarking is more robust because the signal is distributed across the entire text. But it requires the model provider to implement it, and it's defeated by paraphrasing through a different, non-watermarked model.
Where This Leaves Us
I built the detector at zovo.one because people want to check text and it's useful to see the analysis, especially the perplexity and burstiness breakdowns that show you what the signals actually look like. But I try to be honest about what the tool can and can't do.
It can flag text that has strong statistical signatures of generation. It can provide a probability estimate. It cannot give you certainty. No detector can, and anyone selling certainty is overselling their product.
If you're making consequential decisions based on AI detection output, you should understand these limitations before you trust the score.
I'm Michael Lip. I build free developer tools at zovo.one. 350+ tools, all private, all free.
Top comments (0)