AI text detectors are not magic. They are statistical models measuring how predictable your text is. If you have ever wondered what GPTZero, Originality.ai, or Turnitin are actually computing when they flag text as "AI-generated," this post breaks down the math and the models.
The Core Intuition
Language models generate text by repeatedly predicting the next token. At each step, the model assigns a probability distribution over its entire vocabulary, then samples from it. The result is text where every word is, by definition, a high-probability choice given the preceding context.
Human writers do not work this way. We make unexpected word choices, write sentence fragments, insert tangents, and vary our rhythm. Our text is statistically messier.
AI detectors exploit this difference using two primary signals: perplexity and burstiness.
Perplexity: Measuring Surprise
Perplexity quantifies how "surprised" a language model is by a sequence of tokens. Formally, for a sequence of N tokens:
import math
def perplexity(token_log_probs):
"""
token_log_probs: list of log P(token_i | token_1..token_i-1)
from a reference language model (e.g., GPT-2, RoBERTa)
"""
N = len(token_log_probs)
avg_neg_log_prob = -sum(token_log_probs) / N
return math.exp(avg_neg_log_prob)
A low perplexity score means the model easily predicted every token. A high score means the text contained surprises.
In practice, you run the suspect text through a reference model (often GPT-2 or a similar openly available LM), compute the log-probability of each token conditioned on its prefix, and aggregate. Typical ranges:
| Text type | Perplexity range |
|---|---|
| Raw GPT-4 output | 5-15 |
| Human blog post | 30-80 |
| Creative fiction | 60-150+ |
| Non-native English | 15-40 |
The overlap between "non-native English" and "AI output" is immediately visible, and it foreshadows the false positive problem.
Burstiness: Measuring Rhythm Variation
Perplexity alone is not enough. Burstiness measures how much the perplexity varies across a text. Think of it as the standard deviation (or variance) of per-sentence perplexity scores.
import statistics
def burstiness(sentence_perplexities):
"""
sentence_perplexities: list of perplexity scores,
one per sentence in the document
"""
if len(sentence_perplexities) < 2:
return 0.0
mean_ppl = statistics.mean(sentence_perplexities)
std_ppl = statistics.stdev(sentence_perplexities)
# Normalized burstiness coefficient
return std_ppl / mean_ppl
Human writing is bursty. You might write a straightforward factual sentence (low perplexity), followed by a creative metaphor (high perplexity), followed by a one-word interjection (wildcard). The per-sentence perplexity jumps around.
AI text has low burstiness. The model maintains a consistent "temperature" of word choice throughout. Every sentence sits in roughly the same predictability band.
Two-dimensional classification: Low perplexity + low burstiness = strong AI signal. High perplexity + high burstiness = strong human signal. Mixed signals land in the gray zone where detectors are unreliable.
Classifier Models: Learning the Difference
Statistical thresholds on perplexity and burstiness only get you so far. Modern commercial detectors (GPTZero, Originality.ai, Turnitin, Copyleaks) use trained classifier models, typically fine-tuned transformers.
The architecture usually looks like this:
Base model: A pre-trained transformer, commonly RoBERTa-base (125M params) or DeBERTa-v3 (300M+ params). These models already encode deep understanding of language patterns.
Classification head: A linear layer (or small MLP) on top of the
[CLS]token representation that outputs a probability:P(AI-generated | text).Training data: Millions of paired samples. Human text from diverse sources (academic papers, Reddit posts, news articles, fiction). AI text generated by GPT-3.5, GPT-4, Claude, Llama, Gemini, and others across varied prompts and temperatures.
-
Fine-tuning: Standard cross-entropy loss. The model learns subtle distributional features beyond perplexity and burstiness, including things like:
- Ratio of content words to function words
- Distribution of rare vs. common vocabulary
- Paragraph-level structural patterns
- Positional patterns (AI intros and conclusions follow recognizable templates)
# Simplified classifier architecture (PyTorch-style pseudo-code)
class AIDetector:
def __init__(self):
self.encoder = RoBERTa.from_pretrained("roberta-base")
self.classifier = Linear(768, 1) # hidden_size -> binary
def forward(self, text):
# Tokenize and encode
hidden = self.encoder(tokenize(text))
cls_repr = hidden[0] # [CLS] token
logit = self.classifier(cls_repr)
return sigmoid(logit) # P(AI-generated)
The advantage of classifiers over raw perplexity scoring: they capture patterns that are hard to express as a single metric. The disadvantage: they inherit every bias in their training data.
Statistical Watermarking
Some AI providers embed invisible statistical watermarks during generation. The approach works by partitioning the vocabulary into "green" and "red" lists at each token position (using a hash of the preceding token as a seed), then biasing generation toward green-list tokens.
A detector checks whether the proportion of green-list tokens is statistically improbable under random chance. If so, the text was likely generated by that specific model.
Watermarking is the most reliable detection method when present, but it only works for models that implement it, breaks under paraphrasing or editing, and requires provider cooperation.
Where Detection Breaks Down
Every detection method has systematic failure modes:
- Short text (under 250 words): Not enough tokens to establish reliable statistical patterns. Detectors on short text are essentially guessing.
- Edited AI text: Even moderate human editing disrupts the statistical fingerprint. Change 15-20% of the words and most detectors lose confidence.
- Domain-specific writing: Technical documentation, legal writing, and medical text naturally use predictable vocabulary and structure. Detectors conflate "domain-constrained" with "AI-generated."
- Non-native English: Simpler vocabulary and more regular grammar produce lower perplexity, overlapping with AI output distributions. Studies have found false positive rates above 60% for non-native writing.
- Temperature and sampling: AI text generated with high temperature or nucleus sampling can have perplexity profiles that look human.
The Confidence Score Trap
When a detector reports "94% likely AI-generated," most people read that as "94% chance this is AI." That is not what it means. The score is the model's internal confidence, not the posterior probability of AI authorship given the base rate of AI text in the population being tested.
This matters enormously. We will cover the math behind this (Bayes' theorem and the base rate fallacy) in the next article in this series.
Practical Takeaways for Developers
- Do not trust a single score. Cross-reference multiple detectors. If they disagree, the text is in the gray zone.
- Understand the input constraints. Anything under 250 words is unreliable. Longer is better.
- Know what is being measured. Perplexity and burstiness are proxies, not ground truth. They measure statistical properties that correlate with AI authorship but do not define it.
- Build defensively. If you are building tools that incorporate AI detection, expose confidence intervals, not point estimates. Communicate uncertainty honestly.
If you want to test your own text, Metric37's free AI detector scores any text and breaks down the result. For programmatic access, the Metric37 API provides detection scores alongside humanization in a single endpoint.
This is Part 1 of a technical series on AI detection. Part 2 covers why false positives happen, with the probability math to prove it.
Top comments (0)