akash

Posted on Apr 9 • Originally published at metric37.com

How AI Text Detection Works Under the Hood: Perplexity, Burstiness, and Classifiers

#ai #machinelearning #nlp #security

AI text detectors are not magic. They are statistical models measuring how predictable your text is. If you have ever wondered what GPTZero, Originality.ai, or Turnitin are actually computing when they flag text as "AI-generated," this post breaks down the math and the models.

The Core Intuition

Language models generate text by repeatedly predicting the next token. At each step, the model assigns a probability distribution over its entire vocabulary, then samples from it. The result is text where every word is, by definition, a high-probability choice given the preceding context.

Human writers do not work this way. We make unexpected word choices, write sentence fragments, insert tangents, and vary our rhythm. Our text is statistically messier.

AI detectors exploit this difference using two primary signals: perplexity and burstiness.

Perplexity: Measuring Surprise

Perplexity quantifies how "surprised" a language model is by a sequence of tokens. Formally, for a sequence of N tokens:

import math

def perplexity(token_log_probs):
    """
    token_log_probs: list of log P(token_i | token_1..token_i-1)
    from a reference language model (e.g., GPT-2, RoBERTa)
    """
    N = len(token_log_probs)
    avg_neg_log_prob = -sum(token_log_probs) / N
    return math.exp(avg_neg_log_prob)

A low perplexity score means the model easily predicted every token. A high score means the text contained surprises.

In practice, you run the suspect text through a reference model (often GPT-2 or a similar openly available LM), compute the log-probability of each token conditioned on its prefix, and aggregate. Typical ranges:

Text type	Perplexity range
Raw GPT-4 output	5-15
Human blog post	30-80
Creative fiction	60-150+
Non-native English	15-40

The overlap between "non-native English" and "AI output" is immediately visible, and it foreshadows the false positive problem.

Burstiness: Measuring Rhythm Variation

Perplexity alone is not enough. Burstiness measures how much the perplexity varies across a text. Think of it as the standard deviation (or variance) of per-sentence perplexity scores.

import statistics

def burstiness(sentence_perplexities):
    """
    sentence_perplexities: list of perplexity scores,
    one per sentence in the document
    """
    if len(sentence_perplexities) < 2:
        return 0.0
    mean_ppl = statistics.mean(sentence_perplexities)
    std_ppl = statistics.stdev(sentence_perplexities)
    # Normalized burstiness coefficient
    return std_ppl / mean_ppl

Human writing is bursty. You might write a straightforward factual sentence (low perplexity), followed by a creative metaphor (high perplexity), followed by a one-word interjection (wildcard). The per-sentence perplexity jumps around.

AI text has low burstiness. The model maintains a consistent "temperature" of word choice throughout. Every sentence sits in roughly the same predictability band.

Two-dimensional classification: Low perplexity + low burstiness = strong AI signal. High perplexity + high burstiness = strong human signal. Mixed signals land in the gray zone where detectors are unreliable.

Classifier Models: Learning the Difference

Statistical thresholds on perplexity and burstiness only get you so far. Modern commercial detectors (GPTZero, Originality.ai, Turnitin, Copyleaks) use trained classifier models, typically fine-tuned transformers.

The architecture usually looks like this:

Base model: A pre-trained transformer, commonly RoBERTa-base (125M params) or DeBERTa-v3 (300M+ params). These models already encode deep understanding of language patterns.
Classification head: A linear layer (or small MLP) on top of the [CLS] token representation that outputs a probability: P(AI-generated | text).
Training data: Millions of paired samples. Human text from diverse sources (academic papers, Reddit posts, news articles, fiction). AI text generated by GPT-3.5, GPT-4, Claude, Llama, Gemini, and others across varied prompts and temperatures.
Fine-tuning: Standard cross-entropy loss. The model learns subtle distributional features beyond perplexity and burstiness, including things like:
- Ratio of content words to function words
- Distribution of rare vs. common vocabulary
- Paragraph-level structural patterns
- Positional patterns (AI intros and conclusions follow recognizable templates)

# Simplified classifier architecture (PyTorch-style pseudo-code)
class AIDetector:
    def __init__(self):
        self.encoder = RoBERTa.from_pretrained("roberta-base")
        self.classifier = Linear(768, 1)  # hidden_size -> binary

    def forward(self, text):
        # Tokenize and encode
        hidden = self.encoder(tokenize(text))
        cls_repr = hidden[0]  # [CLS] token
        logit = self.classifier(cls_repr)
        return sigmoid(logit)  # P(AI-generated)

The advantage of classifiers over raw perplexity scoring: they capture patterns that are hard to express as a single metric. The disadvantage: they inherit every bias in their training data.

Statistical Watermarking

Some AI providers embed invisible statistical watermarks during generation. The approach works by partitioning the vocabulary into "green" and "red" lists at each token position (using a hash of the preceding token as a seed), then biasing generation toward green-list tokens.

A detector checks whether the proportion of green-list tokens is statistically improbable under random chance. If so, the text was likely generated by that specific model.

Watermarking is the most reliable detection method when present, but it only works for models that implement it, breaks under paraphrasing or editing, and requires provider cooperation.

Where Detection Breaks Down

Every detection method has systematic failure modes:

Short text (under 250 words): Not enough tokens to establish reliable statistical patterns. Detectors on short text are essentially guessing.
Edited AI text: Even moderate human editing disrupts the statistical fingerprint. Change 15-20% of the words and most detectors lose confidence.
Domain-specific writing: Technical documentation, legal writing, and medical text naturally use predictable vocabulary and structure. Detectors conflate "domain-constrained" with "AI-generated."
Non-native English: Simpler vocabulary and more regular grammar produce lower perplexity, overlapping with AI output distributions. Studies have found false positive rates above 60% for non-native writing.
Temperature and sampling: AI text generated with high temperature or nucleus sampling can have perplexity profiles that look human.

The Confidence Score Trap

When a detector reports "94% likely AI-generated," most people read that as "94% chance this is AI." That is not what it means. The score is the model's internal confidence, not the posterior probability of AI authorship given the base rate of AI text in the population being tested.

This matters enormously. We will cover the math behind this (Bayes' theorem and the base rate fallacy) in the next article in this series.

Practical Takeaways for Developers

Do not trust a single score. Cross-reference multiple detectors. If they disagree, the text is in the gray zone.
Understand the input constraints. Anything under 250 words is unreliable. Longer is better.
Know what is being measured. Perplexity and burstiness are proxies, not ground truth. They measure statistical properties that correlate with AI authorship but do not define it.
Build defensively. If you are building tools that incorporate AI detection, expose confidence intervals, not point estimates. Communicate uncertainty honestly.

If you want to test your own text, Metric37's free AI detector scores any text and breaks down the result. For programmatic access, the Metric37 API provides detection scores alongside humanization in a single endpoint.

This is Part 1 of a technical series on AI detection. Part 2 covers why false positives happen, with the probability math to prove it.

DEV Community