Alan West

Posted on Apr 19

How to Detect AI-Generated Text in User Submissions

#machinelearning #ai #python #nlp

So a college instructor made the news recently for dragging typewriters into the classroom to fight AI-written essays. Bold move. But if you're building a platform that accepts user-generated content — whether that's a hiring pipeline, a course management system, or a community forum — you can't exactly mail everyone a typewriter.

You need a programmatic solution. And let me tell you, after spending the last few months wrestling with this exact problem on a client project, it's trickier than it sounds.

Why This Problem Is Harder Than You Think

The core challenge is that AI-generated text is, by design, meant to look human. There's no hidden watermark in ChatGPT output (unless the provider explicitly adds one). You're essentially trying to distinguish between two distributions of text that are converging over time.

Most detection approaches rely on one key insight: LLMs are statistically predictable. They tend to choose high-probability tokens. Human writing is messier — we use unusual word choices, make stylistic detours, and occasionally write sentences that no language model would predict.

This manifests in a few measurable ways:

Perplexity: How "surprised" a language model is by the text. AI text tends to have low perplexity because it's generating exactly what a model would predict.
Burstiness: Humans vary their sentence length and complexity wildly. AI tends to be more uniform.
Token probability distribution: AI-generated text clusters around high-probability tokens. Human text has a fatter tail of unlikely choices.

Building a Detection Pipeline

Here's the approach I've landed on after testing several strategies. It's not bulletproof — nothing is — but it catches a solid majority of unedited AI output.

Step 1: Perplexity Scoring with a Local Model

You can compute perplexity using any causal language model. I've had good results with GPT-2 variants since they're small enough to run on modest hardware and different enough from modern generators to give useful signal.

import torch
from transformers import GPT2LMHeadModel, GPT2TokenizerFast
import numpy as np

def compute_perplexity(text: str, model_name: str = "gpt2-medium") -> float:
    tokenizer = GPT2TokenizerFast.from_pretrained(model_name)
    model = GPT2LMHeadModel.from_pretrained(model_name)
    model.eval()

    encodings = tokenizer(text, return_tensors="pt", truncation=True, max_length=1024)
    input_ids = encodings.input_ids

    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        # negative log-likelihood loss averaged over tokens
        loss = outputs.loss.item()

    return np.exp(loss)  # perplexity = e^(avg NLL)

Low perplexity (roughly under 30-40 for GPT-2 medium) is a flag, not a conviction. Academic papers and formulaic writing also score low. That's why you need multiple signals.

Step 2: Burstiness Analysis

This one's simpler and surprisingly effective. Measure the variance in sentence length and vocabulary complexity across the text.

import re
from statistics import stdev, mean

def compute_burstiness(text: str) -> dict:
    sentences = re.split(r'[.!?]+', text)
    sentences = [s.strip() for s in sentences if s.strip()]

    if len(sentences) < 3:
        return {"score": None, "reason": "too few sentences"}

    lengths = [len(s.split()) for s in sentences]
    avg_len = mean(lengths)
    std_len = stdev(lengths)

    # coefficient of variation — normalizes variance by mean
    burstiness = std_len / avg_len if avg_len > 0 else 0

    return {
        "burstiness": round(burstiness, 3),
        "avg_sentence_length": round(avg_len, 1),
        "sentence_count": len(sentences),
        # AI text typically scores below 0.5 here
        "flag": burstiness < 0.45
    }

Human writing usually has a burstiness coefficient above 0.5. We write one-word fragments. Then we write sprawling compound sentences that go on way too long because we're trying to cram in one more thought before the period. AI doesn't do that as much.

Step 3: Combine Signals with a Simple Scoring System

Don't rely on any single metric. I use a weighted scoring approach that combines multiple signals into a confidence score.

def assess_text(text: str) -> dict:
    perplexity = compute_perplexity(text)
    burstiness = compute_burstiness(text)

    score = 0.0
    reasons = []

    # perplexity signal (weight: 40%)
    if perplexity < 25:
        score += 0.4
        reasons.append(f"very low perplexity ({perplexity:.1f})")
    elif perplexity < 40:
        score += 0.2
        reasons.append(f"low perplexity ({perplexity:.1f})")

    # burstiness signal (weight: 30%)
    if burstiness.get("flag"):
        score += 0.3
        reasons.append(f"uniform sentence structure ({burstiness['burstiness']})")

    # vocabulary diversity signal (weight: 30%)
    words = text.lower().split()
    unique_ratio = len(set(words)) / len(words) if words else 0
    if unique_ratio > 0.75:  # AI tends toward high vocab diversity
        score += 0.15
        reasons.append(f"high vocabulary diversity ({unique_ratio:.2f})")
    if unique_ratio < 0.55:  # very repetitive, likely human or bad AI
        score -= 0.1

    return {
        "ai_probability": round(min(score, 1.0), 2),
        "flags": reasons,
        "recommendation": "review" if score > 0.5 else "likely human"
    }

The Thresholds Problem

Here's where I have to be honest with you: the thresholds above are rough guidelines. They'll drift based on your domain, the writing style of your users, and which AI models are popular this month.

The right approach is to calibrate against your own data:

Collect a labeled dataset of known-human and known-AI submissions from your platform
Run both sets through your pipeline
Plot the distributions and pick thresholds that give you acceptable false-positive and false-negative rates
Re-calibrate quarterly — models get better, and so does the text they produce

What Doesn't Work Well

A few things I've tried that underperformed:

Zero-shot classifiers ("Is this AI-generated? Yes/No") — accuracy degrades fast on edited or hybrid text
Watermark detection without knowing the source model — you're guessing at the watermarking scheme, which is basically random chance
Stylometry alone — works great if you have a writing history for the user, falls apart for first-time submitters
Regex for AI-isms ("as a large language model", "delve into") — trivially bypassed, though still funny to catch

Prevention Is Better Than Detection

Detection is an arms race. If you have control over the submission process, consider these architectural approaches instead:

Process-based verification: Require submissions through an editor that logs keystrokes and revision history. Real writing has deletions, reorderings, and pauses. AI-pasted text appears in one big chunk.
Incremental submission: Break work into stages (outline → draft → revision) and check consistency between them.
Challenge-response: After submission, ask the author to explain or modify a specific section in real time. This is the digital equivalent of that professor's typewriter trick.
Structured inputs: Instead of open-ended text fields, use forms with specific constrained prompts that are harder to bulk-generate.

The Uncomfortable Truth

I'll level with you — this is a losing battle long-term. Models are getting better at mimicking human writing, and the statistical signatures that detection relies on are fading. The perplexity gap between human and AI text narrows with every model release.

The real solution isn't better detection. It's rethinking what you're actually trying to verify. If you need to confirm that a human engaged with a problem, verify the process, not the output. That typewriter professor understood something fundamental: authenticity comes from constraints on the process, not analysis of the result.

But until you can redesign your submission flow, the pipeline above will catch the low-hanging fruit — and honestly, that's most of it. The people who are going to carefully edit AI output to bypass detection are a small minority, and they're putting in enough effort that the line between "AI-assisted" and "human-written" gets genuinely blurry anyway.

Ship the detector. Log the scores. Review the flagged ones. Iterate on your thresholds. And start thinking about process-level verification for v2.

DEV Community