Aral Roca

Posted on Apr 9 • Originally published at kitmul.com

How to Detect AI-Generated Content Using Perplexity and Burstiness

#ai #productivity #writing #webdev

A friend of mine who runs a content agency told me over coffee last week: "we've tried every AI detector out there and they're all snake oil." I told him I thought I could build a better one. He laughed. Fair enough.

The AI Content Detector I built runs entirely in the browser. No uploads, no subscriptions, no cloud API charging you per scan. It uses ten statistical metrics and eighteen sentence-level signals to figure out whether text was written by a human or generated by ChatGPT, Claude, Gemini, or whatever LLM people are using this week. I want to explain how it actually works, because most "AI detector" marketing pages are deliberately vague about their methodology.

Why perplexity and burstiness alone don't cut it

Every blog post about AI detection mentions perplexity and burstiness. They're real metrics, they do measure something useful, but here's the uncomfortable truth I discovered after weeks of testing: modern AI models like GPT-4 and Claude produce text with high perplexity and high burstiness. They've been trained to sound human. Relying on these two metrics alone is like trying to catch a burglar by checking if they used the front door.

Perplexity measures how predictable word sequences are (low = robotic, high = creative). Burstiness measures sentence length variation (low = uniform, high = varied). Old-school AI from 2022 failed both tests spectacularly. But 2025-2026 models? They pass with flying colors.

So what actually works?

The ten metrics that matter

After benchmarking against articles I knew were AI-generated and articles I knew were human (I used a set of ten real URLs ranging from MIT Technology Review to generic SEO coffee blogs), I found that these signals, combined, produce results that are actually useful:

Zipf's Law conformity turned out to be the single most reliable metric. Every natural language follows Zipf's law: the second most common word appears half as often as the first, the third appears a third as often, and so on. Human text deviates from this curve because we get fixated on certain words, go on tangents, make weird word choices. AI text follows Zipf's law almost perfectly because it's sampling from probability distributions that inherently produce Zipfian outputs. I compute R-squared of log-rank vs log-frequency and anything above 0.96 is suspicious.

Repeated sentence starters is embarrassingly simple but catches a ton of AI. Count what percentage of sentences start with the same word. AI loves starting sentences with "The", "This", "It", "In". I've seen AI blog posts where 70%+ of sentences start with one of four words. Humans are messier about it without even trying.

Punctuation entropy measures the Shannon entropy of distances between punctuation marks. AI places commas and periods at remarkably regular intervals. Humans are chaotic; sometimes we write three short sentences in a row, then a long one with five commas, then a fragment.

Sentence length skewness captures the shape of the sentence length distribution. AI produces near-symmetrical distributions (bell curve). Humans write with positive skew: many short sentences, some medium ones, and the occasional monster sentence that runs away from you.

Hapax legomena ratio counts what percentage of words appear only once in the text. Human text has more one-time words because we use specific, contextual vocabulary. AI reuses words more evenly across the text.

Paragraph uniformity is the coefficient of variation of paragraph lengths. AI produces remarkably uniform paragraphs. Humans write a two-sentence paragraph followed by a twelve-sentence one without thinking about it.

The remaining four metrics (perplexity, burstiness, vocabulary richness, word length standard deviation) contribute smaller weights. They help break ties but they're not the heavy hitters anymore.

The real trick: multiplicative signal scoring

Here's where it gets interesting. Individual signals overlap between AI and human text all the time. A human might use dashes (AI signal) or have uniform paragraphs (AI signal). But humans almost never have dashes AND uniform paragraphs AND transition words AND no contractions AND formulaic structure AND repeated starters all in the same sentence.

AI text has clusters of co-occurring signals. When three or more AI signals appear in the same sentence, the score doesn't just add up; it multiplies. A sentence with two AI signals scores normally. Three signals? Score multiplied by 1.5x. Four or more? Multiplied by 2x. This multiplicative approach captures something that linear scoring misses: the difference between "occasionally looks AI-ish" and "this is obviously a pattern."

The sentence-level classifier tracks eighteen separate signals per sentence: length uniformity, dash usage, transition words, filler phrases ("it is important to note", "plays a crucial role"), overused vocabulary ("leverage", "comprehensive", "facilitate"), bold-then-explain patterns, "Here's what/why/how" hooks, proper noun density, contractions, parenthetical asides, questions, informal language, passive voice, starter repetition, colon endings, semicolons, numbered lists, and conclusion patterns.

URL mode and content extraction

You can paste text directly or enter a URL. In URL mode, the tool fetches the HTML, strips out navigation, sidebars, footers, images, scripts, and all non-text elements, then converts the remaining content to Markdown using Turndown. You can expand the extracted content below the results to verify what the tool actually analyzed. Some sites load content via JavaScript (client-side rendering), which the fetcher can't capture; for those, the text tab works better.

The URL fetch tries your browser first (no server involved). If CORS blocks it, a lightweight Edge proxy kicks in with a rate limit of five requests per minute.

Where it falls short

I'm not going to pretend this is perfect.

The biggest weakness: well-written AI text that has been lightly edited by a human. If someone generates a draft with ChatGPT and then rewrites a third of the sentences, adds a personal anecdote, and removes the transition words, our detector (and every other detector) will struggle. That's a fundamental limitation of statistical approaches.

The second weakness: some human writing is genuinely formulaic. Corporate press releases, legal documents, academic abstracts. These trigger AI signals because they lack the messiness that statistical detectors look for. This isn't a bug exactly, but it does produce false positives on a category of text that nobody would call creative writing.

The third weakness: very short text. Below about 200 words, there isn't enough statistical signal for any of the metrics to be reliable.

Compared to GPTZero, Originality.ai, Copyleaks

Those services use trained ML classifiers (neural networks trained on millions of labeled AI/human samples). In theory, they should be more accurate than statistical heuristics like mine. In practice, the gap is smaller than you'd think, especially on longer texts. Their models were trained on specific AI outputs and struggle when new models appear; statistical patterns are more model-agnostic.

The real advantage of the browser-based approach: your text never leaves your device, it's free, and it's instant. If you're scanning a hundred blog posts for a content audit, that matters more than a few percentage points of accuracy.

Try it

The AI Content Detector is on Kitmul. Free, no signup, runs in your browser. Test it on something you know is AI-generated, test it on something you wrote yourself, and see if the results match your intuition.

Next up: a sitemap scanner that crawls every URL in your sitemap.xml and produces a report of which pages look AI-generated. That one should be fun.

References:

Top comments (2)

Claus Paludan • Apr 13

Hmmm - the best I got on various text 100% written by me was 35% human. So either I write like an A.I. or ;)

Aral Roca • Apr 13

It depends; a low percentage doesn't necessarily mean the probability is more likely to be human than AI. Sometimes, when there's uncertainty, a large percentage falls into the “mix” category. Another reference you can use is whether the indicator is green (human) or red (AI). Any feedback or suggestions for improvement are welcome.