An AI detector claims 95% accuracy. A student's essay gets flagged as "98% likely AI-generated." Open-and-shut case, right?
Not even close. The math tells a very different story. This article breaks down exactly why AI detector confidence scores are misleading, using probability theory that any developer can follow.
The Base Rate Fallacy
The base rate fallacy is the single most important concept for understanding AI detection errors. It is the reason a "95% accurate" detector can still be wrong a third of the time.
Here is the setup. A university uses an AI detector with these published metrics:
- True positive rate (sensitivity): 95%. If text is AI-generated, the detector correctly flags it 95% of the time.
- False positive rate: 5%. If text is human-written, the detector incorrectly flags it 5% of the time.
Sounds great. Now apply it to a real population.
In a class of 200 students, suppose 20 actually used AI (10% base rate). What happens when every essay goes through the detector?
# Base rate calculation
total_students = 200
ai_users = 20 # 10% base rate
human_writers = 180
# Detector results
true_positives = ai_users * 0.95 # 19 correctly flagged
false_positives = human_writers * 0.05 # 9 incorrectly flagged
total_flagged = true_positives + false_positives # 28 total flags
# The critical number
precision = true_positives / total_flagged
print(f"P(actually AI | flagged) = {precision:.1%}")
# Output: P(actually AI | flagged) = 67.9%
Out of 28 flagged essays, 9 are false positives. Nearly one in three flagged students wrote their essay themselves. The detector's "95% accuracy" translates to a 32% error rate on flagged results in this scenario.
Bayes' Theorem: The Formal Version
What we just computed informally is Bayes' theorem:
P(AI | flagged) = P(flagged | AI) * P(AI) / P(flagged)
Where:
-
P(flagged | AI) = 0.95(true positive rate) -
P(AI) = 0.10(base rate of AI usage) P(flagged) = P(flagged | AI) * P(AI) + P(flagged | human) * P(human)-
P(flagged) = 0.95 * 0.10 + 0.05 * 0.90 = 0.14
P(AI | flagged) = 0.95 * 0.10 / 0.14 = 0.679
The posterior probability (67.9%) is drastically lower than the detector's confidence output. This is not a flaw in the math. This is the math working correctly. The detector just is not reporting this number.
How Base Rate Changes Everything
The same detector produces wildly different reliability depending on the population:
def precision_at_base_rate(sensitivity, fpr, base_rate):
"""Calculate precision given base rate of AI text."""
tp = sensitivity * base_rate
fp = fpr * (1 - base_rate)
return tp / (tp + fp)
base_rates = [0.01, 0.05, 0.10, 0.20, 0.50]
for br in base_rates:
p = precision_at_base_rate(0.95, 0.05, br)
print(f"Base rate {br:.0%}: P(AI | flagged) = {p:.1%}")
| Base rate (% actually using AI) | P(actually AI given flagged) |
|---|---|
| 1% | 16.1% |
| 5% | 50.0% |
| 10% | 67.9% |
| 20% | 82.6% |
| 50% | 95.0% |
At a 1% base rate, 84% of flags are false positives. The detector is wrong five out of six times it fires. At a 5% base rate, it is a coin flip.
The detector only matches its advertised accuracy when the base rate is 50%, meaning half the population used AI. In most real-world contexts (professional writing, journalism, established authors), the base rate is far lower.
Why Confidence Scores Mislead
When a detector reports "98% confidence this is AI-generated," it is reporting the model's internal softmax output, not the posterior probability accounting for the base rate. These are fundamentally different numbers.
# What the detector reports:
model_confidence = 0.98 # softmax output
# What you actually want to know:
# P(AI | text, base_rate) -- requires Bayesian adjustment
# Rough calibration: even with 0.98 model confidence,
# if the base rate in your context is 5%:
adjusted = (0.98 * 0.05) / (0.98 * 0.05 + 0.02 * 0.95)
print(f"Adjusted probability: {adjusted:.1%}")
# Output: Adjusted probability: 72.1%
A "98% confidence" flag, after base rate adjustment, might mean 72% actual likelihood. That is a meaningful difference when someone's grade, job, or reputation is on the line.
The Overlapping Distributions Problem
Beyond the base rate issue, there is a fundamental signal problem. The statistical features detectors measure (perplexity, burstiness, vocabulary distribution) are not cleanly separated between human and AI text.
Visualize two bell curves on a number line:
Human text AI text
distribution distribution
| ***** *****
| ** ** ** **
| ** ** ** **
| ** ** *** ** **
| ** ***** **
+---|--------|--------|----|--------|--------|--->
High Medium ↑ Low Very Low
perplexity | perplexity
OVERLAP ZONE
(unreliable region)
Any threshold you draw through the overlap zone creates two types of errors simultaneously:
- False positives: Human text to the right of the threshold (lower perplexity than typical humans)
- False negatives: AI text to the left of the threshold (higher perplexity than typical AI)
You can tune the threshold to reduce one error type, but only at the cost of increasing the other. This is the ROC curve tradeoff. No threshold eliminates both errors.
Who Gets Caught in the Overlap Zone?
The overlap zone is not random. Specific groups of human writers consistently fall into it:
Non-native English speakers. Simpler vocabulary, more regular grammar, fewer idiomatic expressions. A 2023 Stanford study found that detectors misclassified non-native English writing as AI over 60% of the time, while achieving near-zero false positives on native English text.
Formal and academic writers. Hedging language ("it could be argued that"), structured argumentation, and domain-specific terminology all reduce perplexity. The writing conventions that make academic text rigorous are the same patterns detectors associate with AI.
Technical writers. Programming tutorials, API documentation, medical summaries. When explaining well-documented concepts, human writers naturally converge on standard phrasing. The text reads as "predictable" not because AI wrote it, but because there are limited natural ways to explain how a hash map works.
Writers on well-covered topics. The more a topic has been written about, the more constrained the natural phrasing becomes. An article about "how to center a div in CSS" will read similarly whether a human or AI wrote it.
Practical Implications for Developers
If you are building systems that consume AI detection output, here is what the math demands:
1. Never treat scores as binary. A detection score is a probability estimate with wide confidence intervals. Threshold-based decisions ("flagged if > 70%") create brittle systems.
2. Account for base rates. If your application context has a low base rate of AI text (e.g., screening submissions from established authors), most flags will be false positives regardless of detector accuracy.
3. Require corroborating evidence. A detector score should be one input among many, not a verdict. Combine with metadata (writing history, edit patterns, timing data) for more reliable decisions.
4. Communicate uncertainty. If you surface detection results to users, show ranges and caveats, not confident-sounding percentages. "This text has statistical properties common in AI-generated text" is more honest than "98% AI."
5. Test on your population. Published accuracy numbers are measured on benchmark datasets. Your actual population (domain, language proficiency, writing style distribution) will produce different error rates. Measure them.
Try It Yourself
Want to see how different detectors score the same text? Metric37's free AI detector lets you paste any text and get a breakdown of the detection signals. For batch analysis or integration into your own tools, the Metric37 API provides programmatic access to detection and humanization scoring.
This is Part 2 of a technical series on AI detection. Part 1 covers how detection works under the hood, including perplexity math and classifier architectures.
Top comments (0)