Benchmarking 4 AI Detectors on 1,000 Texts: Why False Positives Matter More Than Accuracy

#ai #machinelearning #datascience #writing

I’ve been testing AI detectors recently because they are increasingly used in schools, publishing workflows, hiring, and content moderation.

Most AI detector comparisons focus on one headline number: accuracy.

But after running a benchmark across 1,000 English texts, I think that may be the wrong metric to obsess over.

Dataset

The benchmark included:

500 human-written texts
500 AI-generated texts
AI samples generated from 13 different models
Mixed text lengths: short, medium, and long passages

I tested four AI detectors:

GPTHumanizer
GPTZero
ZeroGPT
Sapling

The surprising part

The most important question was not:

Which detector catches the most AI?

It was:

Which detector is least likely to falsely accuse a human writer?

That distinction matters a lot.

In a real school, workplace, or publishing environment, a false positive is not just a bad prediction. It can become an accusation.

Results

Detector	Overall Accuracy	Human False Positive Rate
GPTHumanizer	98.0%	0.0%
GPTZero	98.7%	2.2%
ZeroGPT	88.2%	18.4%
Sapling	88.6%	19.4%

GPTZero achieved the highest overall accuracy in this benchmark.

But GPTHumanizer had the lowest human false positive rate.

ZeroGPT and Sapling were much more aggressive, but that also meant they mislabeled more human-written text as AI.

Why this matters

If you are using an AI detector for low-stakes filtering, raw accuracy might be useful.

But if you are using it to judge students, writers, job applicants, or employees, false positives should probably be treated as the most important metric.

A detector that catches slightly more AI but wrongly flags real human writing may be more harmful than a conservative detector that avoids false accusations.