<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: akash</title>
    <description>The latest articles on DEV Community by akash (@laakash).</description>
    <link>https://dev.to/laakash</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F530666%2F8e7dcd40-0c1c-4923-8ca9-5979c9fc70d0.jpg</url>
      <title>DEV Community: akash</title>
      <link>https://dev.to/laakash</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/laakash"/>
    <language>en</language>
    <item>
      <title>Why AI Detectors Produce False Positives: A Technical Analysis</title>
      <dc:creator>akash</dc:creator>
      <pubDate>Thu, 09 Apr 2026 05:26:35 +0000</pubDate>
      <link>https://dev.to/laakash/why-ai-detectors-produce-false-positives-a-technical-analysis-2gpc</link>
      <guid>https://dev.to/laakash/why-ai-detectors-produce-false-positives-a-technical-analysis-2gpc</guid>
      <description>&lt;p&gt;An AI detector claims 95% accuracy. A student's essay gets flagged as "98% likely AI-generated." Open-and-shut case, right?&lt;/p&gt;

&lt;p&gt;Not even close. The math tells a very different story. This article breaks down exactly why AI detector confidence scores are misleading, using probability theory that any developer can follow.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Base Rate Fallacy
&lt;/h2&gt;

&lt;p&gt;The base rate fallacy is the single most important concept for understanding AI detection errors. It is the reason a "95% accurate" detector can still be wrong a third of the time.&lt;/p&gt;

&lt;p&gt;Here is the setup. A university uses an AI detector with these published metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;True positive rate (sensitivity):&lt;/strong&gt; 95%. If text is AI-generated, the detector correctly flags it 95% of the time.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;False positive rate:&lt;/strong&gt; 5%. If text is human-written, the detector incorrectly flags it 5% of the time.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Sounds great. Now apply it to a real population.&lt;/p&gt;

&lt;p&gt;In a class of 200 students, suppose 20 actually used AI (10% base rate). What happens when every essay goes through the detector?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Base rate calculation
&lt;/span&gt;&lt;span class="n"&gt;total_students&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;
&lt;span class="n"&gt;ai_users&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;        &lt;span class="c1"&gt;# 10% base rate
&lt;/span&gt;&lt;span class="n"&gt;human_writers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;180&lt;/span&gt;

&lt;span class="c1"&gt;# Detector results
&lt;/span&gt;&lt;span class="n"&gt;true_positives&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ai_users&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;           &lt;span class="c1"&gt;# 19 correctly flagged
&lt;/span&gt;&lt;span class="n"&gt;false_positives&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;human_writers&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;     &lt;span class="c1"&gt;# 9 incorrectly flagged
&lt;/span&gt;&lt;span class="n"&gt;total_flagged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;true_positives&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;false_positives&lt;/span&gt;  &lt;span class="c1"&gt;# 28 total flags
&lt;/span&gt;
&lt;span class="c1"&gt;# The critical number
&lt;/span&gt;&lt;span class="n"&gt;precision&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;true_positives&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;total_flagged&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;P(actually AI | flagged) = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;precision&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Output: P(actually AI | flagged) = 67.9%
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Out of 28 flagged essays, 9 are false positives.&lt;/strong&gt; Nearly one in three flagged students wrote their essay themselves. The detector's "95% accuracy" translates to a 32% error rate on flagged results in this scenario.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bayes' Theorem: The Formal Version
&lt;/h2&gt;

&lt;p&gt;What we just computed informally is Bayes' theorem:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P(AI | flagged) = P(flagged | AI) * P(AI) / P(flagged)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;P(flagged | AI) = 0.95&lt;/code&gt; (true positive rate)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;P(AI) = 0.10&lt;/code&gt; (base rate of AI usage)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;P(flagged) = P(flagged | AI) * P(AI) + P(flagged | human) * P(human)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;P(flagged) = 0.95 * 0.10 + 0.05 * 0.90 = 0.14&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;P(AI | flagged) = 0.95 * 0.10 / 0.14 = 0.679
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The posterior probability (67.9%) is drastically lower than the detector's confidence output. This is not a flaw in the math. This is the math working correctly. The detector just is not reporting this number.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Base Rate Changes Everything
&lt;/h2&gt;

&lt;p&gt;The same detector produces wildly different reliability depending on the population:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;precision_at_base_rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sensitivity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;fpr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;base_rate&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Calculate precision given base rate of AI text.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sensitivity&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;base_rate&lt;/span&gt;
    &lt;span class="n"&gt;fp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;fpr&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;base_rate&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tp&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;fp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;base_rates&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mf"&gt;0.01&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;br&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;base_rates&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;precision_at_base_rate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;br&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Base rate &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;br&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: P(AI | flagged) = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Base rate (% actually using AI)&lt;/th&gt;
&lt;th&gt;P(actually AI given flagged)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1%&lt;/td&gt;
&lt;td&gt;16.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5%&lt;/td&gt;
&lt;td&gt;50.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10%&lt;/td&gt;
&lt;td&gt;67.9%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;20%&lt;/td&gt;
&lt;td&gt;82.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;50%&lt;/td&gt;
&lt;td&gt;95.0%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At a 1% base rate, &lt;strong&gt;84% of flags are false positives.&lt;/strong&gt; The detector is wrong five out of six times it fires. At a 5% base rate, it is a coin flip.&lt;/p&gt;

&lt;p&gt;The detector only matches its advertised accuracy when the base rate is 50%, meaning half the population used AI. In most real-world contexts (professional writing, journalism, established authors), the base rate is far lower.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Confidence Scores Mislead
&lt;/h2&gt;

&lt;p&gt;When a detector reports "98% confidence this is AI-generated," it is reporting the model's internal softmax output, not the posterior probability accounting for the base rate. These are fundamentally different numbers.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What the detector reports:
&lt;/span&gt;&lt;span class="n"&gt;model_confidence&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.98&lt;/span&gt;  &lt;span class="c1"&gt;# softmax output
&lt;/span&gt;
&lt;span class="c1"&gt;# What you actually want to know:
# P(AI | text, base_rate) -- requires Bayesian adjustment
&lt;/span&gt;
&lt;span class="c1"&gt;# Rough calibration: even with 0.98 model confidence,
# if the base rate in your context is 5%:
&lt;/span&gt;&lt;span class="n"&gt;adjusted&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.98&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.98&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.05&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.02&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Adjusted probability: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;adjusted&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Output: Adjusted probability: 72.1%
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A "98% confidence" flag, after base rate adjustment, might mean 72% actual likelihood. That is a meaningful difference when someone's grade, job, or reputation is on the line.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Overlapping Distributions Problem
&lt;/h2&gt;

&lt;p&gt;Beyond the base rate issue, there is a fundamental signal problem. The statistical features detectors measure (perplexity, burstiness, vocabulary distribution) are not cleanly separated between human and AI text.&lt;/p&gt;

&lt;p&gt;Visualize two bell curves on a number line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;          Human text              AI text
          distribution            distribution

    |         *****                 *****
    |       **     **             **     **
    |     **         **         **         **
    |   **             ** *** **             **
    | **                 *****                 **
    +---|--------|--------|----|--------|--------|---&amp;gt;
       High    Medium     ↑   Low    Very Low
       perplexity         |   perplexity
                     OVERLAP ZONE
                  (unreliable region)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any threshold you draw through the overlap zone creates two types of errors simultaneously:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;False positives&lt;/strong&gt;: Human text to the right of the threshold (lower perplexity than typical humans)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;False negatives&lt;/strong&gt;: AI text to the left of the threshold (higher perplexity than typical AI)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can tune the threshold to reduce one error type, but only at the cost of increasing the other. This is the ROC curve tradeoff. No threshold eliminates both errors.&lt;/p&gt;

&lt;h2&gt;
  
  
  Who Gets Caught in the Overlap Zone?
&lt;/h2&gt;

&lt;p&gt;The overlap zone is not random. Specific groups of human writers consistently fall into it:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Non-native English speakers.&lt;/strong&gt; Simpler vocabulary, more regular grammar, fewer idiomatic expressions. A 2023 Stanford study found that detectors misclassified non-native English writing as AI over 60% of the time, while achieving near-zero false positives on native English text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Formal and academic writers.&lt;/strong&gt; Hedging language ("it could be argued that"), structured argumentation, and domain-specific terminology all reduce perplexity. The writing conventions that make academic text rigorous are the same patterns detectors associate with AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Technical writers.&lt;/strong&gt; Programming tutorials, API documentation, medical summaries. When explaining well-documented concepts, human writers naturally converge on standard phrasing. The text reads as "predictable" not because AI wrote it, but because there are limited natural ways to explain how a hash map works.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Writers on well-covered topics.&lt;/strong&gt; The more a topic has been written about, the more constrained the natural phrasing becomes. An article about "how to center a div in CSS" will read similarly whether a human or AI wrote it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Implications for Developers
&lt;/h2&gt;

&lt;p&gt;If you are building systems that consume AI detection output, here is what the math demands:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Never treat scores as binary.&lt;/strong&gt; A detection score is a probability estimate with wide confidence intervals. Threshold-based decisions ("flagged if &amp;gt; 70%") create brittle systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Account for base rates.&lt;/strong&gt; If your application context has a low base rate of AI text (e.g., screening submissions from established authors), most flags will be false positives regardless of detector accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Require corroborating evidence.&lt;/strong&gt; A detector score should be one input among many, not a verdict. Combine with metadata (writing history, edit patterns, timing data) for more reliable decisions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Communicate uncertainty.&lt;/strong&gt; If you surface detection results to users, show ranges and caveats, not confident-sounding percentages. "This text has statistical properties common in AI-generated text" is more honest than "98% AI."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Test on your population.&lt;/strong&gt; Published accuracy numbers are measured on benchmark datasets. Your actual population (domain, language proficiency, writing style distribution) will produce different error rates. Measure them.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;Want to see how different detectors score the same text? &lt;a href="https://metric37.com/detect" rel="noopener noreferrer"&gt;Metric37's free AI detector&lt;/a&gt; lets you paste any text and get a breakdown of the detection signals. For batch analysis or integration into your own tools, the &lt;a href="https://metric37.com/api" rel="noopener noreferrer"&gt;Metric37 API&lt;/a&gt; provides programmatic access to detection and humanization scoring.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is Part 2 of a technical series on AI detection. Part 1 covers &lt;a href="https://dev.to/metric37/how-ai-text-detection-works-under-the-hood-perplexity-burstiness-and-classifiers"&gt;how detection works under the hood&lt;/a&gt;, including perplexity math and classifier architectures.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>statistics</category>
      <category>machinelearning</category>
      <category>webdev</category>
    </item>
    <item>
      <title>How AI Text Detection Works Under the Hood: Perplexity, Burstiness, and Classifiers</title>
      <dc:creator>akash</dc:creator>
      <pubDate>Thu, 09 Apr 2026 05:26:03 +0000</pubDate>
      <link>https://dev.to/laakash/how-ai-text-detection-works-under-the-hood-perplexity-burstiness-and-classifiers-2o6m</link>
      <guid>https://dev.to/laakash/how-ai-text-detection-works-under-the-hood-perplexity-burstiness-and-classifiers-2o6m</guid>
      <description>&lt;p&gt;AI text detectors are not magic. They are statistical models measuring how predictable your text is. If you have ever wondered what GPTZero, Originality.ai, or Turnitin are actually computing when they flag text as "AI-generated," this post breaks down the math and the models.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Core Intuition
&lt;/h2&gt;

&lt;p&gt;Language models generate text by repeatedly predicting the next token. At each step, the model assigns a probability distribution over its entire vocabulary, then samples from it. The result is text where every word is, by definition, a high-probability choice given the preceding context.&lt;/p&gt;

&lt;p&gt;Human writers do not work this way. We make unexpected word choices, write sentence fragments, insert tangents, and vary our rhythm. Our text is statistically messier.&lt;/p&gt;

&lt;p&gt;AI detectors exploit this difference using two primary signals: &lt;strong&gt;perplexity&lt;/strong&gt; and &lt;strong&gt;burstiness&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Perplexity: Measuring Surprise
&lt;/h2&gt;

&lt;p&gt;Perplexity quantifies how "surprised" a language model is by a sequence of tokens. Formally, for a sequence of N tokens:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;perplexity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_log_probs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    token_log_probs: list of log P(token_i | token_1..token_i-1)
    from a reference language model (e.g., GPT-2, RoBERTa)
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;N&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_log_probs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;avg_neg_log_prob&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token_log_probs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;N&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;avg_neg_log_prob&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A low perplexity score means the model easily predicted every token. A high score means the text contained surprises.&lt;/p&gt;

&lt;p&gt;In practice, you run the suspect text through a reference model (often GPT-2 or a similar openly available LM), compute the log-probability of each token conditioned on its prefix, and aggregate. Typical ranges:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Text type&lt;/th&gt;
&lt;th&gt;Perplexity range&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Raw GPT-4 output&lt;/td&gt;
&lt;td&gt;5-15&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Human blog post&lt;/td&gt;
&lt;td&gt;30-80&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Creative fiction&lt;/td&gt;
&lt;td&gt;60-150+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-native English&lt;/td&gt;
&lt;td&gt;15-40&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The overlap between "non-native English" and "AI output" is immediately visible, and it foreshadows the false positive problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  Burstiness: Measuring Rhythm Variation
&lt;/h2&gt;

&lt;p&gt;Perplexity alone is not enough. Burstiness measures how much the perplexity varies across a text. Think of it as the standard deviation (or variance) of per-sentence perplexity scores.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;burstiness&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence_perplexities&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    sentence_perplexities: list of perplexity scores,
    one per sentence in the document
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence_perplexities&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
    &lt;span class="n"&gt;mean_ppl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence_perplexities&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;std_ppl&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;statistics&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stdev&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence_perplexities&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="c1"&gt;# Normalized burstiness coefficient
&lt;/span&gt;    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;std_ppl&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;mean_ppl&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Human writing is bursty. You might write a straightforward factual sentence (low perplexity), followed by a creative metaphor (high perplexity), followed by a one-word interjection (wildcard). The per-sentence perplexity jumps around.&lt;/p&gt;

&lt;p&gt;AI text has low burstiness. The model maintains a consistent "temperature" of word choice throughout. Every sentence sits in roughly the same predictability band.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two-dimensional classification&lt;/strong&gt;: Low perplexity + low burstiness = strong AI signal. High perplexity + high burstiness = strong human signal. Mixed signals land in the gray zone where detectors are unreliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Classifier Models: Learning the Difference
&lt;/h2&gt;

&lt;p&gt;Statistical thresholds on perplexity and burstiness only get you so far. Modern commercial detectors (GPTZero, Originality.ai, Turnitin, Copyleaks) use trained classifier models, typically fine-tuned transformers.&lt;/p&gt;

&lt;p&gt;The architecture usually looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Base model&lt;/strong&gt;: A pre-trained transformer, commonly RoBERTa-base (125M params) or DeBERTa-v3 (300M+ params). These models already encode deep understanding of language patterns.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Classification head&lt;/strong&gt;: A linear layer (or small MLP) on top of the &lt;code&gt;[CLS]&lt;/code&gt; token representation that outputs a probability: &lt;code&gt;P(AI-generated | text)&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Training data&lt;/strong&gt;: Millions of paired samples. Human text from diverse sources (academic papers, Reddit posts, news articles, fiction). AI text generated by GPT-3.5, GPT-4, Claude, Llama, Gemini, and others across varied prompts and temperatures.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Fine-tuning&lt;/strong&gt;: Standard cross-entropy loss. The model learns subtle distributional features beyond perplexity and burstiness, including things like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Ratio of content words to function words&lt;/li&gt;
&lt;li&gt;Distribution of rare vs. common vocabulary&lt;/li&gt;
&lt;li&gt;Paragraph-level structural patterns&lt;/li&gt;
&lt;li&gt;Positional patterns (AI intros and conclusions follow recognizable templates)
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Simplified classifier architecture (PyTorch-style pseudo-code)
&lt;/span&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AIDetector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;encoder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RoBERTa&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;roberta-base&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;classifier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Linear&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# hidden_size -&amp;gt; binary
&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="c1"&gt;# Tokenize and encode
&lt;/span&gt;        &lt;span class="n"&gt;hidden&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encoder&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="n"&gt;cls_repr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;hidden&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;  &lt;span class="c1"&gt;# [CLS] token
&lt;/span&gt;        &lt;span class="n"&gt;logit&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cls_repr&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logit&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# P(AI-generated)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The advantage of classifiers over raw perplexity scoring: they capture patterns that are hard to express as a single metric. The disadvantage: they inherit every bias in their training data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Statistical Watermarking
&lt;/h2&gt;

&lt;p&gt;Some AI providers embed invisible statistical watermarks during generation. The approach works by partitioning the vocabulary into "green" and "red" lists at each token position (using a hash of the preceding token as a seed), then biasing generation toward green-list tokens.&lt;/p&gt;

&lt;p&gt;A detector checks whether the proportion of green-list tokens is statistically improbable under random chance. If so, the text was likely generated by that specific model.&lt;/p&gt;

&lt;p&gt;Watermarking is the most reliable detection method when present, but it only works for models that implement it, breaks under paraphrasing or editing, and requires provider cooperation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Detection Breaks Down
&lt;/h2&gt;

&lt;p&gt;Every detection method has systematic failure modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Short text&lt;/strong&gt; (under 250 words): Not enough tokens to establish reliable statistical patterns. Detectors on short text are essentially guessing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Edited AI text&lt;/strong&gt;: Even moderate human editing disrupts the statistical fingerprint. Change 15-20% of the words and most detectors lose confidence.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain-specific writing&lt;/strong&gt;: Technical documentation, legal writing, and medical text naturally use predictable vocabulary and structure. Detectors conflate "domain-constrained" with "AI-generated."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Non-native English&lt;/strong&gt;: Simpler vocabulary and more regular grammar produce lower perplexity, overlapping with AI output distributions. Studies have found false positive rates above 60% for non-native writing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Temperature and sampling&lt;/strong&gt;: AI text generated with high temperature or nucleus sampling can have perplexity profiles that look human.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Confidence Score Trap
&lt;/h2&gt;

&lt;p&gt;When a detector reports "94% likely AI-generated," most people read that as "94% chance this is AI." That is not what it means. The score is the model's internal confidence, not the posterior probability of AI authorship given the base rate of AI text in the population being tested.&lt;/p&gt;

&lt;p&gt;This matters enormously. We will cover the math behind this (Bayes' theorem and the base rate fallacy) in the next article in this series.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical Takeaways for Developers
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Do not trust a single score.&lt;/strong&gt; Cross-reference multiple detectors. If they disagree, the text is in the gray zone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Understand the input constraints.&lt;/strong&gt; Anything under 250 words is unreliable. Longer is better.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Know what is being measured.&lt;/strong&gt; Perplexity and burstiness are proxies, not ground truth. They measure statistical properties that correlate with AI authorship but do not define it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build defensively.&lt;/strong&gt; If you are building tools that incorporate AI detection, expose confidence intervals, not point estimates. Communicate uncertainty honestly.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you want to test your own text, &lt;a href="https://metric37.com/detect" rel="noopener noreferrer"&gt;Metric37's free AI detector&lt;/a&gt; scores any text and breaks down the result. For programmatic access, the &lt;a href="https://metric37.com/api" rel="noopener noreferrer"&gt;Metric37 API&lt;/a&gt; provides detection scores alongside humanization in a single endpoint.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This is Part 1 of a technical series on AI detection. Part 2 covers why false positives happen, with the probability math to prove it.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>security</category>
    </item>
  </channel>
</rss>
