George Kioko

Posted on Mar 13

How I Built an AI Content Detector Without Using Any AI

#tutorial

Everyone's paying $29/month for GPTZero. I built my own for free using nothing but math and pattern recognition. Here's exactly how.

The Problem

I was reviewing freelance writing submissions and couldn't tell which ones were AI-generated. The commercial detectors exist (GPTZero, Originality.ai, Copyleaks) but they all share the same problem: they're expensive, they require API keys, and they're black boxes. You send text in, get a percentage back, and have zero idea why.

I wanted something I could understand, modify, and run for pennies. So I built it myself.

The Insight: AI Writes Like a Robot Pretending to Be Human

After reading hundreds of ChatGPT outputs side-by-side with human writing, I noticed patterns. Not subtle ones. Obvious, measurable ones:

AI sentences are weirdly uniform in length. Humans write a 5-word sentence, then a 40-word sentence. AI averages around 15-20 words, consistently.
AI doesn't "burst." In linguistics, burstiness means having clusters of short sentences followed by longer ones. Humans do this naturally — rapid-fire thoughts, then extended explanations. AI smooths everything out.
AI loves certain phrases. "Furthermore," "it's worth noting," "delve into," "navigate the complexities," "tapestry," "beacon." These show up in AI text at 10x the rate of human writing.
AI starts sentences the same way. Count how many sentences start with "The," "This," "It," or "In" in an AI-generated paragraph. It's usually 60%+. Humans vary their starters.
AI targets a specific reading level. Feed any GPT output through a Flesch-Kincaid calculator — it's almost always grade 8-12. Humans are all over the map.

These aren't opinions. They're measurable signals. And they don't require any AI to detect.

The Architecture: 9 Signals, Weighted Scoring

Here's the actual system I built:

Input Text
    |
    +-- Signal 1: Sentence Length Uniformity (0-100)
    +-- Signal 2: Burstiness Score (0-100)
    +-- Signal 3: Vocabulary Diversity (0-100)
    +-- Signal 4: AI Phrase Density (0-100)
    +-- Signal 5: Sentence Starter Variety (0-100)
    +-- Signal 6: Readability Grade Targeting (0-100)
    +-- Signal 7: Paragraph Structure Analysis (0-100)
    +-- Signal 8: Repetitive Pattern Detection (0-100)
    +-- Signal 9: Punctuation Patterns (0-100)
    |
    v
Weighted Average -> AI Score (0-100) -> Verdict

Each signal returns a score from 0 (definitely human) to 100 (definitely AI). The final score is a weighted average.

Signal 1: Sentence Length Uniformity

The most powerful single signal. Here's the logic:

function analyzeSentenceUniformity(sentences) {
    const lengths = sentences.map(s => s.split(/\s+/).length);
    const mean = lengths.reduce((a, b) => a + b, 0) / lengths.length;

    // Standard deviation
    const variance = lengths.reduce((sum, l) =>
        sum + Math.pow(l - mean, 2), 0) / lengths.length;
    const stdDev = Math.sqrt(variance);

    // Coefficient of variation
    const cv = mean > 0 ? (stdDev / mean) * 100 : 0;

    // Low variation = AI. High variation = human.
    // Typical AI: cv = 20-35
    // Typical human: cv = 45-80
    if (cv < 25) return 90;  // Very uniform -> likely AI
    if (cv < 35) return 70;
    if (cv < 45) return 50;
    if (cv < 55) return 30;
    return 10;  // High variation -> likely human
}

When I tested this on 100 known AI texts vs 100 known human texts, the coefficient of variation alone correctly classified 73% of samples. Not bad for one metric.

Signal 4: AI Phrase Detection

This is the fun one. I compiled a list of 87 phrases that show up disproportionately in AI-generated text:

const AI_PHRASES = [
    // Transition words AI overuses
    'furthermore', 'moreover', 'additionally', 'consequently',
    'nevertheless', 'nonetheless', 'in addition', 'as a result',

    // The "delve" family
    'delve into', 'delve deeper', 'explore the intricacies',
    'navigate the complexities', 'shed light on', 'pave the way',

    // AI vocabulary
    'landscape', 'tapestry', 'beacon', 'cornerstone', 'linchpin',
    'robust', 'seamless', 'groundbreaking', 'innovative',

    // Dead giveaways
    'it is worth noting', 'it bears mentioning',
    'in today\'s rapidly evolving', 'in the realm of',
    'plays a crucial role', 'serves as a testament',
];

The scoring is density-based — how many AI phrases per 1,000 words. Human text typically hits 0-2. AI text? 8-15 easily.

Signal 2: Burstiness

Burstiness measures variation in consecutive sentence lengths:

function analyzeBurstiness(sentences) {
    const lengths = sentences.map(s => s.split(/\s+/).length);
    let totalDiff = 0;

    for (let i = 1; i < lengths.length; i++) {
        totalDiff += Math.abs(lengths[i] - lengths[i-1]);
    }

    const avgDiff = totalDiff / (lengths.length - 1);

    // High difference between consecutive sentences = bursty = human
    // Low difference = smooth = AI
    if (avgDiff < 4) return 85;   // Very smooth -> likely AI
    if (avgDiff < 6) return 65;
    if (avgDiff < 8) return 45;
    if (avgDiff < 10) return 25;
    return 10;                     // Very bursty -> likely human
}

Signal 5: Readability Grade Targeting

AI models consistently produce text at a grade 8-12 reading level. Humans are wildly inconsistent — a Reddit comment might be grade 4, an academic paper grade 16. I use a simplified Flesch-Kincaid implementation:

function analyzeReadability(text) {
    const sentences = text.split(/[.!?]+/).filter(s => s.trim());
    const words = text.split(/\s+/).filter(w => w.length > 0);
    const syllables = words.reduce((sum, w) => sum + countSyllables(w), 0);

    // Flesch-Kincaid Grade Level
    const grade = 0.39 * (words.length / sentences.length)
                + 11.8 * (syllables / words.length) - 15.59;

    // AI text clusters around grade 8-12
    // Human text is much more variable
    if (grade >= 8 && grade <= 12) return 75;  // Sweet spot = likely AI
    if (grade >= 6 && grade <= 14) return 50;
    return 15;  // Very low or very high grade = likely human
}

Signal 8: Repetitive Pattern Detection

AI models reuse sentence structures more than humans. I detect this by tracking POS-tag-like patterns — specifically, how sentences begin and the structural templates they follow:

function analyzeRepetitivePatterns(sentences) {
    // Track 3-word opening patterns
    const openings = sentences.map(s => {
        const words = s.trim().split(/\s+/).slice(0, 3);
        return words.map(w => categorize(w)).join('-');
    });

    // Count duplicated patterns
    const counts = {};
    openings.forEach(p => counts[p] = (counts[p] || 0) + 1);
    const repeated = Object.values(counts).filter(c => c > 1);
    const repetitionRate = repeated.length / openings.length;

    if (repetitionRate > 0.4) return 85;  // Very repetitive -> AI
    if (repetitionRate > 0.25) return 60;
    return 20;  // Varied patterns -> human
}

The Weighting

Not all signals are equal. After testing on hundreds of samples, these weights worked best:

AI Phrase Density:       20%  (most distinctive — 87 tracked phrases)
Sentence Uniformity:     15%  (strong standalone signal)
Burstiness:              12%
Vocabulary Diversity:    10%
Sentence Starters:       10%
Repetitive Patterns:     10%  (catches structural repetition)
Readability Targeting:    8%
Paragraph Structure:      8%
Punctuation:              7%

The final AI score maps to verdicts:

Score	Verdict
80-100	`ai_generated`
60-79	`likely_ai`
40-59	`mixed`
20-39	`likely_human`
0-19	`human_written`

Real Results

Let me show you actual outputs.

Test 1: ChatGPT output about machine learning

{
  "ai_score": 78,
  "human_score": 22,
  "verdict": "likely_ai",
  "confidence": "high",
  "details": {
    "sentence_uniformity": 82,
    "burstiness": 75,
    "ai_phrases_found": ["furthermore", "it is worth noting",
                          "plays a crucial role", "in the realm of"]
  }
}

Test 2: My friend's blog post about their vacation

{
  "ai_score": 12,
  "human_score": 88,
  "verdict": "human_written",
  "confidence": "high",
  "details": {
    "sentence_uniformity": 15,
    "burstiness": 8,
    "ai_phrases_found": []
  }
}

Test 3: Human-edited AI text (the hard case)

{
  "ai_score": 48,
  "human_score": 52,
  "verdict": "mixed",
  "confidence": "low"
}

The "mixed" verdict is honest — when humans edit AI text, the statistical fingerprint gets muddled. The tool correctly says "I'm not sure" instead of giving a false confident answer.

Deploying as a Real-Time API on Apify

This is where Apify's infrastructure made the project actually viable. I needed an API that responds instantly (not a batch job), costs fractions of a cent per call, and scales without me managing servers.

Why Apify Standby Mode

Apify has a feature called Standby mode — your actor stays alive as an HTTP server, ready to handle requests with near-zero latency. No cold starts, no Lambda spin-up time. It's perfect for lightweight APIs like this one:

import { Actor } from 'apify';

await Actor.init();
const port = 4321;

const server = http.createServer(async (req, res) => {
    const url = new URL(req.url, `http://localhost:${port}`);

    if (url.pathname === '/detect') {
        // Single text analysis
        const body = await parseBody(req);
        const result = analyzeText(body.text);

        // Pay-per-event: charge only when work is done
        await Actor.charge({ eventName: 'text-analyzed', count: 1 });
        res.writeHead(200, { 'Content-Type': 'application/json' });
        return res.end(JSON.stringify(result));
    }

    if (url.pathname === '/detect/bulk') {
        // Bulk analysis: up to 20 texts per request
        const body = await parseBody(req);
        const texts = body.texts.slice(0, 20);
        const results = texts.map(t => analyzeText(t));

        await Actor.charge({ eventName: 'text-analyzed', count: texts.length });
        res.writeHead(200, { 'Content-Type': 'application/json' });
        return res.end(JSON.stringify({ results }));
    }
});

server.listen(port);

Pay-Per-Event Pricing

The killer feature is Actor.charge() — Apify's usage-based pricing. Instead of charging per minute of compute time, I charge per text analyzed. The user only pays when the API actually does work:

Single text: 1 charge event = $0.003
Bulk (up to 20 texts): N charge events = $0.003 x N

No monthly subscriptions. No unused API credits. Just pay for what you analyze.

Dual-Mode Architecture

The actor also runs in regular (non-Standby) mode for batch processing. Pass texts as input, get results pushed to Apify's dataset storage:

if (!isStandby) {
    const input = await Actor.getInput();
    const results = input.texts.slice(0, 20).map(t => analyzeText(t));
    await Actor.pushData(results);
    await Actor.charge({ eventName: 'text-analyzed', count: results.length });
    await Actor.exit();
}

This means the same actor works as both:

Real-time API (Standby) — for integrations, browser extensions, CI/CD pipelines
Batch processor (regular) — for analyzing document collections via Apify Console

Try It Right Now

curl -X POST \
  "https://george-the-developer--ai-content-detector.apify.actor/detect" \
  -H "Content-Type: application/json" \
  -d '{"text": "Furthermore, it is worth noting that this innovative approach has the potential to significantly impact the landscape of modern technology."}'

Total dependencies: zero (beyond the Apify SDK). No OpenAI calls, no external APIs. The entire detection engine is ~300 lines of vanilla JavaScript using nothing but string operations and basic statistics.

Cost per analysis: $0.003. That's $3 per 1,000 texts. GPTZero charges $14.99/month for 10,000 words.

Limitations (Honest Take)

No detector is perfect, and pretending otherwise would undermine the whole project. Here's where this approach struggles:

Heavily edited AI text: If someone takes ChatGPT output and rewrites 40%+ of it, the statistical fingerprint gets muddled. The detector correctly returns "mixed" — which is the honest answer.
Short texts (<50 words): Not enough data points for reliable statistics. Sentence uniformity needs at least 5-6 sentences to be meaningful.
Non-English text: The AI phrase list is English-only. The structural signals (uniformity, burstiness) still work across languages, but accuracy drops.
Newer models are getting better: GPT-4o produces more varied sentence lengths than GPT-3.5 did. The phrase detection still catches them, but the gap is narrowing. I update the phrase list monthly.
Technical/academic writing: Formal writing naturally uses some "AI-like" patterns. The detector may flag academic papers as "mixed" even when human-written.

The honest truth: this is a screening tool, not a courtroom-grade forensic analyzer. It catches obvious AI text reliably and flags borderline cases as uncertain. That's the right behavior.

What I Learned

1. You don't need AI to detect AI. The statistical fingerprint is strong enough that simple math catches most generated text.

2. Honest uncertainty beats false confidence. When the signals disagree, saying "mixed" is more useful than pretending to know.

3. The "delve" family is the most reliable signal. If a text contains "delve," "tapestry," or "it's worth noting," there's a very high chance it's AI-generated. These words appear in human text at a rate of maybe 1 per 100,000 words. In AI text? 1 per 500.

4. Burstiness is underrated. Most people focus on vocabulary. But the rhythm of sentence length variation is actually harder for AI to fake because it requires the kind of messy, inconsistent thinking that humans do naturally.

Try It

The full tool is live and free to test:

API: apify.com/george.the.developer/ai-content-detector
Source: github.com/the-ai-entrepreneur-ai-hub

It handles single text, bulk analysis (up to 20 texts per request), and returns detailed signal breakdowns for every signal.

Built with Node.js and the Apify SDK. Deployed as a Standby actor with pay-per-event pricing. No AI was harmed (or used) in the making of this detector.

DEV Community