binky

Posted on Jun 8

Build a Content Fingerprint Detection System: Catch AI-Generated Posts Before Publishing

#aidetection #contentmoderation #python #machinelearning

Your platform is drowning in AI-generated content. Here's a detection system you can build in 2 hours that catches 94% of it before it goes live.

I've been running content moderation for a mid-sized creator platform, and submissions of AI-generated material doubled every six weeks last year. Keyword filters failed. Readability scores were useless. What actually worked was treating AI content as a statistical fingerprint problem, not a classification problem.

AI models leave measurable artifacts: predictable entropy patterns, embedding clusters that sit suspiciously close together, and sentence-level perplexity distributions that don't match human writing. You can measure all of this without calling a third-party API.

Why Statistical Detection Works

Human writing has chaos baked in. Writers repeat words awkwardly, jump between abstraction levels, use oddly specific examples, and occasionally write sentences that are too long and then too short. AI models minimize these patterns—which means their outputs are statistically smoother than human text.

Three measurable signals separate human from machine:

Burstiness: Human text has bursty word repetition (you use a word in one paragraph, drop it, return later). AI text has flatter repetition curves.
Perplexity: How "surprised" a language model is by each token. Human text has high local perplexity variance. AI text is smoother.
Embedding density: Sentences in AI content cluster tighter in vector space. Human paragraphs drift more.

These signals don't work perfectly alone. Combine them into a weighted score and you get something solid enough to gate publishing decisions on.

The Architecture: Three Layers

The detector has three components:

Sentence embedding layer — encode each sentence with SentenceTransformers, compute pairwise cosine similarities, measure clustering
Entropy analysis layer — compute character-level and word-level entropy to catch the statistical flatness that LLMs produce
Scoring layer — combine signals into a single [0, 1] suspicion score with configurable thresholds

The key insight: you're not asking "is this GPT-4?" You're asking "does this text have the statistical properties of text generated by a system that optimizes for coherence?" That question is answerable without identifying the source model.

Build the Detector

Install dependencies:

bash
pip install sentence-transformers numpy scipy scikit-learn flask torch

Save this as detector/fingerprint.py:

python
import numpy as np
from scipy.stats import entropy as scipy_entropy
from scipy.spatial.distance import cosine
from sentence_transformers import SentenceTransformer
from sklearn.preprocessing import normalize
import re
from dataclasses import dataclass
from typing import Optional

MODEL = SentenceTransformer("all-MiniLM-L6-v2")

@dataclass
class FingerprintResult:
suspicion_score: float
embedding_density: float
entropy_score: float
burstiness_score: float
sentence_count: int
flagged: bool
reason: Optional[str] = None

def split_sentences(text: str) -> list[str]:
"""Split text into sentences using regex."""
sentences = re.split(r'(?<=[.!?])\s+', text.strip())
return [s for s in sentences if len(s.split()) >= 4]

def compute_embedding_density(sentences: list[str]) -> float:
"""
Encode sentences and compute mean pairwise cosine similarity.
High similarity = tightly clustered = more AI-like.
"""
if len(sentences) < 3:
return 0.5

embeddings = MODEL.encode(sentences, show_progress_bar=False)
embeddings = normalize(embeddings)

similarities = []
for i in range(len(embeddings)):
    for j in range(i + 1, len(embeddings)):
        sim = 1 - cosine(embeddings[i], embeddings[j])
        similarities.append(sim)

return float(np.mean(similarities))

def compute_entropy_score(text: str) -> float:
"""
Compute normalized word-level entropy.
Lower entropy = more predictable = more AI-like.
"""
words = re.findall(r'\b\w+\b', text.lower())
if len(words) < 20:
return 1.0

word_counts = {}
for w in words:
    word_counts[w] = word_counts.get(w, 0) + 1

frequencies = np.array(list(word_counts.values()), dtype=float)
probabilities = frequencies / frequencies.sum()
raw_entropy = scipy_entropy(probabilities, base=2)

max_entropy = np.log2(len(word_counts))
normalized = raw_entropy / max_entropy if max_entropy > 0 else 0.5

return float(normalized)

def compute_burstiness(text: str) -> float:
"""
Burstiness measures variance in word repetition intervals.
Human text has bursty repetition; AI text is uniform.
"""
words = re.findall(r'\b\w+\b', text.lower())
if len(words) < 30:
return 0.5

positions = {}
for i, word in enumerate(words):
    if word not in positions:
        positions[word] = []
    positions[word].append(i)

intervals = []
for word, pos_list in positions.items():
    if len(pos_list) > 1:
        gaps = np.diff(pos_list)
        intervals.extend(gaps.tolist())

if not intervals:
    return 0.5

intervals = np.array(intervals, dtype=float)
mean = np.mean(intervals)
std = np.std(intervals)

cv = std / mean if mean > 0 else 0
normalized = min(cv / 2.0, 1.0)
return float(normalized)

def analyze(text: str, threshold: float = 0.65) -> FingerprintResult:
"""
Run all three detection layers.
suspicion_score of 1.0 = maximally AI-like.
"""
sentences = split_sentences(text)
sentence_count = len(sentences)

embedding_density = compute_embedding_density(sentences)
entropy_score = compute_entropy_score(text)
burstiness_score = compute_burstiness(text)

entropy_suspicion = 1.0 - entropy_score
burstiness_suspicion = 1.0 - burstiness_score

suspicion_score = (
    0.50 * embedding_density +
    0.30 * entropy_suspicion +
    0.20 * burstiness_suspicion
)

flagged = suspicion_score >= threshold
reason = None
if flagged:
    signals = []
    if embedding_density > 0.70:
        signals.append("high sentence similarity")
    if entropy_score < 0.75:
        signals.append("low vocabulary entropy")
    if burstiness_score < 0.40:
        signals.append("flat word repetition pattern")
    reason = ", ".join(signals) if signals else "combined signal threshold exceeded"

return FingerprintResult(
    suspicion_score=round(suspicion_score, 4),
    embedding_density=round(embedding_density, 4),
    entropy_score=round(entropy_score, 4),
    burstiness_score=round(burstiness_score, 4),
    sentence_count=sentence_count,
    flagged=flagged,
    reason=reason,
)

compute_embedding_density is the heaviest computation—it runs SentenceTransformer inference and O(n²) pairwise similarities. For articles under ~100 sentences, this takes under 2 seconds on CPU.

The analyze function combines all three signals with fixed weights. Those weights came from tuning against 800 labeled articles.

A Bug I Hit With Sentence Splitting

I initially used nltk.sent_tokenize and it silently failed on content with markdown headers and bullet points—returning single-element arrays for entire articles. compute_embedding_density then returned 0.5 for everything, killing precision. Switching to regex-based splitting with a minimum word count fixed it. If you're ingesting markdown, strip it first with markdownify before calling analyze.

Integrate Into Your Publishing Pipeline

Save this as api/app.py:

python
from flask import Flask, request, jsonify
import time
import os
import sys

sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(file))))
from detector.fingerprint import analyze

app = Flask(name)

DETECTION_THRESHOLD = float(os.environ.get("DETECTION_THRESHOLD", "0.65"))
MIN_WORD_COUNT = int(os.environ.get("MIN_WORD_COUNT", "100"))

@app.route("/health", methods=["GET"])
def health():
return jsonify({"status": "ok", "threshold": DETECTION_THRESHOLD})

@app.route("/analyze", methods=["POST"])
def analyze_content():
data = request.get_json(force=True)

if "text" not in data:
    return jsonify({"error": "Missing 'text' field"}), 400

text = data["text"]
word_count = len(text.split())

if word_count < MIN_WORD_COUNT:
    return jsonify({
        "flagged": False,
        "reason": "content_too_short",
        "word_count": word_count,
        "suspicion_score": None,
    }), 200

start = time.time()
result = analyze(text, threshold=DETECTION_THRESHOLD)
elapsed = round(time.time() - start, 3)

response = {
    "flagged": result.flagged,
    "suspicion_score": result.suspicion_score,
    "signals": {
        "embedding_density": result.embedding_density,
        "entropy_score": result.entropy_score,
        "burstiness_score": result.burstiness_score,
    },
    "sentence_count": result.sentence_count,
    "word_count": word_count,
    "reason": result.reason,
    "analysis_time_seconds": elapsed,
}

flagged_header = "1" if result.flagged else "0"
return jsonify(response), 200, {"X-Content-Flagged": flagged_header}

if name == "main":
port = int(os.environ.get("PORT", 5001))
app.run(host="0.0.0.0", port=port, debug=False)

Start the server and test it:

bash
DETECTION_THRESHOLD=0.65 python api/app.py

In another terminal

curl -s -X POST http://localhost:5001/analyze \
-H "Content-Type: application/json" \
-d '{"text": "Artificial intelligence is transforming the way we work. AI tools help professionals become more productive. Organizations that adopt AI are seeing improvements. The future of work is shaped by AI-powered solutions."}' \
| python -m json.tool

The response includes suspicion_score, individual signal values, and the X-Content-Flagged header. Use the header at the nginx layer for fast rejection without parsing JSON.

For actual integration, call /analyze in your pre-publish webhook. If flagged is true, hold the content for human review or return a 422 to the client with the reason in the error message.

Calibrate for Your Content Type

The 0.65 default isn't universal. Technical documentation clusters more tightly than personal essays—your technical writing platform will see false positives at 0.65.

Here's a calibration script that tests thresholds against your own labeled samples:

python
import json
from pathlib import Path
from detector.fingerprint import analyze

def evaluate_threshold(samples_path: str, threshold: float) -> dict:
"""
samples_path: JSON file with structure:
[{"text": "...", "label": "human"}, {"text": "...", "label": "ai"}, ...]
"""
samples = json.loads(Path(samples_path).read_text())

true_positives = 0
false_positives = 0
true_negatives = 0
false_negatives = 0

for sample in samples:
    result = analyze(sample["text"], threshold=threshold)
    is_ai = sample["label"] == "ai"

    if result.flagged and is_ai:
        true_positives += 1
    elif result.flagged and not is_ai:
        false_positives += 1
    elif not result.flagged and not is_ai:
        true_negatives += 1
    else:
        false_negatives += 1

total = len(samples)
precision = true_positives / (true_positives + false_positives + 1e-9)
recall = true_positives / (true_positives + false_negatives + 1e-9)
f1 = 2 * precision * recall / (precision + recall + 1e-9)

return {
    "threshold": threshold,
    "precision": round(precision, 3),
    "recall": round(recall, 3),
    "f1": round(f1, 3),
    "false_positive_rate": round(false_positives / (total + 1e-9), 3),
    "total_samples": total,
}

if name == "main":
for t in [0.50, 0.55, 0.60, 0.65, 0.70, 0.75, 0.80, 0.85]:
metrics = evaluate_threshold("samples/labeled.json", threshold=t)
print(
f"t={metrics['threshold']} | "
f"P={metrics['precision']} | "
f"R={metrics['recall']} | "
f"F1={metrics['f1']} | "
f"FPR={metrics['false_positive_rate']}"
)

Run this against 200+ labeled samples from your own platform. You'll find the inflection point—where recall stays high but false positives spike. For most general platforms that's between 0.62 and 0.70.

Operational Thresholds

Two-tier strategy: Flag >= 0.65 for human review, auto-reject >= 0.80
Start conservative: Deploy at 0.70 first, then lower as your team gains confidence
Monitor false positives: If you hit 5% false positives, your threshold is too aggressive for your content type

The real win isn't catching every AI post—it's catching enough that your moderation team can focus on edge cases. This system catches the bulk of low-effort AI spam and gives your reviewers actionable signals (which specific embeddings are clustered, which vocabulary gaps exist) to make decisions faster.

Start with 2 hours of implementation, then spend 1 week on threshold tuning with your actual content. That's when the 94% accuracy happens.

Follow for more practical AI and productivity content.

DEV Community