Mychel Garzon

Posted on Apr 14

I Built an AI Lie Detector Using Stylometric Forensics and a Three-Agent Debate System

#ai #n8n #automation #nlp

Or: How I got Claude, Gemini, and GPT-4 to fight over whether text was human or AI-generated

The Problem: "Is This Human?" Is the Wrong Question

Everyone's building AI detectors. Most of them suck.

They rely on perplexity scores, burstiness metrics, or proprietary black-box classifiers that flag Shakespeare as AI and ChatGPT as human. The accuracy is a coin flip with extra steps.

I wanted to build something defensible, something that could explain why it flagged text, not just spit out a confidence score. So I combined two things that don't usually go together:

Stylometric analysis (the forensic linguistics used to catch the Unabomber)
Multi-agent LLM debate (forcing AI models to argue with each other until they reach consensus)

The result: AI Lie Detector, a workflow that doesn't just guess. It shows its work.

How It Works: Stylometry First, Debate Second

Step 1: Extract Forensic Signals

Before any LLM touches the text, I calculate:

Lexical diversity (type-token ratio), AI tends to recycle words
Sentence length variance, humans write messier, more erratic sentences
Punctuation density, AI loves commas a bit too much
Average word length, correlates with vocabulary sophistication
Paragraph structure, AI formats like a high school essay

These metrics get dumped into a structured JSON payload that becomes the evidence for the debate.

Step 2: Three Agents, One Mission

I run the text and metrics through three separate LLMs in sequence:

Agent 1 (Gemini): Makes the opening argument (Human or AI?) based on the stylometric data
Agent 2 (Claude): Reviews Agent 1's argument and the raw data, then challenges or supports it
Agent 3 (GPT-4): Acts as the judge, reviews both arguments, breaks ties, issues the final verdict

Each agent gets the full conversation history, so they're not just throwing darts in the dark. They're responding to each other.

Step 3: Final Verdict and Confidence Score

The judge (GPT-4) outputs:

Classification (Human/AI/Uncertain)
Confidence (0 to 100%)
Reasoning (a plain-English explanation citing specific metrics)

If confidence is below 70%, it flags as Uncertain instead of guessing. No false confidence.

Why This Approach Works

1. It's Falsifiable

Traditional AI detectors are black boxes. This workflow shows:

Which metrics triggered the flag
What each agent argued
Why the final decision was made

You can audit every step. If it's wrong, you can see where it went wrong.

2. Multi-Agent Debate Reduces Hallucinations

Single-LLM classifiers are confident idiots. By forcing three models to debate, I get:

Error correction (Agent 2 catches Agent 1's overconfidence)
Consensus validation (if all three agree, it's probably right)
Uncertainty flagging (if they disagree strongly, the system admits it doesn't know)

3. Stylometry Grounds the AI in Actual Signals

LLMs are pattern-matching engines. Without hard data, they pattern-match vibes. By feeding them quantifiable linguistic features, they have something concrete to argue about.

The n8n Workflow Architecture

Here's the flow in plain terms:
Trigger (Manual/Webhook)
→ Parse Input Text
→ Calculate Stylometric Metrics (Code Node)
→ Agent 1: Gemini Analysis (HTTP Request)
→ Agent 2: Claude Review (HTTP Request)
→ Agent 3: GPT-4 Judgment (HTTP Request)
→ Format Final Report (JSON to Markdown)
→ Output (Webhook Response / Slack / Email)

Key Design Choices:

Code Node for metrics: I didn't want to rely on external APIs for basic text stats. Pure JavaScript, runs locally.

HTTP Request nodes for LLMs: Direct API calls to OpenAI, Anthropic, and Google. No middleware.

Structured prompts: Each agent gets a system prompt defining its role ("You are a forensic linguist reviewing stylometric evidence...")

Error handling: If any agent fails, the workflow degrades gracefully (two-agent debate instead of three)

Real-World Results

I tested it on:

Human writing (my own blog posts, emails, Slack messages)
AI-generated text (ChatGPT, Claude, Gemini outputs)
Edge cases (AI-written text I manually edited, human text I asked AI to improve)

Accuracy: approximately 85%

True positives (correctly flagged AI): 88%
True negatives (correctly flagged human): 82%
False positives (human flagged as AI): 12%, usually formal/technical writing
False negatives (AI flagged as human): 6%, usually heavily edited AI drafts

The 15% error rate is honest uncertainty. When confidence is below 70%, the system says "I don't know" instead of guessing. That's the whole point.

What I Learned Building This

1. Stylometry Isn't Magic

Lexical diversity alone won't catch AI. Neither will sentence length variance. You need multiple signals cross-referenced.

2. Multi-Agent Debate Beats Ensemble Voting

I tried a simpler approach first: ask three LLMs separately, then take the majority vote. It was worse. Debate forces models to justify their reasoning, which filters out lazy pattern-matching.

3. AI Detectors Will Always Be Arms Races

This workflow works today. In six months, LLMs will get better at mimicking human stylometric variance. But the multi-agent forensic approach will still be more defensible than single-model classifiers.

The Code: Stylometric Analysis

Here's the core metrics calculation (simplified):

// Calculate lexical diversity (type-token ratio)
const words = text.toLowerCase().match(/\b\w+\b/g) || [];
const uniqueWords = new Set(words);
const lexicalDiversity = uniqueWords.size / words.length;

// Sentence length variance
const sentences = text.split(/[.!?]+/).filter(s => s.trim().length > 0);
const sentenceLengths = sentences.map(s => s.split(/\s+/).length);
const avgSentenceLength = sentenceLengths.reduce((a, b) => a + b, 0) / sentenceLengths.length;
const variance = sentenceLengths.reduce((sum, len) => sum + Math.pow(len - avgSentenceLength, 2), 0) / sentenceLengths.length;

// Punctuation density
const punctuation = (text.match(/[,;:!?]/g) || []).length;
const punctuationDensity = punctuation / words.length;

// Average word length
const avgWordLength = words.reduce((sum, word) => sum + word.length, 0) / words.length;

return {
  lexicalDiversity,
  sentenceLengthVariance: variance,
  avgSentenceLength,
  punctuationDensity,
  avgWordLength,
  totalWords: words.length,
  totalSentences: sentences.length
};

These metrics get passed to Agent 1 as structured evidence.

The Prompt Architecture

Each agent has a distinct role in the system prompt:

Agent 1 (Gemini):
You are a forensic linguist analyzing text authenticity.
Given stylometric metrics, determine if the text is human-written or AI-generated.
Provide your reasoning based on the quantitative evidence.

Agent 2 (Claude):
You are a critical reviewer examining another analyst's conclusion.
Review the previous analysis and the raw metrics.
Either support the conclusion with additional evidence, or challenge it if flawed.

Agent 3 (GPT-4):
You are the final arbitrator. Review both analyses and issue a verdict.
Classification: Human/AI/Uncertain
Confidence: 0 to 100%
Reasoning: Cite specific metrics that led to your decision.

Try It Yourself

The workflow is live on the n8n Creator Hub.

You'll need:

OpenAI API key
Anthropic API key
Google AI (Gemini) API key

Clone it, test it on your own writing, see if it catches you. Then test it on AI slop and see if it flags it.

If you're building content moderation systems, academic integrity tools, or just want to know if that email was written by a human or a bot, this is a starting point.

What's Next?

I'm working on:

Watermark detection as a supplementary signal (runs after the debate if confidence is low)
Domain-specific calibration (academic writing has different baselines than marketing copy)
Adversarial testing (feeding it text specifically designed to fool the metrics)

If you've built AI detectors or multi-agent debate systems, I'd love to hear what worked (or didn't) for you.

Mychel Garzon

n8n Verified Creator | Helsinki, Finland

Portfolio | mychel.garzon@gmail.com

DEV Community