My Writing Twin

Posted on Feb 22 • Originally published at mywritingtwin.com

How We Measure 'Average AI' — Computational Stylometry for Writing Style Analysis

#ai #machinelearning #nlp #writing

Your writing has a fingerprint. Not in the words you choose for any single email, but in the structural patterns that persist across everything you write -- sentence lengths, vocabulary density, punctuation habits, rhythmic variation.

At MyWritingTwin, we built a system that measures this fingerprint across six dimensions and compares it to an empirically derived AI baseline. The result is a radar chart that shows users exactly where their writing diverges from "Average AI."

This post is about the methodology. How we built the corpus, what the six axes actually measure, and why per-locale baselines matter more than you'd expect.

The Corpus: 320 Samples, 100K+ Words

We needed a baseline that represents what "AI writing" actually looks like in practice -- not what one model does, but what the field converges on.

The corpus:

5 models: Claude Opus 4.6, Claude Sonnet 4.5, Claude Haiku 4.5, GPT-5.2, Gemini 3 Pro
4 languages: English, French, Spanish, Japanese
8 prompt types: formal emails, casual emails, business reports, social posts, blog intros, Slack messages, presentations, meeting follow-ups
2 variants per combination: to account for response variability

That's 5 x 4 x 8 x 2 = 320 samples, totaling approximately 100,800 words.

Every sample was generated with identical instructions per prompt type, ensuring apples-to-apples comparison. The prompt types map to what professionals actually ask AI to write -- no creative fiction, no poetry, no edge cases.

The striking finding: all five models converge within a ~12-point band on most axes. Different companies, different architectures, same stylistic output. But that's a separate post.

The Six Axes: What We Measure and How

Each axis captures an independent dimension of writing style, grounded in computational stylometry research. The formulas are deterministic -- same text in, same scores out. No LLM in the measurement loop.

1. Sentence Complexity

Captures structural density using a sigmoid curve applied to mean sentence length and standard deviation.

function sentenceComplexity(sentences: string[]): number {
  const lengths = sentences.map(s => wordCount(s))
  const mean = avg(lengths)
  const stddev = standardDeviation(lengths)

  // Sigmoid prevents hard ceiling at extreme values
  // Raw input combines mean length with variation
  const raw = mean * 0.7 + stddev * 0.3
  return sigmoid(raw) * 100
}

The sigmoid curve is key here. Raw sentence length doesn't map linearly to perceived complexity. The difference between 10 and 15 words per sentence feels significant. The difference between 35 and 40 feels negligible. The sigmoid captures this diminishing-returns relationship, producing scores that match human intuition.

AI baseline (English): 65 -- moderately complex. Neither simple nor dense.

2. Vocabulary Range

Type-Token Ratio (TTR) -- the number of unique words divided by total words.

function vocabularyRange(text: string, locale: string): number {
  const tokens = tokenize(text, locale)
  const uniqueTokens = new Set(tokens.map(t => t.toLowerCase()))

  const ttr = uniqueTokens.size / tokens.length
  return normalize(ttr, 0, 1) * 100
}

TTR has been used in linguistics since the 1940s to measure lexical diversity. While raw TTR is sensitive to text length (longer texts naturally have lower TTR), we mitigate this by comparing within standardized corpus sizes.

The Japanese tokenization problem: Japanese doesn't use spaces between words, so standard split(' ') tokenization fails entirely. We use Intl.Segmenter for morphological tokenization:

function tokenize(text: string, locale: string): string[] {
  if (locale === 'ja') {
    const segmenter = new Intl.Segmenter('ja', { granularity: 'word' })
    return [...segmenter.segment(text)]
      .filter(s => s.isWordLike)
      .map(s => s.segment)
  }
  return text.split(/\s+/)
}

This means a Japanese user's vocabulary score is calculated with the same conceptual approach but language-appropriate mechanics. Intl.Segmenter handles the morphological analysis that would otherwise require a dedicated NLP library.

AI baseline (English): 48 -- moderate diversity, safe word choices.

3. Expressiveness

Measures emotional and rhetorical energy through five signals:

function expressiveness(text: string, sentences: string[]): number {
  const totalSentences = sentences.length
  const totalWords = wordCount(text)
  const wordsK = totalWords / 1000

  const exclamationRatio = countMatches(sentences, /!$/) / totalSentences
  const questionRatio = countMatches(sentences, /\?$/) / totalSentences
  const attitudeMarkers = countAttitudeMarkers(text) / wordsK
  const emDashes = countMatches(text, /—/g) / wordsK
  const ellipses = countMatches(text, /\.\.\./g) / wordsK

  // Weighted combination, normalized to 0-100
  const raw = (exclamationRatio * 25)
            + (questionRatio * 20)
            + (attitudeMarkers * 30)
            + (emDashes * 15)
            + (ellipses * 10)

  return clamp(raw, 0, 100)
}

Attitude markers include words and phrases that signal personal opinion: "importantly," "crucially," "interestingly," "clearly," "obviously." These are high-signal features -- their presence density is a reliable discriminator between human and AI writing patterns.

AI baseline (English): 76 -- high energy. This is where RLHF shows its fingerprint most clearly. Raters prefer engaged text, so every model defaults to enthusiasm.

4. Formality

Weighted combination of three indicators, with one negative signal:

function formality(text: string, sentences: string[]): number {
  const totalWords = wordCount(text)
  const wordsK = totalWords / 1000

  // Positive signals
  const functionWordDensity = countFunctionWords(text) / totalWords
  const hedgeFrequency = countHedges(text) / wordsK  // "might," "perhaps," "could"
  const semicolonUsage = countMatches(text, /;/g) / wordsK

  // Negative signal
  const exclamationRatio = countMatches(sentences, /!$/) / sentences.length

  const raw = (functionWordDensity * 40)
            + (hedgeFrequency * 25)
            + (semicolonUsage * 20)
            - (exclamationRatio * 15)

  return clamp(normalize(raw), 0, 100)
}

The use of function words as a formality signal comes from authorship attribution research. The most counterintuitive finding in stylometry: the words that reveal your identity aren't the impressive ones. They're the boring ones -- "the," "of," "and," "to." Your frequency of using these words is remarkably stable across contexts and time.

AI baseline (English): 58 -- slightly formal, "professional-but-approachable." The RLHF sweet spot.

5. Consistency

Coefficient of variation on sentence lengths, inverted:

function consistency(sentences: string[]): number {
  const lengths = sentences.map(s => wordCount(s))
  const mean = avg(lengths)
  const stddev = standardDeviation(lengths)

  // CV = stddev / mean (higher = more variable)
  const cv = stddev / mean

  // Invert: high consistency = low variation
  return clamp((1 - cv) * 100, 0, 100)
}

A writer who alternates between 5-word and 40-word sentences scores low -- their writing is "bursty." A writer whose sentences cluster between 12 and 18 words scores high. Literary prose tends to be bursty. Legal writing tends to be consistent.

AI baseline (English): 53 -- moderate variation. All five models land between 49 and 57 on this axis. An 8-point spread across the entire field.

6. Conciseness

The simplest formula: inverse of mean sentence length.

function conciseness(sentences: string[]): number {
  const lengths = sentences.map(s => wordCount(s))
  const mean = avg(lengths)

  // Shorter sentences = higher conciseness
  return clamp(inverseSigmoid(mean) * 100, 0, 100)
}

AI baseline (English): 42 -- below midpoint. AI writes long because "helpful" = "thorough" in RLHF terms. More words, more explanation, more qualifiers.

Per-Locale Baselines: Why Language Matters

This is where things get interesting from an engineering perspective.

AI writes differently in different languages. Not just word choice -- structural patterns shift. Comparing a Japanese user's writing to an English AI baseline would produce meaningless results. The differences would reflect language structure, not personal style.

So we compute per-locale baselines:

Axis	EN	FR	ES	JA
Sentence Complexity	65	75	71	62
Vocabulary Range	48	49	44	37
Expressiveness	76	74	59	100
Formality	58	42	46	59
Consistency	53	52	55	53
Conciseness	42	32	36	45

Two values jump out:

Japanese expressiveness = 100. This isn't a bug. Japanese business writing uses question forms (~でしょうか, ~ませんか) and polite markers that the expressiveness formula picks up as rhetorical signals. Both AI baselines and user scores use the same formula, so the comparison remains fair. If a Japanese user scores 85 on expressiveness, that genuinely means they use fewer of these markers than typical AI output.

French conciseness = 32. French naturally produces longer sentences than English. Articles, prepositions, and grammatical structures that English compresses or drops entirely inflate French sentence length. An English-only baseline would unfairly penalize French writers for something that's a feature of their language, not their style.

The implementation is straightforward -- we store baseline vectors per locale and select the appropriate one at comparison time:

const AI_BASELINES: Record<Locale, StyleVector> = {
  en: { complexity: 65, vocabulary: 48, expressiveness: 76, formality: 58, consistency: 53, conciseness: 42 },
  fr: { complexity: 75, vocabulary: 49, expressiveness: 74, formality: 42, consistency: 52, conciseness: 32 },
  es: { complexity: 71, vocabulary: 44, expressiveness: 59, formality: 46, consistency: 55, conciseness: 36 },
  ja: { complexity: 62, vocabulary: 37, expressiveness: 100, formality: 59, consistency: 53, conciseness: 45 },
}

function compareToBaseline(userScores: StyleVector, locale: Locale): StyleDelta[] {
  const baseline = AI_BASELINES[locale]
  return Object.keys(baseline).map(axis => ({
    axis,
    user: userScores[axis],
    ai: baseline[axis],
    delta: userScores[axis] - baseline[axis],
  }))
}

Research Grounding

These measurements aren't invented from scratch. They're practical distillations of established NLP and stylometry research:

Biber's multidimensional analysis: Douglas Biber's framework for analyzing text along multiple independent dimensions -- originally 67 linguistic features -- established that writing style isn't one-dimensional. You can be formal yet expressive. Concise yet complex. Our six-axis radar chart is a practical distillation of this principle.
Function words as authorship markers: Stylometry research consistently shows that function word frequencies are among the most discriminative features in authorship attribution. Our formality axis uses this as a primary signal.
Type-Token Ratio: Used in linguistics since the 1940s for lexical diversity measurement. Well-understood limitations (length sensitivity), well-understood mitigations (standardized window sizes).
Sigmoid scaling: Maps raw measurements to human-intuitive scales. The perceptual difference between 10-word and 15-word average sentences is larger than between 35-word and 40-word averages. Sigmoid curves capture this.

Why Deterministic Measurement Matters

A deliberate design choice: no LLM in the measurement pipeline. Every score is computed with deterministic functions. Same text in, same scores out.

This matters for three reasons:

Reproducibility. Run the analysis on the same text tomorrow and you get the same result. No prompt sensitivity, no temperature variance, no model updates changing your scores.
Explainability. When a user asks "why is my expressiveness score 45?", the answer is specific: "Your text has X exclamation marks per Y sentences, Z attitude markers per 1,000 words, etc." Not "the model thought your text seemed less expressive."
Fairness across languages. Deterministic formulas with per-locale baselines ensure that a Japanese user and an English user are measured by the same rules, compared to appropriate benchmarks, and given equally actionable results.

What Happens Next

The delta between a user's measured style vector and the AI baseline feeds directly into style profile generation. Instead of telling an AI "be more concise," the profile provides a quantified target: "conciseness target 68, baseline 42, reduce average sentence length by ~40%."

This transforms vague stylistic instructions into measurable calibration parameters. The AI gets concrete targets. The user gets output that sounds like them.

Try It Yourself

Curious where your writing falls on the six axes? Grab a free Writing DNA Snapshot -- submit a few writing samples, see your radar chart, and find out which dimensions diverge most from the AI average.

If you're working on anything related to computational stylometry, NLP-based authorship analysis, or writing style measurement, we'd love to hear about your approach. What dimensions do you think matter most for distinguishing human writing from AI output? Are there axes we're missing?

Drop your thoughts in the comments.

DEV Community