How We Score AI Writing Quality: Building an Objective Comparison Framework
Rating writing quality is subjective. Ask 10 people to rank five pieces and you'll get 10 different lists.
Yet we built a framework for objectively comparing AI writing tools. Here's how, with all the uncomfortable compromises.
The Core Problem
How do you measure something as fuzzy as "good writing"?
Traditional approaches:
- Expert panels (expensive, biased)
- User votes (popularity ≠ quality)
- Readability scores (ignores nuance)
- Grammar checkers (catches errors but not excellence)
None work alone. You need a hybrid model.
Our Scoring Dimensions
We evaluate across six dimensions, each with measurable criteria:
interface WritingScore {
clarity: number; // 0-100
relevance: number; // 0-100
originality: number; // 0-100
tone: number; // 0-100
structure: number; // 0-100
accuracy: number; // 0-100
weightedTotal: number; // Final score
}
Dimension 1: Clarity (Weight: 20%)
Clarity isn't just readability. It's "does a reader immediately understand the main point?"
function scoreClarity(text: string): number {
const metrics = {
avgSentenceLength: calculateAvgSentenceLength(text),
passiveVoicePercentage: detectPassiveVoice(text),
jargonDensity: detectUnexplainedTechnicalTerms(text),
readabilityIndex: calculateFleschKincaid(text),
};
// Optimal: avg 15-20 words/sentence, <20% passive voice
const sentenceScore = Math.max(0, 100 - Math.abs(metrics.avgSentenceLength - 17.5) * 5);
const passiveScore = Math.max(0, 100 - (metrics.passiveVoicePercentage * 2));
const jargonScore = Math.max(0, 100 - (metrics.jargonDensity * 10));
return (sentenceScore * 0.4 + passiveScore * 0.35 + jargonScore * 0.25);
}
We penalize both "too simple" and "too complex." Sweet spot is 8th-10th grade reading level for general audience.
Dimension 2: Relevance (Weight: 25%)
Does the content actually address what was requested?
function scoreRelevance(prompt: string, output: string): number {
// Extract key terms from prompt
const promptKeywords = extractNounPhrases(prompt);
// Find coverage in output
const outputKeywords = extractNounPhrases(output);
// Calculate semantic similarity
const keywordCoverage = promptKeywords.filter(keyword =>
outputKeywords.some(out => levenshteinDistance(keyword, out) < 2)
).length / promptKeywords.length;
// Check if output addresses request intent
const intentMatch = analyzeIntent(prompt, output);
// Penalize off-topic tangents
const focusScore = calculateFocusMaintenance(output);
return (keywordCoverage * 40 + intentMatch * 40 + focusScore * 20);
}
This catches when output is well-written but completely missed what was asked.
Dimension 3: Originality (Weight: 15%)
Does it sound like generic AI output, or does it have a voice?
function scoreOriginality(text: string, topic: string): number {
// Check against common phrase database
const genericPhrases = [
"In conclusion", "It is important to note",
"Let me explain", "By and large", "It goes without saying"
];
const clicheCount = genericPhrases.filter(phrase =>
text.toLowerCase().includes(phrase.toLowerCase())
).length;
const clicheScore = Math.max(0, 100 - (clicheCount * 15));
// Measure sentence variety
const sentenceStructures = analyzeSentenceStructure(text);
const varietyScore = (new Set(sentenceStructures).size / sentenceStructures.length) * 100;
// Check for repeated phrases
const phraseDiversity = calculateUniquePhraseRatio(text);
return (clicheScore * 0.4 + varietyScore * 0.3 + phraseDiversity * 0.3);
}
This rewards voice and punishes "sounds like AI."
Dimension 4: Tone (Weight: 15%)
Does tone match the request (professional vs casual, formal vs playful)?
function scoreTone(text: string, requestedTone: ToneType): number {
const detectedTone = analyzeToneMarkers(text);
const toneAlignment = {
professional: {
contractions: -20, // Fewer contractions
exclamationMarks: -15, // Fewer exclamation marks
formalVocab: +30, // More formal word choices
activeListen: +20 // Addresses reader directly
},
casual: {
contractions: +25,
exclamationMarks: +15,
personalPronouns: +20,
humor: +25
},
// ... etc for other tones
};
let score = 50; // Neutral starting point
const rules = toneAlignment[requestedTone];
for (const [marker, adjustment] of Object.entries(rules)) {
const detected = getMarkerCount(text, marker);
score += adjustment * Math.min(1, detected / optimalCount[marker]);
}
return Math.max(0, Math.min(100, score));
}
Tone is contextual, so this is partially rule-based, partially learned from training data.
Dimension 5: Structure (Weight: 10%)
Is the content organized logically?
function scoreStructure(text: string, contentType: string): number {
const expectedStructure = structureTemplates[contentType];
// Analyze hierarchy (H1 > H2 > H3, etc.)
const headingHierarchy = validateHeadingHierarchy(text);
const hierarchyScore = headingHierarchy.isValid ? 100 : 60;
// Check paragraph coherence
const paragraphs = text.split('
');
const coherenceScore = paragraphs.reduce((sum, para) => {
return sum + measureParagraphCoherence(para);
}, 0) / paragraphs.length;
// Verify logical flow
const transitions = analyzeTransitionWords(text);
const transitionScore = transitions.count > 0 ? 100 : 50;
return (hierarchyScore * 0.4 + coherenceScore * 0.4 + transitionScore * 0.2);
}
Well-structured content guides readers naturally through ideas.
Dimension 6: Accuracy (Weight: 15%)
Does it contain factually correct information?
async function scoreAccuracy(text: string, topic: string): Promise<number> {
// Extract claims
const claims = extractFactualClaims(text);
// Verify each claim against knowledge base / real-time search
const verifications = await Promise.all(
claims.map(claim => verifyFactualAccuracy(claim))
);
const accuracyRate = verifications.filter(v => v.confirmed).length / claims.length;
// Penalize unsourced statistics
const statistics = extractStatistics(text);
const sourcedStats = statistics.filter(stat => isCited(stat)).length;
const citationScore = sourcedStats / Math.max(1, statistics.length);
return (accuracyRate * 70 + citationScore * 30);
}
This is the hardest to automate. Real accuracy checking requires either:
- Knowledge base (outdates quickly)
- Real-time web search (expensive)
- Human review (doesn't scale)
We use all three, weighted by confidence.
The Weighted Total
function calculateFinalScore(scores: WritingScore): number {
return (
scores.clarity * 0.20 +
scores.relevance * 0.25 +
scores.originality * 0.15 +
scores.tone * 0.15 +
scores.structure * 0.10 +
scores.accuracy * 0.15
);
}
Weights prioritize relevance (did it answer what was asked?) and clarity (can readers understand it?) as most important.
The Uncomfortable Truth
This framework is 80% objective, 20% guess.
The "accuracy" dimension relies partly on knowledge that goes stale. "Tone" detection has blind spots. "Originality" depends on what you compare against.
But here's the thing: it's consistent. The same piece of writing gets the same score every time. The weighting is explicit. You can disagree with the weights, but you can't accuse the framework of being opaque.
That consistency is what lets you compare AI tools fairly.
Validation
We validated against:
- Expert panel (5 writing professionals rating 50 pieces)
- User preferences (which tool output users actually chose)
- Outcome metrics (which content actually performed in real situations)
Correlation with expert ratings: 0.87
Correlation with user choices: 0.73
Correlation with performance: 0.65
Not perfect, but honestly good for an automated system.
The Implementation
This runs as:
// 1. Parse input
const prompt = req.body.prompt;
const output = req.body.aiOutput;
// 2. Score dimensions
const scores = await Promise.all([
scoreClarity(output),
scoreRelevance(prompt, output),
scoreOriginality(output, extractTopic(prompt)),
scoreTone(output, detectRequestedTone(prompt)),
scoreStructure(output, detectContentType(prompt)),
scoreAccuracy(output, extractTopic(prompt))
]);
// 3. Calculate total
const finalScore = calculateFinalScore(scores);
// 4. Return breakdown
res.json({ finalScore, breakdown: scores, ...diagnostics });
What This Enables
With this framework, you can:
- Compare AI tools objectively (not perfectly, but meaningfully)
- Identify which tools excel for your specific content needs
- Track tool improvements over time
- Make budget decisions based on actual output quality
The platform uses this to show not just "Tool A scores 78, Tool B scores 76" but why they differ and which matters for your use case.
The limitations? Sure. But in a world of marketing fluff, explicit methodology beats vague claims every time.
Top comments (0)