binky

Posted on Jun 2

Catch Mediocre AI Content Before It Ships: A Python Quality Scorer

#aicontent #qualityassurance #python #automation

Your AI generated 50 articles this week. Only 3 were publishable. Here's a Python script that stops the mediocre ones cold before they hit your CMS.

I've been there. You run a content pipeline, Claude or GPT spits out a dozen posts overnight, and then Tuesday morning arrives. You're reading through drafts that start with "In today's digital landscape" and end with "In conclusion, it's clear that..." The irony stings: AI was supposed to save time, not create a new job called "AI content babysitter."

So I built a CLI scorer. It reads your drafts, assigns them a quality score, and flags the weak ones before they touch your publishing system. Here's how.

Why AI Content Needs Gatekeeping

The problem isn't that AI writes badly. It's that AI writes predictably badly in specific, detectable ways. Once you know the patterns, you automate the detection.

Common failure modes in AI-generated content:

Filler openings: "In today's world," "It's no secret that," "As we all know"
Repetitive structure: Same subject-verb-object rhythm across paragraphs
Hollow hedging: "It's important to note that," "Needless to say," "Worth mentioning"
Transition bloat: "Furthermore," "Moreover," "Additionally" every two sentences
Fake specificity: Numbers and claims that sound precise but reference nothing

These patterns are measurable. Measurable means scriptable.

Building the Scoring Engine

The scoring system runs four checks: pattern matching against known AI phrases, lexical diversity (type-token ratio), sentence length variance, and sycophancy density for hollow affirmations.

Each check returns a penalty. The total subtracts from 100. Below 60? Rejected. 60–79? Flagged for review. 80+? Cleared to publish.

Here's the core scoring logic:

python
import re
import math
from collections import Counter
from typing import Tuple

Known AI filler patterns — expand this list aggressively

def score_ai_patterns(text: str) -> Tuple[int, list]:
"""Returns penalty points and matched patterns."""
text_lower = text.lower()
hits = []
for pattern in AI_PATTERNS:
matches = re.findall(pattern, text_lower)
if matches:
hits.append((pattern, len(matches)))
penalty = min(len(hits) * 5, 40) # Cap at 40 points
return penalty, hits

def lexical_diversity(text: str) -> float:
"""Type-token ratio: unique words / total words."""
words = re.findall(r'\b[a-z]+\b', text.lower())
if not words:
return 0.0
return len(set(words)) / len(words)

def sentence_length_variance(text: str) -> float:
"""Higher variance = more natural writing rhythm."""
sentences = re.split(r'[.!?]+', text)
lengths = [len(s.split()) for s in sentences if s.strip()]
if len(lengths) < 2:
return 0.0
mean = sum(lengths) / len(lengths)
variance = sum((l - mean) ** 2 for l in lengths) / len(lengths)
return math.sqrt(variance) # Standard deviation

def score_content(text: str) -> dict:
"""Main scoring function. Returns score dict."""
score = 100
details = {}

# Pattern penalty
pattern_penalty, pattern_hits = score_ai_patterns(text)
score -= pattern_penalty
details["pattern_hits"] = pattern_hits
details["pattern_penalty"] = pattern_penalty

# Lexical diversity penalty
diversity = lexical_diversity(text)
if diversity < 0.45:
    diversity_penalty = int((0.45 - diversity) * 100)
    score -= diversity_penalty
    details["diversity_penalty"] = diversity_penalty
else:
    details["diversity_penalty"] = 0
details["lexical_diversity"] = round(diversity, 3)

# Sentence variance penalty
variance = sentence_length_variance(text)
if variance < 5.0:
    variance_penalty = int((5.0 - variance) * 2)
    score -= variance_penalty
    details["variance_penalty"] = variance_penalty
else:
    details["variance_penalty"] = 0
details["sentence_variance"] = round(variance, 2)

details["final_score"] = max(score, 0)
return details

The lexical_diversity function is the one I tune most. Human writing typically scores 0.55–0.75 on type-token ratio. AI output clusters around 0.40–0.50 because it reuses transition words constantly.

Adding Claude for Semantic Review

Regex catches structural problems. Claude catches the semantic ones — when a paragraph repeats itself three different ways, when claims lack support, when the writing feels hollow.

Install dependencies:

bash
pip install anthropic click rich python-dotenv

Set your API key in .env:

bash
echo "ANTHROPIC_API_KEY=your_key_here" > .env

Here's the full CLI — save as score_content.py:

python

!/usr/bin/env python3

import os
import sys
import json
import click
from pathlib import Path
from dotenv import load_dotenv
from rich.console import Console
from rich.table import Table
from rich.panel import Panel
import anthropic

from scoring_engine import score_content

load_dotenv()
console = Console()

ANTHROPIC_API_KEY = os.getenv("ANTHROPIC_API_KEY")
client = anthropic.Anthropic(api_key=ANTHROPIC_API_KEY)

CLAUDE_QUALITY_PROMPT = """You are a content quality reviewer. Analyze this draft for:

Repetitive sentence structures (score 1-10, 10 = very repetitive)
Vague or unsupported claims (count them)
Missing concrete examples or data points (yes/no)
Overall publishability (PUBLISH / REVIEW / REJECT)

Respond ONLY in this JSON format:
{{
"repetition_score": ,
"vague_claims": ,
"missing_examples": ,
"verdict": "",
"top_issue": ""
}}

CONTENT:

{content}
---"""

def get_claude_verdict(text: str) -> dict:
"""Send content to Claude for semantic quality review."""
try:
message = client.messages.create(
model="claude-opus-4-5",
max_tokens=300,
messages=[
{
"role": "user",
"content": CLAUDE_QUALITY_PROMPT.format(content=text[:4000])
}
]
)
raw = message.content[0].text.strip()
return json.loads(raw)
except json.JSONDecodeError:
return {"verdict": "REVIEW", "top_issue": "Claude response unparseable", "error": True}
except Exception as e:
return {"verdict": "REVIEW", "top_issue": f"API error: {str(e)}", "error": True}

def combined_verdict(local_score: int, claude_verdict: str) -> str:
"""Combine local score and Claude verdict into final decision."""
if local_score < 60 or claude_verdict == "REJECT":
return "REJECT"
if local_score < 80 or claude_verdict == "REVIEW":
return "REVIEW"
return "PUBLISH"

@click.command()
@click.argument("filepath", type=click.Path(exists=True))
@click.option("--json-output", is_flag=True, help="Output raw JSON for pipeline use")
@click.option("--skip-claude", is_flag=True, help="Run local checks only (no API call)")
@click.option("--threshold", default=75, help="Minimum score to auto-approve (default: 75)")
def main(filepath: str, json_output: bool, skip_claude: bool, threshold: int):
"""Score a content file for AI quality issues before publishing."""

text = Path(filepath).read_text(encoding="utf-8")

if len(text.split()) < 100:
    console.print("[red]File too short to score (< 100 words)[/red]")
    sys.exit(2)

# Run local scoring
local_results = score_content(text)
local_score = local_results["final_score"]

# Run Claude check
claude_results = {}
if not skip_claude:
    with console.status("Asking Claude for semantic review..."):
        claude_results = get_claude_verdict(text)

# Determine final verdict
claude_verdict = claude_results.get("verdict", "REVIEW") if claude_results else "REVIEW"
final = combined_verdict(local_score, claude_verdict)

if json_output:
    output = {
        "file": filepath,
        "local_score": local_score,
        "claude": claude_results,
        "final_verdict": final
    }
    print(json.dumps(output, indent=2))
    sys.exit(0 if final == "PUBLISH" else 1)

# Rich terminal output
color = {"PUBLISH": "green", "REVIEW": "yellow", "REJECT": "red"}[final]

table = Table(title=f"Quality Report: {Path(filepath).name}")
table.add_column("Check", style="cyan")
table.add_column("Result", justify="right")

table.add_row("Local Score", str(local_score))
table.add_row("Pattern Hits", str(len(local_results.get("pattern_hits", []))))
table.add_row("Lexical Diversity", str(local_results.get("lexical_diversity", "n/a")))
table.add_row("Sentence Variance", str(local_results.get("sentence_variance", "n/a")))

if claude_results and not claude_results.get("error"):
    table.add_row("Claude Verdict", claude_results.get("verdict", "n/a"))
    table.add_row("Repetition Score", str(claude_results.get("repetition_score", "n/a")))
    table.add_row("Vague Claims", str(claude_results.get("vague_claims", "n/a")))
    if claude_results.get("top_issue"):
        table.add_row("Top Issue", claude_results["top_issue"])

console.print(table)
console.print(Panel(f"[bold {color}]VERDICT: {final}[/bold {color}]"))

sys.exit(0 if final == "PUBLISH" else 1)

if name == "main":
main()

Exit codes matter for automation: 0 for publish-ready, 1 for everything else. This is what makes the workflow integration in the next section work cleanly.

The bug I hit: I initially passed full article text to Claude without truncating. For 3,000-word pieces, this occasionally hit token limits and caused silent failures where get_claude_verdict returned empty strings that broke json.loads. The fix: text[:4000] slice in CLAUDE_QUALITY_PROMPT.format(). Not elegant, but reliable. For production, use a proper token counter before the API call.

Integrating Into Your Publishing Workflow

Git Hook (local pre-commit)

Save as .git/hooks/pre-commit and run chmod +x:

bash

!/bin/bash

Pre-commit hook: score any markdown files staged for commit

STAGED_MD=$(git diff --cached --name-only --diff-filter=ACM | grep '.md$')

if [ -z "$STAGED_MD" ]; then
exit 0
fi

echo "Running content quality check..."

for FILE in $STAGED_MD; do
RESULT=$(python score_content.py "$FILE" --json-output 2>/dev/null)
VERDICT=$(echo "$RESULT" | python -c "import sys,json; print(json.load(sys.stdin)['final_verdict'])")
SCORE=$(echo "$RESULT" | python -c "import sys,json; print(json.load(sys.stdin)['local_score'])")

if [ "$VERDICT" = "REJECT" ]; then
echo "❌ BLOCKED: $FILE (score: $SCORE) — verdict: $VERDICT"
echo "Fix the content issues before committing."
exit 1
elif [ "$VERDICT" = "REVIEW" ]; then
echo "⚠️ FLAGGED: $FILE (score: $SCORE) — needs review before publishing"
else
echo "✅ CLEARED: $FILE (score: $SCORE)"
fi
done

exit 0

GitHub Actions (CI gate)

Add .github/workflows/content-quality.yml:

yaml
name: Content Quality Gate

on:
pull_request:
paths:
- 'content//*.md'
- 'posts//*.md'

jobs:
quality-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4

  - name: Set up Python

    uses: actions/setup-python@v4

    with:

      python-version: '3.11'


name: Install dependencies

run: pip install anthropic click rich python-dotenv
name: Get changed markdown files

id: changed

run: |

  FILES=$(git diff --name-only origin/main...HEAD | grep '.md$' || true)

  echo "files=$FILES" >> $GITHUB_OUTPUT
name: Score content files

env:

  ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

run: |

  for FILE in ${{ steps.changed.outputs.files }}; do

    python score_content.py "$FILE" --threshold 75 || exit 1

  done

Tuning for Your Workflow

Default thresholds are conservative: reject below 60, flag 60–79, approve at 80+. These won't fit every use case.

Lower your threshold for:

Technical tutorials (structured language scores lower on lexical diversity)
Listicles (short sentences = lower variance)
Non-native English writers (different style from training data)

Raise your threshold for:

Brand content
Opinion pieces
Company blog posts

The --threshold flag lets you adjust per-run. For batch jobs, run with --skip-claude to speed things up — just use pattern matching and structural analysis.

Start conservative. Run 50 articles through the scorer, measure how many actually needed human fixes, then adjust. Within a week you'll have thresholds that catch real problems without flooding your review queue.

The scanner can't replace editorial judgment. What it does do is eliminate the reading of obvious mediocrity. That Tuesday morning gets your time back.

Follow for more practical AI and productivity content.

DEV Community