Aakash Gour

Posted on Jun 22

Building a Content Quality Scoring System: How PostAll Ensures Output Standards

#javascript #tutorial #python #ai

PostAll generates articles fast. Speed was never the hard part.

Three weeks after we shipped bulk generation, a beta customer flagged that 40 of their 200 generated articles followed the exact same structural skeleton — same opening hook, same three-section breakdown, same one-line conclusion, just different nouns swapped in. Every article was technically unique. None of them felt like 200 different pieces of content.

That's the day I started building a quality gate that sits between generation and delivery. Not a readability checker bolted on as an afterthought — an actual scoring pipeline that catches uniqueness drift, readability mismatches, and basic SEO failures before a single article reaches a customer's CMS.

This is that system, the version that's running in production right now, and the two earlier versions that didn't work.

Why this is harder than it looks

Readability checkers exist (Hemingway App has been around forever). Plagiarism checkers exist. SEO checklists exist. The hard part isn't building any one of these — it's that none of them, alone, catches "this sounds like every other article we generated this week."

Uniqueness, readability, and SEO compliance pull in different directions. Content can ace a Flesch reading score and still sound robotic. It can pass a plagiarism check (zero copied sentences) and still be structurally identical to your last fifty articles. Combining three signals into one number — and deciding what to do when that number is too low — turned out to be the actual problem.

What you'll build

By the end of this, you'll have a scoring pipeline that takes a piece of generated content and returns something like this:

{
  "composite_score": 81.4,
  "passed": true,
  "breakdown": {
    "readability": 92.0,
    "uniqueness": 71.0,
    "seo": 80.0
  }
}

Three independent checks, weighted into one decision: ship it, or send it back for regeneration.

Prerequisites

Node.js 18+ (for native fetch, no extra HTTP client needed)
Python 3.10+
Basic familiarity with HTTP requests between services
No paid API keys required — every check here runs locally

The setup

pip install fastapi uvicorn textstat

That's it on the Python side. The Node side has zero dependencies for this — we're using built-in fetch.

Step 1: Readability scoring

I used textstat instead of writing my own Flesch-Kincaid implementation, because reinventing a 1948 readability formula has approximately zero upside.

The non-obvious part: "more readable" isn't the goal. A Flesch score of 95 means the content reads like a children's book. For B2B blog content, that's just as wrong as a score of 20. We score distance from a target band, not raw ease.

# scorer/readability.py
from fastapi import FastAPI
from pydantic import BaseModel
import textstat

app = FastAPI()

class ContentPayload(BaseModel):
    text: str

@app.post("/score/readability")
def score_readability(payload: ContentPayload):
    text = payload.text
    ease = textstat.flesch_reading_ease(text)

    # 50-70 is "fairly difficult" on the Flesch scale — it's the band
    # that matches our actual blog audience (technical, not casual)
    target_band = (50, 70)

    if target_band[0] <= ease <= target_band[1]:
        readability_score = 100
    else:
        distance = min(abs(ease - target_band[0]), abs(ease - target_band[1]))
        readability_score = max(0, 100 - (distance * 2))

    return {
        "raw_flesch_score": ease,
        "readability_score": round(readability_score, 1),
        "grade_level": textstat.flesch_kincaid_grade(text),
    }

Run it with uvicorn scorer.readability:app --reload --port 8000.

Step 2: Uniqueness scoring (the version that shipped first)

Here's the version I actually deployed initially, because it was fast to build and I wanted to see if uniqueness scoring would even move the needle before investing more time.

# scorer/uniqueness_v1.py — shipped first, broke within two weeks
recent_content_hashes = []  # in-memory; fine until the process restarts

def get_shingles(text, size=5):
    words = text.lower().split()
    return set(" ".join(words[i:i+size]) for i in range(len(words) - size + 1))

def jaccard_similarity(set_a, set_b):
    if not set_a or not set_b:
        return 0
    intersection = len(set_a & set_b)
    union = len(set_a | set_b)
    return intersection / union

def check_uniqueness(text):
    shingles = get_shingles(text)
    max_similarity = 0

    for past_shingles in recent_content_hashes[-50:]:  # only the last 50 — see below
        max_similarity = max(max_similarity, jaccard_similarity(shingles, past_shingles))

    recent_content_hashes.append(shingles)
    return round((1 - max_similarity) * 100, 1)

This is 5-word shingling with Jaccard similarity — literal overlapping word sequences. It's cheap, it runs in milliseconds, and it worked great for catching exact copy-paste duplication.

It fell apart on paraphrasing. PostAll's generation model would reword a sentence completely — same idea, zero literal overlap — and the shingle check would score it as 100% unique. Meanwhile, two articles about closely related but genuinely different products (think "wireless earbuds" vs. "wireless headphones") would share enough common phrasing to get flagged as near-duplicates, even though a human would never confuse them.

The fix was switching the comparison from literal text overlap to semantic similarity, using sentence embeddings instead of shingles:

# the part that actually mattered — swapping shingles for embeddings
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')  # runs locally, no API key

def cosine_similarity(vec_a, vec_b):
    return np.dot(vec_a, vec_b) / (np.linalg.norm(vec_a) * np.linalg.norm(vec_b))

Same overall structure as before — embed the new text, compare it against the last N embeddings instead of the last N shingle sets, take the max similarity. The check went from "do these share words" to "do these mean the same thing," which is what we actually cared about.

Step 3: SEO scoring

This part lives in the Node layer, since that's where the rendered HTML already exists before it gets handed off.

// scorer/seo-check.js
function scoreSEO(html, targetKeyword) {
  const checks = {
    hasH1: /<h1[^>]*>/.test(html),
    hasH2: (html.match(/<h2[^>]*>/g) || []).length >= 2,
    keywordInH1: new RegExp(targetKeyword, 'i').test(extractTag(html, 'h1')),
    metaLength: getMetaDescription(html).length,
  };

  const keywordDensity = getKeywordDensity(html, targetKeyword);

  let score = 0;
  if (checks.hasH1) score += 20;
  if (checks.hasH2) score += 20;
  if (checks.keywordInH1) score += 20;
  // Google stopped caring about exact meta length years ago, but a
  // truncated description still looks bad in actual search results
  if (checks.metaLength >= 120 && checks.metaLength <= 158) score += 15;
  // Above ~3% density reads as keyword stuffing to engines and humans alike
  if (keywordDensity >= 0.5 && keywordDensity <= 3) score += 25;

  return { score, checks, keywordDensity };
}

function extractTag(html, tag) {
  const match = html.match(new RegExp(`<${tag}[^>]*>(.*?)</${tag}>`, 'i'));
  return match ? match[1] : '';
}

function getMetaDescription(html) {
  const match = html.match(/<meta name="description" content="([^"]*)"/i);
  return match ? match[1] : '';
}

function getKeywordDensity(html, keyword) {
  const text = html.replace(/<[^>]+>/g, ' ').toLowerCase();
  const words = text.split(/\s+/).filter(Boolean);
  const keywordCount = words.filter(w => w.includes(keyword.toLowerCase())).length;
  return (keywordCount / words.length) * 100;
}

module.exports = { scoreSEO };

Nothing exotic here — regex against the rendered HTML. The only reason this lives in Node instead of Python is that it's the layer that already has the HTML in memory; round-tripping it to Python would just add latency for no benefit.

Step 4: The composite score

This is the part that actually decides whether content ships.

// scorer/orchestrator.js
const { scoreSEO } = require('./seo-check');

// Uniqueness gets the highest weight on purpose. Readability and SEO
// problems are fixable after the fact with light editing. "This sounds
// like everything else we generated" is the failure mode that loses customers.
const WEIGHTS = { readability: 0.3, uniqueness: 0.4, seo: 0.3 };
const PASS_THRESHOLD = 75;

async function evaluateContent(html, plainText, targetKeyword) {
  const [readabilityRes, uniquenessRes] = await Promise.all([
    fetch('http://localhost:8000/score/readability', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ text: plainText }),
    }).then(r => r.json()),
    fetch('http://localhost:8000/score/uniqueness', {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({ text: plainText }),
    }).then(r => r.json()),
  ]);

  const seoRes = scoreSEO(html, targetKeyword);

  const composite =
    readabilityRes.readability_score * WEIGHTS.readability +
    uniquenessRes.uniqueness_score * WEIGHTS.uniqueness +
    seoRes.score * WEIGHTS.seo;

  return {
    composite_score: Math.round(composite * 10) / 10,
    passed: composite >= PASS_THRESHOLD,
    breakdown: {
      readability: readabilityRes.readability_score,
      uniqueness: uniquenessRes.uniqueness_score,
      seo: seoRes.score,
    },
  };
}

module.exports = { evaluateContent };

Two parallel HTTP calls to the Python service, one local SEO check, weighted average, threshold gate. That's the whole decision layer.

What can go wrong

Flesch scores penalize technical vocabulary, not just bad writing. A correctly-written article about Kubernetes will score "harder to read" than one about cooking, purely because of word length. If you're generating content across varied technical domains, use per-category target bands instead of one global band — what counts as "readable" for a beginner JavaScript tutorial isn't the same target as a database internals deep-dive.

Embedding-based uniqueness checks add real latency. Loading all-MiniLM-L6-v2 and running inference adds roughly 150-200ms per check on a CPU, on top of the network round-trip. At low volume that's invisible. At 500 articles/hour, it's a meaningful chunk of your pipeline's total time. Batch the embedding calls if you're checking multiple pieces of content at once instead of doing one request per article.

Automatic regeneration on failure can spiral. If failed content automatically triggers a regeneration attempt, and the regenerated content fails again, and you don't cap the retries, you can burn API costs on a loop that never resolves. Cap it at two attempts, then route to human review instead of retrying indefinitely.

The pass threshold is a guess until you have data. I started at 75 because it felt reasonable, not because I'd measured anything. Track your false-positive rate (good content getting rejected) for at least a couple weeks before trusting the number you picked on day one.

Where it is now

This gate runs on every article PostAll generates before it reaches a customer. About 9% of content gets flagged on the first pass — and uniqueness, not readability, is the reason in the large majority of those cases. Failed content gets one automatic regeneration attempt at a higher temperature; a second failure routes to a human review queue instead of silently going out.

The embedding round-trip adds roughly 340ms of p95 latency per article. At our current volume that's an acceptable tradeoff. If we get to a point where it isn't, batching the embedding calls is the next obvious lever.

The part I didn't expect: readability has been almost a non-issue. Uniqueness drift is the failure mode that actually shows up at scale, which tells you something about what bulk generation degrades first.

If you're building something similar — have you found a better signal than embedding similarity for catching "this sounds the same" without flagging legitimately related content? I'm not fully convinced the threshold I'm using is right yet, and I'd genuinely like to know what's worked for you.

DEV Community