MORINAGA

Posted on Jun 28

How I built a content quality gate that stops bad articles before they publish

#webdev #indiehackers #showdev #programming

I run three directory sites and a content pipeline that generates and cross-posts articles to Dev.to, Hashnode, and Bluesky automatically. The pipeline has been running for about six weeks. Early on I found a category of failure that no amount of CI infrastructure was catching: content quality problems. Wrong tags. Cliché phrases that slipped past self-review. Articles that implied specific traffic metrics I couldn't back up. Fabricated specificity disguised as honest reporting.

The solution was scripts/audit-articles.mjs — a lint-style quality gate that runs on every new article before the publish step. It works the way eslint works for code: structured checks, clear error/warning distinction, strict mode for pre-publish, lenient mode for a historical baseline.

Why a lint gate and not a manual review

The specific failure mode I was trying to prevent was this: automated generation leaves an article that reads fine on first pass but fails on systematic inspection. A cliché phrase at the start of a section. The tag "seo" slipping in when the pool explicitly forbids it. A word count of 580 when the spec requires 600-900 for a lightweight article. These aren't hard to catch — they're tedious to catch every single time, and tedium is where manual review degrades.

The pre-post Bluesky QC gate I built earlier applies the same principle to social posts: systematic gates catch what self-review misses reliably. For articles, the gate runs before the publish workflow can touch the file.

The JSON-LD audit script runs a similar check post-deploy against live pages. audit-articles.mjs runs pre-commit, against local markdown. Catching it before it ships is always cheaper than chasing it down after.

What the gate checks

The script runs about 12 distinct checks per file, split into errors (fail hard in strict mode) and warnings (report but continue by default).

Frontmatter structure. Four required keys must be present: title, description, tags, publish_to. Missing any of them is an error. publish_to must contain only known targets. tags must be an array of exactly 4 items, each drawn from an explicit pool of 18 allowed values:

const TAG_POOL = new Set([
  "ai", "astro", "webdev", "showdev", "typescript", "javascript",
  "indiehackers", "productivity", "opensource", "programming", "tutorial",
  "machinelearning", "claude", "anthropic", "vercel", "turso", "tailwindcss",
  "githubactions",
]);

Any tag outside the pool is an error, not a warning. The pool exists to prevent both "seo" (explicitly prohibited in the spec) and organic drift toward one-off tags that don't build topic coherence. The editorial layer over programmatic content has a similar constraint: structure imposed at ingestion time is easier to maintain than structure enforced by convention.

Title and description length. Titles over 90 characters are an error — Dev.to and Hashnode truncate feed titles beyond that. Descriptions over 200 characters are an error — the meta description budget for Google and Hashnode display.

Word count. This is a warning rather than a hard error because the threshold varies by article archetype. A lightweight at 580 words isn't technically failing a hard constraint, but the flag forces a decision: add a section, or accept the shorter count and justify it. The word count always appears in the summary line regardless, so there's no hiding from it.

Cliché detection. 14 literal phrases are checked case-insensitively against the full body. The list covers diving metaphors, fast-paced-world openers, hyperbolic tech superlatives, and in-this-article throat-clearing — the standard set of AI-assisted writing tics. I won't reproduce all 14 here for an ironic reason I'll cover in "What I'd do differently."

The check is simple String.prototype.includes, not regex, because the phrases are specific enough that false positives aren't a real concern. An article using the typical "world of AI" framing is probably using clichéd structure; the check forcing me to see it is the point.

Fabricated metric detection. Two regex patterns catch the most common forms of fabricated social proof:

const FABRICATED_METRIC_PATTERNS = [
  /\b(\d{2,3},?\d{3,}|\d{4,})\s+(visit|view|readers?|users?|subscribers?)\b/gi,
  /\branked #1\b/gi,
  /\bmillion(?:s)?\s+of\s+(?:users?|readers?|developers?)/gi,
];

The first pattern catches round numbers followed by social-proof nouns like "visitors", "users", or "subscribers" — numbers presented as fact when the sites are young enough that no credible claim to that kind of traffic exists. The pattern requires a four-digit-or-higher number followed directly by a social-proof noun. It catches claimed traffic figures but not numeric references in code or data discussion.

This check produces an error, not a warning, because fabricated metrics are the one category I genuinely can't allow. The whole premise of the three sites experiment depends on readers trusting that I'm reporting honestly. One fabricated stat poisons that trust retroactively.

Internal and external link presence. For pillar-sized articles (over 1200 words), the check warns if internal links number fewer than 5 — the spec requires 5-10 for pillar archetype. For any article over 600 words, it warns if there are zero external links. The check uses a simple regex:

const internalLinks = (body.match(/\]\(\/[^)]+\)/g) || []).length;
if (wc > 1200 && internalLinks < 5) {
  warnings.push(
    `pillar-sized (${wc} words) has ${internalLinks} internal links — spec requires 5-10`
  );
}

The noindex gate for programmatic pages uses a similar "catch before ship" logic at the infrastructure level. The article gate applies it at the content level.

Sentence repetition. Any sentence appearing three or more times in the same article generates a warning. This catches a specific AI-assisted writing failure: a paragraph occasionally gets reformulated twice, and the reformulation ends up identical to the original a few hundred words later. The check normalizes to lowercase and trims whitespace before comparing.

Strict versus lenient mode

The gate has two distinct behaviors depending on how it's called. The Node.js process.exitCode documentation is the relevant primitive — returning 1 from main() is enough to block any downstream step that checks exit codes.

When run against a single file (node scripts/audit-articles.mjs path/to/article.md) or with --strict, errors cause a non-zero exit code and the publish step stops.

When run against all articles without explicit paths (node scripts/audit-articles.mjs), errors are reported but don't fail the process. This baseline scan mode exists because older articles pre-date some current rules, and a retroactive constraint would block everything. The three-tier content quality ladder applies here: different content tiers have different quality expectations.

The article generation routine runs strict mode on each newly staged file before committing. The publish workflow runs against the specific file being published. The all-articles scan runs periodically to report on historical drift without failing anything.

Title duplicate detection across the repo

When scanning all articles (not single-file mode), the gate runs a cross-article deduplication check. Titles are normalized to lowercase alphanumeric, whitespace collapsed, then compared:

function detectTitleDuplicates(reports) {
  const titles = new Map();
  for (const [path, r] of reports) {
    if (!r.meta?.title) continue;
    const key = r.meta.title.toLowerCase().replace(/[^a-z0-9]+/g, " ").trim();
    const existing = titles.get(key) ?? [];
    existing.push(path);
    titles.set(key, existing);
  }
  return [...titles.entries()]
    .filter(([, paths]) => paths.length > 1)
    .map(([key, paths]) => ({ key, paths }));
}

This catches rephrase cases more than exact duplicates (which would be obvious). "Why I use Turso for my Astro monorepo" and "Using Turso libSQL in an Astro monorepo" normalize to similar-enough strings to surface as candidates. The check doesn't do semantic similarity — it doesn't need to. Structural overlap is enough of a signal to warrant a second look, given how the pipeline-aware content variants approach works.

What actually got caught

Since adding the gate, here are the real categories it's flagged that I fixed before publishing:

Tag outside pool: twice. Both times were "seo" appearing in generated frontmatter. Reliable pattern: when generating frontmatter from a prompt, the model includes obviously relevant tags that are explicitly off-limits in the spec.

Word count too low: three lightweight articles came in at 540-580 words. Two got an additional section and reached spec. One I accepted shorter — the content was complete and adding words would have been padding.

Fabricated metric match: once. The generated draft said "early testers report" something in a way that parsed as a quantified claim. Fix was removing the number.

Missing external link: regularly. Generated articles sometimes make claims about tools or APIs without citing the primary source. About half the time I add a link on review; the other half I rephrase to not make a claim that requires one.

What I'd do differently

The cliché list is static. A smarter approach: add phrases from real failures after the fact. When a specific phrase slips through and I notice it post-publish, add it. The list has grown from 8 to 14 entries through manual additions, but there's no automated feedback loop. One possible improvement: run the gate on published articles periodically and flag new patterns.

The biggest gap: the checker doesn't strip code blocks before running cliché or metric checks. This article itself demonstrates the problem — I had to describe the cliché list in prose rather than including it as a JavaScript literal, because reproducing the exact phrases inside a code block triggers the gate. The fix is straightforward: split the body on triple-backtick fences, check only the prose regions. A two-pass preprocessing step before any pattern check would resolve this entirely. I haven't shipped it yet because the current behavior just means I write around it, which isn't painful enough to force the fix.

CI integration is partial. The gate runs as part of the article generation routine and as a verify step, but it doesn't run as a blocking step in the publish workflow itself. That means a file committed outside the routine could bypass the gate entirely. The fix is one line in publish-articles.yml — a step I haven't prioritized because the routine path is currently the only real publishing path.

The EEAT transparency pages I built approach content credibility from the site structure side. The lint gate approaches it from the individual article side. Both serve the same goal: content that doesn't embarrass the project in hindsight.

FAQ

Why not use an existing prose linter like alex or write-good?

alex focuses on inclusive language; write-good checks passive voice and weak qualifiers. Neither checks frontmatter structure, tag pools, or fabricated metrics — the pipeline-specific failures I actually need to catch. A domain-specific gate catches domain-specific failures better than a general-purpose tool.

How does the fabricated metric check handle code blocks?

Currently it doesn't exclude them, which is the known gap. Numeric references in code examples occasionally trigger false positives. I review the flagged line location before treating it as a real failure. Pre-processing the body to strip triple-backtick blocks would fix this.

Why warn on internal link count instead of hard-erroring?

Because article type isn't always deterministic from content alone. A 1200-word article might be a pillar or a borderline lightweight. Treating it as a warning preserves the ability to make a judgment call; treating it as a hard error would block edge cases that are genuinely fine.

What happens to historical articles that fail current checks?

Baseline scan mode reports them without failing. Historical content is in a different tier. New articles must pass strict mode; historical articles are known issues reported for visibility, not blocked.

Part of an ongoing 6-month experiment running three AI-curated directory sites. The technical claims here are real; this article was AI-assisted.