Stop Shipping AI Slop: Build an Anti-Slop Harness Around Your LLM

#ai #llm #architecture #engineering

"AI slop" is not a model problem. It's an engineering problem you decided not to solve.

The slop is the bland, off-voice, half-hallucinated, occasionally-just-an-error-message text that your LLM emits maybe 5% of the time — and that 5% is the part users screenshot. The instinct is to fix it in the prompt: add three more sentences of "be concise, be accurate, match my tone." That treats a stochastic system as if it were deterministic. It isn't. You cannot prompt your way to a guarantee.

What actually works is treating the model like any other unreliable upstream dependency: wrap it in a harness that validates, rejects, and retries before anything reaches a user. The model proposes; the harness disposes. Here's how to build one.

Slop is a systems problem, not a prompt problem

Every production LLM feature I've shipped converged on the same shape: the model is one stage in a pipeline, not the pipeline itself. You don't trust raw generation any more than you'd trust raw user input. You parse it, you validate it against constraints you can express in code, and you reject anything that fails — automatically, before a human ever sees it.

The key insight is that most slop is detectable. Empty output, a leaked stack trace, the wrong language, a 900-word answer when you asked for 200, a banned phrase like "in today's fast-paced world" — these are all checkable with deterministic code. You don't need a judge model to catch them (though a judge model has its place at the end). You need a gate that runs on every generation, costs microseconds, and never gets tired.

Think of it as five layers, each rejecting a different class of failure.

Layer 1: Structured output, not freeform text

The single biggest reduction in slop comes from refusing to accept prose where you can demand structure. If you ask for a JSON object with named fields and a schema, the failure modes collapse from "infinite" to "a handful you can enumerate."

Use the provider's native structured-output / tool-calling mode and attach a real schema — Pydantic, Zod, JSON Schema, whatever your stack speaks. This does two things. First, it forces the model to commit to a shape, which kills rambling preambles ("Sure! Here's a great answer for you..."). Second, it gives you a parse step that fails loudly. If the model returns something that doesn't validate, that's not a soft warning — it's a rejected generation that triggers a retry. A parse failure is a quality signal, not an exception to swallow.

The corollary: never try/except: pass around your parser. A swallowed parse error is slop with the lights turned off.

Layer 2: Reject error strings the model smuggles through

This one surprises people. Models are trained on the entire internet, which includes a lot of error messages, apology boilerplate, and refusal language. Under pressure — ambiguous input, a retrieval miss, a truncated context — the model will sometimes emit text that is syntactically valid but semantically garbage: "I'm sorry, I cannot access that file," "Error: undefined," "As an AI language model, I don't have the ability to...," or a half-rendered template with {{variable}} still in it.

Structured output won't catch these, because they fit the schema fine. You need an explicit denylist of error-shaped strings and patterns, checked against every field. It's crude and it works. Maintain it like you maintain a spam filter — every time a new flavor of garbage reaches production, it earns a line in the rejection list.

Layer 3: Voice and constraint checks

This is where you encode the things that make output yours rather than generic. Most of it is deterministic and cheap:

Length bounds. A word or token range per field. Reject the 900-word answer and the one-liner.
Banned phrases. The motivational-closer clichés, the "delve," the emoji clusters, the corporate hedging. A regex pass.
Required language. If you build bilingual TR/EN tooling like I do, you check that a Turkish response is actually in Turkish — a quick script-ratio or language-ID check catches the model code-switching mid-paragraph.
Format invariants. Markdown headings present, no leaked system-prompt fragments, no placeholder tokens.

Here's the core of a harness that strings these layers together with a bounded retry loop.

import re
from pydantic import BaseModel, ValidationError

class Article(BaseModel):
    title: str
    body: str

ERROR_SHAPES = [
    r"as an ai language model",
    r"i (?:cannot|can't|am unable to) (?:access|comply)",
    r"\berror:\s",
    r"undefined|null\b",
    r"\{\{.*?\}\}",          # leaked template tokens
]
BANNED_PHRASES = [r"in today's fast-paced", r"delve into", r"unleash the power"]

def gate(text: str) -> list[str]:
    """Deterministic checks. Returns a list of failures (empty == pass)."""
    fails = []
    if not text.strip():
        fails.append("empty output")
    if not (200 <= len(text.split()) <= 800):
        fails.append(f"length out of bounds: {len(text.split())} words")
    for pat in ERROR_SHAPES:
        if re.search(pat, text, re.I):
            fails.append(f"error-shaped string: /{pat}/")
    for pat in BANNED_PHRASES:
        if re.search(pat, text, re.I):
            fails.append(f"banned phrase: /{pat}/")
    return fails

def generate(client, prompt: str, max_attempts: int = 3) -> Article:
    last_fails: list[str] = []
    for attempt in range(max_attempts):
        feedback = "" if not last_fails else (
            "\n\nYour previous output was rejected for: "
            + "; ".join(last_fails) + ". Fix these and return only the schema."
        )
        raw = client.structured(prompt + feedback, schema=Article)  # native structured mode
        try:
            article = Article.model_validate(raw)
        except ValidationError as e:
            last_fails = [f"schema: {e.error_count()} errors"]
            continue
        last_fails = gate(article.body)
        if not last_fails:
            return article
    raise RuntimeError(f"slop after {max_attempts} attempts: {last_fails}")

Notice what the harness does on rejection: it feeds the specific failures back into the next attempt. The model is far better at fixing a named defect than at avoiding an abstract one. And notice the loop is bounded — after max_attempts it raises rather than shipping. Failing closed is the whole point.

Layer 4: Deterministic quality gates

Layers 1–3 catch format and surface defects. Layer 4 catches semantic invariants that are specific to your task and still checkable in code. If you generate a summary, assert every cited number appears in the source. If you generate SQL, run it through a parser and an EXPLAIN, not the model's confidence. If you generate code, compile it and run the linter. If you generate a translation, check that named entities survived.

These gates are where domain knowledge lives. They're unglamorous assert statements, and they're the difference between a demo and a product. The rule: anything you can verify mechanically, you must — because the model will eventually get it wrong, and you want the gate to catch it, not the user.

Layer 5: Verify before ship

The last layer is the only one that may use another model, and only for the things code genuinely can't judge: faithfulness, relevance, tone-match. A cheap judge model scoring "does this answer the question, grounded in the provided context, in the requested voice?" on a 1–5 scale, with a hard threshold below which you reject, catches the subtle slop that passes every deterministic check.

Keep this layer last and keep it skeptical. A judge model is itself an LLM and can be fooled, so it's a final filter on output that has already survived four deterministic gates — never a replacement for them. And log every rejection at every layer. Your rejection logs are the highest-signal dataset you own: they tell you exactly how your model fails in production, which feeds back into prompts, denylists, and gates.

What this buys you

None of these layers is clever. That's the point. Cleverness is fragile; a denylist and a bounded retry loop are not. What the harness gives you is a guarantee about what reaches the user — not a probability, a guarantee — for every failure mode you've chosen to encode. Slop stops being a vibe you argue about and becomes a set of named, logged, falsifiable conditions.

The model is a brilliant, unreliable intern. You don't fix an unreliable intern by writing a longer brief. You fix it by reviewing the work before it goes out.

The open question I keep circling: which of these checks genuinely belong in deterministic code, and which are you quietly outsourcing to a judge model because writing the real assertion was too hard?