The faster your agents ship code, the faster slop accumulates. Here's why the solution isn't slowing down.
There's a version of the AI productivity story that goes like this: agents write the code, humans review the diff, everyone ships faster. Clean. Straightforward. Optimistic.
The problem is that the middle part — "humans review the diff" — doesn't scale at the same rate as the first part. Agent output has compressed. Human attention hasn't.
When a developer using Claude Code or Cursor can generate in an afternoon what used to take a week, the review queue doesn't empty faster. It fills up with bigger diffs, more files touched, more surface area to hold in your head at once. The execution layer has gotten faster. The judgment layer is the same size it was.
This is what we'd call the output layer problem: the gap between how fast code is produced and how thoroughly it can be evaluated. And it's where a surprising amount of slop quietly enters production.
What slop looks like at speed
AI-generated code isn't bad in the way junior developer code is bad. It doesn't usually have obvious logic errors or missing edge cases you'd catch in five minutes. It compiles. It passes your tests. It looks, at a glance, like perfectly reasonable code.
What it tends to have instead are the structural signatures of code that wasn't written to be maintained:
- Narrative comments that explain what the code does line-by-line rather than why it does it — a tell that the model was narrating rather than documenting
- Generic naming that makes sense in the context of a single function but degrades the readability of a whole file over time
- Swallowed exceptions where error handling was stubbed in and never completed — the catch block exists, but nothing happens inside it
-
as anyand@ts-ignorescattered through TypeScript where the model hit a type complexity it didn't want to resolve properly - TODO stubs left in place where the agent ran out of context or confidence and deferred the hard part
None of these are catastrophic individually. Together, across a codebase where agents are writing code at volume, they compound into something that's genuinely difficult and expensive to work with — not because any single file is broken, but because the accumulated weight of small quality deficits makes the whole system harder to change.
This is what the vibe coding optimism tends to skip over. Building software is fast now. Running, maintaining, and evolving software is still the same job it always was — and that job is harder when the input material is structurally inconsistent.
The review bottleneck is real, and it's getting worse
If you're a tech lead on a team that's adopted AI-assisted coding in any serious way, you've probably already felt this. PRs are bigger. The diff is longer. Your engineers are shipping more, but you're spending more time in review just to keep up with what they're producing — let alone to actually evaluate it carefully.
The instinct is to review faster. Skim the obvious stuff, focus on the logic, trust that tests will catch the rest.
The problem is that the patterns aislop flags — unused exports, oversized functions, trivial comments, unsafe type assertions, swallowed exceptions — are exactly the things that get skimmed past in a fast review. They're not bugs. They're not even wrong, exactly. They're just the structural residue of code that was generated rather than crafted, and they accumulate quietly until the codebase is meaningfully harder to work in.
Speed-reviewing your way through AI-generated output doesn't solve the output layer problem. It just defers it.
What a quality gate actually changes
A deterministic quality gate doesn't replace engineering judgment. It protects it.
When aislop scans a PR before it reaches review, the mechanical issues are already surfaced and scored. The reviewer doesn't have to catch them — they're in the findings. The auto-fixable ones are already gone. What's left for the human reviewer is the stuff that actually requires a human: the architectural decisions, the business logic, the tradeoffs that only make sense if you know why the system works the way it does.
That's the division of labour that makes AI-assisted coding actually work at scale. Agents handle the execution. Deterministic tooling handles the structural quality check. Humans handle the judgment calls that context-free tools can't make.
Without the middle layer, you get a situation where reviewers are either doing the machine's job (checking for as any and TODO stubs in a 400-line diff) or not doing it (approving PRs that look fine but quietly degrade the codebase). Neither option is sustainable as agent output volume grows.
The compounding problem nobody talks about
There's a second-order effect here that's worth naming directly.
Agents don't just write new code. They read the existing codebase to understand context for what they're generating. When the codebase has accumulated structural slop — inconsistent naming, commented-out dead code, generic variable names that don't signal intent — agents use that as training signal for what's acceptable.
Slop, in other words, teaches agents to produce more slop. A codebase that's been maintained with a quality gate produces materially better agent output than one that hasn't — because the agent has better patterns to learn from and fewer ambiguous examples to confuse it.
This is the compounding argument for catching slop early. It's not just about the current PR. It's about the quality floor of every PR that comes after it.
Codifying what "good" looks like
The article that inspired this one — from the Leaders of Code podcast, featuring Jon Hyman of Braze and Jody Bailey of Stack Overflow — makes the point that one of the most consequential engineering tasks of the next few years will be getting experienced engineers' knowledge out of their heads and into a form that agents can use.
That's true of architectural decisions and business context. It's equally true of code quality standards.
Right now, most teams' sense of what acceptable code looks like lives in the heads of senior engineers and in the informal feedback loop of code review. When an agent violates those standards, a reviewer catches it — if they're paying attention, if they're not in a hurry, if the diff isn't too large to parse carefully.
A quality gate externalises that standard. It makes the bar explicit, consistent, and enforceable without depending on reviewer attention in any given review cycle. That's not a replacement for the senior engineer's judgment. It's a way of making that judgment durable and scalable — which is exactly what needs to happen as agent output grows.
What to do with this
If you're running a team that uses Claude Code, Codex, Cursor, or any other AI coding assistant at any meaningful volume, three things are worth doing now rather than later.
Set a baseline score. Run npx aislop scan on your current codebase. Whatever comes back is your current baseline — the accumulated structural debt from however your team has been working. You can't improve what you haven't measured.
Gate new additions, not just the baseline. The value of a CI gate isn't in fixing everything at once — it's in preventing new slop from accumulating while you address the existing debt incrementally. Set your threshold, commit the config, block merges that fall below it. New agent output gets checked before it reaches review.
Hand failing findings back to the agent. When aislop surfaces an issue that isn't auto-fixable, npx aislop fix --claude sends it directly to the agent that wrote the code for a second pass. You're not adding reviewer work — you're closing the loop before the human ever sees it.
The execution layer has gotten dramatically faster. The judgment layer is still human-sized. A quality gate is what keeps those two things from diverging in ways that are expensive to fix later.
aislop is a free, open-source CLI for scoring AI-generated code. It's sub-second, deterministic, and requires no configuration to start. Run it in your repo →, Github →
Top comments (0)