Muggle AI

Posted on Apr 16

Why AI Output Quality Plateaus — And What Actually Raises the Ceiling

#ai #claude #webdev

Ira Glass made this observation about creative work that stuck with a lot of people: the reason your early work is bad isn't that your ability is low. It's that your taste is already high. You can hear the gap between what you made and what you were trying to make. You know it's not there yet. That's your taste working against you.

The observation was about writers and filmmakers developing their craft. But it predicts something about AI that most people haven't named clearly yet. This maps onto AI output quality more precisely than most people realize.

AI output quality plateaus because AI eliminates the execution gap but cannot close the taste gap — the distance between recognizing good work and producing it. Process and guardrails raise the floor. They don't move the ceiling. This article explains what the taste gap is, why longer specs can't close it, and the one practice that does.

AI closes the ability gap. It does not close the taste gap.

What the Ability Gap Actually Was

Before AI, execution was expensive. Writing a draft took hours. Coding a feature took days. The bottleneck was the doing.

A lot of bad output existed because doing was costly. People shipped the second draft when they knew a fifth draft would be better. Teams built the expedient implementation because the elegant one would take three more days. The execution gap — between knowing what good looks like and being able to produce it — was the binding constraint.

AI collapses that gap dramatically. The draft takes minutes. The feature takes hours. A BCG and Harvard study of 758 consultants measured this directly: bottom-quartile performers gained 43% on task quality when given AI access. The floor rose sharply.

This is real. The gains at the bottom are genuine and significant.

The New Binding Constraint
When AI makes execution cheap, the binding constraint shifts from ability to taste: knowing which problem to solve, which feature to cut, and when technically correct output is wrong for a specific user. AI cannot supply this judgment. It averages across acceptable options. The result is output that works but disappoints anyone with specific standards.

When execution becomes cheap, the binding constraint shifts. What's left?

A 2025 arXiv paper on AI output variance found that generative AI systematically compresses human output distribution — the floor rises but the ceiling drops. Ted Chiang called ChatGPT "a blurry JPEG of the web": structure preserved, fine detail lost. Amanda Askell at Anthropic described the dynamic as LLMs providing "the average of what everyone wants."

The average of what everyone wants is not the best version of anything specific.

This is where the taste gap becomes visible. The AI can execute. The AI cannot judge what's worth executing. It cannot tell you which angle on the problem is the interesting one, which feature to cut because the product is already doing too much, when technically correct is the wrong move for this user.

Those judgments are taste. Taste is why good AI output disappoints people with high standards — they can hear what it could have been.

The Spec Problem Is a Taste Problem

There's a concrete version of this that any developer who has worked with AI on a real project has encountered.

Write a thorough spec — two thousand words, every endpoint, every edge case. Feed it to the AI. The AI builds everything on the list. Every requirement is met. The product works.

It also feels like a toy.

Not broken. Not missing features. Just hollow — like a homework assignment that proves the concept without understanding what the concept is for. The AI followed the spec and had no understanding of what production-level software actually needs, or what good design feels like from the user's side. The spec described the parts. Building the parts is not building the product.

The missing information is not writable as spec text. It's taste: the accumulated pattern recognition that tells an experienced engineer when a loading state will feel broken even at 200ms, when an empty state communicates abandonment, when the technically correct dropdown is the wrong choice. That knowledge didn't make it into the spec because it's not articulable as requirements.

Andrej Karpathy walked back his famous "vibe coding" framing in 2024: "You still need taste, architecture thinking." The developer's role shifted from coder to orchestrator — but orchestrating well still requires knowing what good looks like.

Evaluative Taste vs. Generative Taste

Evaluative taste is the ability to judge existing work — scoring, ranking, filtering. Generative taste is the ability to decide what to create — which topic matters, which angle resonates, which details to include and which to cut. AI is improving at evaluative taste. Generative taste remains a human capacity.

There's a nuance worth naming here. AI can learn some forms of taste.

A 2024 study found AI achieved 59% accuracy evaluating research pitches — identifying which proposals were strong. That's evaluative taste: scoring what exists against criteria.

Generative taste is different. It's knowing what to create before it exists, which angle is worth pursuing, what the product needs that nobody asked for. The 59% accuracy on scoring does not transfer to 59% accuracy on generating what's worth scoring highly.

Paul Graham's "Taste for Makers" essay argued that taste can be evaluated but not manufactured. You can articulate what makes something good after seeing it. You cannot turn that articulation into a reliable procedure for generating good things. The articulation is always incomplete. Goodhart's Law runs here too: once you optimize against a quality proxy, the proxy stops measuring quality.

The Practice That Moves the Ceiling

The gap is not fixed. Taste develops. The question is how.

Before accepting AI output — code, copy, analysis — pause and name one thing you'd change if you had unlimited time. Not a bug. Not a missing requirement. The thing that's technically fine but wrong for this specific situation.

That practice does two things. First, it forces the articulation of the taste judgment, which is how taste becomes more precise over time. Second, when you can name the thing, you can respecify it. The next AI iteration has a real target instead of an implicit one.

The ability gap closed fast. The taste gap closes through repetition of this specific exercise: recognize, name, specify. Not through better prompts or longer specs.

Glass was right about creative work. Your taste precedes your ability. With AI, ability is nearly free. What remains is closing the taste gap — and that one you have to do yourself.

What did the last AI output get technically right that still felt wrong — and could you name exactly what was off?

DEV Community

Why AI Output Quality Plateaus — And What Actually Raises the Ceiling

Top comments (0)