DEV Community

John Wade
John Wade

Posted on • Edited on

When a Model Goes Wide Instead of Deep: Installing Quality Gates That Actually Hold


I asked Cursor to review my project documentation against the design constraints I'd spent weeks building. What came back looked like analysis. Clean formatting, section headers, specific references to my architecture. And scattered throughout: "Perfect! 🚀", "Smart to research!", "Great call!" — affirmative language that violated every guardrail in my system's design philosophy.

The problem wasn't Cursor's enthusiasm. The problem was that nothing caught it. No checkpoint, no gate, no pre-delivery verification. Cursor produced output that looked like analysis but was actually compliance — pattern-matching to "helpful assistant" rather than reasoning against constraints. And I didn't notice until I brought the output to ChatGPT for a tone check and ChatGPT's response was immediate: "The assistant's language slides into affirmation mode rather than architectural reasoning. This violates the neutrality principle you've built into the design guidelines."

That was the moment I realized: models don't have internal quality gates. They'll drift from any constraint you set unless something external catches it.

Part 3 of Building at the Edges of LLM Tooling. If you're using LLMs for analytical work — framework reviews, research synthesis, anything requiring depth — comprehensive-looking output isn't the same as deep output. Models optimize for coverage, not rigor. Start here.


Why It Breaks

The drift isn't malicious. It's structural. LLMs optimize for patterns that satisfy users — comprehensive coverage, confident delivery, affirming language. These patterns work most of the time. But for work that requires analytical rigor, this optimization creates a specific failure mode: output that looks thorough without being deep.

I kept hitting three versions of this.

Tone drift without gates. Cursor saw my project documentation and generated responses that matched the form of analytical review — headers, bullet points, references to specific files — while the substance was cheerleading. "Perfect! 🚀" isn't analysis. But without a checkpoint asking "does this output maintain analytical neutrality?", it shipped unchallenged. ChatGPT caught it instantly: "Rather than questioning whether your Chronicle 360° cycle is right, the assistant assumes it's working and offers implementation details. This skips the 'Should we even do this?' layer."

Breadth without depth. I was developing an analytical framework with ChatGPT — asking it to review methods, compare to professional intelligence frameworks, assess rigor. Each iteration looked more comprehensive. It mapped my framework to CIA Structured Analytical Techniques, NATO STRATCOM protocols, FrameNet semantic analysis. Impressive coverage. But when I brought it to Cursor for adversarial review, Cursor caught what I'd missed: "It's partially getting the concept but missing key aspects. ChatGPT is treating boundary objects as technical interoperability tools rather than epistemic translation zones." ChatGPT was pattern-matching to familiar concepts — finding similarities to existing frameworks rather than grasping what was genuinely novel. Comprehensive isn't deep. It's just wide.

Scope expansion without depth gates. A conversation about neurochemical stability and epistemic framing started focused, then expanded. Each response from ChatGPT added another domain — cognitive evolution, Lacanian desire, libidinal economy, sociological parallels — but none got deeper. When I pushed back — "this doesn't check the minimum bar of needed curation for cognitive performance required to access higher ordered reasoning" — ChatGPT interpreted "not deep enough" as "add another layer" and brought in more concepts. Six domains, all surface-level. The model defaulted to horizontal growth because breadth is easier to generate than depth.


What I Tried

The first approach was reactive: catch drift after it happened. I'd generate output in one model, bring it to a second model for critique, bring the critique back to the first, synthesize. This worked — Cursor admitted ChatGPT's analysis was "brutally accurate" after being shown the tone drift diagnosis. But reactive review is expensive. By the time the checkpoint catches the problem, you've already invested in the wrong direction.

What emerged was a checkpoint architecture — gates at the analytical junctures where drift actually happens.

Pre-execution checkpoints. I started inserting a question before each generation step: does this match design constraints? Have we validated the problem first? Should this even be automated? Cursor had jumped to implementing Chronicle 360° assessments without anyone asking whether those assessments should exist. The intervention that would have caught it — "stop iterating clever architectures until you decide what kind of organism you're building" — never ran because there was no gate asking for it.

Depth gates before scope expansion. I started checking, before adding another domain, whether the current one had mechanistic depth, novel insights, and falsifiable claims. The brain chemistry conversation should have deepened the neurochemistry-epistemology connection before expanding to Lacan — instead it accumulated six shallow layers. The question that started catching this: could I specify the mechanism, not just name the category? If not, the domain wasn't ready to expand.

The minimum bar checkpoint. I started asking, before marking any analysis complete: was this accessing higher-order reasoning, or just rearranging familiar patterns? Was it specifying mechanisms or just taxonomies? Making risky predictions or just describing what exists? ChatGPT's analytical frameworks consistently passed completeness checks (all sections covered, domain vocabulary correct) while failing depth checks (pattern-matching to existing concepts, treating processes as things).


What It Revealed

The deeper pattern was about what "comprehensive" actually means in LLM output. Models produce analysis that covers all sections, uses correct terminology, and looks thorough. The output passes because it pattern-matches to "good analysis" — the same heuristic the models used to generate it. The failure is invisible until you install a gate that asks a different question: not "did we cover everything?" but "did we go deep enough on anything?"

The meta-irony crystallized this. I was designing a system explicitly to prevent goal creep — naming it as collapse risk C5, documenting it, building countermeasures. And the conversation designing this system exhibited goal creep. Started with "create compact context pack," ended with proposals for automation layers, MkDocs site hierarchy, n8n workflows, agent architectures. The system designed to prevent drift was drifting during its own creation. When I pointed this out — "Harness is infrastructure, not the end goal" — ChatGPT acknowledged it and kept offering expanded options. Documentation of a constraint is not enforcement of a constraint. The same lesson from Post 2, at a different layer.

The distinction that matters: completeness checks verify that all boxes are ticked. Depth gates verify that what's in the boxes is real. Models are very good at ticking boxes. They need external pressure to fill them with substance.


The Reusable Rule

If you're using LLMs for analytical work — frameworks, reviews, research synthesis — the output will default to comprehensive-but-shallow without explicit depth gates.

The diagnostic signals: Output that looks thorough but feels surface-level. Familiar frameworks applied without novel insight. Scope expanding each iteration while no single domain gets deeper. Affirmative language where critical distance should be. And the meta-signal — when you catch yourself designing solutions to a problem and the conversation itself exhibits that problem.

The checkpoint that matters most isn't "did we cover everything?" It's "can we specify the mechanism?" Taxonomy is easy. Mechanisms require actual understanding. When the model says "these frameworks interact," ask how. When it says "comprehensive analysis," ask what it chose not to include. When it adds another domain, ask whether the last one is deep enough to earn the expansion.

Models don't have internal quality gates. Install yours at the junctures where drift happens — before generation, before expansion, before delivery. The gate that asks "can you specify the mechanism?" will catch what no amount of coverage checking ever will.

Top comments (0)