Maybe Chain-of-Thought Isn't the Trick. Maybe Specification Is.

#ai #architecture #promptengineering

I want to think through something out loud, and I'd genuinely love to know if this lands for anyone else or if I'm just talking myself into a corner.

I've been chasing a hunch lately, the kind that starts as a half-formed idea you can't quite let go of. It's about prompting — specifically, why we talk about prompting the way we do, and whether the techniques everyone swears by are actually doing what we think they're doing. I went looking for an answer and found something that, if it holds up, kind of upends how I've been thinking about this. So take this as me showing my work, not announcing a conclusion.

Here's where it started: prompt engineering has accumulated a pretty standard toolkit by now. Chain-of-thought. Few-shot examples. Persona framing — "you are an expert clinician," that sort of thing. And most of the advice floating around amounts to some version of "stack enough of these and your results get better." I wanted to understand why that's supposedly true. I expected to find a tidy explanation. What I found instead was a paper suggesting it mostly... isn't true. Which sent me somewhere I didn't expect to go.

Where my head was at first

My own way into this question started with pedagogy, of all things — specifically Bloom's Taxonomy, which is a framework educators use to classify what kind of thinking a question is actually asking for. Remember, understand, apply, analyze, evaluate, create. "List the causes of WWI" wants remembering. "Why did WWI happen instead of being avoided" wants analysis. Same topic, completely different cognitive demand.

And it mapped onto prompting in a way that felt almost too clean. "Summarize this doc" — remember/understand. "Critique this argument" — evaluation. I started wondering whether the actual skill in prompting wasn't "pick the right technique" but something more like "ask for the right kind of thinking," the way a good teacher designs a question around the exact cognitive move they want a student to make.

I liked this idea a lot. I still kind of do. But it turned out to be a detour, not the destination — because the real surprise was sitting in the data, not in my analogy.

The belief I didn't realize I was testing

Here's a belief I think most of us building with LLMs are carrying around, myself included until recently: that prompting technique — chain-of-thought, personas, the clever phrasing patterns — is where the real leverage lives. It's certainly where most of the content lives. Every "10 prompting tricks that actually work" post is selling some version of this.

Then I came across a study out of UConn that tested this head-on, I interpret the results pretty clearly. The researchers were trying to get LLMs to classify psychological constructs in text — something like gratitude, which sits close to the surface of the language and is relatively easy to spot, versus something like negative core beliefs, which needs real interpretive distance and a precise theoretical definition before you even know what you're looking for.

They threw the whole toolkit at it: personas, chain-of-thought, explanations, few-shot examples. But the interesting move was that they also systematically varied the baseline task description itself — just rewording the definition and instructions dozens of different ways, and actually measuring what had significant impact.

And the popular techniques... mostly didn't. Persona framing, chain-of-thought — the paper describes their improvements as small and inconsistent, and across the board, not even statistically significant. What actually mattered, by a wide margin, was getting the definition of the thing you're asking for exactly right. The gap between their best-worded prompt and their worst, on the hardest construct, was a 28-point swing in F1 score. From wording alone. No clever tricks involved.

So maybe the strawman isn't "people think prompting is teaching" — I don't think anyone seriously believes that. I think the real strawman is closer to "people think technique is the lever," and this paper makes a pretty strong case that it mostly isn't. Specification might be doing most of the work instead.

The finding that actually reframed things for me

There's a second result in that paper I didn't see coming, and it's the one that ended up answering my original pedagogy question — just not in the direction I thought I was headed.

Across 71 different prompt components they tested, only three improved both precision and recall at the same time. Almost everything else was a trade-off — fix one kind of error, introduce another. Adding context wasn't free. It behaved like bias, in the literal statistical sense: it tilted the result, one way or another, almost every time.

That's roughly when my Bloom's thread resolved itself — by falling apart, which I wasn't expecting but probably should have. Bloom's was built for development. The whole structure assumes a learner can't reliably analyze something before they understand it, so a curriculum sequences cognitive demand over weeks or months. An LLM doesn't really develop across a conversation the way a student develops across a semester. It's not "becoming ready" for evaluation after demonstrating comprehension first. So the part of Bloom's I actually cared about — the sequencing — just doesn't transfer. What's left is more of a label set, a way to name what kind of operation you're asking for. Still useful, I think, just smaller than I originally hoped it would be.

Where it seems to point, at least to me

If I had to name the discipline that explains both of these findings — technique mattering less than expected, context behaving like measurable bias — I don't think it's teaching. It feels closer to something like survey methodology, or structured interviewing: the practice of trying to pull a true signal out of someone without your own question quietly shaping the answer.

Think about what a good survey question is actually trying to do. Not "lead the respondent toward what I want to hear." The opposite, really — precise enough to be answerable, but neutral enough that the response reflects what the person actually thinks rather than what they assume you're hoping for. A good structured interview works the same way. So does a well-built psychometric test item. The common thread across all of them is something like elicitation without leakage — pulling out a real signal without the act of asking corrupting it.

That feels like a better description of what's actually happening when a prompt works than "the right combination of tricks" ever did. And the UConn data backs it up reasonably directly — the lever that mattered was precise, theory-grounded specification of the thing being measured, which is more or less the exact discipline a psychometrician applies when trying to pin down something genuinely hard to define. The lever that mattered less was the technique-stacking most prompting advice is built around.

Where I've landed, for now anyway

Two things surprised me here that I didn't expect going in: prompting technique might not be the leverage point most content treats it as, and the better explanation for why isn't really pedagogy — it's something closer to applied measurement. Bloom's still has a seat at the table, I think, as a way of naming what kind of cognitive operation a prompt is asking for. But the discipline underneath that seems to be more about closing the gap between what you're actually asking for and what context, once you add it, quietly tilts the answer.

I'm not totally sure I've got this right, to be clear — it's an interpretation, and I'd genuinely like to hear where it breaks. I'm working up a more formal version of this for a white paper, where I can lay the research out properly. But this was the messier version, including the part where my first idea turned out to be the wrong one. If you've spent time in the prompting weeds and this rings true — or doesn't — I'd like to hear about it.