Prompting Without the Menu

#ai #llm #discuss

I read a list-format post on prompting techniques over the weekend. Few-shot, chain-of-thought, ReAct, RAG, self-consistency, meta-prompting... fifteen items, flat list, equal weight. Each one explained briefly, each one given the same visual real estate.

The format is the problem.

When you list techniques as parallel options, you train the reader to pick the one that sounds most appropriate to their task. That's how prompting becomes cargo-culting. Someone reads "use chain-of-thought for complex reasoning" and starts adding Let's think step by step to every prompt that feels hard. The technique gets used, but it's not solving anything specific, because nothing specific had been diagnosed in the first place.

Prompting techniques aren't options. They're responses to failure modes. The list-format post collapses that distinction, and the collapse is what makes prompting feel like trial-and-error instead of engineering.

Here's how I've started thinking about it instead.

A workflow is a pipeline, not a single prompt

Most of the techniques in those lists assume you're optimizing one prompt: one input, one output, get the wording right. That framing is wrong for anything beyond a one-shot question. Real work with an LLM is a pipeline, and the stages aren't symmetric.

Early stages reduce entropy. Later stages spend tokens on reasoning. If you skip the first kind and jump straight to the second, every downstream technique compounds noise instead of reducing it.

Compression has to come first. Before generating anything, three questions do most of the work:

What exists? (the actual current state: files, schemas, code paths, existing logic)
What's broken? (the specific gap, not the vague feeling of wrongness)
What should the output look like? (shape, not content)

Skip this and the model fills the gap with probabilistic noise. You'll get plausible-looking output that's wrong in ways that take longer to debug than the original problem. Every prompting technique downstream of a vague problem statement is rearranging deck chairs.

Techniques sort by what they're for, not how they sound

Once you've compressed, the techniques fall into four groups based on what kind of failure they address:

Pattern biasing (low-cost control knobs). Few-shot, role, format, instruction prompting. These are persistent constraints, not reactive fixes. Set them once at the top of the workflow (prefer minimum diff, no abstraction until justified, assume repo context) and inject them only when the model drifts. The mistake is treating them as prompts to write fresh each time. They're more like config than instructions.

Reasoning scaffolds (conditional, expensive). Chain-of-thought, tree-of-thought, reflection. Reflection is the cheapest and most local, so it should be the default. Generate, then immediately ask what failure modes and edge cases exist, then patch. Only escalate to CoT or ToT when the model is visibly guessing at structure rather than missing facts. These trade cost and latency for accuracy... that trade is worth it less often than the lists suggest.

External state interaction (for missing data). Prompt chaining, ReAct, least-to-most. These are about getting information into the workflow that wasn't there before. Use when the task decomposes cleanly into stages, or when the model needs to act on the world before continuing.

Systems we can borrow from (RAG-shaped). A repo, a logs directory, a diff history are all RAG layers in disguise. The lesson from RAG isn't use a vector database. It's don't describe; point. Specific files, specific versions, specific diffs. Description introduces drift; pointing grounds it.

The reframe that changed the most

The biggest single shift in how I prompt: stop asking the model whether the output is correct.

Is this right? invites defense. The model has just produced the output. Asking it to grade itself produces a confident affirmative most of the time, because the same machinery that generated the output is now being asked to evaluate it.

The better question: what assumptions did you make, and what breaks first?

This exposes seams instead of inviting justification. It forces the model to surface the implicit decisions baked into the output: the input formats it assumed, the edge cases it didn't handle, the constraints it inferred but didn't verify. Once those are visible, you can decide whether each one is acceptable. Is this correct gets you a yes. What breaks first gets you a list.

This is also a better reflection prompt than the standard "review your answer for errors" pattern, because it doesn't depend on the model finding its own mistakes. It depends on the model articulating its own assumptions, which is a much more reliable thing to ask of it.

Rolling state, not history

For anything multi-turn, the default failure mode is context bloat. Every turn appends to the history, the history gets too long, you compact it, and compaction loses the load-bearing context: the early decisions, the constraints, the invariants that everything downstream depends on.

The fix is to maintain a compressed state explicitly. Not history. State. Key decisions, active constraints, things-that-must-not-change. Update it as you go. When context gets tight, you compact the conversational history but never the state.

Dense context. Saved tokens. Preserved direction.

What to skip

Some techniques in the typical list increase entropy without giving you a way to reduce it back. Zero-shot prompting (when you have any examples available), self-consistency for tasks that aren't ambiguous, meta-prompting for problems you haven't compressed yet, these add cost and uncertainty without a corresponding reduction in either. They have their place, but they're not the default tools, and listing them with equal weight to reflection or grounding is misleading.

Worse: they're often the techniques that feel sophisticated, which means people reach for them first. Self-consistency feels rigorous. Meta-prompting feels meta. Both are easy ways to spend tokens without spending thought.

The actual diagnostic

Here's the decision logic, condensed:

Output is wrong in a vague way? Reduce entropy. You haven't compressed the problem yet.
Output has wrong structure or format? Pattern bias: few-shot, format, instruction. Cheap and high-leverage.
Reasoning slipped somewhere? Add scaffolding. Reflection first; CoT or ToT only when the model is guessing at structure.
Output is factually off? Ground it. Point at files, versions, diffs. Don't describe.

Failure mode → fix type. That's the whole framework. The specific techniques are implementations of these four moves.

This is also why list-format posts feel unsatisfying after a while. They give you the techniques without the diagnostic, which is like handing someone a toolbox without telling them how to identify what's broken. You end up applying tools by feel rather than by indication.

Where this is going

The reason I've been thinking about this isn't really about prompting techniques. It's about the layer above them... how developers, platforms, and users work with generative AI systems, and where the friction in that interaction comes from.

The friction isn't usually that the model is bad. It's that the interface between human intent and model output is underspecified at every level. List-format posts on prompting techniques are a symptom of that, they're trying to make the interface tractable by enumerating its surface, but the actual problem is structural.

That's a longer thread, and not for this post. But it's where the next few are going.

This post grew out of reading one too many "complete guide to prompting" lists. The decision tree above is my attempt to compress what those lists actually need, into something diagnostic instead of encyclopedic.