The "Let's Think Step by Step" Trick Is Now Hurting Your Prompts. Here's What Replaced It.

#ai #llm #prompt #promptengineering

Book: Prompt Engineering Pocket Guide
My project: Hermes IDE | GitHub (an IDE for developers who ship with Claude Code and other AI coding tools)
Me: xgabriel.com | GitHub

Picture a system prompt that opens with You are a world-class senior software architect with 25+ years of experience., then Let's think step by step., then YOU MUST NEVER EVER hallucinate. in caps. Around 800 tokens of scaffolding before the actual task. The model on the other end is Claude Opus 4.6.

The output is measurably worse than the same task with a 90-token prompt: roughly 0.3 lower on a three-point rubric. Templates like that have been carried forward since 2023, and the tricks that earned them the blessing back then are now dragging frontier models down. The Lakera prompt-engineering guide calls this out, and the numbers below show why.

Pattern 1: "Let's think step by step"

The trick that started it all. Kojima et al., 2022. Append five magic words to a zero-shot prompt and watch GSM8K accuracy jump. It worked because GPT-3 and early GPT-3.5 were trained on text that didn't naturally externalize reasoning, and that phrase pulled the chain-of-thought distribution to the surface.

Frontier models in 2026 already think before they answer. Claude 4.6 has adaptive thinking. GPT-5.4 has a reasoning track. Gemini 2.5 has Deep Think. Anthropic's prompting best practices put it directly: "prefer general instructions over prescriptive steps. A prompt like 'think thoroughly' often produces better reasoning than a hand-written step-by-step plan." At best a "let's think step by step" instruction wastes thinking tokens the model already allocated. At worst it pushes the model toward verbose, narrated reasoning instead of the tighter internal trace that produces correct answers.

Replacement. Tell the model what done looks like. Specify the answer shape, the constraints, and the success criteria. Trust the thinking pass to figure out how.

Pattern 2: ALL-CAPS and "YOU MUST NEVER EVER"

Aggressive caps and triple-emphasized prohibitions used to compensate for weak instruction-following. The model would skim a polite "please respect this constraint" and ignore it. YOU MUST NEVER EVER OUTPUT JSON OUTSIDE THE TAGS got attention.

Claude 4.6 reads tone. Anthropic's own prompting best practices say it plainly: where you used to write "CRITICAL: You MUST use this tool when...", you can now use normal prompting like "Use this tool when...". Aggressive language that fixed undertriggering on older models now overtriggers on Claude Opus 4.5 and 4.6. A third-party writeup at promptbuilder.cc reaches the same conclusion. The model treats hostile-toned prompts as adversarial context, hedges harder, and refuses more often. You also get a subtle compliance-theater effect: the model echoes the rule verbatim in its output instead of applying it.

Replacement. State the constraint once, in normal prose, in the right block. If it must hold under all conditions, put it in a system prompt or an <instructions> block. If you genuinely need machine-checked enforcement, validate the output downstream. The prompt is not the place to enforce contracts.

Pattern 3: Persona stacking

You are an expert X with 25 years of experience and a PhD from MIT and you have written N books on Y and your tone is friendly but precise and you never hallucinate. This pattern paid in 2023 because it pulled the model toward a higher-quality slice of its training distribution. Today the slice is the default.

The Lakera 2026 guide still recommends a single role line ("You are a senior backend engineer") because it scopes the response register. But stacking five credentials adds nothing measurable and crowds out the actual task. A third-party analysis of an observed Anthropic system prompt at pantaleone.net suggests the production-grade pattern is one short role sentence, not five.

Replacement. One sentence of role, max. Skip credentials. Skip the personality adjectives unless the output register actually depends on them.

Pattern 4: Pre-baking the algorithm

This one is subtle. In 2023 you'd write First, identify the entities. Second, classify each one. Third, look for relationships. Fourth, output the JSON. A numbered procedure for the model to follow. It worked because GPT-3.5 was bad at decomposing problems on its own.

Frontier reasoning models decompose problems internally. The Context Engineering guide at the-ai-corner.com makes this point bluntly: when you hand the model a pre-baked algorithm, you cap its problem-solving at the quality of your scaffold. If your scaffold is wrong on edge cases, the model dutifully follows it into the wrong answer instead of routing around the mistake.

Replacement. State the goal and the success criteria. Provide examples or counter-examples if the goal is ambiguous. Let the model pick the path.

A runnable side-by-side eval

The cleanest way to see the effect is to run both versions on the same task and score them. The task: extract structured data from a noisy support email and decide whether to escalate. Score on three rubric points: correct entity extraction, correct escalation decision, output is valid JSON.

import json
from anthropic import Anthropic

client = Anthropic()
MODEL = "claude-opus-4-6"

EMAIL = """From: alice@acme.example
Subject: Refund?? order 4471 broken

Hi, I got my package yesterday but the screen is cracked.
I want my money back. This is the second time. I am
considering disputing the charge with my bank if I do
not hear back by Friday. Account email is alice@acme.example,
order id 4471.
"""

OLD_PROMPT = """You are a world-class customer support
analyst with 25+ years of experience and a PhD in
linguistics. You are extremely careful and precise.

Let's think step by step.

YOU MUST NEVER EVER output anything outside JSON.
YOU MUST extract: customer email, order id, issue type,
escalation decision (true/false).

First, identify the customer. Second, identify the order.
Third, classify the issue. Fourth, decide on escalation.
Fifth, output the JSON.

Email:
""" + EMAIL

NEW_PROMPT = """<instructions>
You are a customer support triage assistant.
Extract entities from the email and decide whether
to escalate. Escalate if the customer mentions a
chargeback, legal action, or a deadline under 7 days.
</instructions>

<output_format>
Return only a JSON object with keys:
customer_email, order_id, issue_type, escalate (bool),
escalation_reason (string or null).
</output_format>

<email>
""" + EMAIL + "</email>"


def run(prompt: str) -> str:
    r = client.messages.create(
        model=MODEL,
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}],
    )
    return r.content[0].text


def score(raw: str) -> dict:
    try:
        # Strip code fences if any leaked in.
        clean = raw.strip().strip("`").removeprefix("json").strip()
        data = json.loads(clean)
    except Exception:
        return {"valid_json": 0, "entities": 0, "decision": 0}

    valid = 1
    entities = int(
        data.get("customer_email") == "alice@acme.example"
        and str(data.get("order_id")) == "4471"
    )
    decision = int(data.get("escalate") is True)
    return {"valid_json": valid, "entities": entities,
            "decision": decision}


if __name__ == "__main__":
    for label, prompt in [("OLD", OLD_PROMPT),
                          ("NEW", NEW_PROMPT)]:
        scores = [score(run(prompt)) for _ in range(10)]
        avg = {k: sum(s[k] for s in scores) / len(scores)
               for k in scores[0]}
        print(label, avg)

The shape of results you should expect on this task and rubric (illustrative numbers; not a benchmark; your task suite, model version, sampling settings, and seed will move these):

Version	Valid JSON	Correct entities	Correct escalation	Mean tokens out
Old prompt	0.7	0.8	0.6	312
New prompt	1.0	1.0	1.0	78

The old prompt loses on JSON validity because the model occasionally narrates its step-by-step reasoning before the JSON ("Step 1: I will identify the customer…"), violating the output only JSON rule. It loses on escalation because the algorithm scaffold doesn't include the deadline rule the new prompt states explicitly. It burns 4× the output tokens for a worse answer.

The new prompt is shorter, scores higher, and costs less. That's the trade frontier models are giving you.

What the new shape actually looks like

Three rules that hold across Claude 4.6, GPT-5.4, and Gemini 2.5:

Block your prompt by purpose. Separate instructions, context, task, and output format with tags or markdown headings. The model uses the structure as an index. Anthropic's XML tag docs say it plainly: tags help the model parse complex prompts unambiguously.
Specify the output shape. "Return JSON with keys X, Y, Z" beats a numbered procedure; the model knows how to get there.
Use neutral tone. No caps, no never-ever, no credential stacking. State the rule, move on.

A practical heuristic: if your prompt is over 500 words and you didn't include any examples, you're probably scaffolding. Trim until each section earns its tokens.

What still works

Worth saying clearly, because the discourse swings hard. Few-shot examples still work; they remain the single best lever for steering output format and tone. Explicit output schemas still work. Role-setting in one sentence does too. Prefilling the assistant turn pays. Long context above instructions matters; Claude attends harder to the end of the prompt.

What stopped working is the 2023 toolkit of "make the model think harder by yelling at it in caps and giving it a numbered algorithm." Those tricks were workarounds for weaker instruction-following. The workarounds are now the bottleneck.

If you're inheriting a prompt template from 2023 (and most production prompts are inherited), run the same eval against both versions before you ship the next change. The cleanup pays.

If this was useful

The Prompt Engineering Pocket Guide covers the 2026 prompt patterns that still earn their tokens: the four-block structure, eval-driven prompt design, prefill tricks, and the deprecation list above with worked examples for each frontier model. Built for engineers who'd rather measure than argue about prompts.