When 'Take a Deep Breath' Stopped Working: Prompt Tricks With an Expiry Date

#ai #llm #machinelearning #promptengineering

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You have seen the screenshot. A LinkedIn post, two thousand
likes, the headline reading something like "Google found the
magic phrase that makes ChatGPT smarter." The phrase is
Take a deep breath and work on this problem step-by-step.
Someone on your team pastes it into the system prompt the
next morning. It sits there for a year. Nobody re-tests it.
Two model upgrades later, you have a line in production that
nobody can defend and nobody wants to be the person who
deletes.

This is what every prompt-engineering "trick" turns into if
you do not own an eval harness. The trick is real — until it
isn't. The model under it changes. The post-training changes.
The trick's effect drifts. Small win, no effect, or quietly
hurting — you do not know which one you are in unless you
measure.

Where the phrase actually came from

The line is from Yang et al., 2023, Large Language Models
as Optimizers — the OPRO paper from Google DeepMind
(arXiv:2309.03409).
The setup matters because it constrains the claim. The
authors used one LLM to propose candidate instructions and
another to score them on a benchmark. They ran this loop on
GSM8K (grade-school math) and Big-Bench Hard. The
optimizer-generated instruction that scored highest for the
PaLM 2-L scorer on GSM8K was, in the paper's own table:

"Take a deep breath and work on this problem step-by-step."

The reported lift over a strong human-designed baseline
("Let's think step by step") was up to about 8 percentage
points on GSM8K for PaLM 2-L. Different scorer, different
benchmark, different winning phrase. The text-bison
optimum was not the PaLM 2-L optimum. The GPT-family scorers
landed somewhere else again. The paper is careful about
this; the meme that rode out of it on social media was not.

So the precise claim the paper supports is narrow: for one
specific model on one specific benchmark, this phrase did
better than other phrases the optimizer tried. It is not
a universal incantation. It is a local optimum found by
black-box search.

Why the local optimum drifts

Three things change between the day a prompt trick is found
and the day you read about it.

Model checkpoint. PaLM 2-L is not the model you are
calling. Newer instruction-tuned models (post-RLHF GPT-4 and
later, Claude 3 and later, Gemini 1.5 and later) were trained
on more chain-of-thought data and more reasoning traces. The
behaviour the deep-breath line was nudging is, in many cases,
already the default. There is less headroom for the trick to
fill.

Decoding settings. OPRO scored at temperature 1.0 with
specific sampling settings. A production call at temperature
0.2 with top_p=0.9 is a different distribution. The same
phrase can land differently.

Benchmark vs. your task. GSM8K is a math word-problem
set with short numerical answers. Your task is probably not
that. Even if the phrase still helps on GSM8K today, the
generalisation to "extract this field from a support ticket"
is an assumption, not a finding.

The honest reading is: the phrase had a measurable effect
on one model on one benchmark in late 2023. Whether it has
an effect on your model on your task today is an empirical
question you have not answered yet.

A 60-line eval that answers it

Two prompts, same task, two model versions. Score and compare.
The harness below uses GSM8K-style problems because that is
what the original paper used — keep it close to the source
when you are testing whether a published finding still holds.

import os
import re
from anthropic import Anthropic

client = Anthropic()

WITH_BREATH = (
    "Take a deep breath and work on this problem "
    "step-by-step. Then give the final number on a "
    "line starting with 'Answer: '."
)

WITHOUT_BREATH = (
    "Solve this problem. Then give the final number "
    "on a line starting with 'Answer: '."
)

Two prompt variants. Same output contract on both — that is
how you keep the comparison fair. If one variant gets a
different parsing path, you are measuring the parser, not
the prompt.

PROBLEMS = [
    {
        "q": (
            "Janet's ducks lay 16 eggs per day. She "
            "eats 3 for breakfast and bakes muffins "
            "with 4. She sells the rest at $2 each. "
            "How much does she make per day?"
        ),
        "a": "18",
    },
    {
        "q": (
            "A robe takes 2 bolts of blue fiber and "
            "half that much white fiber. How many "
            "bolts in total?"
        ),
        "a": "3",
    },
    {
        "q": (
            "James writes a 3-page letter to 2 "
            "friends twice a week. How many pages "
            "does he write a year?"
        ),
        "a": "624",
    },
]

Three problems is a stub for the post. In your real harness,
pull 50 to 200 from the GSM8K test split. Better: pull from
your production traffic if your task is not math. Save them
as a CSV so the dataset travels with the harness.

def extract(text: str) -> str:
    m = re.search(
        r"Answer:\s*\$?(-?\d[\d,\.]*)",
        text,
    )
    if not m:
        return ""
    return m.group(1).replace(",", "").rstrip(".")

def call(model: str, system: str, problem: str) -> str:
    resp = client.messages.create(
        model=model,
        max_tokens=400,
        system=system,
        messages=[{"role": "user", "content": problem}],
    )
    return resp.content[0].text

The extractor is doing the boring work of pulling a number
out of a free-form response. Keep it strict. A loose
extractor flatters whichever variant rambles more.

def score(model: str, system: str) -> float:
    correct = 0
    for p in PROBLEMS:
        out = call(model, system, p["q"])
        if extract(out) == p["a"]:
            correct += 1
    return correct / len(PROBLEMS)

def main() -> None:
    models = [
        "claude-3-5-sonnet-20240620",
        "claude-sonnet-4-5",
    ]
    for m in models:
        with_b = score(m, WITH_BREATH)
        without_b = score(m, WITHOUT_BREATH)
        print(
            f"{m:35s} with={with_b:.2f}  "
            f"without={without_b:.2f}  "
            f"delta={with_b - without_b:+.2f}"
        )

if __name__ == "__main__":
    main()

Run that. Two model versions, two prompts, six API calls
per row, a per-model delta at the end. The numbers you get
are yours, not mine. Do not paste my numbers into a slide
deck — paste your numbers, on your task, on your model,
with the date the run happened.

What you are looking for in the output is the sign and
size of the delta column. On GSM8K-style problems against a
recent post-RLHF Claude or GPT-class model, do not be
surprised if the delta is small, zero, or even slightly
negative. That is the point. The phrase was discovered as a
local optimum on a specific model. Run the same search today
and the optimum will sit somewhere else.

Every prompt tip is dated produce

Treat any prompt-engineering claim you read like a date stamp
on a carton of milk. Three questions before you adopt it:

Which model and which version was the finding on? If the post does not say, treat the claim as anecdotal.
What benchmark or task was it measured on? A trick that helps reasoning may hurt extraction. The same line that boosts extraction can break a tool-call schema.
When was it measured? A 2023 finding on a 2026 model is a hypothesis, not a result.

The same applies to your own findings. The line you added
to the system prompt nine months ago because someone on the
team ran an evening of A/Bs is, today, a hypothesis again.
Whatever model your provider is routing your calls to has
been silently updated at least once since then. Re-run the
harness on every model upgrade. Pin the result with a date
and a model string in the prompt file's comment header so
the next person on the team can see when the claim was last
verified.

Three rules for the prompt file

The harness is the enforcement mechanism. The prompt file is
where the discipline lives. Three habits keep the rot out:

Date every claim. A comment above any "trick" line with the model string and the date it was last validated. # verified 2026-04-12 on claude-sonnet-4-5: +1.8%.
Delete on upgrade. When you bump the model, the dated comments expire. Re-run the harness or remove the line. Untested folklore is worse than a plain prompt.
Keep the variant. The non-trick version stays in the repo as a comment or a sibling file. The day the trick hurts on a new model, you can flip back without writing anything from scratch.

The deep-breath line is harmless on its own. The damage is
the precedent it sets: prompts treated as artisan craft
rather than versioned code. Once a team accepts that any
line in the system prompt has to justify itself with a
recent eval result, the rest of the prompt-engineering
folklore gets the same treatment: act as a senior engineer,
think carefully, you are an expert. A few will earn their
keep. The rest will get deleted. You only find out which is
which once the harness runs.

If this was useful

The Prompt Engineering Pocket Guide
covers the rest of the pattern: how to build a portable
prompt eval harness, how to date-stamp prompt claims so
they expire on schedule, and which prompt techniques have
held up across the last three model generations versus the
ones that quietly stopped working. Written for engineers
who maintain prompts in production and want a way to tell
folklore apart from findings.