Chain-of-Thought When It Hurts: 3 Tasks Where Reasoning Backfires

#llm #ai #prompt #performance

Book: Prompt Engineering Pocket Guide: Techniques for Getting the Most from LLMs
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You add "think step by step" to every prompt now. It became muscle memory the year chain-of-thought turned up in every benchmark table. The model got better at math word problems, so you sprinkled the phrase on everything: the sentiment classifier, the JSON formatter, the yes/no gate in front of an API call.

Then you looked at the p99 latency and the bill. A binary classifier that should answer in 40 tokens is emitting 600 tokens of deliberation before it says "spam." On a million calls a day, you're paying for a model to talk itself into an answer it already knew on token one.

Chain-of-thought is not free, and it is not always an improvement. The 2022 Wei et al. paper that put CoT on the map was specific: the gains showed up on multi-step arithmetic, commonsense, and symbolic reasoning. Nobody promised it would help you sort emails into three buckets. A 2024 study, "To CoT or not to CoT?", looked across task types and found the lift concentrated on math and symbolic problems, with little to no benefit elsewhere. On some tasks, the extra reasoning made things worse.

Here are three task shapes where reasoning backfires, and what to do instead.

1. Simple classification: reasoning is a distraction

You have a classifier. Input is a support ticket, output is one of billing, bug, feature_request, other. The model already knows the answer the moment it reads the ticket. The label is a pattern match, not a derivation.

Force it to reason first and two things happen. Latency multiplies, because the model emits a paragraph before the label. And accuracy can drift, because a long reasoning trace gives the model room to argue itself out of the obvious call. It reads "I was charged twice," starts reasoning about whether a double charge is technically a bug in the billing system, and lands on bug instead of billing. The reasoning manufactured a doubt that wasn't there.

A direct-answer prompt is shorter and steadier:

Classify the ticket into exactly one label:
billing, bug, feature_request, other.

Reply with the label only. No explanation.

Ticket: {ticket_text}

Constrain the output and the model has nowhere to wander:

labels = {"billing", "bug", "feature_request", "other"}

def classify(client, ticket: str) -> str:
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=4,
        messages=[{
            "role": "user",
            "content": (
                "Classify into one label: billing, bug, "
                "feature_request, other. Label only.\n\n"
                f"Ticket: {ticket}"
            ),
        }],
    )
    out = resp.content[0].text.strip().lower()
    return out if out in labels else "other"

max_tokens=4 is the real guardrail. It makes a reasoning trace physically impossible, so the model commits to the label on the first token. On a high-volume classifier, that cap is the difference between a 40-token call and a 600-token one.

If accuracy on the hard tickets genuinely needs reasoning, that is a signal to route only the low-confidence cases to a reasoning pass, not to tax every call with it. Cheap direct call first; escalate only the ambiguous minority.

2. Format and extraction tasks: thinking corrupts the shape

The second place CoT hurts is anything where the output has a strict shape. JSON extraction. Reformatting a date. Pulling three fields out of an invoice. The task isn't "reason about this," it's "transcribe this into a structure."

When you let a model reason before it emits JSON, you invite two failure modes. It narrates ("Looking at the invoice, I can see the total is...") and that prose ends up wrapped around the JSON, breaking your parser. Or it reasons itself into "improving" a field: it sees a date as 03/04/2026, deliberates about US versus EU format, and rewrites it instead of copying it verbatim.

The fix is to make the structure the only thing the model produces. Pre-fill the opening of the response so the model has already started the JSON before it can write a preamble:

def extract(client, invoice: str) -> dict:
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=200,
        messages=[
            {
                "role": "user",
                "content": (
                    "Extract vendor, total, due_date from the "
                    "invoice. Copy values verbatim. Return JSON "
                    "with keys vendor, total, due_date.\n\n"
                    f"Invoice: {invoice}"
                ),
            },
            {"role": "assistant", "content": "{"},
        ],
    )
    return json.loads("{" + resp.content[0].text)

The assistant pre-fill of { is the move. The model's next token has to continue a JSON object, so there is no room for a reasoning paragraph. "Copy values verbatim" in the instruction blocks the second failure mode: it tells the model this is transcription, not interpretation.

If your stack supports a structured-output or tool-call mode that constrains generation to a schema, use that instead. It gives you the same guarantee at the decoder level. Either way, the principle holds: on shape tasks, every reasoning token is a chance for the model to deviate from the shape.

3. Latency-bound interactive paths: the user is waiting

The third case isn't about accuracy at all. It's about a human staring at a spinner.

Autocomplete. Inline code suggestions. A voice agent that has to respond before the silence gets awkward. A search box that ranks results as you type. These paths have a latency budget measured in a few hundred milliseconds, and the user feels every token the model emits before the answer appears.

Chain-of-thought blows that budget. A reasoning trace might add 500 to 1,500 tokens before the first useful word. At typical streaming speeds that's an extra second or two of dead air, on a path where the whole point is to feel instant. The reasoning might even produce a marginally better suggestion. It does not matter, because the user has already kept typing and your suggestion arrived too late to be useful.

For these paths, suppress reasoning explicitly and cap the output hard:

def suggest(client, prefix: str) -> str:
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=30,
        messages=[{
            "role": "user",
            "content": (
                "Complete the next line of code. Output only "
                "the completion, no commentary, no reasoning.\n\n"
                f"{prefix}"
            ),
        }],
    )
    return resp.content[0].text

The small model and the tight max_tokens are both load-bearing. You want the fastest model that clears your quality bar, answering directly, with a token cap that matches the size of a real completion. If a path is interactive and latency-bound, reasoning is a tax you pay in user trust.

How to decide: a one-line test

Before you reach for "think step by step," ask whether the task has intermediate steps that the answer actually depends on.

Multi-step math, planning, multi-hop lookups, debugging a stack trace, anything where the model has to derive the answer through a chain of sub-conclusions: keep CoT. This is what it was built for.
A label, a field, a format, a yes/no gate, a fast completion: the answer is a lookup or a transcription, not a derivation. Suppress reasoning, cap the tokens, and pre-fill the output shape.

A blunter version of the same test: if you can imagine answering correctly in under a second yourself, the model can too, and the reasoning trace is just cost. The tasks where CoT earns its tokens are the ones where you would also need a scratchpad.

The mistake is generalizing a narrow result to tasks that never needed it. Measure latency and per-call tokens on your classifiers and extractors. You'll find calls reasoning their way to answers they had on token one.

If this was useful

Knowing when to suppress reasoning is the same skill as knowing when to add it, and most teams have only practiced the second half. The chapter on reasoning control in the Prompt Engineering Pocket Guide walks through the task taxonomy, the pre-fill and stop-token tricks, and how to build the confidence-based routing that sends only the hard cases through a reasoning pass. Same instinct, applied to a decision you're probably making on autopilot.