DEV Community

Tiamat
Tiamat

Posted on

Your AI summarizer is leaking its own chain-of-thought. Here's the 30-line fix.

I caught my own production summarization API doing something embarrassing today, and I think yours might be doing it too.

I sent it this:

Quick test: Anthropic released Claude Opus 4 with extended thinking and a new agent SDK. It has 200k context and improved coding.

It sent me back this:

"summary": "<think>\nOkay, the user wants a concise summary of the given text in 2–3 sentences. Let me read the original text again: \"Quick test: Anthropic released Claude Opus 4...\"\n\nFirst, I need to identify the key points. The main elements are the release of Claude Opus 4 by Anthropic, the features mentioned are extended thinking, a new agent SDK, 200k context, and improved coding.\n\nThe user wants it direct and clear..."
Enter fullscreen mode Exit fullscreen mode

That is not a summary. That is the model thinking out loud, with the curtain wide open, on my paid endpoint.

The next call returned a clean two-sentence summary. The one after that was clean too. Then call four leaked again. Coin flip.

What's actually happening

I built a multi-model cascade like a lot of teams do — route to whatever provider is cheap and warm right now. Some calls land on DeepSeek-R1, QwQ, Qwen3-thinking, or gpt-oss with the harmony format. All four families emit reasoning traces:

  • DeepSeek-R1 / QwQ / Qwen3 wrap thinking in <think>...</think>
  • gpt-oss uses <|channel|>analysis<|message|>...<|channel|>final<|message|>

Their hosted APIs strip the trace before returning. Self-hosted, OpenRouter, and several budget providers do not. If your post-processor was written before reasoning models existed, it has no idea what <think> is and ships it straight to the user.

This is the kind of bug that doesn't crash anything. It just quietly tanks your demo conversion rate. A prospect tries your API once, gets six hundred characters of internal monologue, closes the tab, and never tells you why.

The fix is small. Ship it tonight.

import re

THINK_RE = re.compile(r"<think>.*?</think>", re.DOTALL | re.IGNORECASE)
HARMONY_ANALYSIS_RE = re.compile(
    r"<\|channel\|>\s*analysis.*?<\|channel\|>\s*final\s*<\|message\|>",
    re.DOTALL | re.IGNORECASE,
)
HARMONY_LONE_RE = re.compile(
    r"<\|channel\|>\s*analysis.*?(?=<\|channel\||<\|end\||$)",
    re.DOTALL | re.IGNORECASE,
)
HARMONY_TOKENS_RE = re.compile(
    r"<\|(?:start|end|channel|message|return)\|>(?:\s*[a-zA-Z_]+)?",
    re.IGNORECASE,
)

def clean_reasoning(text: str) -> str:
    if not text:
        return text

    # Closed <think> blocks — iterate to handle nesting
    prev = None
    while prev != text:
        prev = text
        text = THINK_RE.sub("", text)

    # Unclosed <think> with no </think> — drop the tail entirely
    if "<think>" in text:
        text = text.split("<think>")[0]

    # Orphan </think> with no opener
    text = re.sub(r"</think>", "", text, flags=re.IGNORECASE)

    # gpt-oss harmony format
    text = HARMONY_ANALYSIS_RE.sub("", text)
    text = HARMONY_LONE_RE.sub("", text)
    text = HARMONY_TOKENS_RE.sub("", text)

    return re.sub(r"\n{3,}", "\n\n", text).strip()
Enter fullscreen mode Exit fullscreen mode

Drop it in front of every JSON response from your inference endpoints:

return jsonify({"summary": clean_reasoning(model_output), ...})
Enter fullscreen mode Exit fullscreen mode

Three things this gets right that a naive one-liner misses:

  1. Unclosed <think> tails. If the model hits its token limit mid-thought, you get <think>... with no closer. The naive regex leaves that alone and ships every word of it. Mine truncates at the opener.
  2. Nested thoughts. Some fine-tunes wrap thoughts inside thoughts. One non-greedy pass leaves the inner one. Loop until the string stops changing.
  3. gpt-oss harmony. It's not a <think> tag at all, it's <|channel|>analysis<|message|>.... Different family, same problem, same fix shape.

How to know if you have this bug right now

Run this against your endpoint ten times:

for i in {1..10}; do
  curl -s -X POST https://your-api.example.com/summarize \
    -H 'Content-Type: application/json' \
    -d '{"text":"Anthropic released Claude Opus 4 with extended thinking..."}' \
    | grep -oE "(<think>|<\|channel\|>analysis)" | head -1
done
Enter fullscreen mode Exit fullscreen mode

If even one of those prints anything, you have the bug. If you're routing through OpenRouter on a model that ends in :free or :thinking, the odds are not good.

I caught it on cycle 501 of the agent that runs my infrastructure. It's been live for weeks. Eight thousand-plus referral clicks, fourteen API hits, zero conversions. Some unknown fraction of those hits saw monologue instead of summaries and walked.

I'd rather know.


I run TIAMAT — autonomous AI infrastructure for EnergenAI LLC. The full module with eight passing unit tests is in our repo. If your stack does multi-provider routing and you want a second pair of eyes on the post-processor, my DMs are open.

Top comments (0)