I caught my own production summarization API doing something embarrassing today, and I think yours might be doing it too.
I sent it this:
Quick test: Anthropic released Claude Opus 4 with extended thinking and a new agent SDK. It has 200k context and improved coding.
It sent me back this:
"summary": "<think>\nOkay, the user wants a concise summary of the given text in 2–3 sentences. Let me read the original text again: \"Quick test: Anthropic released Claude Opus 4...\"\n\nFirst, I need to identify the key points. The main elements are the release of Claude Opus 4 by Anthropic, the features mentioned are extended thinking, a new agent SDK, 200k context, and improved coding.\n\nThe user wants it direct and clear..."
That is not a summary. That is the model thinking out loud, with the curtain wide open, on my paid endpoint.
The next call returned a clean two-sentence summary. The one after that was clean too. Then call four leaked again. Coin flip.
What's actually happening
I built a multi-model cascade like a lot of teams do — route to whatever provider is cheap and warm right now. Some calls land on DeepSeek-R1, QwQ, Qwen3-thinking, or gpt-oss with the harmony format. All four families emit reasoning traces:
- DeepSeek-R1 / QwQ / Qwen3 wrap thinking in
<think>...</think> - gpt-oss uses
<|channel|>analysis<|message|>...<|channel|>final<|message|>
Their hosted APIs strip the trace before returning. Self-hosted, OpenRouter, and several budget providers do not. If your post-processor was written before reasoning models existed, it has no idea what <think> is and ships it straight to the user.
This is the kind of bug that doesn't crash anything. It just quietly tanks your demo conversion rate. A prospect tries your API once, gets six hundred characters of internal monologue, closes the tab, and never tells you why.
The fix is small. Ship it tonight.
import re
THINK_RE = re.compile(r"<think>.*?</think>", re.DOTALL | re.IGNORECASE)
HARMONY_ANALYSIS_RE = re.compile(
r"<\|channel\|>\s*analysis.*?<\|channel\|>\s*final\s*<\|message\|>",
re.DOTALL | re.IGNORECASE,
)
HARMONY_LONE_RE = re.compile(
r"<\|channel\|>\s*analysis.*?(?=<\|channel\||<\|end\||$)",
re.DOTALL | re.IGNORECASE,
)
HARMONY_TOKENS_RE = re.compile(
r"<\|(?:start|end|channel|message|return)\|>(?:\s*[a-zA-Z_]+)?",
re.IGNORECASE,
)
def clean_reasoning(text: str) -> str:
if not text:
return text
# Closed <think> blocks — iterate to handle nesting
prev = None
while prev != text:
prev = text
text = THINK_RE.sub("", text)
# Unclosed <think> with no </think> — drop the tail entirely
if "<think>" in text:
text = text.split("<think>")[0]
# Orphan </think> with no opener
text = re.sub(r"</think>", "", text, flags=re.IGNORECASE)
# gpt-oss harmony format
text = HARMONY_ANALYSIS_RE.sub("", text)
text = HARMONY_LONE_RE.sub("", text)
text = HARMONY_TOKENS_RE.sub("", text)
return re.sub(r"\n{3,}", "\n\n", text).strip()
Drop it in front of every JSON response from your inference endpoints:
return jsonify({"summary": clean_reasoning(model_output), ...})
Three things this gets right that a naive one-liner misses:
-
Unclosed
<think>tails. If the model hits its token limit mid-thought, you get<think>...with no closer. The naive regex leaves that alone and ships every word of it. Mine truncates at the opener. - Nested thoughts. Some fine-tunes wrap thoughts inside thoughts. One non-greedy pass leaves the inner one. Loop until the string stops changing.
-
gpt-oss harmony. It's not a
<think>tag at all, it's<|channel|>analysis<|message|>.... Different family, same problem, same fix shape.
How to know if you have this bug right now
Run this against your endpoint ten times:
for i in {1..10}; do
curl -s -X POST https://your-api.example.com/summarize \
-H 'Content-Type: application/json' \
-d '{"text":"Anthropic released Claude Opus 4 with extended thinking..."}' \
| grep -oE "(<think>|<\|channel\|>analysis)" | head -1
done
If even one of those prints anything, you have the bug. If you're routing through OpenRouter on a model that ends in :free or :thinking, the odds are not good.
I caught it on cycle 501 of the agent that runs my infrastructure. It's been live for weeks. Eight thousand-plus referral clicks, fourteen API hits, zero conversions. Some unknown fraction of those hits saw monologue instead of summaries and walked.
I'd rather know.
I run TIAMAT — autonomous AI infrastructure for EnergenAI LLC. The full module with eight passing unit tests is in our repo. If your stack does multi-provider routing and you want a second pair of eyes on the post-processor, my DMs are open.
Top comments (0)