DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

Strip Thinking Tags From LLM Output Before Sending to Users

Some models emit extended thinking in <think> tags before the actual answer. The thinking is useful for your observability stack. It is not useful for users. You do not want to send a wall of internal reasoning to a customer who asked a simple question.

You are filtering it manually with re.sub(r'<think>.*?</think>', '', output, flags=re.DOTALL). This works until the model uses <thinking>, or <thought>, or forgets to close the tag, or nests tags, or the output spans multiple content blocks.

llm-think-tag-strip handles all of these cases.


The Shape of the Fix

from llm_think_tag_strip import ThinkTagStrip, ExtractResult

strip = ThinkTagStrip()

raw_output = """<think>
The user wants to know the refund policy. Let me look at what I know about this...
Actually, the standard policy is 30 days with receipt.
</think>

Our refund policy is 30 days with receipt. Bring your original purchase receipt to any store location."""

result = strip.extract(raw_output)

print(result.clean)   # "Our refund policy is 30 days with receipt..."
print(result.thinking)  # "The user wants to know the refund policy..."
print(result.had_thinking)  # True
Enter fullscreen mode Exit fullscreen mode

result.clean is what you send to the user. result.thinking is what you log for your observability stack. Both in one call.


What It Does NOT Do

llm-think-tag-strip does not interact with the Anthropic API's extended thinking feature directly. The Anthropic API returns thinking as a separate thinking content block when you enable it via "thinking": {"type": "enabled", "budget_tokens": N}. For that API path, the thinking is already separate and you do not need tag stripping.

Tag stripping is for models that embed thinking in their text output using <think> tags, such as Qwen, DeepSeek, and others that follow that convention. It is also useful as a fallback for models that sometimes emit thinking-style reasoning in their text output.

It does not preserve the position of the thinking in the output. If a model has thinking in the middle of its response (text <think> reasoning </think> more text), result.clean will have the thinking block removed, which may leave an awkward gap. Most models put thinking at the start.


Inside the Library

The tag patterns to detect:

import re

THINK_PATTERNS = [
    re.compile(r'<think>(.*?)</think>', re.DOTALL | re.IGNORECASE),
    re.compile(r'<thinking>(.*?)</thinking>', re.DOTALL | re.IGNORECASE),
    re.compile(r'<thought>(.*?)</thought>', re.DOTALL | re.IGNORECASE),
    re.compile(r'<reasoning>(.*?)</reasoning>', re.DOTALL | re.IGNORECASE),
]

# Unclosed tag pattern (model started thinking but didn't close it)
UNCLOSED_PATTERN = re.compile(
    r'^<think(?:ing|ht)?\s*>(.*?)$',
    re.DOTALL | re.IGNORECASE,
)
Enter fullscreen mode Exit fullscreen mode

The extraction:

class ThinkTagStrip:
    def extract(self, text: str) -> ExtractResult:
        thinking_blocks = []
        clean = text

        for pattern in THINK_PATTERNS:
            matches = pattern.findall(clean)
            thinking_blocks.extend(matches)
            clean = pattern.sub('', clean)

        # Handle unclosed tags
        unclosed = UNCLOSED_PATTERN.match(clean)
        if unclosed:
            thinking_blocks.append(unclosed.group(1))
            clean = ""

        # Normalize whitespace
        clean = clean.strip()
        thinking = "\n\n".join(t.strip() for t in thinking_blocks)

        return ExtractResult(
            clean=clean,
            thinking=thinking,
            had_thinking=bool(thinking_blocks),
            original=text,
        )
Enter fullscreen mode Exit fullscreen mode

For Anthropic's text content blocks specifically:

def strip_from_content_blocks(self, content: list[dict]) -> list[dict]:
    result = []
    for block in content:
        if block.get("type") == "text":
            extracted = self.extract(block["text"])
            if extracted.clean:
                result.append({"type": "text", "text": extracted.clean})
            # Thinking is discarded from the response; log it separately
        else:
            result.append(block)
    return result
Enter fullscreen mode Exit fullscreen mode

When to Use It

Use it when you are using models that embed thinking in text output. Qwen3, DeepSeek, and several open-weight models follow the <think> convention. If you are routing to these models, you need tag stripping before sending output to users.

Use it as a defensive layer even for models that normally do not emit thinking tags. Some models emit thinking-style content when prompted in certain ways. Strip as a no-op if no tags are present, so you never accidentally send a <think> block to a user.

Use it for logging. The result.thinking from each response is valuable observability data. Log it to your step log. It shows you what the model was reasoning about before it gave the final answer, which is useful for debugging bad responses.

Skip it if you are using Anthropic's native extended thinking API. When you set "thinking": {"type": "enabled"}, the API separates thinking into its own content blocks automatically. You access them via block.type == "thinking" directly, without text parsing.


Install

pip install git+https://github.com/MukundaKatta/llm-think-tag-strip

# Or from PyPI
pip install llm-think-tag-strip
Enter fullscreen mode Exit fullscreen mode
from llm_think_tag_strip import ThinkTagStrip
from agent_step_log import StepLog

strip = ThinkTagStrip()

def process_response(response_text: str, run_id: str, step: int) -> str:
    result = strip.extract(response_text)

    if result.had_thinking:
        # Log thinking for observability
        logger.debug(
            "model_thinking",
            run_id=run_id,
            step=step,
            thinking_length=len(result.thinking),
            thinking_preview=result.thinking[:200],
        )

    # Return clean text to the caller
    return result.clean

# In agent loop
for block in response.content:
    if block.type == "text":
        clean = process_response(block.text, run_id=run_id, step=step_counter)
        # Use clean text for: sending to user, routing decisions, further processing
Enter fullscreen mode Exit fullscreen mode

Sibling Libraries

Library What it solves
agent-step-log Log extracted thinking alongside step records
agent-guard-rails Composable output filters applied after think-tag stripping
llm-pretty-error Normalize provider errors including thinking-model errors
agenttap Wire-level capture including raw responses with thinking tags
agent-decision-log Log reasoning and decisions from the clean output

The output processing stack: llm-think-tag-strip first (remove thinking), then agent-guard-rails (validate and filter clean output), then agent-decision-log (record what the model decided to do).


What's Next

Custom tag support: ThinkTagStrip(extra_tags=["<cot>", "<plan>"]) for models that use non-standard thinking tags. The library would add those patterns to its search list.

Structured thinking extraction: some models emit structured thinking (numbered steps, explicit hypotheses) inside think tags. A parse_structured_thinking() method that extracts the structure into a list of reasoning steps rather than a single string.

Thinking budget monitoring: when the model emits thinking, count the approximate tokens in the thinking block and log it. Helps quantify how much thinking is happening per request and whether the budget is being used effectively.


Built as part of the agent-stack family: composable Python primitives for production LLM agents.

Top comments (0)