Tushar Jaju

Posted on Jun 21

I almost added an em-dash remover to my LLM library. Then I tested whether local models even produce em-dashes.

#ai #llm #opensource #python

llmclean is a tiny zero-dependency library I maintain for cleaning the noise out of raw LLM output. v0.2.0 was a "what production traffic taught me" release — every fix came from a real break in one of my own pipelines.

0.3.0 is a different kind of release. This time I had a list of features I was fairly sure I needed, sourced from what people keep complaining about and re-implementing by hand: strip the <think> reasoning blocks, kill the em-dashes and smart quotes, remove the zero-width characters, flatten the markdown for text-to-speech.

Before writing any of it, I did something I should have done the first time: I checked whether the models I care about actually produce that mess. I ran eight generative prompts across five local models — Llama 3.1, Gemma 4, Qwen 2.5, DeepSeek-R1, Mistral, all 7–8B instruct — and measured what came out. Forty generations, one diagnostic pass each.

Three of my assumptions were wrong.

1. Local models barely produce the typography mess at all

The em-dash thing is a real phenomenon. It got loud enough that OpenAI shipped a setting to suppress em-dashes in ChatGPT. There are standalone libraries that do nothing but replace fancy punctuation with ASCII. So I assumed I'd see plenty of it.

Across 40 generations from local models, I saw zero smart quotes, zero ellipsis characters, zero non-breaking spaces, zero ligatures, zero zero-width characters. I even wrote a prompt that explicitly asked the model to quote someone saying "hello", use a dash for emphasis, and trail off with an ellipsis. The models gave me straight quotes and three literal dots: ..., not ….

The typography mess that everyone writes cleanup code for is, as far as I can measure, a frontier cloud-model trait. ChatGPT, Claude, and Gemini emit it. A 7B instruct model running on your laptop mostly doesn't.

That doesn't mean the feature is useless — people paste cloud output into pipelines constantly, and that's exactly where this stuff lands. But it changed how I built and tested it. normalize_typography and strip_invisibles are scoped, in their docstrings and tests, as tools for pasted cloud output, and they're tested against synthetic fixtures shaped like ChatGPT output — not against my local models, because my local models can't produce the inputs.

2. The fullwidth-punctuation idea was backwards

I had a note to myself that Qwen, with its Chinese-heavy training, would emit fullwidth punctuation — ，：；（） — and that I'd need to normalize it inside JSON. A whole class of cleanup the library didn't cover.

When I actually prompted Qwen (and the others) with Chinese text, here's what happened: the Chinese content came through fine, sitting inside JSON string values with completely normal ASCII : and " structure. Fullwidth punctuation showed up only when I asked for Chinese prose — 北京是中国的首都，拥有丰富的历史文化遗产 — where the ， and 。 are correct, not noise.

So fullwidth normalization isn't a JSON-repair problem at all. It's a prose-normalization problem, and a niche one. It ended up as an opt-in, off-by-default category on normalize_typography, not a JSON strategy. The synthetic case where fullwidth punctuation breaks JSON parsing? enforce_json does have that gap — but no model I have actually emits it, so I didn't build for it.

3. On Ollama, the `<think>` tags never leak

This was the one that nearly sent me building against the wrong input. DeepSeek-R1 is a reasoning model; it thinks in a <think>...</think> block before answering. The obvious cleanup is to strip that block.

Except when I ran DeepSeek-R1 through Ollama and looked at the response, there were no tags. The reasoning was just gone from the text. Ollama (current versions) parses the <think> block server-side and hands it back in a separate thinking field — on both the native API and the OpenAI-compatible one. A consumer using Ollama never sees the tags inline, so strip_reasoning_trace would be a no-op for them.

The tags are real, though. They leak on llama.cpp directly, on vLLM unless you pass --reasoning-parser, on raw transformers, on LM Studio, and on most hosted aggregators. So I validated the stripper a different way: I captured a genuine DeepSeek-R1 reasoning trace out of Ollama's thinking field, re-wrapped it in the inline <think>...</think> format those other backends emit, and confirmed the function recovers the answer exactly — including the DeepSeek quirk where the opening tag lives in the chat template and only a trailing </think> comes back.

What shipped

Five new functions, all pure standard library, all scoped by what the experiment showed:

from llmclean import strip_reasoning_trace, strip_preamble
from llmclean import strip_invisibles, normalize_typography, strip_markdown

strip_reasoning_trace("<think>let me work it out</think>\nParis.")   # → 'Paris.'
strip_preamble("Sure! Here is the answer: 42")                       # → '42'
strip_invisibles("hello")                               # → 'hello'
normalize_typography("“It’s fine”—really…") # → '"It\'s fine"-really...'
strip_markdown("# Title\n\n- **bold** point")                        # → 'Title\n\nbold point'

strip_markdown and the fence handling are validated against real local captures, because markdown is the one thing every model emits constantly — it showed up on every "explain this with headers and bullets", every code answer, every table.

There's also a correctness fix in this release that has nothing to do with the sweep. The old Python-literal repair in enforce_json did a blind find-and-replace of True/False/None, which meant {"note": "set the flag to True"} came out as {"note": "set the flag to true"} — it corrupted the words inside string values, and inside string keys too. A regex can't tell a bare True token from the letters True inside a quote. The fix is a single pass that tracks whether it's inside a string and only rewrites literals outside one.

The actual lesson

I write cleanup code for a living, more or less, and I still almost built three features against an imagined version of model output instead of the real one. The sweep took an afternoon. It killed one feature's premise, demoted another to opt-in, and re-scoped a third — and it left me able to document when each function actually helps, instead of implying it helps everywhere.

If you're post-processing LLM output, it's worth running the cheap experiment: a handful of prompts across the models you actually deploy, and a look at what literally comes out. The mess you're cleaning may not be the mess you think it is.

llmclean 0.3.0 is on PyPI and GitHub. Eight functions, zero dependencies, still fits in your head.

DEV Community

I almost added an em-dash remover to my LLM library. Then I tested whether local models even produce em-dashes.

1. Local models barely produce the typography mess at all

2. The fullwidth-punctuation idea was backwards

3. On Ollama, the `<think>` tags never leak

What shipped

The actual lesson

Top comments (0)

1. Local models barely produce the typography mess at all

2. The fullwidth-punctuation idea was backwards

3. On Ollama, the <think> tags never leak

What shipped

The actual lesson

3. On Ollama, the `<think>` tags never leak