DEV Community

Cover image for I kept rewriting the same regex passes against LLM output. So I made a library.
Tushar Jaju
Tushar Jaju

Posted on

I kept rewriting the same regex passes against LLM output. So I made a library.

I've been working on a few LLM-based projects over the last year. Sakhi, a Hindi voice-to-form pipeline for community health workers in India. A resume parser for engineering candidates. A couple of smaller things. Different domains, different models, different prompts.

But there's a pattern: at the bottom of every pipeline, right before the model's output became "data we trust," I'd find the same kind of code.

Strip markdown fences. Repair half-broken JSON. Trim runaway repetitions. Normalize Python True/False/None to JSON booleans. Cut off the trailing "I hope this helps!" the model added after the actual answer.

Every project had its own ad-hoc version of these. Slightly different regex, slightly different edge cases. The third time I copy-pasted a "strip json` ... `" cleaner across projects, I gave up and made it a library.

That's llmclean. Zero dependencies, pure standard library, three small utilities. v0.1.0 was on PyPI a couple of months ago. v0.2.0 just shipped, and it's the one I want to talk about — because what changed in this release is the part that makes the case for a separate library at all.

What v0.1.0 did

Three functions, total. That's the entire public API:

from llmclean import strip_fences, enforce_json, trim_repetition

strip_fences('```

json\n{"name": "Alice"}\n

```')
# → '{"name": "Alice"}'

enforce_json('Here you go: {"ok": True, "items": [1,2,3,]}')
# → '{\n  "ok": true,\n  "items": [1, 2, 3]\n}'

trim_repetition("The answer is 42. This is final. This is final.")
# → 'The answer is 42. This is final.'
Enter fullscreen mode Exit fullscreen mode

Each function returns the original input on failure (never raises), so it composes safely:

data = enforce_json(trim_repetition(strip_fences(raw_output)))
Enter fullscreen mode Exit fullscreen mode

Stuck it on PyPI in March, copy-pasted the usage into Sakhi and the resume parser, moved on. Standard "I wrote a thing, hope it doesn't bite me" energy.

What production traffic taught me

Then I went back to those two projects and kept building. And the library quietly broke in three different ways across the next two months, each one from real data I was feeding into it. Every one of those breaks became a v0.2.0 fix.

1. CRLF on Windows silently inverted fence detection

Output from Ollama running on my Windows machine came back with \r\n line endings. The fence regex used [ \t]*$ as the trailing anchor. In Python's re.MULTILINE mode, $ matches the position immediately before \n — not before \r\n. So the \r sat between my whitespace class and the newline, and the regex silently failed to match the fence line.

The nasty part: it failed in an inverted way. The closing fence line (with no \r\n after it) still matched the regex, so the function read it as an unclosed opening fence and stripped it. Meanwhile the actual opening line survived as content. Output looked like garbled JSON wrapped in a leftover code fence.

Fix: [ \t]*\r?$. Three regexes, one character each.

2. BOM at position 0 broke json.loads

Some Windows file-IO round-trips and LLM client SDKs prepend a Byte Order Mark (U+FEFF). Sakhi started hitting this when Whisper transcripts went through Windows file IO and emerged with a BOM at position 0. json.loads sees an unexpected character at position 0 and bails immediately — before any of llmclean's strategy pipeline got a chance to fix anything.

Fix: lstrip("") at the entry point of both strip_fences and enforce_json.

3. Doubled-quote overruns when escape sequences leak

Occasionally I'd see model output like {"key": ""value""}. Doubled quotes on both sides of a string, usually because an upstream stage involved Python triple-quoted f-strings, or an escape got applied twice somewhere.

Sakhi's own pipeline has three regexes for this kind of overrun, but two of them have an edge case: they can corrupt legitimate empty-string values ({"k": ""}) because the regex can't tell "overrun" from "intentional empty" without parser-level context. So in llmclean I only included the safe one — the form that requires non-empty content between the doubled quotes. That handles the common case (""text"""text") and never touches legitimate empties.

This kind of careful subtraction is the part I'm most happy about. It's less code than Sakhi has, but more correct.

The shape of the thing

llmclean lives in a small gap between bigger tools:

  • For schema validation: use jsonschema or pydantic.
  • For re-prompting the model when output is bad: use instructor.
  • For constraining the model at generation time so it can't produce broken output: use outlines.

llmclean is the post-hoc cleanup pass. The thing you run after the model has emitted text and before you try to parse it. It composes with all of the above — it's not competing with them.

What I'm trying to keep true to while iterating:

Functions never raise. Every public function returns the original input on failure, so it composes safely in pipelines that can't afford an exception path.

Zero runtime dependencies. The standard library is enough for what this needs to do, and pulling in a dependency would force every downstream user to deal with version conflicts they didn't sign up for.

Predictable behaviour. Same input, same output. No external state, no model calls, no fuzzy heuristics that change semantics silently between versions.

Try it, tell me where it breaks

pip install llmclean
Enter fullscreen mode Exit fullscreen mode

What I'd find genuinely useful:

If you try it on output from a model I haven't tested against and it fails, file an issue with the raw input. Real failure cases are what improvements come from — every fix in v0.2.0 came from one.

If your project has its own LLM-output cleanup logic, I'd love to know what your edge cases are. The whole library exists because three of my projects had different ad-hoc versions of the same thing. There's probably a fourth and fifth class of failure I haven't seen.

If you've solved this with instructor or guardrails or some other tool and want to argue I should have just used that — also welcome. Comparative honesty is more useful than marketing.

GitHub: Tushar-9802/llmclean
PyPI: llmclean on PyPI
Changelog: CHANGELOG.md

Next version probably picks up a few more patterns I noted while inspecting MedScribe (a SOAP-note extraction project of mine): prompt-leakage stripping when the model echoes back parts of its own prompt, and section-level repetition truncation. Those are in the queue, currently driven by the same process — find them in real work first, port to the library second.

If you've got a use case where llmclean would help, or one where it's already broken on you, the issue tracker is open.

Top comments (0)