<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tushar Jaju</title>
    <description>The latest articles on DEV Community by Tushar Jaju (@tushar9802).</description>
    <link>https://dev.to/tushar9802</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3924950%2F83310de1-c7c7-4ca5-8394-2f1ec6f00afc.jpeg</url>
      <title>DEV Community: Tushar Jaju</title>
      <link>https://dev.to/tushar9802</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tushar9802"/>
    <language>en</language>
    <item>
      <title>I kept rewriting the same regex passes against LLM output. So I made a library.</title>
      <dc:creator>Tushar Jaju</dc:creator>
      <pubDate>Mon, 11 May 2026 12:28:29 +0000</pubDate>
      <link>https://dev.to/tushar9802/i-kept-rewriting-the-same-regex-passes-against-llm-output-so-i-made-a-library-539</link>
      <guid>https://dev.to/tushar9802/i-kept-rewriting-the-same-regex-passes-against-llm-output-so-i-made-a-library-539</guid>
      <description>&lt;p&gt;I've been working on a few LLM-based projects over the last year. &lt;a href="https://github.com/Tushar-9802/Sakhi" rel="noopener noreferrer"&gt;Sakhi&lt;/a&gt;, a Hindi voice-to-form pipeline for community health workers in India. A &lt;a href="https://github.com/Tushar-9802/Resume-parser" rel="noopener noreferrer"&gt;resume parser&lt;/a&gt; for engineering candidates. A couple of smaller things. Different domains, different models, different prompts.&lt;/p&gt;

&lt;p&gt;But there's a pattern: at the bottom of every pipeline, right before the model's output became "data we trust," I'd find the same kind of code.&lt;/p&gt;

&lt;p&gt;Strip markdown fences. Repair half-broken JSON. Trim runaway repetitions. Normalize Python &lt;code&gt;True&lt;/code&gt;/&lt;code&gt;False&lt;/code&gt;/&lt;code&gt;None&lt;/code&gt; to JSON booleans. Cut off the trailing "I hope this helps!" the model added after the actual answer.&lt;/p&gt;

&lt;p&gt;Every project had its own ad-hoc version of these. Slightly different regex, slightly different edge cases. The third time I copy-pasted a "strip &lt;code&gt;&lt;/code&gt;&lt;code&gt;json` ... `&lt;/code&gt;&lt;code&gt;&lt;/code&gt;" cleaner across projects, I gave up and made it a library.&lt;/p&gt;

&lt;p&gt;That's &lt;code&gt;llmclean&lt;/code&gt;. Zero dependencies, pure standard library, three small utilities. v0.1.0 was on PyPI a couple of months ago. v0.2.0 just shipped, and it's the one I want to talk about — because what changed in this release is the part that makes the case for a separate library at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  What v0.1.0 did
&lt;/h2&gt;

&lt;p&gt;Three functions, total. That's the entire public API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llmclean&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;strip_fences&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;enforce_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trim_repetition&lt;/span&gt;

&lt;span class="nf"&gt;strip_fences&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;```

json&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;

```&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# → '{"name": "Alice"}'
&lt;/span&gt;
&lt;span class="nf"&gt;enforce_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Here you go: {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: True, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [1,2,3,]}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# → '{\n  "ok": true,\n  "items": [1, 2, 3]\n}'
&lt;/span&gt;
&lt;span class="nf"&gt;trim_repetition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The answer is 42. This is final. This is final.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# → 'The answer is 42. This is final.'
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each function returns the original input on failure (never raises), so it composes safely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;enforce_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;trim_repetition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;strip_fences&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_output&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stuck it on PyPI in March, copy-pasted the usage into Sakhi and the resume parser, moved on. Standard "I wrote a thing, hope it doesn't bite me" energy.&lt;/p&gt;

&lt;h2&gt;
  
  
  What production traffic taught me
&lt;/h2&gt;

&lt;p&gt;Then I went back to those two projects and kept building. And the library quietly broke in three different ways across the next two months, each one from real data I was feeding into it. Every one of those breaks became a v0.2.0 fix.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. CRLF on Windows silently inverted fence detection
&lt;/h3&gt;

&lt;p&gt;Output from Ollama running on my Windows machine came back with &lt;code&gt;\r\n&lt;/code&gt; line endings. The fence regex used &lt;code&gt;[ \t]*$&lt;/code&gt; as the trailing anchor. In Python's &lt;code&gt;re.MULTILINE&lt;/code&gt; mode, &lt;code&gt;$&lt;/code&gt; matches the position immediately before &lt;code&gt;\n&lt;/code&gt; — not before &lt;code&gt;\r\n&lt;/code&gt;. So the &lt;code&gt;\r&lt;/code&gt; sat between my whitespace class and the newline, and the regex silently failed to match the fence line.&lt;/p&gt;

&lt;p&gt;The nasty part: it failed in an &lt;em&gt;inverted&lt;/em&gt; way. The closing fence line (with no &lt;code&gt;\r\n&lt;/code&gt; after it) still matched the regex, so the function read it as an &lt;em&gt;unclosed opening fence&lt;/em&gt; and stripped it. Meanwhile the actual opening line survived as content. Output looked like garbled JSON wrapped in a leftover code fence.&lt;/p&gt;

&lt;p&gt;Fix: &lt;code&gt;[ \t]*\r?$&lt;/code&gt;. Three regexes, one character each. &lt;/p&gt;

&lt;h3&gt;
  
  
  2. BOM at position 0 broke &lt;code&gt;json.loads&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Some Windows file-IO round-trips and LLM client SDKs prepend a Byte Order Mark (&lt;code&gt;U+FEFF&lt;/code&gt;). Sakhi started hitting this when Whisper transcripts went through Windows file IO and emerged with a BOM at position 0. &lt;code&gt;json.loads&lt;/code&gt; sees an unexpected character at position 0 and bails immediately — before any of llmclean's strategy pipeline got a chance to fix anything.&lt;/p&gt;

&lt;p&gt;Fix: &lt;code&gt;lstrip("﻿")&lt;/code&gt; at the entry point of both &lt;code&gt;strip_fences&lt;/code&gt; and &lt;code&gt;enforce_json&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Doubled-quote overruns when escape sequences leak
&lt;/h3&gt;

&lt;p&gt;Occasionally I'd see model output like &lt;code&gt;{"key": ""value""}&lt;/code&gt;. Doubled quotes on both sides of a string, usually because an upstream stage involved Python triple-quoted f-strings, or an escape got applied twice somewhere.&lt;/p&gt;

&lt;p&gt;Sakhi's own pipeline has three regexes for this kind of overrun, but two of them have an edge case: they can corrupt legitimate empty-string values (&lt;code&gt;{"k": ""}&lt;/code&gt;) because the regex can't tell "overrun" from "intentional empty" without parser-level context. So in llmclean I only included the safe one — the form that &lt;em&gt;requires&lt;/em&gt; non-empty content between the doubled quotes. That handles the common case (&lt;code&gt;""text""&lt;/code&gt; → &lt;code&gt;"text"&lt;/code&gt;) and never touches legitimate empties.&lt;/p&gt;

&lt;p&gt;This kind of careful subtraction is the part I'm most happy about. It's less code than Sakhi has, but more correct.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of the thing
&lt;/h2&gt;

&lt;p&gt;llmclean lives in a small gap between bigger tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For schema validation: use &lt;code&gt;jsonschema&lt;/code&gt; or &lt;code&gt;pydantic&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For re-prompting the model when output is bad: use &lt;code&gt;instructor&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For constraining the model at generation time so it can't produce broken output: use &lt;code&gt;outlines&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;llmclean is the post-hoc cleanup pass. The thing you run &lt;em&gt;after&lt;/em&gt; the model has emitted text and &lt;em&gt;before&lt;/em&gt; you try to parse it. It composes with all of the above — it's not competing with them.&lt;/p&gt;

&lt;p&gt;What I'm trying to keep true to while iterating:&lt;/p&gt;

&lt;p&gt;Functions never raise. Every public function returns the original input on failure, so it composes safely in pipelines that can't afford an exception path.&lt;/p&gt;

&lt;p&gt;Zero runtime dependencies. The standard library is enough for what this needs to do, and pulling in a dependency would force every downstream user to deal with version conflicts they didn't sign up for.&lt;/p&gt;

&lt;p&gt;Predictable behaviour. Same input, same output. No external state, no model calls, no fuzzy heuristics that change semantics silently between versions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it, tell me where it breaks
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;llmclean
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What I'd find genuinely useful:&lt;/p&gt;

&lt;p&gt;If you try it on output from a model I haven't tested against and it fails, file an issue with the raw input. Real failure cases are what improvements come from — every fix in v0.2.0 came from one.&lt;/p&gt;

&lt;p&gt;If your project has its own LLM-output cleanup logic, I'd love to know what your edge cases are. The whole library exists because three of my projects had different ad-hoc versions of the same thing. There's probably a fourth and fifth class of failure I haven't seen.&lt;/p&gt;

&lt;p&gt;If you've solved this with &lt;code&gt;instructor&lt;/code&gt; or &lt;code&gt;guardrails&lt;/code&gt; or some other tool and want to argue I should have just used that — also welcome. Comparative honesty is more useful than marketing.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/Tushar-9802/llmclean" rel="noopener noreferrer"&gt;Tushar-9802/llmclean&lt;/a&gt;&lt;br&gt;
PyPI: &lt;a href="https://pypi.org/project/llmclean/" rel="noopener noreferrer"&gt;llmclean on PyPI&lt;/a&gt;&lt;br&gt;
Changelog: &lt;a href="https://github.com/Tushar-9802/llmclean/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG.md&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next version probably picks up a few more patterns I noted while inspecting MedScribe (a SOAP-note extraction project of mine): prompt-leakage stripping when the model echoes back parts of its own prompt, and section-level repetition truncation. Those are in the queue, currently driven by the same process — find them in real work first, port to the library second.&lt;/p&gt;

&lt;p&gt;If you've got a use case where llmclean would help, or one where it's already broken on you, the issue tracker is open.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
