<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Tushar Jaju</title>
    <description>The latest articles on DEV Community by Tushar Jaju (@tushar9802).</description>
    <link>https://dev.to/tushar9802</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3924950%2F83310de1-c7c7-4ca5-8394-2f1ec6f00afc.jpeg</url>
      <title>DEV Community: Tushar Jaju</title>
      <link>https://dev.to/tushar9802</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/tushar9802"/>
    <language>en</language>
    <item>
      <title>Building Sakhi: Hindi Voice-to-Form for India's ASHA Workers, Solo in Six Weeks</title>
      <dc:creator>Tushar Jaju</dc:creator>
      <pubDate>Tue, 19 May 2026 14:27:01 +0000</pubDate>
      <link>https://dev.to/tushar9802/building-sakhi-hindi-voice-to-form-for-indias-asha-workers-solo-in-six-weeks-2685</link>
      <guid>https://dev.to/tushar9802/building-sakhi-hindi-voice-to-form-for-indias-asha-workers-solo-in-six-weeks-2685</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt; — Six-week solo build of a Hindi voice-to-form pipeline for India's ~1 million community health workers. Two deployment modes: a workstation path with Whisper + Gemma 4 E4B on Ollama, and a fully offline on-device path running Gemma 4 E2B INT4 on the Cactus SDK on Android. Submitted to Kaggle's Gemma 4 Good Hackathon. Source on GitHub, fine-tune on Ollama.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;India's 1 million Accredited Social Health Activists (ASHAs) handle the last clinical mile for maternal and child health. They conduct 50+ million home visits a year — vitals, symptoms, counselling, danger-sign assessment. Every visit still ends with a paper form filled from memory and physically carried to the Primary Health Center on the next clinic day.&lt;/p&gt;

&lt;p&gt;Danger signs that &lt;em&gt;were&lt;/em&gt; observed — preeclampsia, postpartum hemorrhage, neonatal distress — sometimes never reach the clinical system in time for intervention.&lt;/p&gt;

&lt;p&gt;Two compounding constraints make this hard to fix with conventional tooling:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Hindi voice, often in regional dialects.&lt;/strong&gt; Cloud STT is unreliable on rural-clinical Hindi (published benchmarks: 27–70%+ WER, deletion-dominant — numbers and symptoms silently drop).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Connectivity is intermittent.&lt;/strong&gt; Airplane-mode operation cannot be a fallback. It must be the default.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;

&lt;p&gt;Two deployment modes for how ASHAs actually work — a workstation in the health center, and the phone in the field:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Workstation path (PHC, GPU):
[Hindi Audio] → Whisper-Large CT2 → Hindi Normalization → Gemma 4 E4B (function calling)
                                                            ├── extract_form()
                                                            ├── flag_danger_sign()
                                                            └── issue_referral()

On-device path (Android, no network):
[Hindi Text] → Hindi Normalization → Visit-type detect → Gemma 4 E2B INT4 on Cactus
                                                          ├── extract_form
                                                          └── detect_danger
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Workstation mode handles voice: a phone uploads audio to a shared PC at the sub-centre, Whisper-Large-V2 Hindi via CTranslate2 transcribes, Gemma 4 E4B Q4_K_M on Ollama extracts the structured form with native function calling. End-to-end &lt;strong&gt;15–25 seconds&lt;/strong&gt; on an RTX 5070 Ti.&lt;/p&gt;

&lt;p&gt;Field mode runs the full pipeline (normalize → detect visit type → extract form → flag danger signs) entirely on-device. End-to-end &lt;strong&gt;320.7s&lt;/strong&gt; on a OnePlus 11R (Snapdragon 8+ Gen 1), zero network. The on-device LLM does Hindi text → form; voice routes to the workstation when WiFi returns (more on why below).&lt;/p&gt;

&lt;h2&gt;
  
  
  The hardest engineering call: leaving on-device voice OUT
&lt;/h2&gt;

&lt;p&gt;I wanted on-device voice-to-form. A phone, no laptop, no network — that's the cleanest pitch. I pulled it from the build instead.&lt;/p&gt;

&lt;p&gt;Cactus SDK ships multilingual Whisper INT4 for transcription — no Hindi-specific checkpoint. The published numbers are bad:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;27% WER best-case on rural Hindi&lt;/li&gt;
&lt;li&gt;70%+ on clinical content&lt;/li&gt;
&lt;li&gt;Error profile is &lt;strong&gt;deletion-dominant&lt;/strong&gt; — numbers and symptoms silently drop while filler words survive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A missed BP reading is a missed referral. A demo where Sakhi says "BP normal" because the actual &lt;code&gt;155/100&lt;/code&gt; was deleted during transcription is exactly the failure mode an ASHA cannot catch in the field.&lt;/p&gt;

&lt;p&gt;So voice routes to the workstation where Whisper-Large-V2 Hindi runs. The on-device LLM handles Hindi text → form for the case where an ASHA types a quick note offline. Field mode also captures raw audio offline and syncs to the workstation when WiFi returns.&lt;/p&gt;

&lt;p&gt;This was the most uncomfortable call of the build. The submission video shows raw on-device JSON output from text input instead of faking voice.&lt;/p&gt;

&lt;h2&gt;
  
  
  Anti-hallucination: model extracts, Python decides
&lt;/h2&gt;

&lt;p&gt;The hardest problem isn't getting Gemma to talk about a transcript. It's getting it to stop &lt;em&gt;inventing&lt;/em&gt;. Early prototypes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hallucinated patient names from generic forms of address (&lt;code&gt;दीदी&lt;/code&gt; / &lt;code&gt;बहन&lt;/code&gt; — Hindi for "elder sister" / "sister", used informally for any woman regardless of relation).&lt;/li&gt;
&lt;li&gt;Invented BP readings on routine visits that never mentioned vitals.&lt;/li&gt;
&lt;li&gt;Turned counselling utterances ("eat iron-rich food, drink plenty of water") into "danger signs."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pattern that stuck: &lt;strong&gt;Gemma proposes evidence; Python decides what counts.&lt;/strong&gt; The LLM extracts only what was &lt;em&gt;said&lt;/em&gt; — verbatim utterances, structured under the schema. Validation, range-checks, deduplication, blocklist filtering: none of that runs inside the prompt. It runs in code, against the transcript, after extraction.&lt;/p&gt;

&lt;p&gt;Six layers of validation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Evidence length filter&lt;/strong&gt; — danger signs with under 10-character evidence are dropped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generic ASHA phrase blocklist&lt;/strong&gt; — boilerplate (&lt;code&gt;कोई तकलीफ़ हो तो फ़ोन कर दीजिए&lt;/code&gt; / "call me if there's any problem") filtered.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Normal-value filter&lt;/strong&gt; — signs citing benign values (&lt;code&gt;110/70&lt;/code&gt;, &lt;code&gt;बिल्कुल ठीक&lt;/code&gt; / "totally fine", &lt;code&gt;सामान्य&lt;/code&gt; / "normal") stripped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transcript grounding&lt;/strong&gt; — evidence must appear verbatim in the transcript.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deduplication&lt;/strong&gt; across overlapping danger signs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Form validation&lt;/strong&gt; — strips invented patient names (दीदी/बहन patterns), default ages, phantom lab results; range checks on BP (60–250 / 30–150), Hb (3–20), weight (1–200), gestational weeks (1–45).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;False-alarm rate on routine visits: &lt;strong&gt;0&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Demographics never go through the LLM
&lt;/h2&gt;

&lt;p&gt;Early prototypes asked Gemma to extract patient name, age, and household composition from the audio. It hallucinated names from &lt;code&gt;दीदी&lt;/code&gt; and &lt;code&gt;बहन&lt;/code&gt;, defaulted ages on under-specified utterances, invented household members.&lt;/p&gt;

&lt;p&gt;The fix wasn't prompt-tuning. It was structural: demographics enter as a typed header — the way every clinical EMR works. The LLM never sees the question. It only extracts what was &lt;em&gt;said&lt;/em&gt; during the visit.&lt;/p&gt;

&lt;p&gt;This pattern generalizes — any LLM-based structured extraction where the field is known-and-typed should not be in the prompt at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blackwell + Windows + Unsloth dead end
&lt;/h2&gt;

&lt;p&gt;Unsloth's bundled &lt;code&gt;save_pretrained_gguf&lt;/code&gt; mmap-fails on Blackwell + Windows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;RuntimeError: unable to mmap ... [WinError 8] Not enough memory resources
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;WSL was out (CUDA passthrough for Whisper was already finicky in this setup). Linux dual-boot would have eaten two days I didn't have.&lt;/p&gt;

&lt;p&gt;I wrote &lt;code&gt;scripts/export_merge.py&lt;/code&gt; — manual LoRA-into-base delta-merge in PyTorch — then handed the merged FP16 model to &lt;code&gt;llama.cpp/convert_hf_to_gguf.py&lt;/code&gt; + &lt;code&gt;llama-quantize Q4_K_M&lt;/code&gt;. The fine-tune ships on the Ollama registry through that workaround:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ollama pull tusharbrisingr9802/sakhi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A/B vs base on the eval rubric: &lt;strong&gt;14/15 fine-tune vs 15/15 base&lt;/strong&gt;. Base is the production path. The fine-tune is published for deployments that prefer English schema-label normalization (&lt;code&gt;दस्त&lt;/code&gt; → &lt;code&gt;Diarrhea&lt;/code&gt;, &lt;code&gt;चक्कर&lt;/code&gt; → &lt;code&gt;dizziness&lt;/code&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Reproduce it locally
&lt;/h2&gt;

&lt;p&gt;The workstation stack is the primary path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/Tushar-9802/Sakhi
&lt;span class="nb"&gt;cd &lt;/span&gt;Sakhi
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements-runtime.txt
ollama pull gemma4:e4b-it-q4_K_M
&lt;span class="nb"&gt;cd &lt;/span&gt;frontend &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; npm run build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd&lt;/span&gt; ..
python api.py
&lt;span class="c"&gt;# Browser: http://localhost:8000&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Requires ~10 GB VRAM (E4B Q4_K_M is roughly 9 GB resident). Verifies function calling, normalization, the 6-layer validation, and schema correctness end-to-end. Voice-to-form, text-to-form, and queue-and-sync all run on this stack.&lt;/p&gt;

&lt;p&gt;For the on-device Android path see the GitHub Release — prebuilt APK plus in-app SAF zip-import of the Cactus model. Cactus's &lt;code&gt;gemma-4-E2B-it&lt;/code&gt; INT4 build is gated on HuggingFace, so it isn't redistributed; the import flow keeps the no-adb path open for reviewers.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's not in this submission
&lt;/h2&gt;

&lt;p&gt;Full root-cause walkthroughs live in &lt;code&gt;FAILURES.md&lt;/code&gt; in the repo:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No on-device voice&lt;/strong&gt; — covered above. On-device LLM does Hindi text → form; voice routes to the workstation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No real ASHA endorsement.&lt;/strong&gt; Outreach didn't land inside the deadline. Real-voice testing came from family help in Bareilly — Hindi-native readers on a real phone mic, three of four role-play scripts. Not a corpus.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthetic training data.&lt;/strong&gt; 1,154 fine-tune examples and the 15-case automated eval are LLM-generated Hindi with gTTS audio.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regional dialect coverage.&lt;/strong&gt; Tested on standard Hindi from Bareilly + role-play scripts. Bhojpuri, Awadhi, Magahi, code-switched Marwari/Bhili are not validated.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Partner with an ASHA training institute to collect 100+ hours of real ASHA home-visit audio under field conditions.&lt;/li&gt;
&lt;li&gt;Fine-tune an IndicWhisper variant on that real audio for the on-device voice-in path that is not in this submission.&lt;/li&gt;
&lt;li&gt;Harden integration with the official MCTS API so forms post directly into the NHM system instead of being exported as JSON/CSV.&lt;/li&gt;
&lt;li&gt;Pilot with 10–20 ASHA workers in one rural block with before/after time-and-accuracy measurement.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3-min demo video&lt;/strong&gt; — &lt;a href="https://youtu.be/n-u7J1lljUg" rel="noopener noreferrer"&gt;https://youtu.be/n-u7J1lljUg&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub repository&lt;/strong&gt; — &lt;a href="https://github.com/Tushar-9802/Sakhi" rel="noopener noreferrer"&gt;https://github.com/Tushar-9802/Sakhi&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ollama fine-tune&lt;/strong&gt; — &lt;code&gt;ollama pull tusharbrisingr9802/sakhi&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Kaggle writeup&lt;/strong&gt; — &lt;a href="https://www.kaggle.com/competitions/gemma-4-good-hackathon/writeups/sakhi-voice-to-form-for-asha-workers" rel="noopener noreferrer"&gt;https://www.kaggle.com/competitions/gemma-4-good-hackathon/writeups/sakhi-voice-to-form-for-asha-workers&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If any of the patterns above are useful in your own LLM extraction pipelines — the model-extracts/Python-decides separation, demographics-as-typed-header, or the Whisper-INT4-WER receipts argument for not shipping fake on-device voice — drop a note in the comments. I'm &lt;a href="https://github.com/Tushar-9802" rel="noopener noreferrer"&gt;@Tushar-9802&lt;/a&gt; on GitHub.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>healthtech</category>
      <category>hindi</category>
    </item>
    <item>
      <title>I kept rewriting the same regex passes against LLM output. So I made a library.</title>
      <dc:creator>Tushar Jaju</dc:creator>
      <pubDate>Mon, 11 May 2026 12:28:29 +0000</pubDate>
      <link>https://dev.to/tushar9802/i-kept-rewriting-the-same-regex-passes-against-llm-output-so-i-made-a-library-539</link>
      <guid>https://dev.to/tushar9802/i-kept-rewriting-the-same-regex-passes-against-llm-output-so-i-made-a-library-539</guid>
      <description>&lt;p&gt;I've been working on a few LLM-based projects over the last year. &lt;a href="https://github.com/Tushar-9802/Sakhi" rel="noopener noreferrer"&gt;Sakhi&lt;/a&gt;, a Hindi voice-to-form pipeline for community health workers in India. A &lt;a href="https://github.com/Tushar-9802/Resume-parser" rel="noopener noreferrer"&gt;resume parser&lt;/a&gt; for engineering candidates. A couple of smaller things. Different domains, different models, different prompts.&lt;/p&gt;

&lt;p&gt;But there's a pattern: at the bottom of every pipeline, right before the model's output became "data we trust," I'd find the same kind of code.&lt;/p&gt;

&lt;p&gt;Strip markdown fences. Repair half-broken JSON. Trim runaway repetitions. Normalize Python &lt;code&gt;True&lt;/code&gt;/&lt;code&gt;False&lt;/code&gt;/&lt;code&gt;None&lt;/code&gt; to JSON booleans. Cut off the trailing "I hope this helps!" the model added after the actual answer.&lt;/p&gt;

&lt;p&gt;Every project had its own ad-hoc version of these. Slightly different regex, slightly different edge cases. The third time I copy-pasted a "strip &lt;code&gt;&lt;/code&gt;&lt;code&gt;json` ... `&lt;/code&gt;&lt;code&gt;&lt;/code&gt;" cleaner across projects, I gave up and made it a library.&lt;/p&gt;

&lt;p&gt;That's &lt;code&gt;llmclean&lt;/code&gt;. Zero dependencies, pure standard library, three small utilities. v0.1.0 was on PyPI a couple of months ago. v0.2.0 just shipped, and it's the one I want to talk about — because what changed in this release is the part that makes the case for a separate library at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  What v0.1.0 did
&lt;/h2&gt;

&lt;p&gt;Three functions, total. That's the entire public API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llmclean&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;strip_fences&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;enforce_json&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trim_repetition&lt;/span&gt;

&lt;span class="nf"&gt;strip_fences&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;```

json&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Alice&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;

```&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# → '{"name": "Alice"}'
&lt;/span&gt;
&lt;span class="nf"&gt;enforce_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Here you go: {&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ok&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: True, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;items&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;: [1,2,3,]}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# → '{\n  "ok": true,\n  "items": [1, 2, 3]\n}'
&lt;/span&gt;
&lt;span class="nf"&gt;trim_repetition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The answer is 42. This is final. This is final.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# → 'The answer is 42. This is final.'
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each function returns the original input on failure (never raises), so it composes safely:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;enforce_json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;trim_repetition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;strip_fences&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw_output&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Stuck it on PyPI in March, copy-pasted the usage into Sakhi and the resume parser, moved on. Standard "I wrote a thing, hope it doesn't bite me" energy.&lt;/p&gt;

&lt;h2&gt;
  
  
  What production traffic taught me
&lt;/h2&gt;

&lt;p&gt;Then I went back to those two projects and kept building. And the library quietly broke in three different ways across the next two months, each one from real data I was feeding into it. Every one of those breaks became a v0.2.0 fix.&lt;/p&gt;

&lt;h3&gt;
  
  
  1. CRLF on Windows silently inverted fence detection
&lt;/h3&gt;

&lt;p&gt;Output from Ollama running on my Windows machine came back with &lt;code&gt;\r\n&lt;/code&gt; line endings. The fence regex used &lt;code&gt;[ \t]*$&lt;/code&gt; as the trailing anchor. In Python's &lt;code&gt;re.MULTILINE&lt;/code&gt; mode, &lt;code&gt;$&lt;/code&gt; matches the position immediately before &lt;code&gt;\n&lt;/code&gt; — not before &lt;code&gt;\r\n&lt;/code&gt;. So the &lt;code&gt;\r&lt;/code&gt; sat between my whitespace class and the newline, and the regex silently failed to match the fence line.&lt;/p&gt;

&lt;p&gt;The nasty part: it failed in an &lt;em&gt;inverted&lt;/em&gt; way. The closing fence line (with no &lt;code&gt;\r\n&lt;/code&gt; after it) still matched the regex, so the function read it as an &lt;em&gt;unclosed opening fence&lt;/em&gt; and stripped it. Meanwhile the actual opening line survived as content. Output looked like garbled JSON wrapped in a leftover code fence.&lt;/p&gt;

&lt;p&gt;Fix: &lt;code&gt;[ \t]*\r?$&lt;/code&gt;. Three regexes, one character each. &lt;/p&gt;

&lt;h3&gt;
  
  
  2. BOM at position 0 broke &lt;code&gt;json.loads&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Some Windows file-IO round-trips and LLM client SDKs prepend a Byte Order Mark (&lt;code&gt;U+FEFF&lt;/code&gt;). Sakhi started hitting this when Whisper transcripts went through Windows file IO and emerged with a BOM at position 0. &lt;code&gt;json.loads&lt;/code&gt; sees an unexpected character at position 0 and bails immediately — before any of llmclean's strategy pipeline got a chance to fix anything.&lt;/p&gt;

&lt;p&gt;Fix: &lt;code&gt;lstrip("﻿")&lt;/code&gt; at the entry point of both &lt;code&gt;strip_fences&lt;/code&gt; and &lt;code&gt;enforce_json&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Doubled-quote overruns when escape sequences leak
&lt;/h3&gt;

&lt;p&gt;Occasionally I'd see model output like &lt;code&gt;{"key": ""value""}&lt;/code&gt;. Doubled quotes on both sides of a string, usually because an upstream stage involved Python triple-quoted f-strings, or an escape got applied twice somewhere.&lt;/p&gt;

&lt;p&gt;Sakhi's own pipeline has three regexes for this kind of overrun, but two of them have an edge case: they can corrupt legitimate empty-string values (&lt;code&gt;{"k": ""}&lt;/code&gt;) because the regex can't tell "overrun" from "intentional empty" without parser-level context. So in llmclean I only included the safe one — the form that &lt;em&gt;requires&lt;/em&gt; non-empty content between the doubled quotes. That handles the common case (&lt;code&gt;""text""&lt;/code&gt; → &lt;code&gt;"text"&lt;/code&gt;) and never touches legitimate empties.&lt;/p&gt;

&lt;p&gt;This kind of careful subtraction is the part I'm most happy about. It's less code than Sakhi has, but more correct.&lt;/p&gt;

&lt;h2&gt;
  
  
  The shape of the thing
&lt;/h2&gt;

&lt;p&gt;llmclean lives in a small gap between bigger tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For schema validation: use &lt;code&gt;jsonschema&lt;/code&gt; or &lt;code&gt;pydantic&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For re-prompting the model when output is bad: use &lt;code&gt;instructor&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;For constraining the model at generation time so it can't produce broken output: use &lt;code&gt;outlines&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;llmclean is the post-hoc cleanup pass. The thing you run &lt;em&gt;after&lt;/em&gt; the model has emitted text and &lt;em&gt;before&lt;/em&gt; you try to parse it. It composes with all of the above — it's not competing with them.&lt;/p&gt;

&lt;p&gt;What I'm trying to keep true to while iterating:&lt;/p&gt;

&lt;p&gt;Functions never raise. Every public function returns the original input on failure, so it composes safely in pipelines that can't afford an exception path.&lt;/p&gt;

&lt;p&gt;Zero runtime dependencies. The standard library is enough for what this needs to do, and pulling in a dependency would force every downstream user to deal with version conflicts they didn't sign up for.&lt;/p&gt;

&lt;p&gt;Predictable behaviour. Same input, same output. No external state, no model calls, no fuzzy heuristics that change semantics silently between versions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it, tell me where it breaks
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;llmclean
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What I'd find genuinely useful:&lt;/p&gt;

&lt;p&gt;If you try it on output from a model I haven't tested against and it fails, file an issue with the raw input. Real failure cases are what improvements come from — every fix in v0.2.0 came from one.&lt;/p&gt;

&lt;p&gt;If your project has its own LLM-output cleanup logic, I'd love to know what your edge cases are. The whole library exists because three of my projects had different ad-hoc versions of the same thing. There's probably a fourth and fifth class of failure I haven't seen.&lt;/p&gt;

&lt;p&gt;If you've solved this with &lt;code&gt;instructor&lt;/code&gt; or &lt;code&gt;guardrails&lt;/code&gt; or some other tool and want to argue I should have just used that — also welcome. Comparative honesty is more useful than marketing.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/Tushar-9802/llmclean" rel="noopener noreferrer"&gt;Tushar-9802/llmclean&lt;/a&gt;&lt;br&gt;
PyPI: &lt;a href="https://pypi.org/project/llmclean/" rel="noopener noreferrer"&gt;llmclean on PyPI&lt;/a&gt;&lt;br&gt;
Changelog: &lt;a href="https://github.com/Tushar-9802/llmclean/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG.md&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next version probably picks up a few more patterns I noted while inspecting MedScribe (a SOAP-note extraction project of mine): prompt-leakage stripping when the model echoes back parts of its own prompt, and section-level repetition truncation. Those are in the queue, currently driven by the same process — find them in real work first, port to the library second.&lt;/p&gt;

&lt;p&gt;If you've got a use case where llmclean would help, or one where it's already broken on you, the issue tracker is open.&lt;/p&gt;

</description>
      <category>python</category>
      <category>ai</category>
      <category>llm</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
