DEV Community: Tushar Jaju

Building Sakhi: Hindi Voice-to-Form for India's ASHA Workers, Solo in Six Weeks

Tushar Jaju — Tue, 19 May 2026 14:27:01 +0000

TL;DR — Six-week solo build of a Hindi voice-to-form pipeline for India's ~1 million community health workers. Two deployment modes: a workstation path with Whisper + Gemma 4 E4B on Ollama, and a fully offline on-device path running Gemma 4 E2B INT4 on the Cactus SDK on Android. Submitted to Kaggle's Gemma 4 Good Hackathon. Source on GitHub, fine-tune on Ollama.

The problem

India's 1 million Accredited Social Health Activists (ASHAs) handle the last clinical mile for maternal and child health. They conduct 50+ million home visits a year — vitals, symptoms, counselling, danger-sign assessment. Every visit still ends with a paper form filled from memory and physically carried to the Primary Health Center on the next clinic day.

Danger signs that were observed — preeclampsia, postpartum hemorrhage, neonatal distress — sometimes never reach the clinical system in time for intervention.

Two compounding constraints make this hard to fix with conventional tooling:

Hindi voice, often in regional dialects. Cloud STT is unreliable on rural-clinical Hindi (published benchmarks: 27–70%+ WER, deletion-dominant — numbers and symptoms silently drop).
Connectivity is intermittent. Airplane-mode operation cannot be a fallback. It must be the default.

Architecture

Two deployment modes for how ASHAs actually work — a workstation in the health center, and the phone in the field:

Workstation path (PHC, GPU):
[Hindi Audio] → Whisper-Large CT2 → Hindi Normalization → Gemma 4 E4B (function calling)
                                                            ├── extract_form()
                                                            ├── flag_danger_sign()
                                                            └── issue_referral()

On-device path (Android, no network):
[Hindi Text] → Hindi Normalization → Visit-type detect → Gemma 4 E2B INT4 on Cactus
                                                          ├── extract_form
                                                          └── detect_danger

Workstation mode handles voice: a phone uploads audio to a shared PC at the sub-centre, Whisper-Large-V2 Hindi via CTranslate2 transcribes, Gemma 4 E4B Q4_K_M on Ollama extracts the structured form with native function calling. End-to-end 15–25 seconds on an RTX 5070 Ti.

Field mode runs the full pipeline (normalize → detect visit type → extract form → flag danger signs) entirely on-device. End-to-end 320.7s on a OnePlus 11R (Snapdragon 8+ Gen 1), zero network. The on-device LLM does Hindi text → form; voice routes to the workstation when WiFi returns (more on why below).

The hardest engineering call: leaving on-device voice OUT

I wanted on-device voice-to-form. A phone, no laptop, no network — that's the cleanest pitch. I pulled it from the build instead.

Cactus SDK ships multilingual Whisper INT4 for transcription — no Hindi-specific checkpoint. The published numbers are bad:

27% WER best-case on rural Hindi
70%+ on clinical content
Error profile is deletion-dominant — numbers and symptoms silently drop while filler words survive

A missed BP reading is a missed referral. A demo where Sakhi says "BP normal" because the actual 155/100 was deleted during transcription is exactly the failure mode an ASHA cannot catch in the field.

So voice routes to the workstation where Whisper-Large-V2 Hindi runs. The on-device LLM handles Hindi text → form for the case where an ASHA types a quick note offline. Field mode also captures raw audio offline and syncs to the workstation when WiFi returns.

This was the most uncomfortable call of the build. The submission video shows raw on-device JSON output from text input instead of faking voice.

Anti-hallucination: model extracts, Python decides

The hardest problem isn't getting Gemma to talk about a transcript. It's getting it to stop inventing. Early prototypes:

Hallucinated patient names from generic forms of address (दीदी / बहन — Hindi for "elder sister" / "sister", used informally for any woman regardless of relation).
Invented BP readings on routine visits that never mentioned vitals.
Turned counselling utterances ("eat iron-rich food, drink plenty of water") into "danger signs."

The pattern that stuck: Gemma proposes evidence; Python decides what counts. The LLM extracts only what was said — verbatim utterances, structured under the schema. Validation, range-checks, deduplication, blocklist filtering: none of that runs inside the prompt. It runs in code, against the transcript, after extraction.

Six layers of validation:

Evidence length filter — danger signs with under 10-character evidence are dropped.
Generic ASHA phrase blocklist — boilerplate (कोई तकलीफ़ हो तो फ़ोन कर दीजिए / "call me if there's any problem") filtered.
Normal-value filter — signs citing benign values (110/70, बिल्कुल ठीक / "totally fine", सामान्य / "normal") stripped.
Transcript grounding — evidence must appear verbatim in the transcript.
Deduplication across overlapping danger signs.
Form validation — strips invented patient names (दीदी/बहन patterns), default ages, phantom lab results; range checks on BP (60–250 / 30–150), Hb (3–20), weight (1–200), gestational weeks (1–45).

False-alarm rate on routine visits: 0.

Demographics never go through the LLM

Early prototypes asked Gemma to extract patient name, age, and household composition from the audio. It hallucinated names from दीदी and बहन, defaulted ages on under-specified utterances, invented household members.

The fix wasn't prompt-tuning. It was structural: demographics enter as a typed header — the way every clinical EMR works. The LLM never sees the question. It only extracts what was said during the visit.

This pattern generalizes — any LLM-based structured extraction where the field is known-and-typed should not be in the prompt at all.

The Blackwell + Windows + Unsloth dead end

Unsloth's bundled save_pretrained_gguf mmap-fails on Blackwell + Windows:

RuntimeError: unable to mmap ... [WinError 8] Not enough memory resources

WSL was out (CUDA passthrough for Whisper was already finicky in this setup). Linux dual-boot would have eaten two days I didn't have.

I wrote scripts/export_merge.py — manual LoRA-into-base delta-merge in PyTorch — then handed the merged FP16 model to llama.cpp/convert_hf_to_gguf.py + llama-quantize Q4_K_M. The fine-tune ships on the Ollama registry through that workaround:

ollama pull tusharbrisingr9802/sakhi

A/B vs base on the eval rubric: 14/15 fine-tune vs 15/15 base. Base is the production path. The fine-tune is published for deployments that prefer English schema-label normalization (दस्त → Diarrhea, चक्कर → dizziness).

Reproduce it locally

The workstation stack is the primary path:

git clone https://github.com/Tushar-9802/Sakhi
cd Sakhi
pip install -r requirements-runtime.txt
ollama pull gemma4:e4b-it-q4_K_M
cd frontend && npm install && npm run build && cd ..
python api.py
# Browser: http://localhost:8000

Requires ~10 GB VRAM (E4B Q4_K_M is roughly 9 GB resident). Verifies function calling, normalization, the 6-layer validation, and schema correctness end-to-end. Voice-to-form, text-to-form, and queue-and-sync all run on this stack.

For the on-device Android path see the GitHub Release — prebuilt APK plus in-app SAF zip-import of the Cactus model. Cactus's gemma-4-E2B-it INT4 build is gated on HuggingFace, so it isn't redistributed; the import flow keeps the no-adb path open for reviewers.

What's not in this submission

Full root-cause walkthroughs live in FAILURES.md in the repo:

No on-device voice — covered above. On-device LLM does Hindi text → form; voice routes to the workstation.
No real ASHA endorsement. Outreach didn't land inside the deadline. Real-voice testing came from family help in Bareilly — Hindi-native readers on a real phone mic, three of four role-play scripts. Not a corpus.
Synthetic training data. 1,154 fine-tune examples and the 15-case automated eval are LLM-generated Hindi with gTTS audio.
Regional dialect coverage. Tested on standard Hindi from Bareilly + role-play scripts. Bhojpuri, Awadhi, Magahi, code-switched Marwari/Bhili are not validated.

What's next

Partner with an ASHA training institute to collect 100+ hours of real ASHA home-visit audio under field conditions.
Fine-tune an IndicWhisper variant on that real audio for the on-device voice-in path that is not in this submission.
Harden integration with the official MCTS API so forms post directly into the NHM system instead of being exported as JSON/CSV.
Pilot with 10–20 ASHA workers in one rural block with before/after time-and-accuracy measurement.

Links

3-min demo video — https://youtu.be/n-u7J1lljUg
GitHub repository — https://github.com/Tushar-9802/Sakhi
Ollama fine-tune — ollama pull tusharbrisingr9802/sakhi
Kaggle writeup — https://www.kaggle.com/competitions/gemma-4-good-hackathon/writeups/sakhi-voice-to-form-for-asha-workers

If any of the patterns above are useful in your own LLM extraction pipelines — the model-extracts/Python-decides separation, demographics-as-typed-header, or the Whisper-INT4-WER receipts argument for not shipping fake on-device voice — drop a note in the comments. I'm @Tushar-9802 on GitHub.

I kept rewriting the same regex passes against LLM output. So I made a library.

Tushar Jaju — Mon, 11 May 2026 12:28:29 +0000

I've been working on a few LLM-based projects over the last year. Sakhi, a Hindi voice-to-form pipeline for community health workers in India. A resume parser for engineering candidates. A couple of smaller things. Different domains, different models, different prompts.

But there's a pattern: at the bottom of every pipeline, right before the model's output became "data we trust," I'd find the same kind of code.

Strip markdown fences. Repair half-broken JSON. Trim runaway repetitions. Normalize Python True/False/None to JSON booleans. Cut off the trailing "I hope this helps!" the model added after the actual answer.

Every project had its own ad-hoc version of these. Slightly different regex, slightly different edge cases. The third time I copy-pasted a "strip json` ... `" cleaner across projects, I gave up and made it a library.

That's llmclean. Zero dependencies, pure standard library, three small utilities. v0.1.0 was on PyPI a couple of months ago. v0.2.0 just shipped, and it's the one I want to talk about — because what changed in this release is the part that makes the case for a separate library at all.

What v0.1.0 did

Three functions, total. That's the entire public API:

from llmclean import strip_fences, enforce_json, trim_repetition

strip_fences('```

json\n{"name": "Alice"}\n

```')
# → '{"name": "Alice"}'

enforce_json('Here you go: {"ok": True, "items": [1,2,3,]}')
# → '{\n  "ok": true,\n  "items": [1, 2, 3]\n}'

trim_repetition("The answer is 42. This is final. This is final.")
# → 'The answer is 42. This is final.'

Each function returns the original input on failure (never raises), so it composes safely:

data = enforce_json(trim_repetition(strip_fences(raw_output)))

Stuck it on PyPI in March, copy-pasted the usage into Sakhi and the resume parser, moved on. Standard "I wrote a thing, hope it doesn't bite me" energy.

What production traffic taught me

Then I went back to those two projects and kept building. And the library quietly broke in three different ways across the next two months, each one from real data I was feeding into it. Every one of those breaks became a v0.2.0 fix.

1. CRLF on Windows silently inverted fence detection

Output from Ollama running on my Windows machine came back with \r\n line endings. The fence regex used [ \t]*$ as the trailing anchor. In Python's re.MULTILINE mode, $ matches the position immediately before \n — not before \r\n. So the \r sat between my whitespace class and the newline, and the regex silently failed to match the fence line.

The nasty part: it failed in an inverted way. The closing fence line (with no \r\n after it) still matched the regex, so the function read it as an unclosed opening fence and stripped it. Meanwhile the actual opening line survived as content. Output looked like garbled JSON wrapped in a leftover code fence.

Fix: [ \t]*\r?$. Three regexes, one character each.

2. BOM at position 0 broke `json.loads`

Some Windows file-IO round-trips and LLM client SDKs prepend a Byte Order Mark (U+FEFF). Sakhi started hitting this when Whisper transcripts went through Windows file IO and emerged with a BOM at position 0. json.loads sees an unexpected character at position 0 and bails immediately — before any of llmclean's strategy pipeline got a chance to fix anything.

Fix: lstrip("") at the entry point of both strip_fences and enforce_json.

3. Doubled-quote overruns when escape sequences leak

Occasionally I'd see model output like {"key": ""value""}. Doubled quotes on both sides of a string, usually because an upstream stage involved Python triple-quoted f-strings, or an escape got applied twice somewhere.

Sakhi's own pipeline has three regexes for this kind of overrun, but two of them have an edge case: they can corrupt legitimate empty-string values ({"k": ""}) because the regex can't tell "overrun" from "intentional empty" without parser-level context. So in llmclean I only included the safe one — the form that requires non-empty content between the doubled quotes. That handles the common case (""text"" → "text") and never touches legitimate empties.

This kind of careful subtraction is the part I'm most happy about. It's less code than Sakhi has, but more correct.

The shape of the thing

llmclean lives in a small gap between bigger tools:

For schema validation: use jsonschema or pydantic.
For re-prompting the model when output is bad: use instructor.
For constraining the model at generation time so it can't produce broken output: use outlines.

llmclean is the post-hoc cleanup pass. The thing you run after the model has emitted text and before you try to parse it. It composes with all of the above — it's not competing with them.

What I'm trying to keep true to while iterating:

Functions never raise. Every public function returns the original input on failure, so it composes safely in pipelines that can't afford an exception path.

Zero runtime dependencies. The standard library is enough for what this needs to do, and pulling in a dependency would force every downstream user to deal with version conflicts they didn't sign up for.

Predictable behaviour. Same input, same output. No external state, no model calls, no fuzzy heuristics that change semantics silently between versions.

Try it, tell me where it breaks

pip install llmclean

What I'd find genuinely useful:

If you try it on output from a model I haven't tested against and it fails, file an issue with the raw input. Real failure cases are what improvements come from — every fix in v0.2.0 came from one.

If your project has its own LLM-output cleanup logic, I'd love to know what your edge cases are. The whole library exists because three of my projects had different ad-hoc versions of the same thing. There's probably a fourth and fifth class of failure I haven't seen.

If you've solved this with instructor or guardrails or some other tool and want to argue I should have just used that — also welcome. Comparative honesty is more useful than marketing.

GitHub: Tushar-9802/llmclean
PyPI: llmclean on PyPI
Changelog: CHANGELOG.md

Next version probably picks up a few more patterns I noted while inspecting MedScribe (a SOAP-note extraction project of mine): prompt-leakage stripping when the model echoes back parts of its own prompt, and section-level repetition truncation. Those are in the queue, currently driven by the same process — find them in real work first, port to the library second.

If you've got a use case where llmclean would help, or one where it's already broken on you, the issue tracker is open.