Most AI demos assume clean text. Production inputs are messy: mixed script, mixed language, spelling variation, and transliteration drift. I built open-vernacular-ai-kit to solve this layer before retrieval, routing, and LLM generation.
Open Vernacular AI Kit
open-vernacular-ai-kit is an open-source SDK + CLI for cleaning up Indian vernacular-English code-mixed
text. This release is India-first with Sarvam AI integrations, and is designed to expand globally in
future updates with community-contributed language and provider adapters
It is designed for messy WhatsApp-style inputs where vernacular text might appear in:
- native script (example: ગુજરાતી)
- Romanized vernacular text (example: Gujlish)
- Mixed script in the same sentence
The goal is to normalize text before sending it to downstream models (Sarvam-M / Mayura / Sarvam-Translate), and to provide a reusable open-source foundation for vernacular AI workflows Global language/provider expansion is planned and PR-friendly.
This repo is alpha-quality but SDK-first: the public API centers on CodeMixConfig + CodeMixPipeline.
Quick example:
gck codemix "maru business plan ready chhe!!!"
# -> મારું business plan ready છે!!
What We Solve
This project is a production-oriented normalization layer for India-focused AI…
Top comments (0)