DEV Community

Sudhir Gadhvi
Sudhir Gadhvi

Posted on

Building a Vernacular AI Preprocessing Layer for Indian Code-Mixed Text

Most AI demos assume clean text. Production inputs are messy: mixed script, mixed language, spelling variation, and transliteration drift. I built open-vernacular-ai-kit to solve this layer before retrieval, routing, and LLM generation.

Open Vernacular AI Kit

CI Docs Version Python

open-vernacular-ai-kit is an open-source SDK + CLI for cleaning up Indian vernacular-English code-mixed text. This release is India-first with Sarvam AI integrations, and is designed to expand globally in future updates with community-contributed language and provider adapters It is designed for messy WhatsApp-style inputs where vernacular text might appear in:

  • native script (example: ગુજરાતી)
  • Romanized vernacular text (example: Gujlish)
  • Mixed script in the same sentence

The goal is to normalize text before sending it to downstream models (Sarvam-M / Mayura / Sarvam-Translate), and to provide a reusable open-source foundation for vernacular AI workflows Global language/provider expansion is planned and PR-friendly.

This repo is alpha-quality but SDK-first: the public API centers on CodeMixConfig + CodeMixPipeline.

Quick example:

gck codemix "maru business plan ready chhe!!!"
# -> મારું business plan ready છે!!
Enter fullscreen mode Exit fullscreen mode

What We Solve

This project is a production-oriented normalization layer for India-focused AI…

Top comments (0)