BARI ANKIT VINOD

Posted on Jun 5

GuWiki: Building a Gujarati AI Wikipedia from Scratch 🇮🇳

#devchallenge #githubchallenge #githubcopilot #opensource

GitHub “Finish-Up-A-Thon” Challenge Submission

What I Built

GuWiki is an AI-powered Wikipedia-style platform built specifically for the Gujarati language.

Gujarati is spoken by more than 60 million people worldwide, yet high-quality AI tools, language models, and speech technologies for Gujarati remain limited compared to English and other major languages.

I wanted to help close that gap.

Instead of relying on existing foundation models, I built the core AI components from scratch:

A Gujarati Large Language Model (LLM)
A Gujarati Automatic Speech Recognition (ASR) model
A complete data engineering pipeline
A Wikipedia-style knowledge platform powered by these models

Users can search, read, and interact with Gujarati knowledge using AI that understands the language natively.

AI Models

Gujarati NanoGPT (LLM)

https://huggingface.co/aijadugar/gujarati-nanogpt

A language model trained specifically on Gujarati text to understand vocabulary, grammar, and language patterns.

Gujarati ASR Model

https://huggingface.co/aijadugar/gujarati-asr

A speech-to-text model trained for Gujarati audio, enabling voice-based interaction and accessibility.

Demo

Live Application

👉 https://gu-wiki.vercel.app/

Video Walkthrough

🎥 https://www.loom.com/share/b5e278e624724d8f975e95b0fc3c6297

Key Features

AI-powered Gujarati knowledge search
Native Gujarati language understanding
Speech-to-text capabilities
Custom-trained Gujarati LLM
Custom-trained Gujarati ASR
End-to-end data engineering pipeline
Fast and responsive web interface
Open-source repository for community contribution

The Comeback Story

This project started as an ambitious idea:

"Can I build AI infrastructure for Gujarati instead of simply consuming models built for English?"

The answer turned out to be much harder than expected.

The biggest challenge wasn't building the website.

It was the data.

Gujarati lacks the abundance of high-quality datasets available for English. I spent significant time collecting, cleaning, validating, and preparing Gujarati text and speech data before any model training could even begin.

The project went through multiple iterations:

Rebuilding data pipelines
Cleaning noisy datasets
Experimenting with tokenization strategies
Training and retraining language models
Improving speech recognition quality
Optimizing inference performance
Connecting everything into a single user experience

For the Finish-Up-A-Thon, I focused on turning these individual research efforts into a complete, usable product.

I improved:

Model integration
Frontend experience
Deployment workflows
Documentation
Performance optimizations
Repository structure
End-to-end user experience

The result is GuWiki: a fully working AI-powered knowledge platform designed around Gujarati language users.

My Experience with GitHub Copilot

GitHub Copilot played a significant role throughout development.

While building GuWiki, I worked across multiple domains:

Data engineering
Machine learning
Deep learning
Model training
API development
Frontend development
Deployment

Switching contexts constantly can slow development, and Copilot helped reduce that friction.

Some of the ways it helped include:

Generating boilerplate code
Accelerating API development
Creating data processing utilities
Writing training scripts faster
Refactoring repetitive code
Generating documentation
Suggesting fixes during debugging

What I appreciated most was that Copilot allowed me to stay focused on solving the actual AI and language challenges instead of spending time on repetitive implementation details.

It felt less like autocomplete and more like a development companion that helped maintain momentum throughout the project.

Why This Project Matters

Most AI innovation today happens in a small number of major languages.

Millions of people speak Gujarati every day, but the ecosystem of open-source AI tools for the language is still developing.

GuWiki is my contribution toward making AI more accessible for Gujarati speakers by building:

Open-source language models
Open-source speech models
Public datasets and pipelines
Real-world applications powered by those models

My goal is not only to build a product but to help strengthen the Gujarati AI ecosystem so future developers and researchers can build on top of it.

If even a small part of this work helps expand access to knowledge and AI for Gujarati speakers, then the project has already succeeded.

Repository & Resources

GitHub: https://github.com/aijadugar/GuWiki
Live App: https://gu-wiki.vercel.app/
Demo: https://www.loom.com/share/b5e278e624724d8f975e95b0fc3c6297
Gujarati LLM: https://huggingface.co/aijadugar/gujarati-nanogpt
Gujarati ASR: https://huggingface.co/aijadugar/gujarati-asr

⭐ If you find the project interesting, feel free to explore the repository and share feedback.

DEV Community