What I Built
GuWiki is an AI-powered Wikipedia-style platform built specifically for the Gujarati language.
Gujarati is spoken by more than 60 million people worldwide, yet high-quality AI tools, language models, and speech technologies for Gujarati remain limited compared to English and other major languages.
I wanted to help close that gap.
Instead of relying on existing foundation models, I built the core AI components from scratch:
- A Gujarati Large Language Model (LLM)
- A Gujarati Automatic Speech Recognition (ASR) model
- A complete data engineering pipeline
- A Wikipedia-style knowledge platform powered by these models
Users can search, read, and interact with Gujarati knowledge using AI that understands the language natively.
AI Models
Gujarati NanoGPT (LLM)
https://huggingface.co/aijadugar/gujarati-nanogpt
A language model trained specifically on Gujarati text to understand vocabulary, grammar, and language patterns.
Gujarati ASR Model
https://huggingface.co/aijadugar/gujarati-asr
A speech-to-text model trained for Gujarati audio, enabling voice-based interaction and accessibility.
Demo
Live Application
๐ https://gu-wiki.vercel.app/
Video Walkthrough
๐ฅ https://www.loom.com/share/b5e278e624724d8f975e95b0fc3c6297
Key Features
- AI-powered Gujarati knowledge search
- Native Gujarati language understanding
- Speech-to-text capabilities
- Custom-trained Gujarati LLM
- Custom-trained Gujarati ASR
- End-to-end data engineering pipeline
- Fast and responsive web interface
- Open-source repository for community contribution
The Comeback Story
This project started as an ambitious idea:
"Can I build AI infrastructure for Gujarati instead of simply consuming models built for English?"
The answer turned out to be much harder than expected.
The biggest challenge wasn't building the website.
It was the data.
Gujarati lacks the abundance of high-quality datasets available for English. I spent significant time collecting, cleaning, validating, and preparing Gujarati text and speech data before any model training could even begin.
The project went through multiple iterations:
- Rebuilding data pipelines
- Cleaning noisy datasets
- Experimenting with tokenization strategies
- Training and retraining language models
- Improving speech recognition quality
- Optimizing inference performance
- Connecting everything into a single user experience
For the Finish-Up-A-Thon, I focused on turning these individual research efforts into a complete, usable product.
I improved:
- Model integration
- Frontend experience
- Deployment workflows
- Documentation
- Performance optimizations
- Repository structure
- End-to-end user experience
The result is GuWiki: a fully working AI-powered knowledge platform designed around Gujarati language users.
My Experience with GitHub Copilot
GitHub Copilot played a significant role throughout development.
While building GuWiki, I worked across multiple domains:
- Data engineering
- Machine learning
- Deep learning
- Model training
- API development
- Frontend development
- Deployment
Switching contexts constantly can slow development, and Copilot helped reduce that friction.
Some of the ways it helped include:
- Generating boilerplate code
- Accelerating API development
- Creating data processing utilities
- Writing training scripts faster
- Refactoring repetitive code
- Generating documentation
- Suggesting fixes during debugging
What I appreciated most was that Copilot allowed me to stay focused on solving the actual AI and language challenges instead of spending time on repetitive implementation details.
It felt less like autocomplete and more like a development companion that helped maintain momentum throughout the project.
Why This Project Matters
Most AI innovation today happens in a small number of major languages.
Millions of people speak Gujarati every day, but the ecosystem of open-source AI tools for the language is still developing.
GuWiki is my contribution toward making AI more accessible for Gujarati speakers by building:
- Open-source language models
- Open-source speech models
- Public datasets and pipelines
- Real-world applications powered by those models
My goal is not only to build a product but to help strengthen the Gujarati AI ecosystem so future developers and researchers can build on top of it.
If even a small part of this work helps expand access to knowledge and AI for Gujarati speakers, then the project has already succeeded.
Repository & Resources
- GitHub: https://github.com/aijadugar/GuWiki
- Live App: https://gu-wiki.vercel.app/
- Demo: https://www.loom.com/share/b5e278e624724d8f975e95b0fc3c6297
- Gujarati LLM: https://huggingface.co/aijadugar/gujarati-nanogpt
- Gujarati ASR: https://huggingface.co/aijadugar/gujarati-asr
โญ If you find the project interesting, feel free to explore the repository and share feedback.
Top comments (0)