DEV Community

Cover image for GuWiki: Building a Gujarati AI Wikipedia from Scratch ๐Ÿ‡ฎ๐Ÿ‡ณ
BARI ANKIT VINOD
BARI ANKIT VINOD

Posted on

GuWiki: Building a Gujarati AI Wikipedia from Scratch ๐Ÿ‡ฎ๐Ÿ‡ณ

GitHub โ€œFinish-Up-A-Thonโ€ Challenge Submission

What I Built

GuWiki is an AI-powered Wikipedia-style platform built specifically for the Gujarati language.

Gujarati is spoken by more than 60 million people worldwide, yet high-quality AI tools, language models, and speech technologies for Gujarati remain limited compared to English and other major languages.

I wanted to help close that gap.

Instead of relying on existing foundation models, I built the core AI components from scratch:

  • A Gujarati Large Language Model (LLM)
  • A Gujarati Automatic Speech Recognition (ASR) model
  • A complete data engineering pipeline
  • A Wikipedia-style knowledge platform powered by these models

Users can search, read, and interact with Gujarati knowledge using AI that understands the language natively.

AI Models

Gujarati NanoGPT (LLM)

https://huggingface.co/aijadugar/gujarati-nanogpt

A language model trained specifically on Gujarati text to understand vocabulary, grammar, and language patterns.

Gujarati ASR Model

https://huggingface.co/aijadugar/gujarati-asr

A speech-to-text model trained for Gujarati audio, enabling voice-based interaction and accessibility.


Demo

Live Application

๐Ÿ‘‰ https://gu-wiki.vercel.app/

Video Walkthrough

๐ŸŽฅ https://www.loom.com/share/b5e278e624724d8f975e95b0fc3c6297

Key Features

  • AI-powered Gujarati knowledge search
  • Native Gujarati language understanding
  • Speech-to-text capabilities
  • Custom-trained Gujarati LLM
  • Custom-trained Gujarati ASR
  • End-to-end data engineering pipeline
  • Fast and responsive web interface
  • Open-source repository for community contribution

The Comeback Story

This project started as an ambitious idea:

"Can I build AI infrastructure for Gujarati instead of simply consuming models built for English?"

The answer turned out to be much harder than expected.

The biggest challenge wasn't building the website.

It was the data.

Gujarati lacks the abundance of high-quality datasets available for English. I spent significant time collecting, cleaning, validating, and preparing Gujarati text and speech data before any model training could even begin.

The project went through multiple iterations:

  • Rebuilding data pipelines
  • Cleaning noisy datasets
  • Experimenting with tokenization strategies
  • Training and retraining language models
  • Improving speech recognition quality
  • Optimizing inference performance
  • Connecting everything into a single user experience

For the Finish-Up-A-Thon, I focused on turning these individual research efforts into a complete, usable product.

I improved:

  • Model integration
  • Frontend experience
  • Deployment workflows
  • Documentation
  • Performance optimizations
  • Repository structure
  • End-to-end user experience

The result is GuWiki: a fully working AI-powered knowledge platform designed around Gujarati language users.


My Experience with GitHub Copilot

GitHub Copilot played a significant role throughout development.

While building GuWiki, I worked across multiple domains:

  • Data engineering
  • Machine learning
  • Deep learning
  • Model training
  • API development
  • Frontend development
  • Deployment

Switching contexts constantly can slow development, and Copilot helped reduce that friction.

Some of the ways it helped include:

  • Generating boilerplate code
  • Accelerating API development
  • Creating data processing utilities
  • Writing training scripts faster
  • Refactoring repetitive code
  • Generating documentation
  • Suggesting fixes during debugging

What I appreciated most was that Copilot allowed me to stay focused on solving the actual AI and language challenges instead of spending time on repetitive implementation details.

It felt less like autocomplete and more like a development companion that helped maintain momentum throughout the project.


Why This Project Matters

Most AI innovation today happens in a small number of major languages.

Millions of people speak Gujarati every day, but the ecosystem of open-source AI tools for the language is still developing.

GuWiki is my contribution toward making AI more accessible for Gujarati speakers by building:

  • Open-source language models
  • Open-source speech models
  • Public datasets and pipelines
  • Real-world applications powered by those models

My goal is not only to build a product but to help strengthen the Gujarati AI ecosystem so future developers and researchers can build on top of it.

If even a small part of this work helps expand access to knowledge and AI for Gujarati speakers, then the project has already succeeded.


Repository & Resources

โญ If you find the project interesting, feel free to explore the repository and share feedback.

Top comments (0)