Pritam Dey

Posted on Mar 16

A peek into Sarvam AI's vocabulary: How well does it understand Indian languages?

#ai #llm #sarvam #agents

"It is (language) the most vivid and crucial key to identity: It reveals the private identity, and connects one with, or divorces one from, the larger, public, or communal identity."
— James Baldwin (If Black English Isn't a Language, Then Tell Me, What Is?)

Language goes far beyond being a medium for communication; it shapes a community's culture, attitudes, politics, and lived experiences. As LLMs grow increasingly powerful and central to our digital lives, having your language genuinely represented inside them is critical.

For a linguistically diverse country like India, this is indeed a challenge. With hundreds of languages and dialects, each with its own script, grammar, and cultural context, building an AI that can truly understand and generate text in these languages is no small feat.

When Sarvam AI launched their models targeting Indian languages, I was excited to have an Indian company building models that prioritize our languages. Unfortunately, the Sarvam AI team did not share details about the training data. How much English vs. Hindi vs. Bengali vs. ... was in the training data? Was any language more available than others? Where did they source their data from? These questions were left unanswered, which led to my curiosity and this analysis.

What Is a Tokenizer? How is the Vocabulary built?

Before I reduce this vast topic into a few sentences, I would like to direct readers to Andrej Karpathy's video tutorial on tokenization. I don't think anybody can explain this better than him, and his video is the best resource I've found to understand how tokenization works in LLMs.

Modern LLMs don't process text character by character. Instead, they break text into tokens - chunks that can be a character, a syllable, a whole word, or even a common phrase. Before even the model training starts, model creators run a process called tokenization on their training data to build a fixed-size vocabulary of the most useful tokens. This vocabulary is the foundation of how the model understands and generates language.

A token might be the English word running, the Hindi word कार्यक्रम (program/event), or even a multi-word phrase like United Nations. The richer your vocabulary is in a given language, the more fluently the model can read and generate text in that language - because it can process meaningful linguistic units directly, rather than having to laboriously assemble them from smaller fragments. I strongly encourage readers to watch Karpathy's video to understand this process in depth.

What deliberate language investment looks like

Consider the contrast between two open-weight models: GPT-OSS (Open AI) and Qwen (Alibaba). Both are general-purpose models with large vocabularies, but they have very different language priorities.

The chart makes the story immediate. Among tokens of meaningful length (three bytes or more), GPT-OSS dedicates just 3.86% of its vocabulary to Chinese. Qwen - the Chinese model - devotes 22.64%. That is nearly a sixfold difference.

This is deliberate. China's leading AI labs are building models that prioritize Chinese linguistic fluency from the ground up. The vocabulary is a deliberate policy decision. The model's ability to understand and generate Chinese text is baked into its very foundation. For Chinese users, this means a much more fluent experience - the model can read and write in their language with far greater ease and accuracy.

Now ask the same question about India.

How Do Today's Best LLMs Handle Indic Languages?

In this analysis, I compared four widely used models - Phi-4-14B (Microsoft), Qwen3.5-397B-A17B (Alibaba), GPT-OSS-120B (Open AI), and Sarvam-105B (Sarvam AI) - across four major Indic languages: Hindi, Bengali, Telugu, and Tamil.

The chart tells a clear story about priorities.

Phi-4 has such low representation of Indic languages (near-zero, e.g., Hindi at 0.03%, Telugu at 0.00%) that it was invisible on this chart, so I decided not to include it here. Qwen, despite its massive vocabulary, allocates very little to Indic languages (Hindi at 0.39%, Bengali at 0.22%), prioritizing Chinese instead. GPT-OSS presents a surprisingly meaningful baseline for a general-purpose model, with Hindi at 2.02% and Bengali at 1.09%. However, Sarvam leads the pack by a clear margin: Hindi at 5.30%, Bengali at 3.86%, Telugu at 1.67%, and Tamil at 1.49% of its 262,144 token vocabulary, showcasing what intentional investment in Indic languages looks like in practice.

Fluency

Let's see how well these models can read and write Indic languages in practice. In the following example, I fed the same sentence - "Hot tea and snacks on a rainy day is a delight" - written in English, Bengali, Hindi, Telugu, and Tamil into each tokenizer, and visualized how each model breaks it apart.

To interpret this visualization:

Alternating grey and white blocks represent individual, complete tokens. A fluent model will naturally group common words, syllables, and characters into single blocks.
Pink-highlighted sections indicate token fragmentation. Linguistic characters contain multiple UTF-8 bytes, and if a model's vocabulary doesn't include that character as a single token, it will break it into its raw byte components. This is a sign of low fluency - the model is not recognizing the character as a meaningful unit, but rather as a sequence of bytes to be assembled.

Tip: Scroll right to see all models in the visualization.

In Phi-4, Indic text is almost entirely fragmented. Individual characters - even common vowel markers or conjunct consonants that appear in thousands of everyday words - get split across multiple tokens. The model is not reading Bengali; it is decoding it byte by byte.

In Qwen, the situation improves modestly - consistent with its vocabulary numbers. In GPT-OSS, real words start to appear as single coherent tokens: you can see the model's tokenizer recognizing larger coherent character chunks.

In Sarvam, the fluency is markedly better across all four Indic languages. Common words, longer phrases, and even grammatically modified word-forms appear as single tokens. The model has, in vocabulary terms, genuinely learned to read these scripts.

The Deepest Words: What Complexity Can Each Model Handle?

Models trained on rich linguistic data develop tokens that mirror complex grammatical structures. Essentially, a longer, morphologically richer vocabulary indicates that the model has absorbed more of the language's actual framework.

To measure this, I identified the longest tokens in each model's vocabulary. These are the most complex strings the model treats as a single unit rather than breaking them down into smaller pieces.

Tip: Scroll right to see all models in the table below.

In English, all four models handle massive tokens - from iOS identifiers like .translatesAutoresizingMaskIntoConstraints to the full alphabet. English is so over-represented in training data that even niche technical strings are encoded as single units.
For Indic languages, the contrast is noticeable:

Phi-4: It barely knows any Telugu or Tamil characters. Hindi and Bengali knowledge is limited to some vowels and consonants.
Qwen: Qwen recognizes some basic short words like "আমাদের" (ours) or "कार्यक्रम" (program). Definitely a meaningful improvement over Phi-4, but still limited to simple words.
GPT-OSS: This model has genuinely learned complex words. It recognizes terms like विश्वविद्यालय (university) and தெரிவித்துள்ளது (has informed) - a grammatically inflected verb that only a fluent reader would recognize as a single unit.
Sarvam: Sarvam pushes the boundaries further. Its vocabulary includes highly inflected constructions like विद्यार्थ्यांनी (by the students) and ప్రపంచవ్యాప్తంగా (worldwide).

Note the complex suffixes, case markers, and compound words that Sarvam encodes as single tokens. This proves that Sarvam was trained on enough data to build a vocabulary with such complex words as single units. This helps the model build genuine linguistic fluency, as it can recognize and generate these complex forms without needing to piece them together from smaller units.

Numbers: A Window Into Intent

Let's do a similar analysis, but instead of looking at the longest tokens, let's look at the highest-value numeric tokens in each language.

Now that is very interesting!!

For English, both Phi-4 and GPT-OSS contain numbers up to the thousands - 999, 998, 997, all the way down to single digits. These are not learned from frequency; they are clearly hand-crafted entries, deliberately added by the vocabulary designers. Qwen and Sarvam, on the other hand, hand-crafted only the ten digits (0–9) as vocabulary entries, and no larger numbers at all. This is a clear signal of intent: the model creators of Qwen and Sarvam wanted to reserve space in the vocabulary for more complex linguistic units, rather than filling it with a large number of numeric tokens.

However, Sarvam AI did not invest the same effort for other languages. Hindi and Bengali have good coverage, presumably because of their larger presence in the training data. But Telugu and Tamil are missing some single-digit numerals. GPT-OSS does not have any Telugu or Tamil numerals at all, whereas Sarvam has some numerals for Telugu and Tamil, but not all of them. The digits 5, 6, and 7 are missing from Telugu, and the digits 1, 2, 4, 6, and 9 are missing from Tamil.

It is important to be precise about what this means. These models can still generate the missing digits, just not fluently. But to do so, the model must build each numeral from its raw byte components: a sequence of individual byte-level operations that, together, spell out a single character. This forces the model to assemble each character from its constituent byte tokens instead of emitting it directly. For example, to produce the Telugu digit six (౬), the model must emit the UTF-8 byte tokens 0xE0, 0xB1, 0xAC in that exact order - a multi-step process compared to emitting a single pre-learned token. The model has the ability - but not the fluency.

This inability becomes clear when I asked Sarvam to generate the Telugu digits 5,6,7 in different ways.

write the number 567 in Telugu digits

what the number 76 looks like in Telugu digits

list all Telugu digits from zero to nine

what comes after ౬ (6)?

The model either wrote the answer in words, or generated the wrong digits, and very very rarely in my repeated attempts, it got the correct digits. Try it yourself on the Sarvam website and see how it performs.

I also tried a similar exercise with Bengali numerals. It took 2-3 prompts to generate them, but once we got there, the model consistently generated correct results every time, even performed simple addition between numbers.

Conclusion

Sarvam is the clear winner for Indic languages. This is a big leap forward - a genuine commitment to building fluency in Indian languages that other SOTA open-weight models have not matched. And I am excited to see what comes next.

GPT-OSS deserves a special mention. For a model not explicitly built for India, its Indic vocabulary coverage is commendable - arguably the second-best in this comparison for Hindi and Bengali. It shows how capable the OpenAI team is - they built a general-purpose model that still managed to learn a significant amount of Indic language, even without deliberate investment.

But some gaps in Telugu and Tamil - even in Sarvam - tell us there is still room for improvement. I am sure the Sarvam team is already working on it, building better vocabularies, collecting more datasets from underrepresented languages, and improving the model overall.

There's an interesting opportunity for us here. High-quality datasets in Indic languages could accelerate progress for any team building models for Indian languages. If communities were to publish such datasets, it could benefit not just Sarvam but also every organization looking to improve language coverage. The more openly available data we have, the better the models become - everyone benefits.

The world is changing very fast. The future of our next generation is being shaped by AI and LLMs, and it is happening now. Models built with better language data tend to work better for everyone. And that's something worth thinking about as we build the future.

If you are interested in exploring this topic further, I have published the code and data for this analysis on GitHub. You can run the code yourself, modify it, and see how different models perform with different languages. I hope this inspires more people to dive into the fascinating world of tokenization and language modeling!

Thanks to @digital-humanist for reviewing the blog and providing feedback.