Daud Ibrahim

Posted on Mar 14

Building an LLM From Scratch for Indic Languages: What No One Tells You About the Hard Parts

#ai #nlp #machinelearning #deeplearning

Context: In 2023, I was part of the core team at Krutrim (OLA's AI subsidiary) working on pre-training India's first large language model with deep Indic language coverage. This series documents not just the technical pipeline, but the decision-making framework behind each stage the tradeoffs we reasoned through, the dead ends we hit, and why many of the choices we made were genuinely novel, because the reference material simply did not exist yet.

There is a particular kind of engineering challenge that sits at the intersection of research ambiguity and production pressure. Pre-training a large language model from scratch is one of them. Pre-training one for Indian languages, in 2023, without precedent, without benchmarks, without existing tokenizers, and without a community of practitioners who had done it before that is a different challenge altogether.

This article is the first in a series. I want to use it to do something most technical writing does not do well: explain not what we built, but how we thought about what to build, and why the sequencing of decisions mattered as much as the decisions themselves.

If you are reading this as a practitioner considering your own pre-training run, I hope this gives you a map. If you are reading this as someone evaluating the depth and originality of this work, I hope it demonstrates that what we did at Krutrim was genuinely first-principles engineering not a recipe followed from a paper.

The question before the pipeline: why scratch?

The first and most consequential decision in any LLM project is not architectural. It is strategic: do you fine-tune an existing model, or do you build from scratch?

In 2023, the dominant English-centric models LLaMA, Falcon, MPT were already strong. The instinct from the outside is: adapt one of these. Fine-tune on Indic data, add some multilingual tokens, and call it done. We considered this seriously. We rejected it, and the reasoning behind that rejection shaped everything that followed.

The problem is not that these models lacked Indic training data. The problem is that their tokenizers were trained on English-dominated corpora. A tokenizer is not just a pre-processing step it is the model's alphabet. When you take a Hindi or Tamil sentence and pass it through a BPE tokenizer trained overwhelmingly on English, you do something quietly catastrophic: you fragment the text into sub-word units that carry no semantic coherence in the target language. A single Devanagari word might tokenize into six, seven, eight individual tokens. This is what researchers call the fertility problem.

This single observation that tokenizer fertility is a structural bottleneck, not a tunable hyperparameter made the decision for us. If we wanted a model that could reason in Hindi, Tamil, Telugu, Kannada, Bengali, and the other major Indic languages with genuine capability, we had to own the full stack. That meant building our own tokenizer, on our own data, from the ground up. And once you commit to building your own tokenizer, you have committed to pre-training from scratch. There is no shortcut from that point.

The full pipeline, and why the order is not arbitrary

Before going deep on each stage, I want to lay out the complete sequence and explain why the ordering matters. This is not a waterfall process it is a cycle but the initial ordering reflects a set of hard dependencies that are not immediately obvious.

Each arrow in that sequence represents a dependency. You cannot train your tokenizer before your data is filtered, because the tokenizer's vocabulary will be contaminated by noise. You cannot finalize your architecture before your tokenizer is trained, because context window efficiency depends on vocabulary size and fertility. You cannot run large-scale pre-training before you have validated your pipeline on a small model, because at that scale, a bug does not just waste a training run it wastes weeks of GPU time and real money.

Let me walk through the thinking at each stage.

Stage 1 Data collection: the invisible constraint

Data collection sounds like a solved problem. It is not, especially for Indic languages. In 2023, the Common Crawl corpus the backbone of most large-scale LLM training contained approximately 70% English content by volume. High-resource Indic languages like Hindi and Bengali had a meaningful but small footprint. Low-resource Indic languages like Maithili, Odia, or Sindhi were barely represented.

This created an immediate strategic tension. We needed enough data to train a tokenizer with good Indic script coverage. We needed enough data to pre-train a model that would perform meaningfully across multiple languages. But we also had to be honest about the quality ceiling of web-scraped text: not all text is equal, and in the Indic web, the signal-to-noise ratio can be harsh. A significant portion of what gets scraped is transliterated text (Devanagari written in Latin script), machine-translated content of poor quality, forum spam, and code-switched text that switches between English and a regional language mid-sentence without semantic coherence.

Our collection strategy layered multiple sources with different trust levels. Crawled web data formed the high-volume, lower-trust base. Academic and journalistic corpora digitized newspapers, government documents, educational texts formed a higher-trust but lower-volume layer. We also seeded translation pairs from curated multilingual sources, not to train a translation model, but to give the model exposure to formally correct Indic text with semantic alignment to a known reference.

Stage 2 Deduplication: the step most teams underweight

There is a common misconception that filtration and deduplication are the same stage. They are not. Deduplication is a structural quality problem; filtration is a content quality problem. Conflating them leads to a subtle but serious error: you can filter your data perfectly for content quality, and still train on a corpus where 30% of your documents are near-duplicates of each other.

Why does this matter? Deduplication affects what the model memorizes versus what it generalizes. A model that sees the same news article 200 times across different scraped mirrors, archives, and re-publications will overfit to that document in a way that degrades its ability to generalize. More dangerously, it will assign high confidence to the specific phrasing and facts in that document, which is a form of hallucination amplification.

We applied deduplication at multiple granularities: exact-match document deduplication (trivial but necessary), near-duplicate detection using MinHash LSH at the document level, and paragraph-level fuzzy matching to catch templated or boilerplate content the kind of repetitive legal disclaimers and website footers that scrapes tend to accumulate in large quantities. Each pass reduced our corpus, which meant tighter tradeoffs between data volume and data quality.

Stage 3 Data filtration: making quality decisions without a ground truth

Filtration is where the team's judgment matters most, because there is no universally correct filtration strategy. You are making a series of probabilistic bets about what "quality" means for the task ahead.

The filters we applied operated on several dimensions. Language identification was the first gate: we used a combination of fastText language identification and script-based heuristics to verify that a document was genuinely in its claimed language. This is surprisingly non-trivial for Indic text, where multiple scripts can represent the same language (Urdu and Hindi share a large vocabulary but use different scripts), and where transliteration is common.

The second dimension was content quality scoring. We trained lightweight classifiers essentially shallow models on small samples of curated, high-quality text to score documents on fluency, coherence, and absence of spam markers. These classifiers were trained in each target language independently, because quality signals in Hindi look different from quality signals in Tamil.

The third dimension was toxicity and harm filtering. This is a different problem at the filtration stage than it is at the alignment stage. At filtration, we were removing documents that were overwhelmingly harmful extremist content, explicit material, coordinated disinformation. We were not trying to make the model refuse harmful questions at this stage; that work happens later. But if you pre-train on a corpus saturated with hate speech, you create an alignment problem that is much harder to fix during fine-tuning.

Stage 4 Tokenizer training: the decision with the longest shadow

The tokenizer is the most consequential single artefact in the entire pipeline, because it is fixed at pre-training and cannot be meaningfully changed afterwards without retraining. Every downstream decision context window size, model capacity, generation speed is shaped by what the tokenizer does.

For Indic languages, the core challenge is script diversity. Hindi, Marathi, Maithili, and Sanskrit all use Devanagari. Tamil, Telugu, Kannada, Malayalam, and Odia each use their own distinct script. Bengali shares a script with Assamese. The tokenizer must achieve good coverage across all of these without ballooning the vocabulary size to a point where the embedding table becomes prohibitively large, or the output softmax too expensive to compute.

We chose a BPE (Byte Pair Encoding) approach with a vocabulary size calibrated against fertility targets: we wanted average fertility below 1.5 tokens per word across our target languages. Achieving this required careful corpus weighting during tokenizer training if your tokenizer training corpus is dominated by English data, the model will allocate too many vocabulary slots to English sub-words at the expense of Indic sub-words.

We also had to make decisions about how to handle code-switching, which is endemic in modern Indian text. A natural conversation in Hindi on social media will freely mix Devanagari, Romanized Hindi, and English. We tested different strategies treating Romanized Hindi as its own vocabulary segment, folding it into the Latin character set, or normalizing it back to Devanagari and evaluated the downstream impact on the small model before committing.

Stage 5 Architecture decisions: choosing your tradeoffs before training

By the time you reach the architecture stage, you have made decisions that constrain your choices significantly. Your tokenizer vocabulary size determines your embedding dimension lower bound. Your target context length determines your memory requirements. Your available compute determines your viable parameter count.

The core architectural question in 2023 was which transformer variant to build on. The landscape included MPT (MosaicML's Pretrained Transformer), Mistral, and the LLaMA family each representing a different set of design choices.

Attention mechanism: MHA, GQA, or MQA?

Multi-Head Attention (MHA) is the original formulation: every head has its own key and value projections. Multi-Query Attention (MQA), introduced in the PaLM architecture, uses a single shared key-value head for all query heads reducing KV cache size dramatically during inference. Grouped-Query Attention (GQA), used in LLaMA-2 and Mistral, is a compromise: groups of query heads share a key-value head.

The tradeoff is not just performance it is deployment economics. MQA reduces inference memory cost but can hurt model quality on certain tasks. GQA largely recovers that quality while still offering meaningful memory savings. For a model intended to run efficiently on production infrastructure (not just in a research lab), this was a practical decision, not a theoretical one. We chose GQA because our evaluation on small models showed acceptable quality degradation relative to MHA, and the inference efficiency gains were substantial enough to matter at deployment.

Positional encoding: RoPE, ALiBi, or learned?

Learned positional embeddings the default in the original transformer do not generalize beyond the sequence length seen during training. This is a hard constraint. RoPE (Rotary Position Embedding), used in the LLaMA family and later Mistral, encodes position as a rotation in the complex plane and generalizes better to longer sequences than its training context. ALiBi (Attention with Linear Biases) applies a length-penalty directly to attention scores, which also generalizes to longer sequences at inference. We chose RoPE for its stronger empirical track record across the language modelling tasks closest to our use case.

Normalization and activation

Layer normalization placement (pre-norm vs post-norm) has a direct impact on training stability. Post-norm (as in the original transformer) is notoriously difficult to train at large scale without careful learning rate management. Pre-norm, used in the modern families, significantly improves stability. We used RMSNorm (a simplified variant of LayerNorm that omits the mean-centering step), which is computationally cheaper and has been shown to match LayerNorm quality. Our activation function choice was SwiGLU the gated linear unit variant used in PaLM and LLaMA which consistently outperforms ReLU and GeLU on language modelling benchmarks.

Stage 6 Infrastructure: parallelism is not optional at scale

Pre-training at scale is not a single GPU problem. Even a 7 billion parameter model does not fit in the memory of a single A100 GPU when you account for activations, gradients, and optimizer states. You need distributed training, and distributed training is one of the most operationally complex aspects of the entire pipeline.

The three fundamental dimensions of model parallelism are data parallelism (distributing batches across GPUs), tensor parallelism (splitting individual matrices across GPUs), and pipeline parallelism (splitting layers across GPUs). Each introduces different communication overhead and synchronization complexity. Getting this wrong does not just slow you down it produces subtly incorrect gradients that corrupt training in ways that can take days to diagnose.

We evaluated PyTorch's FSDP (Fully Sharded Data Parallel) and the Megatron-LM framework from NVIDIA. FSDP is more flexible and integrates naturally with the HuggingFace ecosystem. Megatron-LM has more mature tensor and pipeline parallelism implementations and better GPU utilization at large scale. Our decision to use Megatron-LM for the large-scale runs came down to one number: GPU utilization. At the scale we were operating, the difference between 40% and 60% GPU MFU (Model FLOPs Utilization) is not a footnote it is the difference between a training run that completes on budget and one that does not.

Gradient clipping deserves specific mention here because it is often treated as a minor implementation detail when it is, in fact, a critical training stability lever. We clipped gradients by global norm, with a threshold set conservatively at 1.0. In practice, we monitored the gradient norm distribution throughout training and adjusted this threshold during the early warmup phase. Loss spikes sudden, large increases in training loss that can partially or fully destabilize a training run are often preceded by gradient norm spikes. Building early-warning monitoring for this saved us from losing multiple training runs.

Stage 7 The small-model cycle: science before scale

This is the stage that distinguishes teams who have actually done pre-training from teams who have read about it. Before you commit GPU-weeks to training a large model, you train small models in our case, 125M, 350M, and 1B parameter configurations on representative data slices, with identical pipeline configurations. These runs are not a warmup. They are a rigorous diagnostic environment.

The small-model cycle lets you answer a specific set of questions that cannot be answered any other way. Is your data pipeline producing correctly shuffled, correctly formatted training batches? Is your tokenizer producing the expected fertility distribution across languages? Is your learning rate schedule appropriate for your batch size and model size? Are there silent bugs in your custom attention implementation that only manifest as subtly degraded perplexity curves?

We used scaling laws specifically the Chinchilla scaling laws as a compass during this phase. Scaling laws give you a principled way to extrapolate from small-model results to predict large-model performance at a given compute budget. More importantly, they tell you whether your actual loss curves are tracking the theoretical prediction. A systematic deviation from the predicted scaling curve in a small model is a red flag that something in your pipeline is wrong even if the training appears to be running normally.

The small-model cycle is not about building a worse version of the final model. It is about creating the conditions under which you can discover problems cheaply, before the cost of a mistake is measured in weeks of training time.

The loop at this stage was: train small model → evaluate on proxy tasks → identify failure modes → trace failure modes back through the pipeline (filtration? tokenization? data distribution?) → correct at source → retrain small model → repeat. We went through this loop multiple times before we were confident enough in the pipeline to scale up. The instinct to skip this step to move faster is understandable. It is also how teams lose their largest training runs to avoidable bugs.

Stage 8 Base model evaluation: before any fine-tuning

A base language model the output of pre-training, before any instruction tuning is not an assistant. It is a distribution over text continuations. Evaluating it requires a different frame than evaluating a chat model.

The primary signal for a base model is perplexity on held-out data ideally data that is representative of each target language and domain, and was not seen during training. But perplexity alone is insufficient, because a model can achieve low perplexity on Indic text by essentially memorizing high-frequency document structures without genuinely understanding the language.

We supplemented perplexity with few-shot evaluation on classification and generation tasks that require genuine linguistic understanding: natural language inference in Hindi, named entity recognition in multiple Indic scripts, and cross-lingual retrieval. These evaluations were largely hand-crafted by our team, because standardized Indic benchmarks of the quality of SuperGLUE or BIG-Bench simply did not exist in 2023.

We also ran language identification probes a technique borrowed from interpretability research to verify that different layers of the model had indeed developed language-specific representations. A model that genuinely understands Hindi will, in its intermediate representations, have clearly separable activation patterns for Hindi and English text. A model that has merely learned surface statistics of Indic characters will not. This kind of probing evaluation gave us confidence that the model was building genuine multilingual capability rather than pattern-matching on script identity.

Stage 9 Instruction fine-tuning: transforming a predictor into an assistant

A base language model predicts the next token. An assistant responds to instructions. These are fundamentally different behaviours, and the transition between them requires supervised fine-tuning on instruction-response pairs.

This stage is often under-theorized. People treat it as "just fine-tuning." But the quality and coverage of your instruction data determines the quality of your assistant more directly than almost any architectural decision. In the Indic context, this problem is acute: there was, in 2023, essentially no publicly available, high-quality Indic instruction dataset. We had to construct our own a combination of translated and culturally adapted English instruction datasets, human-written Indic instruction-response pairs, and programmatically generated instruction data verified by human review.

The framing of fine-tuning data also matters in ways that are not obvious until you see the failure modes. An instruction dataset that only teaches the model to answer factual questions will produce a model that cannot handle open-ended creative tasks. A dataset that over-represents formal Hindi will produce a model that sounds stiff in casual conversation. Coverage across task types, registers, and languages is the goal and achieving it with limited data resources requires careful curation rather than volume maximization.

Stage 10 Alignment: the values layer

Instruction fine-tuning makes the model helpful. Alignment makes it safe. These are distinct properties, and conflating them is a mistake that leads to either over-refusal (the model refuses legitimate requests because it confuses helpfulness with harmlessness) or under-refusal (the model answers genuinely dangerous questions because the safety training was superficial).

In 2023, the dominant alignment approach was RLHF: Reinforcement Learning from Human Feedback. A reward model is trained on human preference comparisons, and the language model is then optimized using RL to maximize this reward. RLHF works, but it is operationally complex, requires a stable RL training loop on top of an already complex pre-training infrastructure, and is sensitive to reward model quality in ways that are hard to diagnose.

Direct Preference Optimization (DPO) was a relatively new alternative in 2023, which reframes the preference learning problem as a direct classification objective eliminating the need for an explicit reward model and a separate RL training loop. We evaluated both approaches on small-scale experiments and found DPO to be more training-stable and easier to iterate on, at the cost of some theoretical elegance. The practical choice was DPO.

Alignment for Indic models also has a cultural dimension that is invisible in the English-centric alignment literature. The definition of harmful content is not universal. What constitutes offensive speech, what topics are sensitive, and what refusals are appropriate vary significantly across the linguistic and cultural communities that speak Indic languages. Alignment data had to be designed with this heterogeneity in mind and again, there was no reference dataset to draw from. Every decision was made in-house.

Stage 11 Final evaluation: benchmarks and what they miss

After alignment, the model goes through final evaluation before any deployment decision. This is where you attempt to make objective, comparable claims about model capability.

For English LLMs, this is a solved infrastructure problem. There are standard benchmarks (MMLU, HellaSwag, HumanEval, GSM8K, TruthfulQA), standard evaluation harnesses, and community results to compare against. For Indic LLMs in 2023, we were in a different situation. We adapted existing benchmarks where Indic-language versions existed, constructed evaluation sets in-house where they did not, and were transparent about the limitations of each approach.

One of the most important things I learned through this process is that benchmark performance and real-world utility have a complex relationship. A model can score well on standardized benchmarks while still producing outputs that feel unnatural to native Indic language speakers because the benchmarks do not capture pragmatics, tone, register, or cultural appropriateness. The final signal we trusted most was qualitative evaluation by fluent speakers across our target languages, which cannot be reduced to a number but which surfaces failure modes that no benchmark catches.

What this series covers next

This overview has been intentionally high-level. Each stage I have described here is a month's worth of work, a set of decisions that could fill its own technical article, and a collection of mistakes that were instructive precisely because we did not have a playbook to follow. The subsequent articles in this series will go deep on each one:

part-2 : Building a Corpus for Languages the Internet Forgot
Data collection strategy, trust-level taxonomy, cross-script normalization, and the specific decisions we made to curate a high-quality Indic pre-training corpus from a noisy web.

Part 3: Deduplication and Filtration at Scale
MinHash LSH, perplexity-based filtering, language identification challenges, and how we iterated filtration criteria using small-model feedback.

Part 4: Tokenizer Design for Indic Scripts
Fertility analysis, vocabulary size calibration, code-switching handling, and the specific engineering decisions behind our custom BPE tokenizer.

Part 5: Architecture From First Principles
GQA vs MHA tradeoffs, RoPE positional encoding, RMSNorm, SwiGLU, and the full reasoning behind every architectural decision — including what we got wrong the first time.

Part 6: Distributed Training and Infrastructure
Tensor, pipeline, and data parallelism, Megatron-LM configuration, gradient clipping strategy, loss spike diagnosis, and checkpoint management.

Part 7: The Small-Model Cycle: Debugging Before Scale
How we used 125M–1B parameter proxy runs to validate every pipeline component before committing to large-scale training.

Part 8: Evaluating a Base Model Without Benchmarks
Perplexity, few-shot evaluation, language identification probing, and constructing our own evaluation framework in the absence of standard Indic benchmarks.

Part 9: Instruction Tuning, Alignment, and Final Evaluation
Building Indic instruction datasets from scratch, DPO vs RLHF, cultural dimensions of alignment, and what benchmarks miss about real-world model quality.

A note on timing. Everything described in this series happened in 2023, at a moment when the Indic language AI ecosystem was sparse enough that many of the tools and reference materials practitioners now take for granted did not yet exist. There were no high-quality Indic instruction datasets. There were no standardized benchmarks. There were no tokenizers designed for the full breadth of Indic scripts at the vocabulary sizes required for large-scale pre-training. Working in that environment forced a level of first-principles thinking that I think is worth documenting not because the specific solutions we found are the only solutions, but because the decision-making process behind them is a useful map for anyone operating at the frontier of a new domain, before the community has converged on a standard playbook.

If this resonates with challenges you are working through whether in Indic AI, low-resource language modelling, or LLM infrastructure more broadly I would welcome the conversation. The subsequent articles will go considerably deeper. The goal is not to be comprehensive, but to be honest about what the work actually required.

LargeLanguageModels LLM LLMPretraining GenerativeAI MachineLearning DeepLearning NLP NaturalLanguageProcessing
Transformers FoundationModels IndicAI IndicNLP MultilingualAI IndicLanguages HindiNLP LowResourceNLP IndianLanguageAI Krutrim BharatAI DevanagariNLP

DEV Community