DEV Community: Daud Ibrahim

Building an LLM from Scratch for Indic Languages Part 2

Daud Ibrahim — Thu, 19 Mar 2026 11:22:30 +0000

Stage 1 : Data Collection: The Invisible Constraint

"Everyone wants to talk about model architecture. Nobody wants to talk about where the data comes from."

I'm going to be honest with you before we begin.

When most people imagine training a large language model from scratch, the mental image is dramatic massive GPU clusters, elegant transformer architectures, loss curves smoothly descending toward something brilliant. The hard part, in this fantasy, is the math.

It isn't.

The hard part : the part that will consume more of your time, more of your engineering effort, and more of your emotional energy than anything else in this entire journey is data. Specifically, finding it, cleaning it, filtering it, deduplicating it, detecting its language reliably, organising it into something a model can actually learn from, and then discovering that what you thought was enough is nowhere close to enough.

And if you're building for Indic languages specifically? The constraint becomes even more brutal. The web wasn't built for Hindi. It wasn't built for Tamil, Telugu, Kannada, Malayalam, Odia, or Assamese. It was built in English, by English-speaking companies, largely for an English-speaking audience. Every data source you reach for will have this asymmetry baked into it.

This post is about what actually happens at Stage 1. Not the idealised version. The real one.

The First Thing You Reach For: CommonCrawl

If you've done any research into LLM training data, you've encountered CommonCrawl. It's a non-profit that has been crawling the web since 2008 and making the data freely available. It is, by a massive margin, the most cited data source in LLM training papers. GPT-3 used it. LLaMA used it. Falcon used it. Mistral's datasets are derived from it. It is, in the minds of most people starting this journey, the obvious answer to the data question.

So let's start there. Let's take it seriously and understand what it actually is : because the gap between what you imagine CommonCrawl to be and what it actually is in practice is where most data collection mistakes live.

What CommonCrawl Actually Is Inside the Format

CommonCrawl releases monthly crawls. Each crawl is somewhere in the range of 200–400 terabytes of compressed data. Across all crawls since 2008, the archive is measured in petabytes. This sounds extraordinary. It is extraordinary. It is also not what you think it is.

Each crawl is distributed across three file formats:

WARC (Web ARChive) files are the raw captures. They contain the full HTTP response headers, HTML markup, JavaScript, CSS, inline images (as base64 or references), cookies, everything. A single WARC file can be several hundred megabytes. The crawl index lists hundreds of thousands of them. If you download WARC files, you are downloading the raw internet HTML tags, JavaScript boilerplate, navigation menus, cookie consent popups, SEO filler, and somewhere buried in all of it, actual human-written text.

WET (Web Extracted Text) files are what most NLP practitioners use. CommonCrawl pre-processes the WARC files and strips out HTML tags, JavaScript, and CSS, leaving only the extracted plain text. This is your primary starting point for language model training data. WET files are significantly smaller than WARCs, but they still contain enormous amounts of noise — boilerplate navigation text, footer content, repetitive legal disclaimers, advertising copy, machine-translated content, and gibberish.

WAT (Web Archive Transformation) files contain metadata and link graphs extracted from the WARC files. These are useful for deduplication and for understanding the web's link structure, but not directly for training text.

The Scale Problem What You Actually Have to Download

Here is where the practical reality starts diverging sharply from the theory.

A single CommonCrawl crawl is organised into segments typically around 90,000 WARC files or around 72,000 WET files per crawl. Each WET file is around 150-300MB compressed. If you want to process a full crawl just one you are looking at roughly 15-20TB of WET data, compressed. Uncompressed, you're working with several times that.

You cannot download this on a home connection. You cannot store it on a local machine. You cannot process it without distributed compute. The practical pipeline looks like this:

CommonCrawl S3 (s3://commoncrawl/) 
     ↓
Stream or download segments via AWS (CC is hosted in us-east-1)
     ↓
EC2 / Spark cluster for processing
     ↓
Your own S3 bucket (processed, filtered output)

CommonCrawl itself is hosted on Amazon S3, which is convenient if you're running AWS workloads you can stream data from s3://commoncrawl/ to EC2 instances within the same region without paying egress costs. If you're not on AWS, you'll pay for egress, which will add up fast at this scale.

The Language Detection Problem

Here's the first point where most tutorials gloss over the real difficulty. CommonCrawl does not tell you what language a document is in. You have to figure that out yourself.

The naive approach look at what URL it came from, or check the HTML lang attribute — doesn't work reliably. HTML lang tags are frequently wrong, missing, or set to a default rather than the actual content language. A website with lang="en" in its header might serve content in five languages depending on which page you're looking at.

So you need language identification (LID) at the document level. The widely-used options are:

fastText's language identification model (lid.176.bin) : trained on Wikipedia and Tatoeba in 176 languages. Fast, reasonably accurate, the industry workhorse. Runs at roughly 1 million sentences per second on a single CPU core. For most Indic scripts, it performs well because the scripts themselves are visually distinct. However, it struggles with romanised Indic text (Hinglish, Tanglish, etc.) and code-switched content.

Google's CLD3 (Compact Language Detector 3) : a neural network-based detector. Slightly more accurate than fastText for short strings and mixed-script content. Available as a Python library (pycld3). Slower than fastText.

langdetect : a Python port of Google's language detection library. Works but has known reliability issues with short documents. Do not use this as your primary detector for a serious pipeline.

GlotLID : a newer model specifically designed for low-resource and Indic languages, supporting over 1600 languages. For building an Indic LLM, this is worth investigating over fastText because it handles scripts and languages that fastText was never well-trained on.

The practical approach for a serious Indic LLM pipeline is to run two detectors fastText and GlotLID and accept a document only when both agree on the language. This sacrifices some recall for significantly better precision. You will lose some data, but the data you keep will be reliably labelled.

A minimal language detection step looks roughly like this in Python:

import fasttext
import glotlid  # hypothetical unified interface

ft_model = fasttext.load_model("lid.176.bin")

def detect_language(text: str) -> str | None:
    # Require minimum document length
    if len(text.split()) < 50:
        return None

    # fastText prediction
    ft_pred, ft_conf = ft_model.predict(text.replace("\n", " "), k=1)
    ft_lang = ft_pred[0].replace("__label__", "")

    # Only accept if confidence is high enough
    if ft_conf[0] < 0.85:
        return None

    return ft_lang

INDIC_LANGS = {"hi", "ta", "te", "kn", "ml", "bn", "mr", "gu", "pa", "or", "as"}

Quality Filtering The Web Is Mostly Garbage

Even after language detection, the text you extract from CommonCrawl is noisy beyond what most people expect. Consider what the web actually contains:

Boilerplate: Navigation menus repeated on every page. "Home | About | Contact | Privacy Policy" appearing thousands of times in your dataset.
Duplicate content: The same news article syndicated to forty different websites, each one a nearly-identical copy.
SEO-spam: Pages that are grammatically coherent but semantically meaningless, designed to rank in search engines rather than communicate information.
Machine-translated content: Often poor quality, sometimes hallucinatory, always training-contaminating.
Adult content: CommonCrawl is an unfiltered crawl of the web.
Code: Often useful, but a different distribution from natural language text.
Template text: E-commerce product pages that are 80% boilerplate and 20% description.

The standard quality filtering pipeline applies heuristics developed by papers like CCNet (Facebook) and later Gopher (DeepMind). The key signals are:

Document-level heuristics:

Minimum word count (reject documents under 100-200 words)
Maximum word count (reject documents that are pathologically long, often scraped indexes)
Fraction of lines ending with punctuation (prose ends with periods; garbage doesn't)
Average sentence length (too short = navigation menus; too long = templates)
Symbol-to-word ratio (high symbol ratio = code or spam)
Fraction of words in a stop-word list for the language (real prose uses many stop words; SEO spam doesn't)
Fraction of words that appear in a reference vocabulary

N-gram deduplication:

Sentence-level: remove documents where more than 30% of sentences appear verbatim elsewhere in the corpus
Paragraph-level: same logic at a coarser granularity
MinHash LSH for fuzzy near-duplicate detection at document level

Exact deduplication:

SHA256 hashing of normalised document content
Remove any document that appears more than once

Each of these steps removes data. By the time you've applied language detection, quality filtering, and deduplication, you typically keep somewhere between 10% and 30% of the raw CommonCrawl text for a high-resource language like English. For Indic languages, the keep rate can be lower because the raw volume is smaller and the noise-to-signal ratio is often higher.

Which brings us to the real problem.

The Indic Language Gap Why CommonCrawl Is Not Enough

Let me give you a number that will recalibrate your entire thinking about this project.

English makes up roughly 45–50% of CommonCrawl by document count, and an even higher fraction by word count since English documents tend to be longer. The next-largest languages — German, French, Spanish, Russian each account for a few percent. Hindi, the most represented Indic language and one of the most spoken languages on Earth, accounts for roughly 1–2% of CommonCrawl. Tamil, Telugu, Kannada, Malayalam each likely under 0.5%.

Let's put this in concrete terms. If you process a single CommonCrawl crawl of, say, 200TB of WET data and apply standard quality filters, you might extract around 50-80TB of clean English text. For Hindi, after the same pipeline, you might get somewhere in the range of 1-3TB. For Tamil or Telugu, you might be looking at a few hundred gigabytes.

For context: training a 7B parameter model to the standard typically requires somewhere between 1 to 2 trillion tokens of training data. A token is roughly 0.75 words, so you're looking for approximately 750 billion to 1.5 trillion words. A 1TB text file contains roughly 150-200 billion words. So a single CommonCrawl crawl might give you 200-400 billion words of Hindi — if the quality filters don't discard too much. For a 7B model, that's still short.

For smaller Indic languages, CommonCrawl alone is simply insufficient. You need to build your own data pipeline from additional sources.

What the Big Labs Actually Use

Before we get into how to build your own pipeline, it's worth understanding what the leading labs are doing. This context matters because it shapes your intuition for what a "good" data strategy looks like.

OpenAI : WebText, FineWeb, and the Data Behind GPT

OpenAI has been deliberately opaque about their training data, but what we know from published research and technical reports gives a useful picture.

GPT-3 (2020) was trained on a mix of CommonCrawl (filtered, ~570GB of text, called WebText2 internally), WebText (a dataset of Reddit-linked web pages, created by crawling URLs that had been shared on Reddit and received at least three upvotes — a clever quality signal that filters for content humans found interesting), Books1 and Books2 (books scraping of unclear provenance), and English Wikipedia. CommonCrawl dominated by volume but was down-weighted in sampling to prevent it from overwhelming the curated sources.

GPT-4's training data has never been fully disclosed. Reporting suggests it includes significantly expanded web data, synthetic data generated by earlier GPT models, and curated datasets from paid partnerships with publishers and data providers. This is an important signal: at the frontier, purely scraped web data is not enough. You need curation, and curation at scale requires either human labour or earlier models.

OpenAI's data partnerships : OpenAI has signed deals with news publishers (AP, The Atlantic), book publishers, and other content owners to license their content for training. This is the quietly significant commercial strategy that open-source efforts genuinely cannot replicate.

Anthropic Constitutional AI and Careful Data Curation

Anthropic has published relatively little about their exact data sources for Claude's pretraining, but several signals emerge from their published research.

Their focus on safety and harmlessness means their data pipeline likely includes aggressive content filtering beyond standard quality heuristics. The Constitutional AI approach they use for alignment also implies that the model's pretraining data is cleaner than average you cannot fine-tune away what is deeply embedded in pretraining.

Anthropic has discussed using a mix of web data (likely CommonCrawl-derived, heavily filtered), curated books and papers, and synthetic data for specific capabilities. Their technical reports mention that Claude's training includes "a large dataset of text from the internet and other sources," but specifics are proprietary.

What Anthropic has been more public about is their emphasis on data quality over data quantity. Repeated statements from their research team suggest that a smaller dataset of higher quality can outperform a larger noisy dataset for many capabilities an insight that is directly relevant to the Indic language challenge, where your data is necessarily smaller.

Google : The C4 Dataset, Gemini, and Vertical Integration

Google's position is structurally unique. They built the web index. They built the crawlers. They are, in a meaningful sense, upstream of CommonCrawl.

T5's C4 dataset (Colossal Clean Crawled Corpus) was an early public example of Google's data processing pipeline. C4 was derived from CommonCrawl using a set of heuristic filters removing lines not ending with terminal punctuation, removing pages with fewer than five sentences, removing offensive content via word-list filtering. This was a relatively simple pipeline but established the basic shape of what large-scale web data processing looks like.

LaMDA, PaLM, and Gemini use significantly more sophisticated pipelines. From what Google has disclosed, their training data includes: web documents (their own crawl, not CommonCrawl), books from Google Books, code from GitHub, scientific papers from Google Scholar's index, YouTube transcripts (they own YouTube), and Wikipedia. The YouTube transcript angle is particularly notable Google has access to an enormous volume of spoken-word Indic language content through YouTube captions, which they can use in ways that external researchers simply cannot.

Gemini specifically was described as using "multimodal" data from the start, incorporating text, images, audio, and video into a unified training corpus. This is architecturally distinct from earlier text-only models and reflects Google's vertical integration across content types.

For Indic languages specifically, Google has a significant advantage through their work on AI4Bharat-derived datasets, their own investment in multilingual NLP research, and the sheer reach of Google Search and YouTube in India. Their data advantage for Indic languages is structural, not just a matter of engineering effort.

The takeaway from looking at what the labs do is this: they all start with web data, but none of them stop there. They curate, they partner, they build proprietary pipelines, and they invest heavily in quality filtering. For an Indic LLM, the lesson is clear — CommonCrawl is the foundation, but you need to build on top of it.

Building Your Own Data Moat Source by Source

Given that CommonCrawl alone is insufficient for Indic languages, you need to construct a multi-source data pipeline. Here is how to think through each source category.

1. News Archives

Indian news organisations produce an enormous volume of Indic language text, and much of it is accessible via scraping. The key targets:

Hindi: Dainik Jagran, Dainik Bhaskar, Navbharat Times, Amar Ujala, Rajasthan Patrika. Each of these has a web presence with years of archived articles. A systematic scraper targeting their article pages (not homepages or index pages) can yield hundreds of millions of words of clean Hindi text.

Tamil: Dinamalar, Dinakaran, Daily Thanthi, The Hindu Tamil edition.

Telugu: Eenadu, Sakshi, Andhra Jyothy.

Kannada: Prajavani, Vijay Karnataka, Deccan Herald Kannada.

Malayalam: Mathrubhumi, Malayala Manorama, Madhyamam.

Bengali: Anandabazar Patrika, Prothom Alo (Bangladesh), Dainik Statesman.

News text is valuable because it is factual, reasonably well-written (sub-edited, not raw user content), topically diverse, and dated — meaning it tracks language evolution over time. The challenge is respecting robots.txt and rate limits, and dealing with the fact that news websites often change their HTML structure, breaking scrapers.

A news scraping pipeline typically looks like this:

Sitemap crawl (find article URLs) 
     → Respectful crawling (rate-limited, respecting robots.txt)
     → HTML extraction (trafilatura, newspaper3k, or BeautifulSoup)
     → Boilerplate removal (header, footer, nav, ads)
     → Language verification (fastText/GlotLID)
     → Quality scoring
     → Deduplication (article titles are frequently reused)
     → Upload to S3 with metadata (source, date, language, word count)

trafilatura is the best tool I've found for extracting main content from news articles. It's specifically designed for news text extraction and outperforms BeautifulSoup for this use case by a significant margin.

2. Books and Literary Text

Books are among the most valuable data sources for language model training because they contain long-form, coherent, carefully structured writing — the kind of writing that produces models which can reason and compose extended arguments.

For Indic languages, digital book sources include:

Project Gutenberg — limited Indic content, but has some classical texts.

Wikisource — has a reasonable collection of public domain texts in Hindi, Tamil, Sanskrit, and other Indic languages. Directly downloadable via API.

Digital Library of India / IGNCA — government-operated repositories with scanned books. OCR quality is variable but there's significant volume.

Sahitya Akademi publications — India's national academy of letters has digitised portions of their catalogue.

State libraries — Several state government libraries have digital catalogues with varying degrees of accessibility.

The challenge with books is two-fold. First, copyright books published before 1927 are typically in the public domain in the US; laws vary by country. Books after that are protected, and scraping them without licensing agreements is legally fraught. Second, OCR quality many Indic language books that exist digitally are scans, and OCR accuracy for Indic scripts, while improving, is still imperfect.

3. Wikipedia and Wikimedia Projects

Wikipedia is a standard component of virtually every LLM training corpus, and Indic Wikis are worth extracting. The good news is that Wikipedia dumps are freely available and pre-segmented by language.

The dump for Hindi Wikipedia (hiwiki-latest-pages-articles.xml.bz2) can be processed with WikiExtractor to produce clean text. However, be aware of scale — Hindi Wikipedia, while growing rapidly, is small compared to English Wikipedia. As of recent counts, Hindi Wikipedia has around 150,000+ articles; English has over 6.7 million. Tamil, Telugu, and Kannada Wikis are even smaller.

Beyond Wikipedia, the Wikimedia Foundation operates several other projects that contain Indic content: Wiktionary (dictionary entries, good for vocabulary coverage), Wikisource (public domain texts), and Wikibooks.

IndicWiki corpus and similar derived datasets exist and are worth using as a starting point before building custom scraping infrastructure.

4. YouTube Transcripts

This is one of the most underutilised sources for Indic language data, and I want to spend some time here because the opportunity is significant.

YouTube is enormous in India. Hindi-language content is one of the fastest-growing segments on the platform. News channels, educational content creators, documentary producers, and politicians all produce significant volumes of Indic language content. And YouTube provides auto-generated captions for much of this content via their speech recognition system.

These captions are imperfect they're generated by an ASR model, not transcribed by humans but they represent a source of spoken-language data that doesn't exist in written form anywhere else. Colloquial Hindi, regional dialects, code-switching between Hindi and English this all shows up in YouTube captions in a way it doesn't show up in news articles.

The pipeline:

from youtube_transcript_api import YouTubeTranscriptApi
import yt_dlp

def get_channel_video_ids(channel_url: str) -> list[str]:
    ydl_opts = {
        "quiet": True,
        "extract_flat": True,
        "force_generic_extractor": False,
    }
    with yt_dlp.YoutubeDL(ydl_opts) as ydl:
        info = ydl.extract_info(channel_url, download=False)
        return [entry["id"] for entry in info.get("entries", [])]

def get_transcript(video_id: str, language: str = "hi") -> str | None:
    try:
        transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
        # Prefer manually created captions; fall back to auto-generated
        try:
            transcript = transcript_list.find_manually_created_transcript([language])
        except:
            transcript = transcript_list.find_generated_transcript([language])

        segments = transcript.fetch()
        return " ".join(seg["text"] for seg in segments)
    except Exception:
        return None

The ethical and legal considerations here are important. YouTube's ToS restricts automated scraping. Many content creators depend on their content for livelihood. The approach most research teams take is to use transcripts for research under fair use principles, but this is a legally nuanced area. Some teams reach out directly to content creators for explicit permission, which also gives you cleaner, human-curated channel lists.

5. Government and Legal Text

Government text is valuable for several reasons: it covers technical and formal registers of the language, it's authoritative, it's high quality, and it's public.

Official government portals — States like Maharashtra, Tamil Nadu, Karnataka, Kerala, and Andhra Pradesh publish official gazette notifications, government orders, and public documents in Indic languages. The Government of India's National Portal publishes content in 22 scheduled languages.

Parliamentary proceedings — The Lok Sabha and Rajya Sabha publish debates in Hindi and in other scheduled languages when speeches are delivered in those languages. Parliamentary debates are long-form, formal, topically diverse, and high quality.

Legal judgments — The Supreme Court of India and many High Courts publish judgments that have been translated into Indic languages. Legal text is challenging (dense with technical terminology) but valuable for formal language understanding.

NCERT textbooks — The National Council of Educational Research and Training publishes school textbooks in all major Indic languages, and many of these are freely downloadable as PDFs. These are edited, high-quality, age-appropriate text covering science, social studies, mathematics, and literature.

6. Academic and Research Text

JSTOR, arXiv, and similar repositories are mostly English. For Indic language academic text, the main sources are:

Shodhganga — India's national repository of PhD theses. Many theses in social sciences, humanities, and education are written in Indic languages.
Indian journals — Several academic journals publish in Hindi and other Indic languages. Access varies.

Building the Full Ingestion Pipeline

Now let's talk about the actual engineering how you stitch all of these sources into a coherent pipeline that produces clean, deduplicated, language-verified text in S3.

The architecture I'd recommend for a team of one to five people with a moderate cloud budget:

┌─────────────────────────────────────────────────────────┐
│                    Source Layer                          │
│  CommonCrawl   News Sites   YouTube   Wikipedia   Books  │
│  (S3 stream)   (scrapers)   (API)     (dumps)    (PDFs) │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│                  Extraction Layer                        │
│   trafilatura / pdfminer / youtube-transcript-api        │
│   Output: raw text + metadata JSON (source, date, lang) │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│                Normalisation Layer                       │
│   Unicode normalisation (NFC/NFKC)                       │
│   Script detection (is this actually Devanagari?)        │
│   Whitespace normalisation                               │
│   Remove HTML artifacts, escape sequences               │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│               Language Detection Layer                   │
│   fastText (lid.176.bin) + GlotLID                      │
│   Threshold: confidence > 0.85, both models agree       │
│   Route to language-specific buckets                    │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│               Quality Filtering Layer                    │
│   Document-level heuristics (length, punct ratio, etc.) │
│   Perplexity filter (KenLM trained on reference text)   │
│   Content filtering (adult, spam, hate speech)          │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│               Deduplication Layer                        │
│   Exact dedup: SHA256 hash of normalised content        │
│   Fuzzy dedup: MinHash LSH (datasketch library)         │
│   Threshold: Jaccard similarity > 0.8 = duplicate       │
└────────────────────────┬────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────┐
│                    S3 Output Layer                       │
│   s3://your-bucket/                                     │
│     ├── raw/{language}/{source}/{year}/{month}/          │
│     ├── processed/{language}/{source}/{year}/{month}/    │
│     └── final/{language}/train/ val/ test/              │
│   Format: JSONL — one document per line                 │
│   Each record: {text, source, language, date, quality}  │
└─────────────────────────────────────────────────────────┘

The JSONL Format Keep It Simple

Every document that passes your pipeline should be stored as a single JSON object on a single line (JSONL format). A minimal schema:

{
  "text": "जब भारत ने 1947 में स्वतंत्रता प्राप्त की...",
  "source": "dainik_bhaskar",
  "source_url": "https://www.bhaskar.com/...",
  "language": "hi",
  "script": "Devanagari",
  "date": "2023-08-15",
  "word_count": 847,
  "quality_score": 0.87,
  "pipeline_version": "v1.2",
  "dedup_hash": "sha256:a3b9c..."
}

The quality_score field is critical it lets you apply different quality thresholds at different stages of training (higher threshold for initial training runs, lower for later stages).

Managing CommonCrawl at Scale

Processing CommonCrawl without burning through your budget requires careful thinking about compute architecture.

The recommended approach is streaming processing rather than downloading everything locally. The flow:

import boto3
import gzip
import warcio

def process_wet_file(s3_key: str):
    s3 = boto3.client("s3", region_name="us-east-1")

    response = s3.get_object(Bucket="commoncrawl", Key=s3_key)

    with gzip.open(response["Body"], "rb") as f:
        for record in warcio.ArchiveIterator(f):
            if record.rec_type != "conversion":
                continue

            text = record.content_stream().read().decode("utf-8", errors="ignore")
            url = record.rec_headers.get_header("WARC-Target-URI")

            language = detect_language(text)
            if language not in INDIC_LANGS:
                continue

            if passes_quality_filters(text, language):
                yield {
                    "text": text,
                    "url": url,
                    "language": language,
                    "source": "commoncrawl",
                    "crawl_date": record.rec_headers.get_header("WARC-Date"),
                }

For scale, run this across a cluster of EC2 instances (even spot instances work fine for this workload it's stateless and restartable). Apache Spark on EMR is an option for very large-scale processing, though the overhead of setting up Spark can outweigh the benefits if you're processing fewer than ~50 crawl segments.

A cost-effective approach for an independent researcher: process one crawl segment at a time on a single large EC2 instance (r6i.4xlarge or similar, 16 vCPUs, 128GB RAM), running the full pipeline in Python with multiprocessing. Each WET file takes roughly 2-5 minutes to process. With parallelism, you can get through a few thousand WET files per day. Budget for this at around $50-150/day in EC2 costs — not cheap, but manageable.

The Deduplication Challenge

Deduplication is where a lot of people underestimate the effort required. The same news article appearing on forty different websites is a trivial case. The harder cases are:

Near-duplicates: Same article with different headlines or different publication dates
Templated content: Product descriptions that share 90% of their text across different products
Aggregated content: Websites that republish translated content from other sources
Multi-crawl duplicates: The same page appearing in the March 2023 and June 2023 CommonCrawl crawls

The standard approach is MinHash LSH deduplication. The datasketch Python library implements this efficiently. For a corpus of hundreds of millions of documents, you'll want to run this on a cluster rather than a single machine.

from datasketch import MinHash, MinHashLSH

def get_minhash(text: str, num_perm: int = 128) -> MinHash:
    m = MinHash(num_perm=num_perm)
    for word in text.lower().split():
        m.update(word.encode("utf-8"))
    return m

lsh = MinHashLSH(threshold=0.8, num_perm=128)

# Insert documents
for doc_id, document in enumerate(documents):
    minhash = get_minhash(document["text"])
    lsh.insert(str(doc_id), minhash)

# Query for duplicates
def find_duplicates(text: str) -> list[str]:
    minhash = get_minhash(text)
    return lsh.query(minhash)

The Mental Model: Thinking at Billion-Parameter Scale

If you've made it this far, you have a picture of the individual components. But let me zoom out and talk about how to think about all of this when your goal is training a model with billions of parameters.

Token Count Is What Actually Matters

Your intuition about data size probably comes from file sizes (GB, TB). Train yourself to think in tokens instead. At roughly 3-4 characters per token for Indic scripts (Devanagari, Tamil script, etc. are more compact in terms of characters but tokenise differently), a 1GB file of Hindi text might contain somewhere around 150-300 million tokens.

Current research (the Chinchilla paper from DeepMind) suggests that a model should be trained on roughly 20 tokens per parameter for optimal compute efficiency. A 7B parameter model, therefore, wants around 140 billion tokens of training data at minimum. Frontier models are now often trained on significantly more than this 2-4 trillion tokens is common.

For a 1B parameter Indic language model (a reasonable starting point), you want at minimum 20-50 billion tokens. For Hindi, this is achievable with CommonCrawl + news + books + Wikipedia combined. For Tamil or Telugu at the same scale, it's tight but doable with aggressive multi-source collection. For languages like Odia or Assamese at 1B scale, you'll need creative sourcing and likely some degree of cross-lingual data from closely related languages.

Quality Is a Multiplier on Quantity

Here is the mental model that will save you from chasing raw scale at the expense of everything else: quality multiplies quantity.

If you train on 100 billion tokens of noisy, low-quality data, you may get worse results than training on 20 billion tokens of carefully curated, high-quality data. This has been shown empirically in papers like Phi-1 (Microsoft), which achieved strong benchmark performance with a small model trained on carefully curated synthetic and textbook-quality data.

This is particularly relevant for Indic languages, where your data is going to be limited. You cannot win the quantity game against the labs. You can potentially win the quality game — by being more careful than a large lab's automated pipeline can afford to be, by sourcing data that scrapers don't reach, and by curating rather than accumulating.

The Data Flywheel Bootstrap Thinking

Here's a pattern that's worth internalising for a long-running project like this. The data you collect now is not just for your first model. It is the foundation for a data flywheel:

Iteration 1: Collect data manually → train a small model → use model to help identify more data sources, classify quality, detect duplicates.

Iteration 2: Use iteration 1 model outputs to generate synthetic training examples in domains where your real data is weak (e.g., scientific text in Telugu, legal text in Kannada).

Iteration 3: Use the iteration 2 model to help human annotators work faster propose labels, suggest edits, flag ambiguous cases.

At each iteration, the model gets better, and the data quality improves. This is how serious Indic NLP efforts like AI4Bharat have built momentum not by trying to collect everything perfectly upfront, but by building a feedback loop between models and data.

The S3 Bucket Structure Design for the Whole Journey

Before you start dumping files, think carefully about your S3 structure. You will want to be able to:

Reprocess specific source types without touching others
Apply new quality filters to existing data without re-scraping
Maintain separate train/val/test splits per language
Track data lineage (which pipeline version produced this file)
Add new languages without reorganising existing structure

A structure that has worked well in practice:

s3://your-indic-llm-bucket/
├── raw/
│   ├── commoncrawl/2024-10/hi/segment_001.jsonl.gz
│   ├── news/dainik_bhaskar/2023/08/articles.jsonl.gz
│   ├── youtube/transcripts/hi/batch_001.jsonl.gz
│   └── wikipedia/hiwiki/20240101/dump.jsonl.gz
├── processed/
│   ├── hi/v1.2/deduped_quality_filtered.jsonl.gz
│   ├── ta/v1.2/deduped_quality_filtered.jsonl.gz
│   └── te/v1.2/deduped_quality_filtered.jsonl.gz
├── final/
│   ├── hi/train.jsonl.gz
│   ├── hi/val.jsonl.gz
│   └── hi/test.jsonl.gz
└── metadata/
    ├── stats/token_counts_by_source_language.json
    └── pipeline_versions/v1.2_config.json

Keep pipeline version numbers in your paths. When you update a filter or fix a bug, you want to know which files were produced by which version of your pipeline.

What You've Built And What Comes Next

By the end of Stage 1, you should have:

A processed, deduplicated, language-verified dataset in S3, stored as JSONL
Coverage across at minimum: CommonCrawl (multi-crawl), news archives, Wikipedia, and one or more of: YouTube transcripts, books, government text
Metadata tracking for each document: source, language, date, quality score, pipeline version
A rough token count per language — probably something like 20-100B tokens of Hindi, 5-30B tokens of the major Dravidian languages, and smaller amounts for less-resourced Indic languages
A deduplication report showing how much duplication existed in your raw sources

This dataset is not clean enough to train on yet. It is a foundation.

In Stage 2, we'll talk about what comes next: the data cleaning, normalisation, and preparation pipeline that takes this raw JSONL corpus and produces the tokenised, shuffled, packed training sequences that an actual model can consume. We'll cover tokenisation for Indic languages specifically — the choice between character-level, BPE, and sentencepiece models, and why this decision has large downstream effects on your model's ability to handle morphologically rich Indic languages.

We'll also talk about data mixing: how you decide the ratio of Hindi to Tamil to Telugu to English in your training batch, and how to implement curriculum-style data ordering that trains on easier data first and harder data later.

For now, though, you have the invisible constraint in hand. Data is the foundation. Everything else architecture, training, alignment is built on top of what you've built here.

Build it carefully.

This is Part 2 of the "Building an LLM from Scratch for Indic Languages" series. Part 1 (the introduction) is here ). Part 3 — Data Cleaning and Tokenisation — is coming next.

If you found this useful, the series is being written openly. Feedback, corrections, and suggestions are welcome in the responses.

Tags: LLM, Large Language Models, Indic Languages, NLP, Data Engineering, CommonCrawl, Hindi, Tamil, Telugu, Machine Learning, AI, Data Pipeline, S3, Python, Natural Language Processing

Beyond ReconVLA: Annotation-Free Visual Grounding via Language-Attention Masked Reconstruction

Daud Ibrahim — Sat, 14 Mar 2026 12:48:16 +0000

Replacing gaze annotations with language-driven attention masking makes robot perception annotation-free and up to 5x faster at inference. Here is how I got there.

Picture a robot arm sitting across a table from you. You say: "Put the black bowl in the drawer." The arm moves. But not toward the bowl. It hovers. It hesitates. Then it grabs the wrong thing. From the outside this looks like a minor coordination failure. From the inside, it is a fundamental problem with how the robot perceives the world.

The robot was not confused about language. It understood the words perfectly. The failure was visual. Its perception system was distributing attention more or less equally across the entire scene: the table, the wall, the drawer handle, the bowl, the cup beside the bowl. It had no reliable mechanism to concentrate attention on the one object the instruction actually named. This scattered perception is the root cause of most manipulation failures in modern robotics.

State-of-the-art Vision-Language-Action (VLA) models understand language well. What they do not do well is spatially align their visual attention with the object the instruction names. The result: imprecise action prediction on the very tasks they were built for." width="800" height="252">

A recent paper called ReconVLA attempted to solve this. I spent a significant stretch of time reading it carefully, stress-testing its assumptions, and thinking about what it would mean to implement and extend it. What I found impressed me in some ways and genuinely troubled me in others. This post is the story of that investigation, and the architecture I designed in response.

What ReconVLA Got Right

The core insight behind ReconVLA is elegant. Instead of adding an external object detection module (which requires labelled bounding boxes) or generating bounding box tokens before action prediction (which changes the output format), ReconVLA uses visual reconstruction as a purely internal supervisory signal.

Here is how it works. The model identifies a "gaze region" in the input image corresponding to the manipulation target. It then trains a diffusion transformer head to reconstruct that gaze region using only the backbone's internal visual tokens. The logic is clean: if the backbone does not encode the shape and precise position of the target object, it cannot reconstruct the gaze region. The reconstruction task creates a gradient pressure that forces the backbone to develop geometrically precise, spatially structured representations.

The reconstruction task forces the backbone to encode the shape and position of the target object. If it does not know where the bowl is, it cannot reconstruct the bowl region.

At inference, no reconstruction happens. The improved backbone simply produces better action predictions. No external module, no extra output format, no visible seams. ReconVLA outperforms OpenVLA and RT-2 style baselines on LIBERO-Spatial, LIBERO-Long, and CALVIN benchmarks. The attention maps they visualise show genuinely more focused perception. This is real progress.

So where is the problem?

Where I Found the Gaps

After reading the paper closely and thinking through what it would take to reproduce, extend, and trust these results, I identified three substantive issues.

Gap 1: The gaze region is doing hidden work

The gaze regions used as reconstruction targets come from robot eye-tracking or annotation in the training data. The paper does not fully specify how these are obtained across all three data sources: BridgeData V2, LIBERO, and CALVIN. If the gaze regions are derived heuristically (for example, a bounding box drawn around the object named in the instruction), then there is a circular dependency buried in the method.

The reconstruction target is computed from the same language instruction that guides the action. The model could learn to shortcut: attend to language cues rather than developing genuine geometric understanding of the scene. You would get good benchmark numbers either way, and you would have no way to tell the difference.

Critically, there is no ablation in the paper comparing reconstruction of gaze regions against reconstruction of random patch regions. This single missing experiment means we cannot attribute the performance improvement to gaze-specific grounding versus the simpler hypothesis that any auxiliary reconstruction task would help. Without it, we do not know what the method is actually learning.

Gap 2: The diffusion transformer adds overhead they never measured

Diffusion models require T iterative denoising steps per forward pass. In robot manipulation, inference latency directly determines control frequency. If your model runs at 1 Hz, it cannot close a control loop that needs 10 Hz. ReconVLA does not report any inference latency benchmarks. For a robotics paper, this is a significant omission. Diffusion Policy, for comparison, explicitly benchmarks latency and shows diffusion-based policies typically operating at 1 to 2 Hz due to iterative denoising. ReconVLA provides no comparable numbers.

Gap 3: Evaluation scope is narrower than the generalisation claims

LIBERO and CALVIN are simulation benchmarks. Real-world results are limited to qualitative demonstrations on a single robot arm. The pretraining dataset overlaps with evaluation environments, which raises data leakage concerns. CALVIN evaluates long-horizon tasks with a fixed language vocabulary, which does not test open-vocabulary instruction following: the core promise of VLA models. Taken together, the generalisation claims exceed what the evaluation design can actually support.

These are not minor methodological quibbles. They go to the heart of whether the model is learning what we think it is learning, and whether it would hold up in deployment. Closing these gaps is the motivation for the architecture I designed." width="800" height="215">

The Architecture I Designed: LA-ReconVLA

The research question I set myself: can we replace gaze-region supervision with language-driven attention masking, deriving reconstruction targets that are semantically grounded in the task instruction, while replacing the diffusion transformer with a computationally efficient MAE decoder?

The two problems addressed simultaneously: annotation dependency and inference overhead.

How It Works, Step by Step

1. Extract cross-attention maps from the backbone

Using PaliGemma-3B as the backbone, I extract cross-attention scores between language tokens and image patch tokens from the last 3 transformer layers. These are aggregated across all language tokens and attention heads to produce a single saliency map A over the 196 patch positions (a 14x14 grid for a 224x224 image). The aggregation uses the last 3 layers specifically to reduce noise from the frozen earlier layers.

2. Apply attention-guided masking

Select the top 49 patches from the saliency map: the top 25% of the image by cross-attention score. These patches are semantically grounded in the instruction because they come directly from the backbone's own language understanding. The word "bowl" in the instruction produces high attention weights on patches containing bowl-like features. The binary mask M produced by this process is the reconstruction target.

3. Single-pass MAE decoder reconstruction

A 4-layer transformer decoder (hidden dimension 256, 8 attention heads) receives unmasked patch tokens from the backbone and learnable mask tokens at masked positions. It reconstructs pixel values at masked positions in a single forward pass. Reconstruction loss is pixel MSE over the masked region. For spatial grounding, coarse reconstruction at correct locations suffices. The geometry matters more than photorealism.

4. Joint training with action prediction

The total loss combines action prediction and reconstruction with a weighting hyperparameter. Action prediction uses cross-entropy over discretised action bins (7 degrees of freedom x 256 bins per DoF). Lambda defaults to 0.5 with ablations planned at 0.1 and 1.0.

Why This Should Work: The Theoretical Reasoning

I want to be honest that this is a hypothesis until the experiments say otherwise. But the theoretical grounding is solid across four independent arguments.

Self-supervised learning tells us this will help

Masked Autoencoder (MAE) research established that masking semantically meaningful regions produces stronger visual representations than masking random patches or using contrastive objectives. By masking specifically the patches the language model attends to when processing the instruction, we create the hardest and most informative prediction problem we can construct without external labels. The backbone has to predict task-relevant content or fail at reconstruction.

Information bottleneck creates the right pressure

Masking high-attention patches and requiring their reconstruction creates an information bottleneck. The backbone must retain spatial information in its latent representations that it would otherwise be free to compress away. This regularisation pressure pushes the backbone toward encoding geometric structure as a side effect of minimising reconstruction loss.

Direct gradients are better than multi-step gradients

In diffusion models, gradients flow through T denoising timesteps before reaching the encoder. Each step introduces noise into the gradient signal. The MAE decoder provides direct, single-step gradients back to the backbone. Theoretically, this produces more stable and efficient training.

Attention-guided masking creates a self-reinforcing loop

Using attention maps as masking targets creates a productive feedback cycle. The attention map determines what is masked. The reconstruction loss improves backbone features. Better backbone features produce sharper, more semantically coherent attention maps in the next forward pass. The system's grounding quality should improve during training as a natural consequence of the architecture.

 // Total training objective
L_total = L_action + lambda * L_recon

// Where:
L_action = CrossEntropy(action_bins) // 7 DoF x 256 bins
L_recon = MSE(decoder_output, original_pixels) // masked patches only
lambda = 0.5 // ablations: 0.1, 0.5, 1.0

The Experiments I am Running

I designed four experimental conditions on LIBERO-Spatial, training on 3 tasks x 50 demonstrations, running on a single T4 GPU.

The ablation in Condition 2 is the experiment I care about most. If random masking performs as well as attention-guided masking, it means the performance gain comes from the auxiliary task structure, not from language grounding. If attention-guided masking wins, it validates the core hypothesis. This is precisely the ablation that was missing from ReconVLA.

On Accessibility and Reproducibility

One thing that struck me about ReconVLA's experimental setup: it requires 8 A100 80GB GPUs and 2 million training samples. That is a real barrier. Most academic groups cannot reproduce it, let alone extend it. Scientific iteration requires accessibility.

LA-ReconVLA is designed to run on a single T4 (Google Colab). The architectural choices that make this possible are not compromises: the MAE decoder is lighter than a diffusion transformer by design, PaliGemma-3B is smaller and partially frozen to reduce gradient computation, and the training pipeline avoids the large pretraining dataset requirement by relying on the backbone's pretrained language understanding instead.

All code, configs, and random seeds will be published once the experimental phase is complete. The goal is that any researcher with access to a single GPU can reproduce and extend this work." width="800" height="188">

What Comes Next

The experiments are running. Part 2 of this work will share full quantitative results across all four conditions, latency benchmarks against ReconVLA, attention visualisations comparing AOS scores, and an honest analysis of where the method falls short.

There is a known limitation worth naming now: LA-ReconVLA assumes cross-attention maps are extractable from the backbone. Architectures without explicit cross-attention require adaptation, for example falling back to self-attention over image tokens. I have documented this in the design and will report on it during implementation. Real-robot validation is deferred to future work. For now, this is simulation-only.

If you work on VLA models, robotic manipulation, or self-supervised visual representation learning, I would genuinely like to hear from you. The hypothesis space here is large and I do not think one architecture will be the final answer. But I do think eliminating the gaze annotation dependency and the diffusion overhead is the right direction, and I think the ablation design will tell us something we did not know before.

This is an ongoing independent research experiment. Results, code, and full experimental logs will be published once the implementation phase is complete.

Vision-Language ModelsRobot Manipulation Self-Supervised-Learning MAELIBERO-Benchmark Open-Source AI

Building an LLM From Scratch for Indic Languages: What No One Tells You About the Hard Parts

Daud Ibrahim — Sat, 14 Mar 2026 11:39:16 +0000

Context: In 2023, I was part of the core team at Krutrim (OLA's AI subsidiary) working on pre-training India's first large language model with deep Indic language coverage. This series documents not just the technical pipeline, but the decision-making framework behind each stage the tradeoffs we reasoned through, the dead ends we hit, and why many of the choices we made were genuinely novel, because the reference material simply did not exist yet.

There is a particular kind of engineering challenge that sits at the intersection of research ambiguity and production pressure. Pre-training a large language model from scratch is one of them. Pre-training one for Indian languages, in 2023, without precedent, without benchmarks, without existing tokenizers, and without a community of practitioners who had done it before that is a different challenge altogether.

This article is the first in a series. I want to use it to do something most technical writing does not do well: explain not what we built, but how we thought about what to build, and why the sequencing of decisions mattered as much as the decisions themselves.

If you are reading this as a practitioner considering your own pre-training run, I hope this gives you a map. If you are reading this as someone evaluating the depth and originality of this work, I hope it demonstrates that what we did at Krutrim was genuinely first-principles engineering not a recipe followed from a paper.

The question before the pipeline: why scratch?

The first and most consequential decision in any LLM project is not architectural. It is strategic: do you fine-tune an existing model, or do you build from scratch?

In 2023, the dominant English-centric models LLaMA, Falcon, MPT were already strong. The instinct from the outside is: adapt one of these. Fine-tune on Indic data, add some multilingual tokens, and call it done. We considered this seriously. We rejected it, and the reasoning behind that rejection shaped everything that followed.

The problem is not that these models lacked Indic training data. The problem is that their tokenizers were trained on English-dominated corpora. A tokenizer is not just a pre-processing step it is the model's alphabet. When you take a Hindi or Tamil sentence and pass it through a BPE tokenizer trained overwhelmingly on English, you do something quietly catastrophic: you fragment the text into sub-word units that carry no semantic coherence in the target language. A single Devanagari word might tokenize into six, seven, eight individual tokens. This is what researchers call the fertility problem.

This single observation that tokenizer fertility is a structural bottleneck, not a tunable hyperparameter made the decision for us. If we wanted a model that could reason in Hindi, Tamil, Telugu, Kannada, Bengali, and the other major Indic languages with genuine capability, we had to own the full stack. That meant building our own tokenizer, on our own data, from the ground up. And once you commit to building your own tokenizer, you have committed to pre-training from scratch. There is no shortcut from that point.

The decision to build from scratch was not driven by ambition. It was driven by a clear-eyed analysis of what fine-tuning structurally cannot fix. Knowing when adaptation is insufficient — and having the conviction to make the harder call — is the work of a technical leader, not an implementer." width="800" height="170">

The full pipeline, and why the order is not arbitrary

Before going deep on each stage, I want to lay out the complete sequence and explain why the ordering matters. This is not a waterfall process it is a cycle but the initial ordering reflects a set of hard dependencies that are not immediately obvious.

Small-Model Cycle→Base Model Eval→Instruction Fine-Tuning→Alignment→
Benchmark Eval" width="800" height="156">

Each arrow in that sequence represents a dependency. You cannot train your tokenizer before your data is filtered, because the tokenizer's vocabulary will be contaminated by noise. You cannot finalize your architecture before your tokenizer is trained, because context window efficiency depends on vocabulary size and fertility. You cannot run large-scale pre-training before you have validated your pipeline on a small model, because at that scale, a bug does not just waste a training run it wastes weeks of GPU time and real money.

Let me walk through the thinking at each stage.

Stage 1 Data collection: the invisible constraint

Data collection sounds like a solved problem. It is not, especially for Indic languages. In 2023, the Common Crawl corpus the backbone of most large-scale LLM training contained approximately 70% English content by volume. High-resource Indic languages like Hindi and Bengali had a meaningful but small footprint. Low-resource Indic languages like Maithili, Odia, or Sindhi were barely represented.

This created an immediate strategic tension. We needed enough data to train a tokenizer with good Indic script coverage. We needed enough data to pre-train a model that would perform meaningfully across multiple languages. But we also had to be honest about the quality ceiling of web-scraped text: not all text is equal, and in the Indic web, the signal-to-noise ratio can be harsh. A significant portion of what gets scraped is transliterated text (Devanagari written in Latin script), machine-translated content of poor quality, forum spam, and code-switched text that switches between English and a regional language mid-sentence without semantic coherence.

Our collection strategy layered multiple sources with different trust levels. Crawled web data formed the high-volume, lower-trust base. Academic and journalistic corpora digitized newspapers, government documents, educational texts formed a higher-trust but lower-volume layer. We also seeded translation pairs from curated multilingual sources, not to train a translation model, but to give the model exposure to formally correct Indic text with semantic alignment to a known reference.

Curated Indic language datasets of the quality comparable to The Pile (English) or ROOTS (multilingual) did not exist at scale for our target languages. There was no IndicGLUE-equivalent for pre-training data. We were assembling our own corpus from scratch, making quality judgements that had no established benchmark to validate against. Every filtration decision was, in some sense, a hypothesis." width="800" height="210">

Stage 2 Deduplication: the step most teams underweight

There is a common misconception that filtration and deduplication are the same stage. They are not. Deduplication is a structural quality problem; filtration is a content quality problem. Conflating them leads to a subtle but serious error: you can filter your data perfectly for content quality, and still train on a corpus where 30% of your documents are near-duplicates of each other.

Why does this matter? Deduplication affects what the model memorizes versus what it generalizes. A model that sees the same news article 200 times across different scraped mirrors, archives, and re-publications will overfit to that document in a way that degrades its ability to generalize. More dangerously, it will assign high confidence to the specific phrasing and facts in that document, which is a form of hallucination amplification.

We applied deduplication at multiple granularities: exact-match document deduplication (trivial but necessary), near-duplicate detection using MinHash LSH at the document level, and paragraph-level fuzzy matching to catch templated or boilerplate content the kind of repetitive legal disclaimers and website footers that scrapes tend to accumulate in large quantities. Each pass reduced our corpus, which meant tighter tradeoffs between data volume and data quality.

Stage 3 Data filtration: making quality decisions without a ground truth

Filtration is where the team's judgment matters most, because there is no universally correct filtration strategy. You are making a series of probabilistic bets about what "quality" means for the task ahead.

The filters we applied operated on several dimensions. Language identification was the first gate: we used a combination of fastText language identification and script-based heuristics to verify that a document was genuinely in its claimed language. This is surprisingly non-trivial for Indic text, where multiple scripts can represent the same language (Urdu and Hindi share a large vocabulary but use different scripts), and where transliteration is common.

The second dimension was content quality scoring. We trained lightweight classifiers essentially shallow models on small samples of curated, high-quality text to score documents on fluency, coherence, and absence of spam markers. These classifiers were trained in each target language independently, because quality signals in Hindi look different from quality signals in Tamil.

The third dimension was toxicity and harm filtering. This is a different problem at the filtration stage than it is at the alignment stage. At filtration, we were removing documents that were overwhelmingly harmful extremist content, explicit material, coordinated disinformation. We were not trying to make the model refuse harmful questions at this stage; that work happens later. But if you pre-train on a corpus saturated with hate speech, you create an alignment problem that is much harder to fix during fine-tuning.

We learned quickly that filtration cannot be designed perfectly in the first pass. The right approach is to run a small model training on a filtered slice of data, evaluate its output qualitatively, and use the model's failure modes as a signal to revise the filtration criteria. Bad filtration shows up as model behaviour before it shows up in any pre-training metric. This is why the small-model experimentation cycle feeds back into filtration — not just forward into larger training runs." width="800" height="226">

Stage 4 Tokenizer training: the decision with the longest shadow

The tokenizer is the most consequential single artefact in the entire pipeline, because it is fixed at pre-training and cannot be meaningfully changed afterwards without retraining. Every downstream decision context window size, model capacity, generation speed is shaped by what the tokenizer does.

For Indic languages, the core challenge is script diversity. Hindi, Marathi, Maithili, and Sanskrit all use Devanagari. Tamil, Telugu, Kannada, Malayalam, and Odia each use their own distinct script. Bengali shares a script with Assamese. The tokenizer must achieve good coverage across all of these without ballooning the vocabulary size to a point where the embedding table becomes prohibitively large, or the output softmax too expensive to compute.

We chose a BPE (Byte Pair Encoding) approach with a vocabulary size calibrated against fertility targets: we wanted average fertility below 1.5 tokens per word across our target languages. Achieving this required careful corpus weighting during tokenizer training if your tokenizer training corpus is dominated by English data, the model will allocate too many vocabulary slots to English sub-words at the expense of Indic sub-words.

We also had to make decisions about how to handle code-switching, which is endemic in modern Indian text. A natural conversation in Hindi on social media will freely mix Devanagari, Romanized Hindi, and English. We tested different strategies treating Romanized Hindi as its own vocabulary segment, folding it into the Latin character set, or normalizing it back to Devanagari and evaluated the downstream impact on the small model before committing.

In 2023, there was no publicly available tokenizer trained specifically for the full diversity of Indic languages at the scale we needed. IndicBERT's tokenizer was built for a different model class and vocabulary size. We were navigating tokenizer design without a directly comparable reference — every fertility analysis and vocabulary decision was derived from first principles and internal experimentation." width="800" height="206">

Stage 5 Architecture decisions: choosing your tradeoffs before training

By the time you reach the architecture stage, you have made decisions that constrain your choices significantly. Your tokenizer vocabulary size determines your embedding dimension lower bound. Your target context length determines your memory requirements. Your available compute determines your viable parameter count.

The core architectural question in 2023 was which transformer variant to build on. The landscape included MPT (MosaicML's Pretrained Transformer), Mistral, and the LLaMA family each representing a different set of design choices.

Attention mechanism: MHA, GQA, or MQA?

Multi-Head Attention (MHA) is the original formulation: every head has its own key and value projections. Multi-Query Attention (MQA), introduced in the PaLM architecture, uses a single shared key-value head for all query heads reducing KV cache size dramatically during inference. Grouped-Query Attention (GQA), used in LLaMA-2 and Mistral, is a compromise: groups of query heads share a key-value head.

The tradeoff is not just performance it is deployment economics. MQA reduces inference memory cost but can hurt model quality on certain tasks. GQA largely recovers that quality while still offering meaningful memory savings. For a model intended to run efficiently on production infrastructure (not just in a research lab), this was a practical decision, not a theoretical one. We chose GQA because our evaluation on small models showed acceptable quality degradation relative to MHA, and the inference efficiency gains were substantial enough to matter at deployment.

Positional encoding: RoPE, ALiBi, or learned?

Learned positional embeddings the default in the original transformer do not generalize beyond the sequence length seen during training. This is a hard constraint. RoPE (Rotary Position Embedding), used in the LLaMA family and later Mistral, encodes position as a rotation in the complex plane and generalizes better to longer sequences than its training context. ALiBi (Attention with Linear Biases) applies a length-penalty directly to attention scores, which also generalizes to longer sequences at inference. We chose RoPE for its stronger empirical track record across the language modelling tasks closest to our use case.

Normalization and activation

Layer normalization placement (pre-norm vs post-norm) has a direct impact on training stability. Post-norm (as in the original transformer) is notoriously difficult to train at large scale without careful learning rate management. Pre-norm, used in the modern families, significantly improves stability. We used RMSNorm (a simplified variant of LayerNorm that omits the mean-centering step), which is computationally cheaper and has been shown to match LayerNorm quality. Our activation function choice was SwiGLU the gated linear unit variant used in PaLM and LLaMA which consistently outperforms ReLU and GeLU on language modelling benchmarks.

We did not pick architectural components based on what was newest or most cited. We evaluated each component against three criteria: training stability impact, inference cost, and quality on our small-model proxy tasks in Indic languages. An architectural choice that works well for English is not guaranteed to work equally well when your token vocabulary and linguistic structure are fundamentally different." width="800" height="198">

Stage 6 Infrastructure: parallelism is not optional at scale

Pre-training at scale is not a single GPU problem. Even a 7 billion parameter model does not fit in the memory of a single A100 GPU when you account for activations, gradients, and optimizer states. You need distributed training, and distributed training is one of the most operationally complex aspects of the entire pipeline.

The three fundamental dimensions of model parallelism are data parallelism (distributing batches across GPUs), tensor parallelism (splitting individual matrices across GPUs), and pipeline parallelism (splitting layers across GPUs). Each introduces different communication overhead and synchronization complexity. Getting this wrong does not just slow you down it produces subtly incorrect gradients that corrupt training in ways that can take days to diagnose.

We evaluated PyTorch's FSDP (Fully Sharded Data Parallel) and the Megatron-LM framework from NVIDIA. FSDP is more flexible and integrates naturally with the HuggingFace ecosystem. Megatron-LM has more mature tensor and pipeline parallelism implementations and better GPU utilization at large scale. Our decision to use Megatron-LM for the large-scale runs came down to one number: GPU utilization. At the scale we were operating, the difference between 40% and 60% GPU MFU (Model FLOPs Utilization) is not a footnote it is the difference between a training run that completes on budget and one that does not.

Gradient clipping deserves specific mention here because it is often treated as a minor implementation detail when it is, in fact, a critical training stability lever. We clipped gradients by global norm, with a threshold set conservatively at 1.0. In practice, we monitored the gradient norm distribution throughout training and adjusted this threshold during the early warmup phase. Loss spikes sudden, large increases in training loss that can partially or fully destabilize a training run are often preceded by gradient norm spikes. Building early-warning monitoring for this saved us from losing multiple training runs.

Stage 7 The small-model cycle: science before scale

This is the stage that distinguishes teams who have actually done pre-training from teams who have read about it. Before you commit GPU-weeks to training a large model, you train small models in our case, 125M, 350M, and 1B parameter configurations on representative data slices, with identical pipeline configurations. These runs are not a warmup. They are a rigorous diagnostic environment.

The small-model cycle lets you answer a specific set of questions that cannot be answered any other way. Is your data pipeline producing correctly shuffled, correctly formatted training batches? Is your tokenizer producing the expected fertility distribution across languages? Is your learning rate schedule appropriate for your batch size and model size? Are there silent bugs in your custom attention implementation that only manifest as subtly degraded perplexity curves?

We used scaling laws specifically the Chinchilla scaling laws as a compass during this phase. Scaling laws give you a principled way to extrapolate from small-model results to predict large-model performance at a given compute budget. More importantly, they tell you whether your actual loss curves are tracking the theoretical prediction. A systematic deviation from the predicted scaling curve in a small model is a red flag that something in your pipeline is wrong even if the training appears to be running normally.

The small-model cycle is not about building a worse version of the final model. It is about creating the conditions under which you can discover problems cheaply, before the cost of a mistake is measured in weeks of training time.

The loop at this stage was: train small model → evaluate on proxy tasks → identify failure modes → trace failure modes back through the pipeline (filtration? tokenization? data distribution?) → correct at source → retrain small model → repeat. We went through this loop multiple times before we were confident enough in the pipeline to scale up. The instinct to skip this step to move faster is understandable. It is also how teams lose their largest training runs to avoidable bugs.

Stage 8 Base model evaluation: before any fine-tuning

A base language model the output of pre-training, before any instruction tuning is not an assistant. It is a distribution over text continuations. Evaluating it requires a different frame than evaluating a chat model.

The primary signal for a base model is perplexity on held-out data ideally data that is representative of each target language and domain, and was not seen during training. But perplexity alone is insufficient, because a model can achieve low perplexity on Indic text by essentially memorizing high-frequency document structures without genuinely understanding the language.

We supplemented perplexity with few-shot evaluation on classification and generation tasks that require genuine linguistic understanding: natural language inference in Hindi, named entity recognition in multiple Indic scripts, and cross-lingual retrieval. These evaluations were largely hand-crafted by our team, because standardized Indic benchmarks of the quality of SuperGLUE or BIG-Bench simply did not exist in 2023.

One of the most structurally difficult aspects of working on Indic language models in 2023 was the absence of evaluation benchmarks. IndicGLUE covered some classification tasks but not the generation capabilities central to LLM evaluation. There was no Indic equivalent of MMLU, HellaSwag, or TruthfulQA. We were evaluating against standards we had to partially design ourselves — which means our evaluation was only as good as our ability to define what " width="800" height="236">

We also ran language identification probes a technique borrowed from interpretability research to verify that different layers of the model had indeed developed language-specific representations. A model that genuinely understands Hindi will, in its intermediate representations, have clearly separable activation patterns for Hindi and English text. A model that has merely learned surface statistics of Indic characters will not. This kind of probing evaluation gave us confidence that the model was building genuine multilingual capability rather than pattern-matching on script identity.

Stage 9 Instruction fine-tuning: transforming a predictor into an assistant

A base language model predicts the next token. An assistant responds to instructions. These are fundamentally different behaviours, and the transition between them requires supervised fine-tuning on instruction-response pairs.

This stage is often under-theorized. People treat it as "just fine-tuning." But the quality and coverage of your instruction data determines the quality of your assistant more directly than almost any architectural decision. In the Indic context, this problem is acute: there was, in 2023, essentially no publicly available, high-quality Indic instruction dataset. We had to construct our own a combination of translated and culturally adapted English instruction datasets, human-written Indic instruction-response pairs, and programmatically generated instruction data verified by human review.

The framing of fine-tuning data also matters in ways that are not obvious until you see the failure modes. An instruction dataset that only teaches the model to answer factual questions will produce a model that cannot handle open-ended creative tasks. A dataset that over-represents formal Hindi will produce a model that sounds stiff in casual conversation. Coverage across task types, registers, and languages is the goal and achieving it with limited data resources requires careful curation rather than volume maximization.

Stage 10 Alignment: the values layer

Instruction fine-tuning makes the model helpful. Alignment makes it safe. These are distinct properties, and conflating them is a mistake that leads to either over-refusal (the model refuses legitimate requests because it confuses helpfulness with harmlessness) or under-refusal (the model answers genuinely dangerous questions because the safety training was superficial).

In 2023, the dominant alignment approach was RLHF: Reinforcement Learning from Human Feedback. A reward model is trained on human preference comparisons, and the language model is then optimized using RL to maximize this reward. RLHF works, but it is operationally complex, requires a stable RL training loop on top of an already complex pre-training infrastructure, and is sensitive to reward model quality in ways that are hard to diagnose.

Direct Preference Optimization (DPO) was a relatively new alternative in 2023, which reframes the preference learning problem as a direct classification objective eliminating the need for an explicit reward model and a separate RL training loop. We evaluated both approaches on small-scale experiments and found DPO to be more training-stable and easier to iterate on, at the cost of some theoretical elegance. The practical choice was DPO.

Alignment for Indic models also has a cultural dimension that is invisible in the English-centric alignment literature. The definition of harmful content is not universal. What constitutes offensive speech, what topics are sensitive, and what refusals are appropriate vary significantly across the linguistic and cultural communities that speak Indic languages. Alignment data had to be designed with this heterogeneity in mind and again, there was no reference dataset to draw from. Every decision was made in-house.

Stage 11 Final evaluation: benchmarks and what they miss

After alignment, the model goes through final evaluation before any deployment decision. This is where you attempt to make objective, comparable claims about model capability.

For English LLMs, this is a solved infrastructure problem. There are standard benchmarks (MMLU, HellaSwag, HumanEval, GSM8K, TruthfulQA), standard evaluation harnesses, and community results to compare against. For Indic LLMs in 2023, we were in a different situation. We adapted existing benchmarks where Indic-language versions existed, constructed evaluation sets in-house where they did not, and were transparent about the limitations of each approach.

One of the most important things I learned through this process is that benchmark performance and real-world utility have a complex relationship. A model can score well on standardized benchmarks while still producing outputs that feel unnatural to native Indic language speakers because the benchmarks do not capture pragmatics, tone, register, or cultural appropriateness. The final signal we trusted most was qualitative evaluation by fluent speakers across our target languages, which cannot be reduced to a number but which surfaces failure modes that no benchmark catches.

What this series covers next

This overview has been intentionally high-level. Each stage I have described here is a month's worth of work, a set of decisions that could fill its own technical article, and a collection of mistakes that were instructive precisely because we did not have a playbook to follow. The subsequent articles in this series will go deep on each one:

part-2 : Building a Corpus for Languages the Internet Forgot
Data collection strategy, trust-level taxonomy, cross-script normalization, and the specific decisions we made to curate a high-quality Indic pre-training corpus from a noisy web.

Part 3: Deduplication and Filtration at Scale
MinHash LSH, perplexity-based filtering, language identification challenges, and how we iterated filtration criteria using small-model feedback.

Part 4: Tokenizer Design for Indic Scripts
Fertility analysis, vocabulary size calibration, code-switching handling, and the specific engineering decisions behind our custom BPE tokenizer.

Part 5: Architecture From First Principles
GQA vs MHA tradeoffs, RoPE positional encoding, RMSNorm, SwiGLU, and the full reasoning behind every architectural decision — including what we got wrong the first time.

Part 6: Distributed Training and Infrastructure
Tensor, pipeline, and data parallelism, Megatron-LM configuration, gradient clipping strategy, loss spike diagnosis, and checkpoint management.

Part 7: The Small-Model Cycle: Debugging Before Scale
How we used 125M–1B parameter proxy runs to validate every pipeline component before committing to large-scale training.

Part 8: Evaluating a Base Model Without Benchmarks
Perplexity, few-shot evaluation, language identification probing, and constructing our own evaluation framework in the absence of standard Indic benchmarks.

Part 9: Instruction Tuning, Alignment, and Final Evaluation
Building Indic instruction datasets from scratch, DPO vs RLHF, cultural dimensions of alignment, and what benchmarks miss about real-world model quality.

A note on timing. Everything described in this series happened in 2023, at a moment when the Indic language AI ecosystem was sparse enough that many of the tools and reference materials practitioners now take for granted did not yet exist. There were no high-quality Indic instruction datasets. There were no standardized benchmarks. There were no tokenizers designed for the full breadth of Indic scripts at the vocabulary sizes required for large-scale pre-training. Working in that environment forced a level of first-principles thinking that I think is worth documenting not because the specific solutions we found are the only solutions, but because the decision-making process behind them is a useful map for anyone operating at the frontier of a new domain, before the community has converged on a standard playbook.

If this resonates with challenges you are working through whether in Indic AI, low-resource language modelling, or LLM infrastructure more broadly I would welcome the conversation. The subsequent articles will go considerably deeper. The goal is not to be comprehensive, but to be honest about what the work actually required.

LargeLanguageModels LLM LLMPretraining GenerativeAI MachineLearning DeepLearning NLP NaturalLanguageProcessing
Transformers FoundationModels IndicAI IndicNLP MultilingualAI IndicLanguages HindiNLP LowResourceNLP IndianLanguageAI Krutrim BharatAI DevanagariNLP