Stage 1 : Data Collection: The Invisible Constraint
"Everyone wants to talk about model architecture. Nobody wants to talk about where the data comes from."
I'm going to be honest with you before we begin.
When most people imagine training a large language model from scratch, the mental image is dramatic massive GPU clusters, elegant transformer architectures, loss curves smoothly descending toward something brilliant. The hard part, in this fantasy, is the math.
It isn't.
The hard part : the part that will consume more of your time, more of your engineering effort, and more of your emotional energy than anything else in this entire journey is data. Specifically, finding it, cleaning it, filtering it, deduplicating it, detecting its language reliably, organising it into something a model can actually learn from, and then discovering that what you thought was enough is nowhere close to enough.
And if you're building for Indic languages specifically? The constraint becomes even more brutal. The web wasn't built for Hindi. It wasn't built for Tamil, Telugu, Kannada, Malayalam, Odia, or Assamese. It was built in English, by English-speaking companies, largely for an English-speaking audience. Every data source you reach for will have this asymmetry baked into it.
This post is about what actually happens at Stage 1. Not the idealised version. The real one.
The First Thing You Reach For: CommonCrawl
If you've done any research into LLM training data, you've encountered CommonCrawl. It's a non-profit that has been crawling the web since 2008 and making the data freely available. It is, by a massive margin, the most cited data source in LLM training papers. GPT-3 used it. LLaMA used it. Falcon used it. Mistral's datasets are derived from it. It is, in the minds of most people starting this journey, the obvious answer to the data question.
So let's start there. Let's take it seriously and understand what it actually is : because the gap between what you imagine CommonCrawl to be and what it actually is in practice is where most data collection mistakes live.
What CommonCrawl Actually Is Inside the Format
CommonCrawl releases monthly crawls. Each crawl is somewhere in the range of 200–400 terabytes of compressed data. Across all crawls since 2008, the archive is measured in petabytes. This sounds extraordinary. It is extraordinary. It is also not what you think it is.
Each crawl is distributed across three file formats:
WARC (Web ARChive) files are the raw captures. They contain the full HTTP response headers, HTML markup, JavaScript, CSS, inline images (as base64 or references), cookies, everything. A single WARC file can be several hundred megabytes. The crawl index lists hundreds of thousands of them. If you download WARC files, you are downloading the raw internet HTML tags, JavaScript boilerplate, navigation menus, cookie consent popups, SEO filler, and somewhere buried in all of it, actual human-written text.
WET (Web Extracted Text) files are what most NLP practitioners use. CommonCrawl pre-processes the WARC files and strips out HTML tags, JavaScript, and CSS, leaving only the extracted plain text. This is your primary starting point for language model training data. WET files are significantly smaller than WARCs, but they still contain enormous amounts of noise — boilerplate navigation text, footer content, repetitive legal disclaimers, advertising copy, machine-translated content, and gibberish.
WAT (Web Archive Transformation) files contain metadata and link graphs extracted from the WARC files. These are useful for deduplication and for understanding the web's link structure, but not directly for training text.
The Scale Problem What You Actually Have to Download
Here is where the practical reality starts diverging sharply from the theory.
A single CommonCrawl crawl is organised into segments typically around 90,000 WARC files or around 72,000 WET files per crawl. Each WET file is around 150-300MB compressed. If you want to process a full crawl just one you are looking at roughly 15-20TB of WET data, compressed. Uncompressed, you're working with several times that.
You cannot download this on a home connection. You cannot store it on a local machine. You cannot process it without distributed compute. The practical pipeline looks like this:
CommonCrawl S3 (s3://commoncrawl/)
↓
Stream or download segments via AWS (CC is hosted in us-east-1)
↓
EC2 / Spark cluster for processing
↓
Your own S3 bucket (processed, filtered output)
CommonCrawl itself is hosted on Amazon S3, which is convenient if you're running AWS workloads you can stream data from s3://commoncrawl/ to EC2 instances within the same region without paying egress costs. If you're not on AWS, you'll pay for egress, which will add up fast at this scale.
The Language Detection Problem
Here's the first point where most tutorials gloss over the real difficulty. CommonCrawl does not tell you what language a document is in. You have to figure that out yourself.
The naive approach look at what URL it came from, or check the HTML lang attribute — doesn't work reliably. HTML lang tags are frequently wrong, missing, or set to a default rather than the actual content language. A website with lang="en" in its header might serve content in five languages depending on which page you're looking at.
So you need language identification (LID) at the document level. The widely-used options are:
fastText's language identification model (lid.176.bin) : trained on Wikipedia and Tatoeba in 176 languages. Fast, reasonably accurate, the industry workhorse. Runs at roughly 1 million sentences per second on a single CPU core. For most Indic scripts, it performs well because the scripts themselves are visually distinct. However, it struggles with romanised Indic text (Hinglish, Tanglish, etc.) and code-switched content.
Google's CLD3 (Compact Language Detector 3) : a neural network-based detector. Slightly more accurate than fastText for short strings and mixed-script content. Available as a Python library (pycld3). Slower than fastText.
langdetect : a Python port of Google's language detection library. Works but has known reliability issues with short documents. Do not use this as your primary detector for a serious pipeline.
GlotLID : a newer model specifically designed for low-resource and Indic languages, supporting over 1600 languages. For building an Indic LLM, this is worth investigating over fastText because it handles scripts and languages that fastText was never well-trained on.
The practical approach for a serious Indic LLM pipeline is to run two detectors fastText and GlotLID and accept a document only when both agree on the language. This sacrifices some recall for significantly better precision. You will lose some data, but the data you keep will be reliably labelled.
A minimal language detection step looks roughly like this in Python:
import fasttext
import glotlid # hypothetical unified interface
ft_model = fasttext.load_model("lid.176.bin")
def detect_language(text: str) -> str | None:
# Require minimum document length
if len(text.split()) < 50:
return None
# fastText prediction
ft_pred, ft_conf = ft_model.predict(text.replace("\n", " "), k=1)
ft_lang = ft_pred[0].replace("__label__", "")
# Only accept if confidence is high enough
if ft_conf[0] < 0.85:
return None
return ft_lang
INDIC_LANGS = {"hi", "ta", "te", "kn", "ml", "bn", "mr", "gu", "pa", "or", "as"}
Quality Filtering The Web Is Mostly Garbage
Even after language detection, the text you extract from CommonCrawl is noisy beyond what most people expect. Consider what the web actually contains:
- Boilerplate: Navigation menus repeated on every page. "Home | About | Contact | Privacy Policy" appearing thousands of times in your dataset.
- Duplicate content: The same news article syndicated to forty different websites, each one a nearly-identical copy.
- SEO-spam: Pages that are grammatically coherent but semantically meaningless, designed to rank in search engines rather than communicate information.
- Machine-translated content: Often poor quality, sometimes hallucinatory, always training-contaminating.
- Adult content: CommonCrawl is an unfiltered crawl of the web.
- Code: Often useful, but a different distribution from natural language text.
- Template text: E-commerce product pages that are 80% boilerplate and 20% description.
The standard quality filtering pipeline applies heuristics developed by papers like CCNet (Facebook) and later Gopher (DeepMind). The key signals are:
Document-level heuristics:
- Minimum word count (reject documents under 100-200 words)
- Maximum word count (reject documents that are pathologically long, often scraped indexes)
- Fraction of lines ending with punctuation (prose ends with periods; garbage doesn't)
- Average sentence length (too short = navigation menus; too long = templates)
- Symbol-to-word ratio (high symbol ratio = code or spam)
- Fraction of words in a stop-word list for the language (real prose uses many stop words; SEO spam doesn't)
- Fraction of words that appear in a reference vocabulary
N-gram deduplication:
- Sentence-level: remove documents where more than 30% of sentences appear verbatim elsewhere in the corpus
- Paragraph-level: same logic at a coarser granularity
- MinHash LSH for fuzzy near-duplicate detection at document level
Exact deduplication:
- SHA256 hashing of normalised document content
- Remove any document that appears more than once
Each of these steps removes data. By the time you've applied language detection, quality filtering, and deduplication, you typically keep somewhere between 10% and 30% of the raw CommonCrawl text for a high-resource language like English. For Indic languages, the keep rate can be lower because the raw volume is smaller and the noise-to-signal ratio is often higher.
Which brings us to the real problem.
The Indic Language Gap Why CommonCrawl Is Not Enough
Let me give you a number that will recalibrate your entire thinking about this project.
English makes up roughly 45–50% of CommonCrawl by document count, and an even higher fraction by word count since English documents tend to be longer. The next-largest languages — German, French, Spanish, Russian each account for a few percent. Hindi, the most represented Indic language and one of the most spoken languages on Earth, accounts for roughly 1–2% of CommonCrawl. Tamil, Telugu, Kannada, Malayalam each likely under 0.5%.
Let's put this in concrete terms. If you process a single CommonCrawl crawl of, say, 200TB of WET data and apply standard quality filters, you might extract around 50-80TB of clean English text. For Hindi, after the same pipeline, you might get somewhere in the range of 1-3TB. For Tamil or Telugu, you might be looking at a few hundred gigabytes.
For context: training a 7B parameter model to the standard typically requires somewhere between 1 to 2 trillion tokens of training data. A token is roughly 0.75 words, so you're looking for approximately 750 billion to 1.5 trillion words. A 1TB text file contains roughly 150-200 billion words. So a single CommonCrawl crawl might give you 200-400 billion words of Hindi — if the quality filters don't discard too much. For a 7B model, that's still short.
For smaller Indic languages, CommonCrawl alone is simply insufficient. You need to build your own data pipeline from additional sources.
What the Big Labs Actually Use
Before we get into how to build your own pipeline, it's worth understanding what the leading labs are doing. This context matters because it shapes your intuition for what a "good" data strategy looks like.
OpenAI : WebText, FineWeb, and the Data Behind GPT
OpenAI has been deliberately opaque about their training data, but what we know from published research and technical reports gives a useful picture.
GPT-3 (2020) was trained on a mix of CommonCrawl (filtered, ~570GB of text, called WebText2 internally), WebText (a dataset of Reddit-linked web pages, created by crawling URLs that had been shared on Reddit and received at least three upvotes — a clever quality signal that filters for content humans found interesting), Books1 and Books2 (books scraping of unclear provenance), and English Wikipedia. CommonCrawl dominated by volume but was down-weighted in sampling to prevent it from overwhelming the curated sources.
GPT-4's training data has never been fully disclosed. Reporting suggests it includes significantly expanded web data, synthetic data generated by earlier GPT models, and curated datasets from paid partnerships with publishers and data providers. This is an important signal: at the frontier, purely scraped web data is not enough. You need curation, and curation at scale requires either human labour or earlier models.
OpenAI's data partnerships : OpenAI has signed deals with news publishers (AP, The Atlantic), book publishers, and other content owners to license their content for training. This is the quietly significant commercial strategy that open-source efforts genuinely cannot replicate.
Anthropic Constitutional AI and Careful Data Curation
Anthropic has published relatively little about their exact data sources for Claude's pretraining, but several signals emerge from their published research.
Their focus on safety and harmlessness means their data pipeline likely includes aggressive content filtering beyond standard quality heuristics. The Constitutional AI approach they use for alignment also implies that the model's pretraining data is cleaner than average you cannot fine-tune away what is deeply embedded in pretraining.
Anthropic has discussed using a mix of web data (likely CommonCrawl-derived, heavily filtered), curated books and papers, and synthetic data for specific capabilities. Their technical reports mention that Claude's training includes "a large dataset of text from the internet and other sources," but specifics are proprietary.
What Anthropic has been more public about is their emphasis on data quality over data quantity. Repeated statements from their research team suggest that a smaller dataset of higher quality can outperform a larger noisy dataset for many capabilities an insight that is directly relevant to the Indic language challenge, where your data is necessarily smaller.
Google : The C4 Dataset, Gemini, and Vertical Integration
Google's position is structurally unique. They built the web index. They built the crawlers. They are, in a meaningful sense, upstream of CommonCrawl.
T5's C4 dataset (Colossal Clean Crawled Corpus) was an early public example of Google's data processing pipeline. C4 was derived from CommonCrawl using a set of heuristic filters removing lines not ending with terminal punctuation, removing pages with fewer than five sentences, removing offensive content via word-list filtering. This was a relatively simple pipeline but established the basic shape of what large-scale web data processing looks like.
LaMDA, PaLM, and Gemini use significantly more sophisticated pipelines. From what Google has disclosed, their training data includes: web documents (their own crawl, not CommonCrawl), books from Google Books, code from GitHub, scientific papers from Google Scholar's index, YouTube transcripts (they own YouTube), and Wikipedia. The YouTube transcript angle is particularly notable Google has access to an enormous volume of spoken-word Indic language content through YouTube captions, which they can use in ways that external researchers simply cannot.
Gemini specifically was described as using "multimodal" data from the start, incorporating text, images, audio, and video into a unified training corpus. This is architecturally distinct from earlier text-only models and reflects Google's vertical integration across content types.
For Indic languages specifically, Google has a significant advantage through their work on AI4Bharat-derived datasets, their own investment in multilingual NLP research, and the sheer reach of Google Search and YouTube in India. Their data advantage for Indic languages is structural, not just a matter of engineering effort.
The takeaway from looking at what the labs do is this: they all start with web data, but none of them stop there. They curate, they partner, they build proprietary pipelines, and they invest heavily in quality filtering. For an Indic LLM, the lesson is clear — CommonCrawl is the foundation, but you need to build on top of it.
Building Your Own Data Moat Source by Source
Given that CommonCrawl alone is insufficient for Indic languages, you need to construct a multi-source data pipeline. Here is how to think through each source category.
1. News Archives
Indian news organisations produce an enormous volume of Indic language text, and much of it is accessible via scraping. The key targets:
Hindi: Dainik Jagran, Dainik Bhaskar, Navbharat Times, Amar Ujala, Rajasthan Patrika. Each of these has a web presence with years of archived articles. A systematic scraper targeting their article pages (not homepages or index pages) can yield hundreds of millions of words of clean Hindi text.
Tamil: Dinamalar, Dinakaran, Daily Thanthi, The Hindu Tamil edition.
Telugu: Eenadu, Sakshi, Andhra Jyothy.
Kannada: Prajavani, Vijay Karnataka, Deccan Herald Kannada.
Malayalam: Mathrubhumi, Malayala Manorama, Madhyamam.
Bengali: Anandabazar Patrika, Prothom Alo (Bangladesh), Dainik Statesman.
News text is valuable because it is factual, reasonably well-written (sub-edited, not raw user content), topically diverse, and dated — meaning it tracks language evolution over time. The challenge is respecting robots.txt and rate limits, and dealing with the fact that news websites often change their HTML structure, breaking scrapers.
A news scraping pipeline typically looks like this:
Sitemap crawl (find article URLs)
→ Respectful crawling (rate-limited, respecting robots.txt)
→ HTML extraction (trafilatura, newspaper3k, or BeautifulSoup)
→ Boilerplate removal (header, footer, nav, ads)
→ Language verification (fastText/GlotLID)
→ Quality scoring
→ Deduplication (article titles are frequently reused)
→ Upload to S3 with metadata (source, date, language, word count)
trafilatura is the best tool I've found for extracting main content from news articles. It's specifically designed for news text extraction and outperforms BeautifulSoup for this use case by a significant margin.
2. Books and Literary Text
Books are among the most valuable data sources for language model training because they contain long-form, coherent, carefully structured writing — the kind of writing that produces models which can reason and compose extended arguments.
For Indic languages, digital book sources include:
Project Gutenberg — limited Indic content, but has some classical texts.
Wikisource — has a reasonable collection of public domain texts in Hindi, Tamil, Sanskrit, and other Indic languages. Directly downloadable via API.
Digital Library of India / IGNCA — government-operated repositories with scanned books. OCR quality is variable but there's significant volume.
Sahitya Akademi publications — India's national academy of letters has digitised portions of their catalogue.
State libraries — Several state government libraries have digital catalogues with varying degrees of accessibility.
The challenge with books is two-fold. First, copyright books published before 1927 are typically in the public domain in the US; laws vary by country. Books after that are protected, and scraping them without licensing agreements is legally fraught. Second, OCR quality many Indic language books that exist digitally are scans, and OCR accuracy for Indic scripts, while improving, is still imperfect.
3. Wikipedia and Wikimedia Projects
Wikipedia is a standard component of virtually every LLM training corpus, and Indic Wikis are worth extracting. The good news is that Wikipedia dumps are freely available and pre-segmented by language.
The dump for Hindi Wikipedia (hiwiki-latest-pages-articles.xml.bz2) can be processed with WikiExtractor to produce clean text. However, be aware of scale — Hindi Wikipedia, while growing rapidly, is small compared to English Wikipedia. As of recent counts, Hindi Wikipedia has around 150,000+ articles; English has over 6.7 million. Tamil, Telugu, and Kannada Wikis are even smaller.
Beyond Wikipedia, the Wikimedia Foundation operates several other projects that contain Indic content: Wiktionary (dictionary entries, good for vocabulary coverage), Wikisource (public domain texts), and Wikibooks.
IndicWiki corpus and similar derived datasets exist and are worth using as a starting point before building custom scraping infrastructure.
4. YouTube Transcripts
This is one of the most underutilised sources for Indic language data, and I want to spend some time here because the opportunity is significant.
YouTube is enormous in India. Hindi-language content is one of the fastest-growing segments on the platform. News channels, educational content creators, documentary producers, and politicians all produce significant volumes of Indic language content. And YouTube provides auto-generated captions for much of this content via their speech recognition system.
These captions are imperfect they're generated by an ASR model, not transcribed by humans but they represent a source of spoken-language data that doesn't exist in written form anywhere else. Colloquial Hindi, regional dialects, code-switching between Hindi and English this all shows up in YouTube captions in a way it doesn't show up in news articles.
The pipeline:
from youtube_transcript_api import YouTubeTranscriptApi
import yt_dlp
def get_channel_video_ids(channel_url: str) -> list[str]:
ydl_opts = {
"quiet": True,
"extract_flat": True,
"force_generic_extractor": False,
}
with yt_dlp.YoutubeDL(ydl_opts) as ydl:
info = ydl.extract_info(channel_url, download=False)
return [entry["id"] for entry in info.get("entries", [])]
def get_transcript(video_id: str, language: str = "hi") -> str | None:
try:
transcript_list = YouTubeTranscriptApi.list_transcripts(video_id)
# Prefer manually created captions; fall back to auto-generated
try:
transcript = transcript_list.find_manually_created_transcript([language])
except:
transcript = transcript_list.find_generated_transcript([language])
segments = transcript.fetch()
return " ".join(seg["text"] for seg in segments)
except Exception:
return None
The ethical and legal considerations here are important. YouTube's ToS restricts automated scraping. Many content creators depend on their content for livelihood. The approach most research teams take is to use transcripts for research under fair use principles, but this is a legally nuanced area. Some teams reach out directly to content creators for explicit permission, which also gives you cleaner, human-curated channel lists.
5. Government and Legal Text
Government text is valuable for several reasons: it covers technical and formal registers of the language, it's authoritative, it's high quality, and it's public.
Official government portals — States like Maharashtra, Tamil Nadu, Karnataka, Kerala, and Andhra Pradesh publish official gazette notifications, government orders, and public documents in Indic languages. The Government of India's National Portal publishes content in 22 scheduled languages.
Parliamentary proceedings — The Lok Sabha and Rajya Sabha publish debates in Hindi and in other scheduled languages when speeches are delivered in those languages. Parliamentary debates are long-form, formal, topically diverse, and high quality.
Legal judgments — The Supreme Court of India and many High Courts publish judgments that have been translated into Indic languages. Legal text is challenging (dense with technical terminology) but valuable for formal language understanding.
NCERT textbooks — The National Council of Educational Research and Training publishes school textbooks in all major Indic languages, and many of these are freely downloadable as PDFs. These are edited, high-quality, age-appropriate text covering science, social studies, mathematics, and literature.
6. Academic and Research Text
JSTOR, arXiv, and similar repositories are mostly English. For Indic language academic text, the main sources are:
- Shodhganga — India's national repository of PhD theses. Many theses in social sciences, humanities, and education are written in Indic languages.
- Indian journals — Several academic journals publish in Hindi and other Indic languages. Access varies.
Building the Full Ingestion Pipeline
Now let's talk about the actual engineering how you stitch all of these sources into a coherent pipeline that produces clean, deduplicated, language-verified text in S3.
The architecture I'd recommend for a team of one to five people with a moderate cloud budget:
┌─────────────────────────────────────────────────────────┐
│ Source Layer │
│ CommonCrawl News Sites YouTube Wikipedia Books │
│ (S3 stream) (scrapers) (API) (dumps) (PDFs) │
└────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Extraction Layer │
│ trafilatura / pdfminer / youtube-transcript-api │
│ Output: raw text + metadata JSON (source, date, lang) │
└────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Normalisation Layer │
│ Unicode normalisation (NFC/NFKC) │
│ Script detection (is this actually Devanagari?) │
│ Whitespace normalisation │
│ Remove HTML artifacts, escape sequences │
└────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Language Detection Layer │
│ fastText (lid.176.bin) + GlotLID │
│ Threshold: confidence > 0.85, both models agree │
│ Route to language-specific buckets │
└────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Quality Filtering Layer │
│ Document-level heuristics (length, punct ratio, etc.) │
│ Perplexity filter (KenLM trained on reference text) │
│ Content filtering (adult, spam, hate speech) │
└────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Deduplication Layer │
│ Exact dedup: SHA256 hash of normalised content │
│ Fuzzy dedup: MinHash LSH (datasketch library) │
│ Threshold: Jaccard similarity > 0.8 = duplicate │
└────────────────────────┬────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ S3 Output Layer │
│ s3://your-bucket/ │
│ ├── raw/{language}/{source}/{year}/{month}/ │
│ ├── processed/{language}/{source}/{year}/{month}/ │
│ └── final/{language}/train/ val/ test/ │
│ Format: JSONL — one document per line │
│ Each record: {text, source, language, date, quality} │
└─────────────────────────────────────────────────────────┘
The JSONL Format Keep It Simple
Every document that passes your pipeline should be stored as a single JSON object on a single line (JSONL format). A minimal schema:
{
"text": "जब भारत ने 1947 में स्वतंत्रता प्राप्त की...",
"source": "dainik_bhaskar",
"source_url": "https://www.bhaskar.com/...",
"language": "hi",
"script": "Devanagari",
"date": "2023-08-15",
"word_count": 847,
"quality_score": 0.87,
"pipeline_version": "v1.2",
"dedup_hash": "sha256:a3b9c..."
}
The quality_score field is critical it lets you apply different quality thresholds at different stages of training (higher threshold for initial training runs, lower for later stages).
Managing CommonCrawl at Scale
Processing CommonCrawl without burning through your budget requires careful thinking about compute architecture.
The recommended approach is streaming processing rather than downloading everything locally. The flow:
import boto3
import gzip
import warcio
def process_wet_file(s3_key: str):
s3 = boto3.client("s3", region_name="us-east-1")
response = s3.get_object(Bucket="commoncrawl", Key=s3_key)
with gzip.open(response["Body"], "rb") as f:
for record in warcio.ArchiveIterator(f):
if record.rec_type != "conversion":
continue
text = record.content_stream().read().decode("utf-8", errors="ignore")
url = record.rec_headers.get_header("WARC-Target-URI")
language = detect_language(text)
if language not in INDIC_LANGS:
continue
if passes_quality_filters(text, language):
yield {
"text": text,
"url": url,
"language": language,
"source": "commoncrawl",
"crawl_date": record.rec_headers.get_header("WARC-Date"),
}
For scale, run this across a cluster of EC2 instances (even spot instances work fine for this workload it's stateless and restartable). Apache Spark on EMR is an option for very large-scale processing, though the overhead of setting up Spark can outweigh the benefits if you're processing fewer than ~50 crawl segments.
A cost-effective approach for an independent researcher: process one crawl segment at a time on a single large EC2 instance (r6i.4xlarge or similar, 16 vCPUs, 128GB RAM), running the full pipeline in Python with multiprocessing. Each WET file takes roughly 2-5 minutes to process. With parallelism, you can get through a few thousand WET files per day. Budget for this at around $50-150/day in EC2 costs — not cheap, but manageable.
The Deduplication Challenge
Deduplication is where a lot of people underestimate the effort required. The same news article appearing on forty different websites is a trivial case. The harder cases are:
- Near-duplicates: Same article with different headlines or different publication dates
- Templated content: Product descriptions that share 90% of their text across different products
- Aggregated content: Websites that republish translated content from other sources
- Multi-crawl duplicates: The same page appearing in the March 2023 and June 2023 CommonCrawl crawls
The standard approach is MinHash LSH deduplication. The datasketch Python library implements this efficiently. For a corpus of hundreds of millions of documents, you'll want to run this on a cluster rather than a single machine.
from datasketch import MinHash, MinHashLSH
def get_minhash(text: str, num_perm: int = 128) -> MinHash:
m = MinHash(num_perm=num_perm)
for word in text.lower().split():
m.update(word.encode("utf-8"))
return m
lsh = MinHashLSH(threshold=0.8, num_perm=128)
# Insert documents
for doc_id, document in enumerate(documents):
minhash = get_minhash(document["text"])
lsh.insert(str(doc_id), minhash)
# Query for duplicates
def find_duplicates(text: str) -> list[str]:
minhash = get_minhash(text)
return lsh.query(minhash)
The Mental Model: Thinking at Billion-Parameter Scale
If you've made it this far, you have a picture of the individual components. But let me zoom out and talk about how to think about all of this when your goal is training a model with billions of parameters.
Token Count Is What Actually Matters
Your intuition about data size probably comes from file sizes (GB, TB). Train yourself to think in tokens instead. At roughly 3-4 characters per token for Indic scripts (Devanagari, Tamil script, etc. are more compact in terms of characters but tokenise differently), a 1GB file of Hindi text might contain somewhere around 150-300 million tokens.
Current research (the Chinchilla paper from DeepMind) suggests that a model should be trained on roughly 20 tokens per parameter for optimal compute efficiency. A 7B parameter model, therefore, wants around 140 billion tokens of training data at minimum. Frontier models are now often trained on significantly more than this 2-4 trillion tokens is common.
For a 1B parameter Indic language model (a reasonable starting point), you want at minimum 20-50 billion tokens. For Hindi, this is achievable with CommonCrawl + news + books + Wikipedia combined. For Tamil or Telugu at the same scale, it's tight but doable with aggressive multi-source collection. For languages like Odia or Assamese at 1B scale, you'll need creative sourcing and likely some degree of cross-lingual data from closely related languages.
Quality Is a Multiplier on Quantity
Here is the mental model that will save you from chasing raw scale at the expense of everything else: quality multiplies quantity.
If you train on 100 billion tokens of noisy, low-quality data, you may get worse results than training on 20 billion tokens of carefully curated, high-quality data. This has been shown empirically in papers like Phi-1 (Microsoft), which achieved strong benchmark performance with a small model trained on carefully curated synthetic and textbook-quality data.
This is particularly relevant for Indic languages, where your data is going to be limited. You cannot win the quantity game against the labs. You can potentially win the quality game — by being more careful than a large lab's automated pipeline can afford to be, by sourcing data that scrapers don't reach, and by curating rather than accumulating.
The Data Flywheel Bootstrap Thinking
Here's a pattern that's worth internalising for a long-running project like this. The data you collect now is not just for your first model. It is the foundation for a data flywheel:
Iteration 1: Collect data manually → train a small model → use model to help identify more data sources, classify quality, detect duplicates.
Iteration 2: Use iteration 1 model outputs to generate synthetic training examples in domains where your real data is weak (e.g., scientific text in Telugu, legal text in Kannada).
Iteration 3: Use the iteration 2 model to help human annotators work faster propose labels, suggest edits, flag ambiguous cases.
At each iteration, the model gets better, and the data quality improves. This is how serious Indic NLP efforts like AI4Bharat have built momentum not by trying to collect everything perfectly upfront, but by building a feedback loop between models and data.
The S3 Bucket Structure Design for the Whole Journey
Before you start dumping files, think carefully about your S3 structure. You will want to be able to:
- Reprocess specific source types without touching others
- Apply new quality filters to existing data without re-scraping
- Maintain separate train/val/test splits per language
- Track data lineage (which pipeline version produced this file)
- Add new languages without reorganising existing structure
A structure that has worked well in practice:
s3://your-indic-llm-bucket/
├── raw/
│ ├── commoncrawl/2024-10/hi/segment_001.jsonl.gz
│ ├── news/dainik_bhaskar/2023/08/articles.jsonl.gz
│ ├── youtube/transcripts/hi/batch_001.jsonl.gz
│ └── wikipedia/hiwiki/20240101/dump.jsonl.gz
├── processed/
│ ├── hi/v1.2/deduped_quality_filtered.jsonl.gz
│ ├── ta/v1.2/deduped_quality_filtered.jsonl.gz
│ └── te/v1.2/deduped_quality_filtered.jsonl.gz
├── final/
│ ├── hi/train.jsonl.gz
│ ├── hi/val.jsonl.gz
│ └── hi/test.jsonl.gz
└── metadata/
├── stats/token_counts_by_source_language.json
└── pipeline_versions/v1.2_config.json
Keep pipeline version numbers in your paths. When you update a filter or fix a bug, you want to know which files were produced by which version of your pipeline.
What You've Built And What Comes Next
By the end of Stage 1, you should have:
- A processed, deduplicated, language-verified dataset in S3, stored as JSONL
- Coverage across at minimum: CommonCrawl (multi-crawl), news archives, Wikipedia, and one or more of: YouTube transcripts, books, government text
- Metadata tracking for each document: source, language, date, quality score, pipeline version
- A rough token count per language — probably something like 20-100B tokens of Hindi, 5-30B tokens of the major Dravidian languages, and smaller amounts for less-resourced Indic languages
- A deduplication report showing how much duplication existed in your raw sources
This dataset is not clean enough to train on yet. It is a foundation.
In Stage 2, we'll talk about what comes next: the data cleaning, normalisation, and preparation pipeline that takes this raw JSONL corpus and produces the tokenised, shuffled, packed training sequences that an actual model can consume. We'll cover tokenisation for Indic languages specifically — the choice between character-level, BPE, and sentencepiece models, and why this decision has large downstream effects on your model's ability to handle morphologically rich Indic languages.
We'll also talk about data mixing: how you decide the ratio of Hindi to Tamil to Telugu to English in your training batch, and how to implement curriculum-style data ordering that trains on easier data first and harder data later.
For now, though, you have the invisible constraint in hand. Data is the foundation. Everything else architecture, training, alignment is built on top of what you've built here.
Build it carefully.
This is Part 2 of the "Building an LLM from Scratch for Indic Languages" series. Part 1 (the introduction) is here ). Part 3 — Data Cleaning and Tokenisation — is coming next.
If you found this useful, the series is being written openly. Feedback, corrections, and suggestions are welcome in the responses.
Tags: LLM, Large Language Models, Indic Languages, NLP, Data Engineering, CommonCrawl, Hindi, Tamil, Telugu, Machine Learning, AI, Data Pipeline, S3, Python, Natural Language Processing
Top comments (0)