Project: Building "Mini-C4" — A Production-Grade LLM Pre-training Pipeline 🏗️

#ai #beginners #opensource #learning

Project: Building "Mini-C4" Pre-training Corpus 🏗️

This project demonstrates how to build a miniaturized version of the C4 (Colossal Clean Crawled Corpus) pipeline. Our mission: transform chaotic, raw web data (Common Crawl) into low-noise, deduplicated, high-quality text ready for LLM pre-training.

👉 GitHub: datascale-ai/data_engineering_book

1. Project Brief

Objective: Build a pipeline to process raw Common Crawl data into a clean text corpus.
Input: Raw WARC files (.warc.gz) containing HTTP headers, HTML source, and binary noise.
Output: Categorized JSONL files (final_data.jsonl) featuring clean text, language labels, and Perplexity (PPL) scores.
Challenges:
Extremely Low Signal-to-Noise Ratio: Over 90% of raw web data consists of navbars, ads, SEO spam, and JS code.
Fuzzy Deduplication: Identifying semantically similar documents across millions of records is computationally expensive.
Quality Quantification: How to distinguish "human-grade prose" from "machine-generated gibberish" without expensive LLM APIs.

2. Architecture Design

We designed a Funnel-shaped pipeline to filter noise layer by layer:

Tech Stack Decisions

Component	Choice	Rationale
Parsing	`warcio`, `trafilatura`	`trafilatura` excels at extracting main content (removing footers/ads) far better than BeautifulSoup.
Compute	`Ray`	Python's `multiprocessing` has high overhead for large shared states. Ray’s Actor Model scales easily from multi-core to clusters.
Deduplication	`MinHash LSH`	Reduces complexity to using Locality Sensitive Hashing.
Evaluation	`KenLM`	A lightweight N-gram model used by GPT-3/CCNet to measure text "naturalness" via Perplexity.

3. Step-by-Step Implementation

Phase I: Heuristic Cleaning & Extraction

Raw WARC files are a mess. We use warcio for streaming and trafilatura to extract the "soul" of the webpage.

Code Insight: Streaming Processor

from warcio.archiveiterator import ArchiveIterator
import trafilatura

for record in ArchiveIterator(stream):
    if record.rec_type == 'response':
        # Filter for HTML only
        if 'text/html' not in record.http_headers.get_header('Content-Type', ''):
            continue
        # Extract main body, ignoring comments and tables
        text = trafilatura.extract(record.content_stream().read(), include_comments=False)

🔍 The Cleaning Rules (Gopher/C4 Standards):

Symbol-to-Word Ratio: If symbols like { } [ ] exceed 10%, it's likely code.
Average Word Length: High-quality English text usually averages 5-10 characters. Values > 15 suggest minified JS or URL lists.
Keyword Blocklist: Drop pages containing "lorem ipsum", "enable cookies", or "403 forbidden".

Phase II: Distributed MinHash Deduplication

To handle "mirrored" content, we use Ray to parallelize the computation of MinHash signatures.

Code Insight: Ray Actor Pattern

@ray.remote
def process_batch(lines):
    results = []
    for line in lines:
        m = MinHash(num_perm=128)
        for w in line['text'].split():
            m.update(w.encode('utf8'))
        results.append((line['url'], m))
    return results
# Map-Reduce: Dispatch batches to all CPU cores
futures = [process_batch.remote(batch) for batch in batches]
results = ray.get(futures)

Phase III: Quality Filtering (KenLM)

We use a pre-trained KenLM model to calculate Perplexity. Lower perplexity means more "natural" language.

📈 Tuning the Threshold:

Score > -5.0: Wikipedia-grade, highly fluent content.
Score -5.0 to -6.0: Standard blog posts and forum discussions.
Score < -6.5: Broken sentences, machine translation failures, or SEO keyword lists (Discard).

4. Performance & Showcase (Data Funnel)

Based on a sample 1GB WARC file processing:

Stage	In (Docs)	Out (Docs)	Retention	Main Loss Reason
Raw WARC	~35,000	~10,000	28%	Non-HTML, Empty content.
Heuristics	10,000	~6,500	65%	Code snippets, short text.
Deduplication	6,500	~4,800	73%	Mirrored sites, templates.
Quality Filter	4,800	~3,900	81%	Gibberish, non-English.
Final Yield	35,000	3,900	~11%	Data Purity over Volume.

5. Scaling to Terabytes (The Next Steps)

State Management: Move MinHashLSH indices from RAM to Redis or Cassandra to handle billions of records.
I/O Optimization: Transition from local files to S3/MinIO using Apache Arrow for columnar streaming.
Global Sharding: Follow the CCNet approach—shard data by hash buckets and deduplicate within shards to minimize cross-node communication.