DEV Community

Xin Xu
Xin Xu

Posted on

Project: Building "Mini-C4" β€” A Production-Grade LLM Pre-training Pipeline πŸ—οΈ

Project: Building "Mini-C4" Pre-training Corpus πŸ—οΈ

This project demonstrates how to build a miniaturized version of the C4 (Colossal Clean Crawled Corpus) pipeline. Our mission: transform chaotic, raw web data (Common Crawl) into low-noise, deduplicated, high-quality text ready for LLM pre-training.

πŸ‘‰ GitHub: datascale-ai/data_engineering_book


1. Project Brief

  • Objective: Build a pipeline to process raw Common Crawl data into a clean text corpus.
  • Input: Raw WARC files (.warc.gz) containing HTTP headers, HTML source, and binary noise.
  • Output: Categorized JSONL files (final_data.jsonl) featuring clean text, language labels, and Perplexity (PPL) scores.
  • Challenges:
  • Extremely Low Signal-to-Noise Ratio: Over 90% of raw web data consists of navbars, ads, SEO spam, and JS code.
  • Fuzzy Deduplication: Identifying semantically similar documents across millions of records is computationally expensive.
  • Quality Quantification: How to distinguish "human-grade prose" from "machine-generated gibberish" without expensive LLM APIs.

2. Architecture Design

We designed a Funnel-shaped pipeline to filter noise layer by layer:

Tech Stack Decisions

Component Choice Rationale
Parsing warcio, trafilatura trafilatura excels at extracting main content (removing footers/ads) far better than BeautifulSoup.
Compute Ray Python's multiprocessing has high overhead for large shared states. Ray’s Actor Model scales easily from multi-core to clusters.
Deduplication MinHash LSH Reduces complexity to using Locality Sensitive Hashing.
Evaluation KenLM A lightweight N-gram model used by GPT-3/CCNet to measure text "naturalness" via Perplexity.

3. Step-by-Step Implementation

Phase I: Heuristic Cleaning & Extraction

Raw WARC files are a mess. We use warcio for streaming and trafilatura to extract the "soul" of the webpage.

Code Insight: Streaming Processor

from warcio.archiveiterator import ArchiveIterator
import trafilatura

for record in ArchiveIterator(stream):
    if record.rec_type == 'response':
        # Filter for HTML only
        if 'text/html' not in record.http_headers.get_header('Content-Type', ''):
            continue
        # Extract main body, ignoring comments and tables
        text = trafilatura.extract(record.content_stream().read(), include_comments=False)

Enter fullscreen mode Exit fullscreen mode

πŸ” The Cleaning Rules (Gopher/C4 Standards):

  1. Symbol-to-Word Ratio: If symbols like { } [ ] exceed 10%, it's likely code.
  2. Average Word Length: High-quality English text usually averages 5-10 characters. Values > 15 suggest minified JS or URL lists.
  3. Keyword Blocklist: Drop pages containing "lorem ipsum", "enable cookies", or "403 forbidden".

Phase II: Distributed MinHash Deduplication

To handle "mirrored" content, we use Ray to parallelize the computation of MinHash signatures.

Code Insight: Ray Actor Pattern

@ray.remote
def process_batch(lines):
    results = []
    for line in lines:
        m = MinHash(num_perm=128)
        for w in line['text'].split():
            m.update(w.encode('utf8'))
        results.append((line['url'], m))
    return results
# Map-Reduce: Dispatch batches to all CPU cores
futures = [process_batch.remote(batch) for batch in batches]
results = ray.get(futures)

Enter fullscreen mode Exit fullscreen mode

Phase III: Quality Filtering (KenLM)

We use a pre-trained KenLM model to calculate Perplexity. Lower perplexity means more "natural" language.

πŸ“ˆ Tuning the Threshold:

  • Score > -5.0: Wikipedia-grade, highly fluent content.
  • Score -5.0 to -6.0: Standard blog posts and forum discussions.
  • Score < -6.5: Broken sentences, machine translation failures, or SEO keyword lists (Discard).

4. Performance & Showcase (Data Funnel)

Based on a sample 1GB WARC file processing:

Stage In (Docs) Out (Docs) Retention Main Loss Reason
Raw WARC ~35,000 ~10,000 28% Non-HTML, Empty content.
Heuristics 10,000 ~6,500 65% Code snippets, short text.
Deduplication 6,500 ~4,800 73% Mirrored sites, templates.
Quality Filter 4,800 ~3,900 81% Gibberish, non-English.
Final Yield 35,000 3,900 ~11% Data Purity over Volume.

5. Scaling to Terabytes (The Next Steps)

  1. State Management: Move MinHashLSH indices from RAM to Redis or Cassandra to handle billions of records.
  2. I/O Optimization: Transition from local files to S3/MinIO using Apache Arrow for columnar streaming.
  3. Global Sharding: Follow the CCNet approachβ€”shard data by hash buckets and deduplicate within shards to minimize cross-node communication.

Conclusion

Building Mini-C4 is a masterclass in Data Funneling. It’s not about how much data you have, but how effectively you can discard the garbage.

πŸ‘‰ Full Source Code: datascale-ai/data_engineering_book

Have you ever tried processing Common Crawl? What’s the weirdest thing you’ve found in a raw WARC file? Let’s talk in the comments! πŸ‘‡


Top comments (0)