Project: Building "Mini-C4" Pre-training Corpus ποΈ
This project demonstrates how to build a miniaturized version of the C4 (Colossal Clean Crawled Corpus) pipeline. Our mission: transform chaotic, raw web data (Common Crawl) into low-noise, deduplicated, high-quality text ready for LLM pre-training.
π GitHub: datascale-ai/data_engineering_book
1. Project Brief
- Objective: Build a pipeline to process raw Common Crawl data into a clean text corpus.
-
Input: Raw WARC files (
.warc.gz) containing HTTP headers, HTML source, and binary noise. -
Output: Categorized JSONL files (
final_data.jsonl) featuring clean text, language labels, and Perplexity (PPL) scores. - Challenges:
- Extremely Low Signal-to-Noise Ratio: Over 90% of raw web data consists of navbars, ads, SEO spam, and JS code.
- Fuzzy Deduplication: Identifying semantically similar documents across millions of records is computationally expensive.
- Quality Quantification: How to distinguish "human-grade prose" from "machine-generated gibberish" without expensive LLM APIs.
2. Architecture Design
We designed a Funnel-shaped pipeline to filter noise layer by layer:
Tech Stack Decisions
| Component | Choice | Rationale |
|---|---|---|
| Parsing |
warcio, trafilatura
|
trafilatura excels at extracting main content (removing footers/ads) far better than BeautifulSoup. |
| Compute | Ray |
Python's multiprocessing has high overhead for large shared states. Rayβs Actor Model scales easily from multi-core to clusters. |
| Deduplication | MinHash LSH |
Reduces complexity to using Locality Sensitive Hashing. |
| Evaluation | KenLM |
A lightweight N-gram model used by GPT-3/CCNet to measure text "naturalness" via Perplexity. |
3. Step-by-Step Implementation
Phase I: Heuristic Cleaning & Extraction
Raw WARC files are a mess. We use warcio for streaming and trafilatura to extract the "soul" of the webpage.
Code Insight: Streaming Processor
from warcio.archiveiterator import ArchiveIterator
import trafilatura
for record in ArchiveIterator(stream):
if record.rec_type == 'response':
# Filter for HTML only
if 'text/html' not in record.http_headers.get_header('Content-Type', ''):
continue
# Extract main body, ignoring comments and tables
text = trafilatura.extract(record.content_stream().read(), include_comments=False)
π The Cleaning Rules (Gopher/C4 Standards):
-
Symbol-to-Word Ratio: If symbols like
{ } [ ]exceed 10%, it's likely code. - Average Word Length: High-quality English text usually averages 5-10 characters. Values > 15 suggest minified JS or URL lists.
- Keyword Blocklist: Drop pages containing "lorem ipsum", "enable cookies", or "403 forbidden".
Phase II: Distributed MinHash Deduplication
To handle "mirrored" content, we use Ray to parallelize the computation of MinHash signatures.
Code Insight: Ray Actor Pattern
@ray.remote
def process_batch(lines):
results = []
for line in lines:
m = MinHash(num_perm=128)
for w in line['text'].split():
m.update(w.encode('utf8'))
results.append((line['url'], m))
return results
# Map-Reduce: Dispatch batches to all CPU cores
futures = [process_batch.remote(batch) for batch in batches]
results = ray.get(futures)
Phase III: Quality Filtering (KenLM)
We use a pre-trained KenLM model to calculate Perplexity. Lower perplexity means more "natural" language.
π Tuning the Threshold:
- Score > -5.0: Wikipedia-grade, highly fluent content.
- Score -5.0 to -6.0: Standard blog posts and forum discussions.
- Score < -6.5: Broken sentences, machine translation failures, or SEO keyword lists (Discard).
4. Performance & Showcase (Data Funnel)
Based on a sample 1GB WARC file processing:
| Stage | In (Docs) | Out (Docs) | Retention | Main Loss Reason |
|---|---|---|---|---|
| Raw WARC | ~35,000 | ~10,000 | 28% | Non-HTML, Empty content. |
| Heuristics | 10,000 | ~6,500 | 65% | Code snippets, short text. |
| Deduplication | 6,500 | ~4,800 | 73% | Mirrored sites, templates. |
| Quality Filter | 4,800 | ~3,900 | 81% | Gibberish, non-English. |
| Final Yield | 35,000 | 3,900 | ~11% | Data Purity over Volume. |
5. Scaling to Terabytes (The Next Steps)
-
State Management: Move
MinHashLSHindices from RAM to Redis or Cassandra to handle billions of records. - I/O Optimization: Transition from local files to S3/MinIO using Apache Arrow for columnar streaming.
- Global Sharding: Follow the CCNet approachβshard data by hash buckets and deduplicate within shards to minimize cross-node communication.
Conclusion
Building Mini-C4 is a masterclass in Data Funneling. Itβs not about how much data you have, but how effectively you can discard the garbage.
π Full Source Code: datascale-ai/data_engineering_book
Have you ever tried processing Common Crawl? Whatβs the weirdest thing youβve found in a raw WARC file? Letβs talk in the comments! π

Top comments (0)