As large language models push into trillion-token training territory, I’ve observed a critical bottleneck emerge: data duplication. When scaling datasets to 15 trillion tokens—like Kimi K2 or GPT-4—even 0.1% duplication wastes $150K+ in compute. Here’s what works (and what backfires) at scale.
Why Deduplication Isn’t Optional
During a recent deduplication project for a billion-document corpus, I measured concrete impacts:
- Compute Waste: 20% duplicated shingles consumed 18% extra GPU-hours.
- Model Degradation: In fine-tuning tests, duplicated data reduced accuracy by 4% on reasoning tasks.
- Memorization Risks: Verbatim duplicates increased privacy leakage by 8× in model outputs.
Key insight: More data ≠ better data. At trillion-scale, filtering duplicates isn’t preprocessing—it’s infrastructure.
Beyond Basic Hashing: The MinHash LSH Workflow
Cryptographic hashing misses near-duplicates (e.g., reformatted code or translated articles). Semantic deduplication? Prohibitively expensive at scale. Instead, I use MinHash LSH—a probabilistic method balancing precision and cost.
How It Operates
- Shingling: Split documents into overlapping word triplets (n=3).
def shingle(text: str, n=3):
tokens = text.split()
return {" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1)}
-
MinHash Signatures: Generate compressed document fingerprints.
- Problem: Hash collisions occur when signatures exceed 16.7M (float32 precision ceiling).
- Fix: Use uint32 vectors with binary packing.
- Locality-Sensitive Hashing (LSH): Cluster signatures into "bands" for collision-based similarity detection.
# Banding example (4 bands of 3 rows each)
signature = [281, 812, 102, 993, 374, 555, 621, 901]
bands = [
hash(tuple(signature[0:3])),
hash(tuple(signature[3:6])),
hash(tuple(signature[6:9]))
] # Match if any band matches
Tradeoffs:
- Lower bands increase recall (more duplicates found) but raise false positives.
- For 99% recall in 1B+ docs, I use 10 bands with 12 rows.
Engineering Pitfalls at Scale
Testing on 10M Wikipedia documents exposed three critical hurdles:
1. The Float32 Trap
When storing MinHash signatures in a vector database, float32 formats corrupt values above 16,777,216.
-
Solution: Binary vector support (e.g., Milvus’
BINARY_VECTOR
type) preserves uint32 integrity.
2. Import Bottlenecks
Loading 30GB of signatures (780-dimensional uint32) took 45 minutes—unacceptable for iterative pipelines.
-
Breakthrough: Parallel file processing cut this to 4 minutes. Key optimizations:
- Distributed shard ingestion
- Dynamic memory pooling
3. Query Concurrency Walls
At peak load (44K queries/sec), indexing collapsed. We redesigned the pipeline:
[Shingling] → [MinHash Gen] → [LSH Bucketing]
↓
[Distributed Vector DB] ← [Batch Dedup API]
Deployment Guide: Consistency Levels Matter
Not all deduplication requires strong consistency. For training data:
- Strong Consistency: Use when building canonical datasets. Guarantees no dupes—at 30% throughput cost.
- Eventual Consistency: Acceptable for augmenting live data. Achieves 97% dedup accuracy at 60% lower latency.
Misuse Example: Strong consistency in streaming data ingestion crashed our cluster at 100K docs/sec. Downgrading to eventual consistency solved it.
Performance Benchmarks: 10M Document Test
Method | Precision | Recall | Time (min) |
---|---|---|---|
Exact Hashing | 100% | 62% | 18 |
Semantic (BERT) | 98% | 95% | 240 |
MinHash LSH (Ours) | 92% | 99% | 27 |
Hardware: 8x AWS r6g.2xlarge (64 vCPU, 512GB RAM).
Reflections and Future Tests
The biggest surprise? Deduplication improved model generalization more than adding 5% more data. Next, I’m testing:
- Hybrid Semantic-MinHash Systems: Can BERT filters + LSH reduce false positives?
- Dynamic Band Adjustment: Automatically tune LSH bands based on dataset entropy.
- Pre-training Impact: Quantifying perplexity reduction from deduplicated vs. raw data.
Trillion-token training is a minefield of inefficiencies. Deduplication isn’t glamorous—but ignoring it wastes millions and cripples models.
Top comments (0)