Why Your Content Pipeline Needs Deduplication Before Anything Else
I built a knowledge base that ingests thousands of developer articles daily. The first thing I learned isn't about embeddings, retrieval, or vector search.
It's about deduplication. And most people get it completely wrong.
The Problem Nobody Talks About
Scrape 1,000 developer articles and you'll find that 30-40% of them are duplicates. Same article, different hosts:
- Original on a personal blog
- Cross-posted to dev.to, Medium, Hashnode
- Republished on 3 aggregator sites
- Mirrored on a company's engineering blog
If you ingest all of them, your knowledge base is bloated, your RAG retrieval returns 4 identical chunks, and your users see the same content over and over.
Worse: the duplicate from a low-quality aggregator might outrank the original because of metadata quirks. Garbage wins over gold.
Why Exact Matching Fails
Your first instinct might be URL normalization or content hashing (SHA-256). That catches byte-identical copies. But it misses the real-world cases:
-
Cross-platform formatting differences — Medium wraps
<figure>tags differently than dev.to - Added boilerplate — "Originally published on..." footers, author bios, newsletter CTAs
- Minor edits — typos fixed between cross-posts
- Truncated versions — one platform has a "read more" link, the other has the full text
Exact hashes see these as completely different documents. They're not.
Enter SimHash: The Right Tool for the Job
SimHash converts a document into a 64-bit fingerprint where similar documents have similar fingerprints. The similarity metric is Hamming distance — count the bit positions that differ.
In practice:
- Hamming distance ≤ 3 → near-identical (same article, minor formatting changes)
- Hamming distance 4-10 → related (different versions, translations, excerpts)
- Hamming distance > 10 → different content
For a knowledge base with ~100K articles, I use distance ≤ 3 as my dedup threshold. It catches cross-posts without false positives on genuinely different articles about the same topic.
How It Works (The Simple Version)
- Tokenize the article into shingles (overlapping word sequences, typically 5-grams)
- Hash each shingle to a 64-bit integer
- Weight each hash by term frequency (or just use 1 for uniform weighting)
- Accumulate — for each bit position, add or subtract the weight based on whether the hash bit is 1 or 0
- Sign — if the accumulated value at a position is positive, set the output bit to 1; otherwise 0
The result: a 64-bit fingerprint where documents sharing many shingles end up with similar fingerprints.
# Simplified example
def simhash(tokens, bits=64):
v = [0] * bits
for token in tokens:
h = hash(token) # 64-bit hash
for i in range(bits):
bit = (h >> i) & 1
v[i] += 1 if bit else -1
fingerprint = 0
for i in range(bits):
if v[i] > 0:
fingerprint |= (1 << i)
return fingerprint
The beauty: comparing two fingerprints is a single XOR + bit count. O(1) comparison, regardless of document size.
Two-Level Dedup Architecture
In production, I run dedup at two levels:
Level 1: Local SQLite Index
Every ingested article's SimHash gets stored locally:
CREATE TABLE simhash_index (
url TEXT PRIMARY KEY,
simhash INTEGER NOT NULL,
category TEXT,
ingested_at TIMESTAMP
);
Before processing a new URL, I check: does any existing fingerprint have Hamming distance ≤ 3? If yes → skip.
For 100K articles, a linear scan takes ~50ms on an M2 chip. For millions, you'd use a multi-index strategy (split the 64 bits into 4 chunks of 16 bits each, index each chunk, and only compare candidates that match on at least one chunk).
Level 2: Remote API Check
When the local ingester syncs to the central knowledge base, the server runs its own dedup check. This catches cases where:
- Multiple ingestion nodes process the same URL simultaneously
- The local index is stale
- Cross-node dedup is needed
The server responds with CONFLICT if a near-duplicate already exists, and the article gets marked as IGNORED.
The Numbers
On my pipeline processing ~500 articles/day:
| Metric | Value |
|---|---|
| Duplicate rate | 30-40% |
| False positive rate | < 0.5% |
| False negative rate | ~2% (missed cross-posts with heavy reformatting) |
| Comparison time (100K index) | ~50ms |
| Storage per fingerprint | 8 bytes + metadata |
That 30-40% saving isn't just about storage. It's about processing cost (AI scoring runs ~$0.002/article), CDN bandwidth, and retrieval quality. Every duplicate you skip is money and compute saved.
What SimHash Can't Do
SimHash is great for near-duplicate detection. It's not great for:
- Topic dedup — Two different articles about "React hooks best practices" will have high Hamming distance. You need semantic similarity (embeddings) for that.
- Plagiarism detection — Paraphrased content won't match at the token level.
- Image/video dedup — Perceptual hashing (pHash, dHash) is the right tool for media.
For a complete pipeline, you want SimHash first (fast, cheap, catches the bulk of dupes), then semantic routing (catches topic overlap), then quality scoring (filters out low-value content).
The Takeaway
Before you spend a dollar on vector databases, embedding models, or retrieval strategies — fix your dedup. It's the highest-ROI optimization in any content pipeline.
SimHash gives you 64-bit fingerprints, O(1) comparisons, and catches 30-40% of crawled content as duplicates. It runs in SQLite, costs nothing, and makes everything downstream better.
The unsexy infrastructure wins. Every time.
This is part of my series building an AI-powered developer knowledge base in public. Follow along for more real-world engineering breakdowns.
Top comments (0)