kol kol

Posted on May 16

Why Your Content Pipeline Needs Deduplication Before Anything Else

#codcompass #ai #knowledgebase #webdev

I built a knowledge base that ingests thousands of developer articles daily. The first thing I learned isn't about embeddings, retrieval, or vector search.

It's about deduplication. And most people get it completely wrong.

The Problem Nobody Talks About

Scrape 1,000 developer articles and you'll find that 30-40% of them are duplicates. Same article, different hosts:

Original on a personal blog
Cross-posted to dev.to, Medium, Hashnode
Republished on 3 aggregator sites
Mirrored on a company's engineering blog

If you ingest all of them, your knowledge base is bloated, your RAG retrieval returns 4 identical chunks, and your users see the same content over and over.

Worse: the duplicate from a low-quality aggregator might outrank the original because of metadata quirks. Garbage wins over gold.

Why Exact Matching Fails

Your first instinct might be URL normalization or content hashing (SHA-256). That catches byte-identical copies. But it misses the real-world cases:

Cross-platform formatting differences — Medium wraps <figure> tags differently than dev.to
Added boilerplate — "Originally published on..." footers, author bios, newsletter CTAs
Minor edits — typos fixed between cross-posts
Truncated versions — one platform has a "read more" link, the other has the full text

Exact hashes see these as completely different documents. They're not.

Enter SimHash: The Right Tool for the Job

SimHash converts a document into a 64-bit fingerprint where similar documents have similar fingerprints. The similarity metric is Hamming distance — count the bit positions that differ.

In practice:

Hamming distance ≤ 3 → near-identical (same article, minor formatting changes)
Hamming distance 4-10 → related (different versions, translations, excerpts)
Hamming distance > 10 → different content

For a knowledge base with ~100K articles, I use distance ≤ 3 as my dedup threshold. It catches cross-posts without false positives on genuinely different articles about the same topic.

How It Works (The Simple Version)

Tokenize the article into shingles (overlapping word sequences, typically 5-grams)
Hash each shingle to a 64-bit integer
Weight each hash by term frequency (or just use 1 for uniform weighting)
Accumulate — for each bit position, add or subtract the weight based on whether the hash bit is 1 or 0
Sign — if the accumulated value at a position is positive, set the output bit to 1; otherwise 0

The result: a 64-bit fingerprint where documents sharing many shingles end up with similar fingerprints.

# Simplified example
def simhash(tokens, bits=64):
    v = [0] * bits
    for token in tokens:
        h = hash(token)  # 64-bit hash
        for i in range(bits):
            bit = (h >> i) & 1
            v[i] += 1 if bit else -1
    fingerprint = 0
    for i in range(bits):
        if v[i] > 0:
            fingerprint |= (1 << i)
    return fingerprint

The beauty: comparing two fingerprints is a single XOR + bit count. O(1) comparison, regardless of document size.

Two-Level Dedup Architecture

In production, I run dedup at two levels:

Level 1: Local SQLite Index

Every ingested article's SimHash gets stored locally:

CREATE TABLE simhash_index (
    url TEXT PRIMARY KEY,
    simhash INTEGER NOT NULL,
    category TEXT,
    ingested_at TIMESTAMP
);

Before processing a new URL, I check: does any existing fingerprint have Hamming distance ≤ 3? If yes → skip.

For 100K articles, a linear scan takes ~50ms on an M2 chip. For millions, you'd use a multi-index strategy (split the 64 bits into 4 chunks of 16 bits each, index each chunk, and only compare candidates that match on at least one chunk).

Level 2: Remote API Check

When the local ingester syncs to the central knowledge base, the server runs its own dedup check. This catches cases where:

Multiple ingestion nodes process the same URL simultaneously
The local index is stale
Cross-node dedup is needed

The server responds with CONFLICT if a near-duplicate already exists, and the article gets marked as IGNORED.

The Numbers

On my pipeline processing ~500 articles/day:

Metric	Value
Duplicate rate	30-40%
False positive rate	< 0.5%
False negative rate	~2% (missed cross-posts with heavy reformatting)
Comparison time (100K index)	~50ms
Storage per fingerprint	8 bytes + metadata

That 30-40% saving isn't just about storage. It's about processing cost (AI scoring runs ~$0.002/article), CDN bandwidth, and retrieval quality. Every duplicate you skip is money and compute saved.

What SimHash Can't Do

SimHash is great for near-duplicate detection. It's not great for:

Topic dedup — Two different articles about "React hooks best practices" will have high Hamming distance. You need semantic similarity (embeddings) for that.
Plagiarism detection — Paraphrased content won't match at the token level.
Image/video dedup — Perceptual hashing (pHash, dHash) is the right tool for media.

For a complete pipeline, you want SimHash first (fast, cheap, catches the bulk of dupes), then semantic routing (catches topic overlap), then quality scoring (filters out low-value content).

The Takeaway

Before you spend a dollar on vector databases, embedding models, or retrieval strategies — fix your dedup. It's the highest-ROI optimization in any content pipeline.

SimHash gives you 64-bit fingerprints, O(1) comparisons, and catches 30-40% of crawled content as duplicates. It runs in SQLite, costs nothing, and makes everything downstream better.

The unsexy infrastructure wins. Every time.

This is part of my series building an AI-powered developer knowledge base in public. Follow along for more real-world engineering breakdowns.

DEV Community

Why Your Content Pipeline Needs Deduplication Before Anything Else

The Problem Nobody Talks About

Why Exact Matching Fails

Enter SimHash: The Right Tool for the Job

How It Works (The Simple Version)

Two-Level Dedup Architecture

Level 1: Local SQLite Index

Level 2: Remote API Check

The Numbers

What SimHash Can't Do

The Takeaway

Top comments (0)