Most tutorials show you a toy RAG pipeline with 3 PDFs and a LangChain snippet. Here's what it actually takes to run a content ingestion pipeline at production scale — processing thousands of articles from the wild into a curated, deduplicated, quality-scored knowledge base.
This isn't theoretical. This is the actual architecture I built and shipped.
The Problem
Scraping developer content is easy. Making it useful is the hard part. Raw HTML from tech blogs is full of nav bars, sidebars, cookie banners, and noise. Duplicate articles show up across aggregators. Quality ranges from "groundbreaking research" to "AI-generated SEO spam."
You need a pipeline that handles all of this automatically.
Architecture Overview
Fetch → Extract → Dedup → Score → Route → Store → CDN
Five stages. Each one filters out noise and adds signal. Let me walk through each.
Stage 1: Content Extraction (Readability + Turndown)
Forget regex. The only reliable way to extract article body from arbitrary HTML is Mozilla's Readability algorithm — the same engine Firefox uses for Reader View.
// Extract clean content from any HTML page
const article = new Readability(doc).parse();
// article.content → clean HTML, no chrome
// article.textContent → plain text for scoring
Then convert to Markdown with Turndown so the content is portable, version-controllable, and LLM-friendly.
Why this matters: Your RAG system's retrieval quality is capped by your extraction quality. Garbage in, garbage out — no matter how good your embeddings are.
Stage 2: Deduplication with SimHash
You'll find the same article on dev.to, Medium, personal blogs, and three aggregator sites. You don't want four copies.
SimHash is perfect for this: it converts a document into a 64-bit fingerprint where similar documents have similar fingerprints (small Hamming distance).
Two-level dedup strategy:
├── Local: Compare new article against local SQLite index
│ Hamming distance < 10 → skip
│
└── Remote: Query server's existing simhash index
via /api/articles/ingest/simhashes (with auth header)
Conflict → mark as IGNORED, don't even POST
The 64-bit approach means you can store millions of fingerprints in a SQLite index and compare in milliseconds. Much faster than full embedding similarity for dedup.
Stage 3: AI Quality Scoring
Not all developer content is worth storing. I built a scoring pipeline that evaluates each article on:
- Technical depth — Does it teach something real?
- Originality — New insights vs. rehashed documentation
- Code quality — Are examples practical and correct?
- Structure — Clear explanations with logical flow
Articles below a threshold (I use ≥80) get filtered out. The AI scorer runs against a ~1000 token excerpt — enough to judge quality without burning through API budgets.
Cost control tip: Run the scorer on the excerpt, not the full article. You're judging quality, not summarizing content. One call per article, ~$0.002 with GPT-4o-mini.
Stage 4: Semantic Routing with Taxonomy
Once an article passes quality, where does it go? Instead of keyword matching, I use AI-powered semantic routing against a taxonomy:
{
"categories": [
{"slug": "ai-engineering", "keywords": ["LLM", "RAG", "agents"]},
{"slug": "architecture", "keywords": ["microservices", "patterns", "tradeoffs"]},
{"slug": "dev-ops", "keywords": ["CI/CD", "monitoring", "infrastructure"]}
]
}
The router evaluates an excerpt against the taxonomy and assigns the best-fit category. Invalid slugs trigger warnings — taxonomy drift is a real problem as your corpus grows.
Stage 5: Media Pipeline (Cloudflare R2)
Raw articles contain image URLs pointing to their original hosts. If those sites go down, your knowledge base is full of broken images.
Solution: Download all images → upload to Cloudflare R2 → rewrite URLs in the Markdown.

↓

This gives you full control over media lifecycle. Bonus: R2 has zero egress fees, so serving images is essentially free.
Gotcha: Make sure your R2 bucket allows public reads, or your images will 403 in production. Set the bucket policy or use a custom domain with a CDN in front.
The Local State Layer
Between local processing and remote ingestion, I maintain a SQLite database that tracks:
- CrawlTasks — What URLs are queued, in progress, or completed
- RawMaterials — The raw HTML/JSON before any processing
- SyncLogs — What was pushed to the remote knowledge base and when
This enables resume-after-failure. If the process crashes at article 847 of 1000, it picks up at 848. No reprocessing, no wasted API calls.
The Numbers
- Processing cost: ~$0.003 per article (scoring + routing)
- Dedup savings: 30-40% of crawled content is duplicate
- Quality filter: ~60% of non-duplicate articles pass the threshold
- End-to-end latency: 5-15 seconds per article (depends on image download)
- Storage: ~50KB per article in Markdown (vs ~500KB raw HTML)
What I'd Do Differently
- Add Sharp compression before R2 upload — some source images are 5MB+ and don't need to be
- Internal linking — auto-link related articles by title matching after ingestion
- Favicon extraction — store source site favicons for attribution
- Parallel processing — the current pipeline is sequential; async worker pools could 3x throughput
Takeaway
A production knowledge base isn't about the RAG part. It's about the pipeline feeding it. Extraction quality, dedup, scoring, and media management — these are the unsexy problems that make or break your system.
The embeddings and retrieval strategies are interchangeable. The pipeline is your moat.
Building in public: this is the architecture behind an AI-powered developer knowledge base. Follow the journey for more real-world engineering breakdowns.
Top comments (0)