Petteri Pucilowski

Posted on May 29

How I Built a Free Backlink Intelligence Tool on Common Crawl + DuckDB

#webdev #seo #duckdb #opensource

The problem

Backlink data is a $1.5B/year SaaS category. Ahrefs is $129/month, SEMrush is $140/month, Moz is $99/month. As an indie I needed competitor backlinks for outreach — the prospecting half of what these tools do — but I wasn't going to pay $1,548/year just for a quarterly list of domains.

Turns out the data is already public. Common Crawl publishes a hyperlink graph every ~3 months containing every public link they discover. The latest release I pulled has 4.4 billion edges across 120 million domains — comparable to the size of Ahrefs' index, just refreshed quarterly instead of continuously.

This is a walkthrough of the actual stack I used to turn that public dataset into a queryable backlink lookup. Total infra cost: about $40/month for a small VPS.

The data: Common Crawl's webgraph

Common Crawl publishes the webgraph as Parquet on S3:

s3://commoncrawl/cc-webgraph/
  cc-main-2026-jan-feb-mar/
    vertices/  # domain registry, with reverse-DNS string keys
    edges/     # (src_id, dst_id) tuples
    ranks/     # PageRank-equivalent + harmonic centrality per vertex

The full edges table is ~120GB compressed Parquet across a few hundred files. The vertices table is ~3GB. Both are accessible without authentication — Common Crawl publishes everything as open data.

The query engine: DuckDB over Parquet

The trick that makes the whole thing economical: DuckDB can query Parquet files directly from S3 via httpfs without downloading them. For a backlink lookup ("show me every domain linking to stripe.com"), you don't need the full graph in memory — you need columnar scans for the specific domain ID.

-- Install the httpfs extension once
INSTALL httpfs;
LOAD httpfs;

-- Resolve "stripe.com" to its vertex ID
SELECT vertex_id
FROM read_parquet('https://data.commoncrawl.org/cc-webgraph/.../vertices/*.parquet')
WHERE rev_domain = 'com.stripe';
-- (Common Crawl stores domains reverse-DNS: stripe.com → com.stripe)

-- Pull every incoming edge
SELECT src.rev_domain, ranks.cg_authority
FROM read_parquet('https://.../edges/*.parquet') edges
JOIN read_parquet('https://.../vertices/*.parquet') src ON edges.src_id = src.vertex_id
JOIN read_parquet('https://.../ranks/*.parquet') ranks ON edges.src_id = ranks.vertex_id
WHERE edges.dst_id = <stripe's vertex_id>
ORDER BY ranks.cg_authority DESC
LIMIT 1000;

DuckDB plans the scan, fetches only the relevant Parquet row groups via HTTP range requests, and returns in 2-30 seconds depending on how popular the target domain is.

The cache layer: SQLite

DuckDB-on-S3 is fast enough for ad-hoc queries but you don't want every user hitting Common Crawl's S3 bucket. I added a SQLite cache keyed on (domain, release_id):

def get_backlinks(domain: str) -> list[Backlink]:
    cache_key = (domain, current_release_id())
    cached = sqlite_db.fetch_one("SELECT result FROM cache WHERE key = ?", cache_key)
    if cached and not is_expired(cached):
        return json.loads(cached["result"])

    result = duckdb_query_s3(domain)
    sqlite_db.execute("INSERT INTO cache (key, result, expires_at) VALUES (?, ?, ?)",
                      cache_key, json.dumps(result), now() + timedelta(days=30))
    return result

About 90% of traffic hits the cache after the first month. The DuckDB-on-S3 path stays as the cold-start handler.

The frontend: Next.js with SSR

The user-facing tool is a Next.js 14 app with App Router. The domain pages (/backlinks/<domain>) are server-rendered so they're indexable by Google — which matters because the long-tail SEO traffic ("backlinks for [specific competitor]") is a real channel.

Per-domain pages include:

The referring-domain list with their authority scores
TLD breakdown (visual)
Anchor text distribution (when available from the WARC, separate pipeline)
CSV/JSON export

The frontend talks to a thin FastAPI backend that owns the SQLite cache and the DuckDB connection pool.

Honest limitations (what this doesn't replace)

Common Crawl is a quarterly snapshot. The current release I'm querying is from January-March 2026, so:

No live signal monitoring. When a new high-authority site links you on Thursday, Ahrefs surfaces it by Friday. I won't see it until the next Common Crawl release ingests (~3 months later).
No anchor-text-velocity tracking. Same reason.
No spam-filtering layer. Common Crawl publishes the graph as it found it. Ahrefs runs continuous re-validation and de-duplication; I don't replicate that.

So this isn't a replacement for Ahrefs if your job is live rank-impact attribution or campaign monitoring. It is a replacement for the outreach-prospecting use case — give me a 50-domain list to pitch today — which is what indies actually need most of the time.

The numbers

Common Crawl webgraph: free (open data)
DuckDB: free
SQLite: free
VPS hosting (small Hetzner box): ~$40/month
Next.js on Vercel: included in hobby tier for now
Domain + email: ~$15/month

Total: under $60/month to run a tool that does ~1/30th of what Ahrefs does, for free or $99 lifetime per user. The unit economics are sharp enough that I could make this a public project — which I did.

You can try it at crawlgraph.com (5 queries free, no signup). Source data and methodology are documented; if you want to build your own version on the same dataset, all the SQL above is roughly the right shape. Common Crawl's getting started guide is the place to begin.

Happy to answer questions about the architecture in the comments.

Top comments (1)

Harjot Singh • May 31

This is a great "use the free public dataset to undercut the $99/mo SaaS" build - Ahrefs/Semrush charge for backlink data partly because crawling the web is expensive, and Common Crawl quietly removes that moat for anyone willing to do the data engineering. DuckDB is the perfect pick here too: querying CC's Parquet/WARC at rest without standing up a cluster is exactly its sweet spot, and it keeps the whole thing runnable on a laptop instead of a data warehouse bill. Smart stack.

The honest caveat worth naming for readers: Common Crawl is a sample, not the whole web, and it's periodic, so coverage and freshness are the tradeoff against the paid tools - which is totally fine as long as users know they're getting "directional backlink intelligence" not "every link as of today." That framing (be clear about what the data can and can't support) is the same discipline I lean on in Moonshift, the thing I build - a multi-agent pipeline that takes a prompt to a deployed SaaS, where outputs are grounded in what's actually verifiable rather than asserted. Multi-model routing keeps a build ~$3 flat, first run's free no card. Genuinely useful tool. How are you handling CC's coverage gaps - just transparent about the sample, or are you supplementing with another source for freshness? That's the one thing that decides whether people trust it over the paid incumbents.