The problem
Backlink data is a $1.5B/year SaaS category. Ahrefs is $129/month, SEMrush is $140/month, Moz is $99/month. As an indie I needed competitor backlinks for outreach — the prospecting half of what these tools do — but I wasn't going to pay $1,548/year just for a quarterly list of domains.
Turns out the data is already public. Common Crawl publishes a hyperlink graph every ~3 months containing every public link they discover. The latest release I pulled has 4.4 billion edges across 120 million domains — comparable to the size of Ahrefs' index, just refreshed quarterly instead of continuously.
This is a walkthrough of the actual stack I used to turn that public dataset into a queryable backlink lookup. Total infra cost: about $40/month for a small VPS.
The data: Common Crawl's webgraph
Common Crawl publishes the webgraph as Parquet on S3:
s3://commoncrawl/cc-webgraph/
cc-main-2026-jan-feb-mar/
vertices/ # domain registry, with reverse-DNS string keys
edges/ # (src_id, dst_id) tuples
ranks/ # PageRank-equivalent + harmonic centrality per vertex
The full edges table is ~120GB compressed Parquet across a few hundred files. The vertices table is ~3GB. Both are accessible without authentication — Common Crawl publishes everything as open data.
The query engine: DuckDB over Parquet
The trick that makes the whole thing economical: DuckDB can query Parquet files directly from S3 via httpfs without downloading them. For a backlink lookup ("show me every domain linking to stripe.com"), you don't need the full graph in memory — you need columnar scans for the specific domain ID.
-- Install the httpfs extension once
INSTALL httpfs;
LOAD httpfs;
-- Resolve "stripe.com" to its vertex ID
SELECT vertex_id
FROM read_parquet('https://data.commoncrawl.org/cc-webgraph/.../vertices/*.parquet')
WHERE rev_domain = 'com.stripe';
-- (Common Crawl stores domains reverse-DNS: stripe.com → com.stripe)
-- Pull every incoming edge
SELECT src.rev_domain, ranks.cg_authority
FROM read_parquet('https://.../edges/*.parquet') edges
JOIN read_parquet('https://.../vertices/*.parquet') src ON edges.src_id = src.vertex_id
JOIN read_parquet('https://.../ranks/*.parquet') ranks ON edges.src_id = ranks.vertex_id
WHERE edges.dst_id = <stripe's vertex_id>
ORDER BY ranks.cg_authority DESC
LIMIT 1000;
DuckDB plans the scan, fetches only the relevant Parquet row groups via HTTP range requests, and returns in 2-30 seconds depending on how popular the target domain is.
The cache layer: SQLite
DuckDB-on-S3 is fast enough for ad-hoc queries but you don't want every user hitting Common Crawl's S3 bucket. I added a SQLite cache keyed on (domain, release_id):
def get_backlinks(domain: str) -> list[Backlink]:
cache_key = (domain, current_release_id())
cached = sqlite_db.fetch_one("SELECT result FROM cache WHERE key = ?", cache_key)
if cached and not is_expired(cached):
return json.loads(cached["result"])
result = duckdb_query_s3(domain)
sqlite_db.execute("INSERT INTO cache (key, result, expires_at) VALUES (?, ?, ?)",
cache_key, json.dumps(result), now() + timedelta(days=30))
return result
About 90% of traffic hits the cache after the first month. The DuckDB-on-S3 path stays as the cold-start handler.
The frontend: Next.js with SSR
The user-facing tool is a Next.js 14 app with App Router. The domain pages (/backlinks/<domain>) are server-rendered so they're indexable by Google — which matters because the long-tail SEO traffic ("backlinks for [specific competitor]") is a real channel.
Per-domain pages include:
- The referring-domain list with their authority scores
- TLD breakdown (visual)
- Anchor text distribution (when available from the WARC, separate pipeline)
- CSV/JSON export
The frontend talks to a thin FastAPI backend that owns the SQLite cache and the DuckDB connection pool.
Honest limitations (what this doesn't replace)
Common Crawl is a quarterly snapshot. The current release I'm querying is from January-March 2026, so:
- No live signal monitoring. When a new high-authority site links you on Thursday, Ahrefs surfaces it by Friday. I won't see it until the next Common Crawl release ingests (~3 months later).
- No anchor-text-velocity tracking. Same reason.
- No spam-filtering layer. Common Crawl publishes the graph as it found it. Ahrefs runs continuous re-validation and de-duplication; I don't replicate that.
So this isn't a replacement for Ahrefs if your job is live rank-impact attribution or campaign monitoring. It is a replacement for the outreach-prospecting use case — give me a 50-domain list to pitch today — which is what indies actually need most of the time.
The numbers
- Common Crawl webgraph: free (open data)
- DuckDB: free
- SQLite: free
- VPS hosting (small Hetzner box): ~$40/month
- Next.js on Vercel: included in hobby tier for now
- Domain + email: ~$15/month
Total: under $60/month to run a tool that does ~1/30th of what Ahrefs does, for free or $99 lifetime per user. The unit economics are sharp enough that I could make this a public project — which I did.
You can try it at crawlgraph.com (5 queries free, no signup). Source data and methodology are documented; if you want to build your own version on the same dataset, all the SQL above is roughly the right shape. Common Crawl's getting started guide is the place to begin.
Happy to answer questions about the architecture in the comments.
Top comments (0)