DEV Community: Petteri Pucilowski

8 domains link to every major fintech app (Common Crawl analysis)

Petteri Pucilowski — Mon, 06 Jul 2026 08:39:36 +0000

Fourth niche, same experiment. Take a category's biggest players, pull every domain that links to them from Common Crawl's open hyperlink graph, keep the ones that link to the whole set. SEO tools gave 37 universal linkers (media). CRMs gave 2 (integration hubs). AI agent frameworks gave 5 (dev platforms). Fintech just gave 8 - and they are the most consistent set yet.

The setup

Eight fintech apps across payments, banking, and spend: Stripe, Wise, Revolut, Plaid, Brex, Ramp, Mercury, Chime. For each, referring domains ranked by authority from the Common Crawl webgraph (120M domains, 4.4B edges), then intersect the eight lists. Stripe is its own planet - 100,000+ referring domains, more than the other seven combined.

The 8 that link to all 8

substack.com, entrepreneur.com, libsyn.com (podcast host), beehiiv.com, pymnts.com, thefinancialbrand.com, spocket.co, coda.io.

Two newsletter platforms, a podcast host, a business magazine, and two dedicated fintech trade publications. Widen to the 7-of-8 club and the pattern gets sharp: github, crunchbase, ycombinator, bvp (Bessemer), contrary, builtin, bankrate, bankingdive, finovate, 11fs. Finance trade press plus startup/VC infrastructure. Almost nothing else.

Four niches, four shapes

Category	Link to all 8	Who they are
SEO tools	37	trade media + roundups
CRMs	2	integration hubs
AI agent frameworks	5	dev platforms + directories
Fintech apps	8	finance press + VC databases

The takeaway for anyone building in fintech: you do not earn links with a payments integration. You earn them by being covered (get into PYMNTS, The Financial Brand, Banking Dive, Tearsheet) and by being tracked as a company (complete Crunchbase / Sacra / BuiltIn profiles, land on VC portfolio pages). Mercury has 1,689 referring domains vs Stripe's 100,000+, yet both are reached by the same ten publications and three databases. You don't out-link Stripe; you get on the same shelf.

Full study with the overlap pyramid and downloadable CSV/JSON: 8 domains link to every major fintech app. Method is category-agnostic - it's a backlink gap analysis on Common Crawl data, free.

Only 5 domains link to every major AI agent framework (Common Crawl analysis)

Petteri Pucilowski — Fri, 03 Jul 2026 15:50:16 +0000

Same experiment, third niche. Take a category's biggest players, pull every domain that links to them from Common Crawl's open hyperlink graph, and keep the ones that link to the whole set. SEO tools gave 37 universal linkers (nearly all media). CRMs gave 2 (both integration hubs). AI agent frameworks just gave 5 - and they are a different animal again.

The setup

Eight frameworks: LangChain, LlamaIndex, CrewAI, Dify, Flowise, Langflow, AutoGPT, SuperAGI. For each, we pulled referring domains ranked by authority from the Common Crawl webgraph (120M domains, 4.4B links), then intersected the eight lists.

The numbers

7,508 unique domains link to at least one framework
77% link to exactly one (usually LangChain: 5,949 referring domains vs ~450 for Langflow and AutoGPT)
5 link to all eight: github.com, substack.com, analyticsvidhya.com, and two product sites

None of the universal linkers are tech press.

What the 7-and-6 club looks like

dev.to (yes, this site), qiita.com, zenn.dev, n8n.io, arize.com, zenml.io, langfuse.com... and then a whole crop of AI agent directories that did not exist two years ago: aiagentsdirectory.com, agenthunter.io, everydev.ai, findmyagentai.com, bestaiagents.ai. Most have single-digit-to-low authority scores. The category's shelf is still being built.

The takeaway for anyone shipping agent tooling

The niche's link graph is developer publishing platforms plus young directories, not journalists. So the play is: publish where developers already read (GitHub, Substack, dev.to, Qiita), and get listed in the agent directories while they are young and take an email. A directory with authority 20 today is not impressive. The one that becomes the geekflare of AI agents in three years is - and you got in early.

Full study with methodology, the overlap pyramid, and the downloadable CSV/JSON dataset: only 5 domains link to every major AI agent framework. The DuckDB query behind it is a public gist.

We shipped a free backlink API tier (15 calls/month on Common Crawl data)

Petteri Pucilowski — Fri, 03 Jul 2026 15:49:40 +0000

Until this week, the CrawlGraph API required the $99 lifetime license. Now there is a free tier: 15 backlink calls a month on Common Crawl's open hyperlink graph (120M domains, 4.4B links, refreshed quarterly). No card, no signup form - the key lands in your inbox.

Two curl commands

# 1. get a key (delivered by email)
curl -X POST https://crawlgraph.com/api/v1/free-key \
  -H "Content-Type: application/json" \
  -d '{"email": "you@company.com"}'

# 2. look up any domain's backlinks
curl -X POST https://crawlgraph.com/api/v1/backlinks \
  -H "Authorization: Bearer cg_live_..." \
  -H "Content-Type: application/json" \
  -d '{"domain": "example.com", "limit": 10}'

Responses carry X-RateLimit-* headers so you always know where you stand. One key per email - delivery by mail is the ownership check, which is why there is no signup flow to speak of.

It works inside your AI agent

The same key powers the open-source crawlgraph-mcp server in Claude, Cursor, Cline, Zed, or Windsurf:

{
  "mcpServers": {
    "crawlgraph": {
      "command": "npx",
      "args": ["-y", "crawlgraph-mcp"],
      "env": { "CRAWLGRAPH_API_KEY": "cg_live_..." }
    }
  }
}

Your agent can pull the backlink profile of any domain mid-conversation. The gap_analysis and gap_outreach_targets tools (the ones that turn "research my niche" into a ranked outreach list) stay part of the $99 lifetime tier, which also raises the quota to 1,000 calls + 50 gap analyses a month.

Why free

The underlying data is open - Common Crawl publishes the hyperlink graph quarterly. Gating basic lookups behind a purchase never felt right. Try the data, build something; if you need the gap tools or more volume, the upgrade is a single payment, not a subscription.

Docs: crawlgraph.com/docs/api

Find your competitor's backlinks from inside Claude Code (free, via MCP)

Petteri Pucilowski — Mon, 01 Jun 2026 19:06:11 +0000

Backlink prospecting is a tab-hopping chore: dashboard → export CSV → eyeball → copy domains → switch to email. It's a filter-and-rank problem, which is exactly the kind of thing an agent should do for you. So here's how to do the whole thing from inside Claude Code (or Cursor / Cline / Zed / Windsurf) with one prompt, using a free MCP server.

No affiliation required to follow along — the data is the public Common Crawl webgraph, and the MCP wrapper is open source.

What we're building toward

By the end you'll be able to type this into your agent and get back a ranked, de-noised list:

"Find the sites linking to competitor-a.com, competitor-b.com and competitor-c.com but not to my-site.com, and draft a short outreach email to each."

The agent runs a competitor gap analysis, filters to the highest-value targets, and writes the emails — without you opening a single SEO dashboard.

Step 1: install the MCP server

The server is crawlgraph-mcp on npm. It's a thin TypeScript stdio wrapper over a backlink API; nothing to clone or build.

Add it to your MCP config (claude_desktop_config.json, or .mcp.json for Claude Code):

{
  "mcpServers": {
    "crawlgraph": {
      "command": "npx",
      "args": ["-y", "crawlgraph-mcp"],
      "env": { "CRAWLGRAPH_API_KEY": "cg_live_..." }
    }
  }
}

Grab the cg_live_ key from your account page, restart the client, and you'll see four tools appear: backlinks, gap_analysis, gap_outreach_targets, and releases.

Step 2: sanity-check the data

Before trusting it for outreach, point it at a domain you know. In your agent:

"Use the backlinks tool on stripe.com, limit 5, sorted by authority."

You'll get the top referring domains plus the target's own authority score:

{
  "domain": "stripe.com",
  "total_linking_domains": 100000,
  "cg_authority": 78,
  "results": [
    { "linking_domain": "github.com", "num_hosts": 2120, "cg_authority": 96 },
    { "linking_domain": "news.ycombinator.com", "num_hosts": 880, "cg_authority": 94 }
  ]
}

That total_linking_domains count is the quick gut-check: compare it to whatever your current tool reports for the same domain. If it's in the same ballpark, the graph is complete enough to prospect from.

Step 3: the actual play — gap analysis

The primitive is gap_analysis: give it your domain plus 2-5 competitors, and it returns every domain linking to at least one competitor but not to you, each tagged with which competitors it links to (found_on).

The raw output is the material, not the answer — you don't want a 1,000-row dump, you want the few dozen worth emailing. That's what gap_outreach_targets does on top:

keeps only domains in the gap that link to all your competitors (a site linking to all three covers your whole niche and just hasn't heard of you — the warmest possible target);
strips platform/CDN noise (amazonaws, github.io, facebook, shorteners);
ranks the survivors by authority.

So you skip straight to:

"Run gap_outreach_targets for my-site.com against competitor-a.com, competitor-b.com and competitor-c.com."

and get back something like:

{
  "priority_targets": [
    { "linking_domain": "industry-roundup.com", "found_on": ["competitor-a.com","competitor-b.com","competitor-c.com"], "cg_authority": 73 },
    { "linking_domain": "niche-review.io", "found_on": ["competitor-a.com","competitor-b.com","competitor-c.com"], "cg_authority": 61 }
  ],
  "secondary_targets": [ /* link to 2 of 3 */ ],
  "platforms_filtered": 26
}

Step 4: let the agent write the outreach

This is where doing it in an agent beats a dashboard. The list is already in context, so:

"For each priority target, draft a 3-sentence outreach email: reference the kind of content they link to in my niche, and pitch my-site.com as a fit. Keep it specific, no templates."

You get first-draft emails per target in the same session. You still edit them in your voice — cold outreach that goes out unedited gets ignored — but the gather → filter → rank → draft chain that used to be an afternoon is now one conversation.

Why 2-3 competitors, not one

A site linking to one competitor might be a fluke or a paid placement. A site linking to three of your competitors is a publisher who covers your category. That overlap is the qualifier — it's the difference between a prospect list and a noise list.

Honest limitations

Quarterly snapshot. Common Crawl publishes ~4×/year, so this is for one-off prospecting, not live link monitoring. For "what changed this week," use a continuous-crawl tool.
No anchor text in the gap output (the webgraph is source→destination edges).
Quotas: the free path covers light use; heavier use needs the lifetime tier (1,000 backlink calls + 50 gap jobs/month).

Why this pattern matters beyond backlinks

The interesting bit isn't the SEO — it's that a well-scoped MCP tool can encode a workflow, not just an API call. gap_outreach_targets doesn't mirror an endpoint; it does the filtering-and-ranking judgment you'd otherwise re-explain to the model every time. If you're building MCP servers, that's the lever: ship the composite tool that returns a decision-ready answer, not just the raw rows.

Server's MIT and on GitHub (npx -y crawlgraph-mcp). If you try the gap play, I'd genuinely like to hear whether the priority/secondary split matches what you'd have picked by hand.

I wrapped a backlink API in an MCP server so I could do SEO gap analysis from inside Claude

Petteri Pucilowski — Sun, 31 May 2026 21:44:12 +0000

I do a fair amount of competitor backlink research, and the workflow always annoyed me: open a dashboard, run a query, export a CSV, eyeball it, copy domains into a doc, switch to email. Lots of tab-hopping for what is fundamentally a data-filtering problem an agent should handle.

So I wrapped the backlink API I'd been using into an MCP server. Now I stay in Claude Code (or Cursor, Cline, Zed, Windsurf) and just describe the goal. This is the build: the architecture, the four tools, and the one design decision I'm still not sure about.

The data source

The server runs on the Common Crawl hyperlink webgraph — about 4.4 billion edges across 120 million domains, published quarterly as Parquet. That matters for an MCP tool specifically: the data is open, so there's no scraped-proprietary-index liability in handing it to an agent, and the same query is reproducible by anyone.

The HTTP API in front of it (CrawlGraph) does the heavy DuckDB work; the MCP server is a thin TypeScript stdio client over it. Keeping the server thin was deliberate — all the query cost, caching, and quota logic lives server-side, so the MCP package stays a ~300-line wrapper that's easy to audit before you hand it your API key.

The four tools

backlinks            → referring domains for a target, with authority scores
gap_analysis         → domains linking to your competitors but not to you
gap_outreach_targets → the composite play (below)
releases             → list the Common Crawl snapshots

backlinks and gap_analysis map 1:1 to API endpoints. gap_analysis is the interesting primitive: submit your domain plus 2-5 competitors, and it returns every domain that links to at least one competitor but not to you, each tagged with a found_on array listing which competitors it links to.

The composite tool, and the decision I'm unsure about

Most API-wrapper MCP servers are pure 1:1 mappings. I added one opinionated composite tool, gap_outreach_targets, because the raw gap output isn't the thing you actually want — it's the raw material for the thing you want.

What it does on top of gap_analysis:

Filters to total overlap. Keep only domains whose found_on covers every competitor you listed. A site linking to one competitor might be a fluke or a paid placement. A site linking to all three is a publisher who covers your whole niche and has simply never heard of you. That overlap is the qualifier.
Strips platform noise. amazonaws.com, github.io, facebook.com, CDNs, URL shorteners — they show up in every backlink profile and are never outreach targets. There's a denylist with suffix matching so subdomains get caught too.
Ranks by authority. For the top N survivors it makes a cheap per-domain authority lookup and sorts, so the highest-value warm targets surface first. This is opt-out (enrich_top: 0) because each lookup costs one API call against quota.

// the core filter, roughly
const priority = gaps
  .filter(g => !isPlatformNoise(g.linking_domain))
  .filter(g => g.found_on.length === competitors.length)
  .sort((a, b) => (b.cg_authority ?? -1) - (a.cg_authority ?? -1));

The decision I keep going back and forth on: is a composite, opinionated tool the right call for an MCP server, or should it stay a pure API mirror and let the agent do the filtering/ranking in its own reasoning?

Arguments for the composite tool:

It encodes a workflow the model would otherwise have to reconstruct each time, costing tokens and inviting mistakes (I watched an agent forget to filter platforms more than once).
It returns a small, ranked, decision-ready list instead of a 1,000-row dump the model has to chew through.

Arguments against:

It's a leaky abstraction. The moment someone wants a slightly different filter (2-of-3 overlap, a different noise list), they're fighting my opinion instead of composing primitives.
It hides the platform denylist, which is a judgment call that should arguably be visible.

I landed on "ship both" — the raw gap_analysis primitive and the composite — but I'm genuinely unsure that's not just indecision dressed up as flexibility. If you've built MCP servers, I'd like to hear where you draw the primitive-vs-composite line.

Using it

{
  "mcpServers": {
    "crawlgraph": {
      "command": "npx",
      "args": ["-y", "crawlgraph-mcp"],
      "env": { "CRAWLGRAPH_API_KEY": "cg_live_..." }
    }
  }
}

Then the whole workflow collapses to one sentence:

"Use gap_outreach_targets for mydomain.com against competitor-a.com and competitor-b.com, then draft a short outreach email to each priority target."

The agent submits the gap job, polls it, filters and ranks, and writes the emails — all in one turn.

Honest limitations

Quarterly snapshot. Common Crawl publishes ~4x/year, so this is for one-off prospecting, not live link monitoring. If you need "what changed this week," it's the wrong tool.
No anchor text in the gap result. The webgraph is (src, dst) edges; anchor text needs a separate WARC pass I didn't wire into the MCP.
Authority enrichment costs calls. Each scored domain is one API call, hence the cap.

Code is MIT, on GitHub and npm (npx -y crawlgraph-mcp). Feedback on the composite-tool question especially welcome — it's the part of the design I'm least settled on.

How I Built a Free Backlink Intelligence Tool on Common Crawl + DuckDB

Petteri Pucilowski — Fri, 29 May 2026 09:32:31 +0000

The problem

Backlink data is a $1.5B/year SaaS category. Ahrefs is $129/month, SEMrush is $140/month, Moz is $99/month. As an indie I needed competitor backlinks for outreach — the prospecting half of what these tools do — but I wasn't going to pay $1,548/year just for a quarterly list of domains.

Turns out the data is already public. Common Crawl publishes a hyperlink graph every ~3 months containing every public link they discover. The latest release I pulled has 4.4 billion edges across 120 million domains — comparable to the size of Ahrefs' index, just refreshed quarterly instead of continuously.

This is a walkthrough of the actual stack I used to turn that public dataset into a queryable backlink lookup. Total infra cost: about $40/month for a small VPS.

The data: Common Crawl's webgraph

Common Crawl publishes the webgraph as Parquet on S3:

s3://commoncrawl/cc-webgraph/
  cc-main-2026-jan-feb-mar/
    vertices/  # domain registry, with reverse-DNS string keys
    edges/     # (src_id, dst_id) tuples
    ranks/     # PageRank-equivalent + harmonic centrality per vertex

The full edges table is ~120GB compressed Parquet across a few hundred files. The vertices table is ~3GB. Both are accessible without authentication — Common Crawl publishes everything as open data.

The query engine: DuckDB over Parquet

The trick that makes the whole thing economical: DuckDB can query Parquet files directly from S3 via httpfs without downloading them. For a backlink lookup ("show me every domain linking to stripe.com"), you don't need the full graph in memory — you need columnar scans for the specific domain ID.

-- Install the httpfs extension once
INSTALL httpfs;
LOAD httpfs;

-- Resolve "stripe.com" to its vertex ID
SELECT vertex_id
FROM read_parquet('https://data.commoncrawl.org/cc-webgraph/.../vertices/*.parquet')
WHERE rev_domain = 'com.stripe';
-- (Common Crawl stores domains reverse-DNS: stripe.com → com.stripe)

-- Pull every incoming edge
SELECT src.rev_domain, ranks.cg_authority
FROM read_parquet('https://.../edges/*.parquet') edges
JOIN read_parquet('https://.../vertices/*.parquet') src ON edges.src_id = src.vertex_id
JOIN read_parquet('https://.../ranks/*.parquet') ranks ON edges.src_id = ranks.vertex_id
WHERE edges.dst_id = <stripe's vertex_id>
ORDER BY ranks.cg_authority DESC
LIMIT 1000;

DuckDB plans the scan, fetches only the relevant Parquet row groups via HTTP range requests, and returns in 2-30 seconds depending on how popular the target domain is.

The cache layer: SQLite

DuckDB-on-S3 is fast enough for ad-hoc queries but you don't want every user hitting Common Crawl's S3 bucket. I added a SQLite cache keyed on (domain, release_id):

def get_backlinks(domain: str) -> list[Backlink]:
    cache_key = (domain, current_release_id())
    cached = sqlite_db.fetch_one("SELECT result FROM cache WHERE key = ?", cache_key)
    if cached and not is_expired(cached):
        return json.loads(cached["result"])

    result = duckdb_query_s3(domain)
    sqlite_db.execute("INSERT INTO cache (key, result, expires_at) VALUES (?, ?, ?)",
                      cache_key, json.dumps(result), now() + timedelta(days=30))
    return result

About 90% of traffic hits the cache after the first month. The DuckDB-on-S3 path stays as the cold-start handler.

The frontend: Next.js with SSR

The user-facing tool is a Next.js 14 app with App Router. The domain pages (/backlinks/<domain>) are server-rendered so they're indexable by Google — which matters because the long-tail SEO traffic ("backlinks for [specific competitor]") is a real channel.

Per-domain pages include:

The referring-domain list with their authority scores
TLD breakdown (visual)
Anchor text distribution (when available from the WARC, separate pipeline)
CSV/JSON export

The frontend talks to a thin FastAPI backend that owns the SQLite cache and the DuckDB connection pool.

Honest limitations (what this doesn't replace)

Common Crawl is a quarterly snapshot. The current release I'm querying is from January-March 2026, so:

No live signal monitoring. When a new high-authority site links you on Thursday, Ahrefs surfaces it by Friday. I won't see it until the next Common Crawl release ingests (~3 months later).
No anchor-text-velocity tracking. Same reason.
No spam-filtering layer. Common Crawl publishes the graph as it found it. Ahrefs runs continuous re-validation and de-duplication; I don't replicate that.

So this isn't a replacement for Ahrefs if your job is live rank-impact attribution or campaign monitoring. It is a replacement for the outreach-prospecting use case — give me a 50-domain list to pitch today — which is what indies actually need most of the time.

The numbers

Common Crawl webgraph: free (open data)
DuckDB: free
SQLite: free
VPS hosting (small Hetzner box): ~$40/month
Next.js on Vercel: included in hobby tier for now
Domain + email: ~$15/month

Total: under $60/month to run a tool that does ~1/30th of what Ahrefs does, for free or $99 lifetime per user. The unit economics are sharp enough that I could make this a public project — which I did.

You can try it at crawlgraph.com (5 queries free, no signup). Source data and methodology are documented; if you want to build your own version on the same dataset, all the SQL above is roughly the right shape. Common Crawl's getting started guide is the place to begin.

Happy to answer questions about the architecture in the comments.