I do a fair amount of competitor backlink research, and the workflow always annoyed me: open a dashboard, run a query, export a CSV, eyeball it, copy domains into a doc, switch to email. Lots of tab-hopping for what is fundamentally a data-filtering problem an agent should handle.
So I wrapped the backlink API I'd been using into an MCP server. Now I stay in Claude Code (or Cursor, Cline, Zed, Windsurf) and just describe the goal. This is the build: the architecture, the four tools, and the one design decision I'm still not sure about.
The data source
The server runs on the Common Crawl hyperlink webgraph — about 4.4 billion edges across 120 million domains, published quarterly as Parquet. That matters for an MCP tool specifically: the data is open, so there's no scraped-proprietary-index liability in handing it to an agent, and the same query is reproducible by anyone.
The HTTP API in front of it (CrawlGraph) does the heavy DuckDB work; the MCP server is a thin TypeScript stdio client over it. Keeping the server thin was deliberate — all the query cost, caching, and quota logic lives server-side, so the MCP package stays a ~300-line wrapper that's easy to audit before you hand it your API key.
The four tools
backlinks → referring domains for a target, with authority scores
gap_analysis → domains linking to your competitors but not to you
gap_outreach_targets → the composite play (below)
releases → list the Common Crawl snapshots
backlinks and gap_analysis map 1:1 to API endpoints. gap_analysis is the interesting primitive: submit your domain plus 2-5 competitors, and it returns every domain that links to at least one competitor but not to you, each tagged with a found_on array listing which competitors it links to.
The composite tool, and the decision I'm unsure about
Most API-wrapper MCP servers are pure 1:1 mappings. I added one opinionated composite tool, gap_outreach_targets, because the raw gap output isn't the thing you actually want — it's the raw material for the thing you want.
What it does on top of gap_analysis:
-
Filters to total overlap. Keep only domains whose
found_oncovers every competitor you listed. A site linking to one competitor might be a fluke or a paid placement. A site linking to all three is a publisher who covers your whole niche and has simply never heard of you. That overlap is the qualifier. -
Strips platform noise.
amazonaws.com,github.io,facebook.com, CDNs, URL shorteners — they show up in every backlink profile and are never outreach targets. There's a denylist with suffix matching so subdomains get caught too. -
Ranks by authority. For the top N survivors it makes a cheap per-domain authority lookup and sorts, so the highest-value warm targets surface first. This is opt-out (
enrich_top: 0) because each lookup costs one API call against quota.
// the core filter, roughly
const priority = gaps
.filter(g => !isPlatformNoise(g.linking_domain))
.filter(g => g.found_on.length === competitors.length)
.sort((a, b) => (b.cg_authority ?? -1) - (a.cg_authority ?? -1));
The decision I keep going back and forth on: is a composite, opinionated tool the right call for an MCP server, or should it stay a pure API mirror and let the agent do the filtering/ranking in its own reasoning?
Arguments for the composite tool:
- It encodes a workflow the model would otherwise have to reconstruct each time, costing tokens and inviting mistakes (I watched an agent forget to filter platforms more than once).
- It returns a small, ranked, decision-ready list instead of a 1,000-row dump the model has to chew through.
Arguments against:
- It's a leaky abstraction. The moment someone wants a slightly different filter (2-of-3 overlap, a different noise list), they're fighting my opinion instead of composing primitives.
- It hides the platform denylist, which is a judgment call that should arguably be visible.
I landed on "ship both" — the raw gap_analysis primitive and the composite — but I'm genuinely unsure that's not just indecision dressed up as flexibility. If you've built MCP servers, I'd like to hear where you draw the primitive-vs-composite line.
Using it
{
"mcpServers": {
"crawlgraph": {
"command": "npx",
"args": ["-y", "crawlgraph-mcp"],
"env": { "CRAWLGRAPH_API_KEY": "cg_live_..." }
}
}
}
Then the whole workflow collapses to one sentence:
"Use gap_outreach_targets for mydomain.com against competitor-a.com and competitor-b.com, then draft a short outreach email to each priority target."
The agent submits the gap job, polls it, filters and ranks, and writes the emails — all in one turn.
Honest limitations
- Quarterly snapshot. Common Crawl publishes ~4x/year, so this is for one-off prospecting, not live link monitoring. If you need "what changed this week," it's the wrong tool.
-
No anchor text in the gap result. The webgraph is
(src, dst)edges; anchor text needs a separate WARC pass I didn't wire into the MCP. - Authority enrichment costs calls. Each scored domain is one API call, hence the cap.
Code is MIT, on GitHub and npm (npx -y crawlgraph-mcp). Feedback on the composite-tool question especially welcome — it's the part of the design I'm least settled on.
Top comments (0)