DEV Community

Petteri Pucilowski
Petteri Pucilowski

Posted on

I wrapped a backlink API in an MCP server so I could do SEO gap analysis from inside Claude

I do a fair amount of competitor backlink research, and the workflow always annoyed me: open a dashboard, run a query, export a CSV, eyeball it, copy domains into a doc, switch to email. Lots of tab-hopping for what is fundamentally a data-filtering problem an agent should handle.

So I wrapped the backlink API I'd been using into an MCP server. Now I stay in Claude Code (or Cursor, Cline, Zed, Windsurf) and just describe the goal. This is the build: the architecture, the four tools, and the one design decision I'm still not sure about.

The data source

The server runs on the Common Crawl hyperlink webgraph — about 4.4 billion edges across 120 million domains, published quarterly as Parquet. That matters for an MCP tool specifically: the data is open, so there's no scraped-proprietary-index liability in handing it to an agent, and the same query is reproducible by anyone.

The HTTP API in front of it (CrawlGraph) does the heavy DuckDB work; the MCP server is a thin TypeScript stdio client over it. Keeping the server thin was deliberate — all the query cost, caching, and quota logic lives server-side, so the MCP package stays a ~300-line wrapper that's easy to audit before you hand it your API key.

The four tools

backlinks            → referring domains for a target, with authority scores
gap_analysis         → domains linking to your competitors but not to you
gap_outreach_targets → the composite play (below)
releases             → list the Common Crawl snapshots
Enter fullscreen mode Exit fullscreen mode

backlinks and gap_analysis map 1:1 to API endpoints. gap_analysis is the interesting primitive: submit your domain plus 2-5 competitors, and it returns every domain that links to at least one competitor but not to you, each tagged with a found_on array listing which competitors it links to.

The composite tool, and the decision I'm unsure about

Most API-wrapper MCP servers are pure 1:1 mappings. I added one opinionated composite tool, gap_outreach_targets, because the raw gap output isn't the thing you actually want — it's the raw material for the thing you want.

What it does on top of gap_analysis:

  1. Filters to total overlap. Keep only domains whose found_on covers every competitor you listed. A site linking to one competitor might be a fluke or a paid placement. A site linking to all three is a publisher who covers your whole niche and has simply never heard of you. That overlap is the qualifier.
  2. Strips platform noise. amazonaws.com, github.io, facebook.com, CDNs, URL shorteners — they show up in every backlink profile and are never outreach targets. There's a denylist with suffix matching so subdomains get caught too.
  3. Ranks by authority. For the top N survivors it makes a cheap per-domain authority lookup and sorts, so the highest-value warm targets surface first. This is opt-out (enrich_top: 0) because each lookup costs one API call against quota.
// the core filter, roughly
const priority = gaps
  .filter(g => !isPlatformNoise(g.linking_domain))
  .filter(g => g.found_on.length === competitors.length)
  .sort((a, b) => (b.cg_authority ?? -1) - (a.cg_authority ?? -1));
Enter fullscreen mode Exit fullscreen mode

The decision I keep going back and forth on: is a composite, opinionated tool the right call for an MCP server, or should it stay a pure API mirror and let the agent do the filtering/ranking in its own reasoning?

Arguments for the composite tool:

  • It encodes a workflow the model would otherwise have to reconstruct each time, costing tokens and inviting mistakes (I watched an agent forget to filter platforms more than once).
  • It returns a small, ranked, decision-ready list instead of a 1,000-row dump the model has to chew through.

Arguments against:

  • It's a leaky abstraction. The moment someone wants a slightly different filter (2-of-3 overlap, a different noise list), they're fighting my opinion instead of composing primitives.
  • It hides the platform denylist, which is a judgment call that should arguably be visible.

I landed on "ship both" — the raw gap_analysis primitive and the composite — but I'm genuinely unsure that's not just indecision dressed up as flexibility. If you've built MCP servers, I'd like to hear where you draw the primitive-vs-composite line.

Using it

{
  "mcpServers": {
    "crawlgraph": {
      "command": "npx",
      "args": ["-y", "crawlgraph-mcp"],
      "env": { "CRAWLGRAPH_API_KEY": "cg_live_..." }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Then the whole workflow collapses to one sentence:

"Use gap_outreach_targets for mydomain.com against competitor-a.com and competitor-b.com, then draft a short outreach email to each priority target."

The agent submits the gap job, polls it, filters and ranks, and writes the emails — all in one turn.

Honest limitations

  • Quarterly snapshot. Common Crawl publishes ~4x/year, so this is for one-off prospecting, not live link monitoring. If you need "what changed this week," it's the wrong tool.
  • No anchor text in the gap result. The webgraph is (src, dst) edges; anchor text needs a separate WARC pass I didn't wire into the MCP.
  • Authority enrichment costs calls. Each scored domain is one API call, hence the cap.

Code is MIT, on GitHub and npm (npx -y crawlgraph-mcp). Feedback on the composite-tool question especially welcome — it's the part of the design I'm least settled on.

Top comments (0)