Simon

Posted on Mar 27 • Edited on Jun 29 • Originally published at crawlforge.dev

The Complete Guide to MCP Web Scraping: Everything Developers Need to Know

#mcp #webdev #ai #webscraping

The Model Context Protocol (MCP) has fundamentally changed how AI assistants interact with the web. If you've been building with Claude, Cursor, or any MCP-compatible client, you've probably wondered: what's the best way to give your AI agent access to live web data?

This guide covers everything -- from foundational MCP concepts to all 18 CrawlForge tools and how to pick the right one for each job.

TL;DR: MCP is the open standard for connecting AI to tools. CrawlForge is the most comprehensive MCP web scraping server with 18 tools -- 4x more than alternatives. Free tier includes 1,000 credits, no credit card required.

Part 1: Understanding MCP
Part 2: The MCP Web Scraping Ecosystem
Part 3: All 18 Tools Explained
Part 4: Integration Guide
Part 5: Best Practices
Part 6: Getting Started

Part 1: Understanding MCP

What is the Model Context Protocol?

MCP is an open standard developed by Anthropic that lets AI assistants connect to external tools and data sources. Think of it as a universal adapter between AI models and specialized tools.

+--------------+      +----------------+      +---------------+
|   Claude     | <--> |  MCP Server    | <--> |  External     |
|  (AI Model)  |      |  (CrawlForge)  |      |  Resources    |
+--------------+      +----------------+      +---------------+
                             ^
                      MCP Protocol
                   (JSON-RPC over stdio)

Why MCP Matters for Web Scraping

Before MCP, getting real-time web data into AI was painful:

Approach	Problems
Training data	Outdated, knowledge cutoff
RAG (Retrieval)	Limited to indexed documents
Function calling	Requires custom implementation
Browser plugins	Inconsistent, security concerns

MCP solves these with:

Standardized interface -- One protocol for all tools
Real-time data -- Fresh information from any source
Tool composability -- Combine multiple tools seamlessly
Security model -- Controlled access to external resources

How MCP Works

MCP uses a client-server architecture with JSON-RPC:

Step 1: Server Discovery

{
  "mcpServers": {
    "crawlforge": {
      "command": "npx",
      "args": ["crawlforge-mcp-server"]
    }
  }
}

Step 2: Tool Registration -- the server advertises its capabilities

{
  "tools": [
    {
      "name": "fetch_url",
      "description": "Fetch content from a URL",
      "inputSchema": {
        "type": "object",
        "properties": {
          "url": { "type": "string", "format": "uri" }
        },
        "required": ["url"]
      }
    }
  ]
}

Step 3: Tool Invocation -- the AI calls a tool

{
  "method": "tools/call",
  "params": {
    "name": "fetch_url",
    "arguments": { "url": "https://example.com" }
  }
}

Step 4: Response -- structured data comes back

{
  "result": {
    "content": "<html>...",
    "status": 200,
    "headers": {}
  }
}

Part 2: The MCP Web Scraping Ecosystem

Several MCP servers provide web scraping capabilities:

Server	Tools	Focus
CrawlForge	18	Comprehensive scraping, research, stealth
Firecrawl	~5	Basic scraping and crawling
Browser MCP	~3	Browser automation
Fetch MCP	1	Simple HTTP requests

CrawlForge: ████████████████████ 18 tools
Firecrawl:  █████                 5 tools
Browser:    ███                   3 tools
Fetch:      █                     1 tool

Part 3: CrawlForge's 18 Tools Explained

Basic Scraping (1-2 credits)

1. fetch_url (1 credit) -- Fetch raw HTML from any URL

The foundation of web scraping. Always try this first.

// Usage: "Fetch https://example.com"
// Returns:
{
  html: "<html>...",
  status: 200,
  headers: { /* ... */ },
  timing: { total: 523 }
}

When to use: Starting point for any scraping task.

2. extract_text (1 credit) -- Clean text extraction

Removes HTML tags, scripts, and styles. Returns readable text with word count and reading time.

When to use: Blog posts, articles, documentation where you need readable text.

3. extract_links (1 credit) -- Discover all links on a page

{
  url: "https://example.com",
  filter_external: true  // Internal links only
}
// Returns: { links: [{ href: "/about", text: "About Us" }, ...], total: 45 }

When to use: Site exploration, finding pages to scrape, building sitemaps.

4. extract_metadata (1 credit) -- SEO metadata extraction

Pulls title, description, Open Graph, JSON-LD structured data.

When to use: SEO analysis, content previews, structured data extraction.

Structured Extraction (2-3 credits)

5. scrape_structured (2 credits) -- CSS selector-based extraction

{
  url: "https://example.com/products",
  selectors: {
    title: "h1.product-title",
    price: "span.price",
    description: ".product-description"
  }
}
// Returns: { data: { title: "Product Name", price: "$99.99", description: "..." } }

When to use: E-commerce scraping, structured data, known page layouts.

6. extract_content (2 credits) -- Intelligent article extraction

Like Readability -- returns title, author, published date, clean content, images, reading time.

When to use: News articles, blog posts, editorial content.

7. map_site (2 credits) -- Site structure discovery

{
  url: "https://example.com",
  max_urls: 1000,
  include_sitemap: true
}
// Returns: pages[], structure tree

When to use: Site audits, crawl planning, content discovery.

8. analyze_content (3 credits) -- NLP analysis

Language detection, sentiment scoring, topic extraction, named entity recognition, readability analysis.

When to use: Content analysis, sentiment tracking, topic extraction.

Advanced Scraping (4-5 credits)

9. process_document (2 credits) -- PDF and document handling

{
  source: "https://example.com/report.pdf",
  sourceType: "pdf_url"
}
// Returns: { text: "...", pages: 15, metadata: { author, created } }

When to use: Research papers, reports, documentation PDFs.

10. summarize_content (4 credits) -- AI-powered summarization

Returns concise summary, key points, and word reduction percentage.

When to use: Long documents, research synthesis, content digests.

11. crawl_deep (4 credits) -- Multi-page crawling

{
  url: "https://example.com",
  max_depth: 3,
  max_pages: 100,
  include_patterns: ["/blog/*"],
  exclude_patterns: ["/admin/*"]
}

When to use: Full site scraping, content aggregation, archiving.

12. batch_scrape (5 credits) -- Parallel scraping of multiple URLs

Scrape up to 50 URLs in parallel with concurrency control.

When to use: Multiple known URLs, competitor monitoring, price tracking.

13. scrape_with_actions (5 credits) -- Browser automation

{
  url: "https://example.com/app",
  actions: [
    { type: "wait", selector: ".content" },
    { type: "click", selector: "#load-more" },
    { type: "wait", timeout: 2000 },
    { type: "scroll", selector: "body" },
    { type: "screenshot" }
  ]
}

When to use: SPAs, infinite scroll, dynamic content, login-required pages.

14. search_web (5 credits) -- Google search integration

{
  query: "web scraping best practices 2026",
  limit: 10,
  site: "github.com"  // Optional site filter
}

When to use: Discovery, finding sources, research starting point.

Specialized Tools (2-10 credits)

15. stealth_mode (5 credits) -- Anti-detection bypass

Hides WebDriver, randomizes fingerprints, simulates human behavior. Configurable stealth levels.

When to use: Protected sites, Cloudflare bypass, anti-bot evasion.

16. track_changes (3 credits) -- Content monitoring

Set a baseline, then detect changes with before/after diffs and significance scoring.

When to use: Price monitoring, competitor tracking, content updates.

17. localization (2 credits) -- Geo-targeted scraping

Set country code and language to get region-specific content and pricing.

When to use: Regional pricing, localized content, geo-restricted data.

18. deep_research (10 credits) -- Multi-source research

{
  topic: "quantum computing commercialization",
  maxUrls: 50,
  enableSourceVerification: true,
  enableConflictDetection: true
}
// Returns: { synthesis, sources[], conflicts[], citations[] }

When to use: Research projects, due diligence, market analysis.

Part 4: Integration Guide

Claude Code Setup

# 1. Install CrawlForge MCP server
npm install -g crawlforge-mcp-server

# 2. Run setup wizard
npx crawlforge-setup

# 3. Add to Claude Code
claude
> /mcp add crawlforge npx crawlforge-mcp-server

# 4. Verify
> /mcp list
# Should show: crawlforge (18 tools)

Claude Desktop Setup

Edit your config file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "crawlforge": {
      "command": "npx",
      "args": ["crawlforge-mcp-server"],
      "env": {
        "CRAWLFORGE_API_KEY": "cf_live_your_key_here"
      }
    }
  }
}

Custom Application (MCP SDK)

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";

const transport = new StdioClientTransport({
  command: "npx",
  args: ["crawlforge-mcp-server"],
  env: { CRAWLFORGE_API_KEY: process.env.CRAWLFORGE_API_KEY }
});

const client = new Client({ name: "my-app", version: "1.0.0" });
await client.connect(transport);

// List available tools
const tools = await client.listTools();
console.log(`Available tools: ${tools.tools.length}`);

// Call a tool
const result = await client.callTool({
  name: "fetch_url",
  arguments: { url: "https://example.com" }
});

Part 5: Best Practices

Credit Optimization

Pick the cheapest tool that gets the job done:

Goal	Expensive	Efficient
Check if page exists	`deep_research` (10)	`fetch_url` (1)
Get article text	`scrape_with_actions` (5)	`extract_content` (2)
Find competitor URLs	`search_web` x 10 (50)	`extract_links` (1)
Scrape 20 product pages	`fetch_url` x 20 (20)	`batch_scrape` (5)

Error Handling

try {
  const result = await fetchUrl(url);
  if (result.status >= 400) {
    // Try with stealth mode for blocked requests
    return await stealthMode(url);
  }
  return result;
} catch (error) {
  await sleep(retryDelay * attempt);
  return retry(url, attempt + 1);
}

Rate Limiting

Respect target sites:

// Good: Reasonable delays
for (const url of urls) {
  await scrape(url);
  await sleep(1000 + Math.random() * 2000); // 1-3s delay
}

// Better: Use batch_scrape with built-in rate limiting
await batchScrape(urls, { delayBetweenRequests: 1500 });

Caching

Don't scrape the same URL twice:

const cache = new Map<string, ScrapedContent>();

async function smartScrape(url: string) {
  if (cache.has(url)) return cache.get(url);

  const result = await fetchUrl(url);
  cache.set(url, result);
  return result;
}

Part 6: Getting Started

Free Tier -- 1,000 one-time trial credits. All 18 tools available. No credit card required.

npm install -g crawlforge-mcp-server
npx crawlforge-setup

What 1,000 Credits Gets You

Use Case	Tool	Credits	Capacity
Basic scraping	`fetch_url`	1	1,000 pages
Article extraction	`extract_content`	2	500 articles
Site mapping	`map_site`	2	500 sites
Batch jobs	`batch_scrape`	5	200 batches
Research projects	`deep_research`	10	100 topics

Key Takeaways

MCP is the standard -- All major AI assistants support it
CrawlForge leads with 18 tools -- 4x more than alternatives
Start simple -- Use fetch_url (1 credit) before reaching for advanced tools
Combine tools -- Chain operations for powerful workflows
Be ethical -- Respect robots.txt and rate limits

More from CrawlForge:

Start Free -- 1,000 Credits Included

DEV Community