DEV Community

Cover image for The Complete Guide to MCP Web Scraping: Everything Developers Need to Know
Simon
Simon

Posted on • Originally published at crawlforge.dev

The Complete Guide to MCP Web Scraping: Everything Developers Need to Know

The Model Context Protocol (MCP) has fundamentally changed how AI assistants interact with the web. If you've been building with Claude, Cursor, or any MCP-compatible client, you've probably wondered: what's the best way to give your AI agent access to live web data?

This guide covers everything -- from foundational MCP concepts to all 18 CrawlForge tools and how to pick the right one for each job.


TL;DR: MCP is the open standard for connecting AI to tools. CrawlForge is the most comprehensive MCP web scraping server with 18 tools -- 4x more than alternatives. Free tier includes 1,000 credits, no credit card required.

Table of Contents


Part 1: Understanding MCP

What is the Model Context Protocol?

MCP is an open standard developed by Anthropic that lets AI assistants connect to external tools and data sources. Think of it as a universal adapter between AI models and specialized tools.

+--------------+      +----------------+      +---------------+
|   Claude     | <--> |  MCP Server    | <--> |  External     |
|  (AI Model)  |      |  (CrawlForge)  |      |  Resources    |
+--------------+      +----------------+      +---------------+
                             ^
                      MCP Protocol
                   (JSON-RPC over stdio)
Enter fullscreen mode Exit fullscreen mode

Why MCP Matters for Web Scraping

Before MCP, getting real-time web data into AI was painful:

Approach Problems
Training data Outdated, knowledge cutoff
RAG (Retrieval) Limited to indexed documents
Function calling Requires custom implementation
Browser plugins Inconsistent, security concerns

MCP solves these with:

  • Standardized interface -- One protocol for all tools
  • Real-time data -- Fresh information from any source
  • Tool composability -- Combine multiple tools seamlessly
  • Security model -- Controlled access to external resources

How MCP Works

MCP uses a client-server architecture with JSON-RPC:

Step 1: Server Discovery
{
  "mcpServers": {
    "crawlforge": {
      "command": "npx",
      "args": ["crawlforge-mcp-server"]
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Tool Registration -- the server advertises its capabilities
{
  "tools": [
    {
      "name": "fetch_url",
      "description": "Fetch content from a URL",
      "inputSchema": {
        "type": "object",
        "properties": {
          "url": { "type": "string", "format": "uri" }
        },
        "required": ["url"]
      }
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Tool Invocation -- the AI calls a tool
{
  "method": "tools/call",
  "params": {
    "name": "fetch_url",
    "arguments": { "url": "https://example.com" }
  }
}
Enter fullscreen mode Exit fullscreen mode

Step 4: Response -- structured data comes back
{
  "result": {
    "content": "<html>...",
    "status": 200,
    "headers": {}
  }
}
Enter fullscreen mode Exit fullscreen mode


Part 2: The MCP Web Scraping Ecosystem

Several MCP servers provide web scraping capabilities:

Server Tools Focus
CrawlForge 18 Comprehensive scraping, research, stealth
Firecrawl ~5 Basic scraping and crawling
Browser MCP ~3 Browser automation
Fetch MCP 1 Simple HTTP requests
CrawlForge: ████████████████████ 18 tools
Firecrawl:  █████                 5 tools
Browser:    ███                   3 tools
Fetch:      █                     1 tool
Enter fullscreen mode Exit fullscreen mode

Part 3: CrawlForge's 18 Tools Explained

Basic Scraping (1-2 credits)

1. fetch_url (1 credit) -- Fetch raw HTML from any URL

The foundation of web scraping. Always try this first.

// Usage: "Fetch https://example.com"
// Returns:
{
  html: "<html>...",
  status: 200,
  headers: { /* ... */ },
  timing: { total: 523 }
}
Enter fullscreen mode Exit fullscreen mode

When to use: Starting point for any scraping task.

2. extract_text (1 credit) -- Clean text extraction

Removes HTML tags, scripts, and styles. Returns readable text with word count and reading time.

When to use: Blog posts, articles, documentation where you need readable text.

3. extract_links (1 credit) -- Discover all links on a page
{
  url: "https://example.com",
  filter_external: true  // Internal links only
}
// Returns: { links: [{ href: "/about", text: "About Us" }, ...], total: 45 }
Enter fullscreen mode Exit fullscreen mode

When to use: Site exploration, finding pages to scrape, building sitemaps.

4. extract_metadata (1 credit) -- SEO metadata extraction

Pulls title, description, Open Graph, JSON-LD structured data.

When to use: SEO analysis, content previews, structured data extraction.

Structured Extraction (2-3 credits)

5. scrape_structured (2 credits) -- CSS selector-based extraction
{
  url: "https://example.com/products",
  selectors: {
    title: "h1.product-title",
    price: "span.price",
    description: ".product-description"
  }
}
// Returns: { data: { title: "Product Name", price: "$99.99", description: "..." } }
Enter fullscreen mode Exit fullscreen mode

When to use: E-commerce scraping, structured data, known page layouts.

6. extract_content (2 credits) -- Intelligent article extraction

Like Readability -- returns title, author, published date, clean content, images, reading time.

When to use: News articles, blog posts, editorial content.

7. map_site (2 credits) -- Site structure discovery
{
  url: "https://example.com",
  max_urls: 1000,
  include_sitemap: true
}
// Returns: pages[], structure tree
Enter fullscreen mode Exit fullscreen mode

When to use: Site audits, crawl planning, content discovery.

8. analyze_content (3 credits) -- NLP analysis

Language detection, sentiment scoring, topic extraction, named entity recognition, readability analysis.

When to use: Content analysis, sentiment tracking, topic extraction.

Advanced Scraping (4-5 credits)

9. process_document (2 credits) -- PDF and document handling
{
  source: "https://example.com/report.pdf",
  sourceType: "pdf_url"
}
// Returns: { text: "...", pages: 15, metadata: { author, created } }
Enter fullscreen mode Exit fullscreen mode

When to use: Research papers, reports, documentation PDFs.

10. summarize_content (4 credits) -- AI-powered summarization

Returns concise summary, key points, and word reduction percentage.

When to use: Long documents, research synthesis, content digests.

11. crawl_deep (4 credits) -- Multi-page crawling
{
  url: "https://example.com",
  max_depth: 3,
  max_pages: 100,
  include_patterns: ["/blog/*"],
  exclude_patterns: ["/admin/*"]
}
Enter fullscreen mode Exit fullscreen mode

When to use: Full site scraping, content aggregation, archiving.

12. batch_scrape (5 credits) -- Parallel scraping of multiple URLs

Scrape up to 50 URLs in parallel with concurrency control.

When to use: Multiple known URLs, competitor monitoring, price tracking.

13. scrape_with_actions (5 credits) -- Browser automation
{
  url: "https://example.com/app",
  actions: [
    { type: "wait", selector: ".content" },
    { type: "click", selector: "#load-more" },
    { type: "wait", timeout: 2000 },
    { type: "scroll", selector: "body" },
    { type: "screenshot" }
  ]
}
Enter fullscreen mode Exit fullscreen mode

When to use: SPAs, infinite scroll, dynamic content, login-required pages.

14. search_web (5 credits) -- Google search integration
{
  query: "web scraping best practices 2026",
  limit: 10,
  site: "github.com"  // Optional site filter
}
Enter fullscreen mode Exit fullscreen mode

When to use: Discovery, finding sources, research starting point.

Specialized Tools (2-10 credits)

15. stealth_mode (5 credits) -- Anti-detection bypass

Hides WebDriver, randomizes fingerprints, simulates human behavior. Configurable stealth levels.

When to use: Protected sites, Cloudflare bypass, anti-bot evasion.

16. track_changes (3 credits) -- Content monitoring

Set a baseline, then detect changes with before/after diffs and significance scoring.

When to use: Price monitoring, competitor tracking, content updates.

17. localization (2 credits) -- Geo-targeted scraping

Set country code and language to get region-specific content and pricing.

When to use: Regional pricing, localized content, geo-restricted data.

18. deep_research (10 credits) -- Multi-source research
{
  topic: "quantum computing commercialization",
  maxUrls: 50,
  enableSourceVerification: true,
  enableConflictDetection: true
}
// Returns: { synthesis, sources[], conflicts[], citations[] }
Enter fullscreen mode Exit fullscreen mode

When to use: Research projects, due diligence, market analysis.


Part 4: Integration Guide

Claude Code Setup

# 1. Install CrawlForge MCP server
npm install -g crawlforge-mcp-server

# 2. Run setup wizard
npx crawlforge-setup

# 3. Add to Claude Code
claude
> /mcp add crawlforge npx crawlforge-mcp-server

# 4. Verify
> /mcp list
# Should show: crawlforge (18 tools)
Enter fullscreen mode Exit fullscreen mode

Claude Desktop Setup

Edit your config file:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json

{
  "mcpServers": {
    "crawlforge": {
      "command": "npx",
      "args": ["crawlforge-mcp-server"],
      "env": {
        "CRAWLFORGE_API_KEY": "cf_live_your_key_here"
      }
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

Custom Application (MCP SDK)

import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";

const transport = new StdioClientTransport({
  command: "npx",
  args: ["crawlforge-mcp-server"],
  env: { CRAWLFORGE_API_KEY: process.env.CRAWLFORGE_API_KEY }
});

const client = new Client({ name: "my-app", version: "1.0.0" });
await client.connect(transport);

// List available tools
const tools = await client.listTools();
console.log(`Available tools: ${tools.tools.length}`);

// Call a tool
const result = await client.callTool({
  name: "fetch_url",
  arguments: { url: "https://example.com" }
});
Enter fullscreen mode Exit fullscreen mode

Part 5: Best Practices

Credit Optimization

Pick the cheapest tool that gets the job done:

Goal Expensive Efficient
Check if page exists deep_research (10) fetch_url (1)
Get article text scrape_with_actions (5) extract_content (2)
Find competitor URLs search_web x 10 (50) extract_links (1)
Scrape 20 product pages fetch_url x 20 (20) batch_scrape (5)

Error Handling

try {
  const result = await fetchUrl(url);
  if (result.status >= 400) {
    // Try with stealth mode for blocked requests
    return await stealthMode(url);
  }
  return result;
} catch (error) {
  await sleep(retryDelay * attempt);
  return retry(url, attempt + 1);
}
Enter fullscreen mode Exit fullscreen mode

Rate Limiting

Respect target sites:

// Good: Reasonable delays
for (const url of urls) {
  await scrape(url);
  await sleep(1000 + Math.random() * 2000); // 1-3s delay
}

// Better: Use batch_scrape with built-in rate limiting
await batchScrape(urls, { delayBetweenRequests: 1500 });
Enter fullscreen mode Exit fullscreen mode

Caching

Don't scrape the same URL twice:

const cache = new Map<string, ScrapedContent>();

async function smartScrape(url: string) {
  if (cache.has(url)) return cache.get(url);

  const result = await fetchUrl(url);
  cache.set(url, result);
  return result;
}
Enter fullscreen mode Exit fullscreen mode

Part 6: Getting Started


Free Tier -- 1,000 one-time trial credits. All 18 tools available. No credit card required.
npm install -g crawlforge-mcp-server
npx crawlforge-setup
Enter fullscreen mode Exit fullscreen mode

What 1,000 Credits Gets You

Use Case Tool Credits Capacity
Basic scraping fetch_url 1 1,000 pages
Article extraction extract_content 2 500 articles
Site mapping map_site 2 500 sites
Batch jobs batch_scrape 5 200 batches
Research projects deep_research 10 100 topics

Key Takeaways

  1. MCP is the standard -- All major AI assistants support it
  2. CrawlForge leads with 18 tools -- 4x more than alternatives
  3. Start simple -- Use fetch_url (1 credit) before reaching for advanced tools
  4. Combine tools -- Chain operations for powerful workflows
  5. Be ethical -- Respect robots.txt and rate limits

More from CrawlForge:

Start Free -- 1,000 Credits Included

Top comments (0)