The Model Context Protocol (MCP) has fundamentally changed how AI assistants interact with the web. If you've been building with Claude, Cursor, or any MCP-compatible client, you've probably wondered: what's the best way to give your AI agent access to live web data?
This guide covers everything -- from foundational MCP concepts to all 18 CrawlForge tools and how to pick the right one for each job.
Table of Contents
- Part 1: Understanding MCP
- Part 2: The MCP Web Scraping Ecosystem
- Part 3: All 18 Tools Explained
- Part 4: Integration Guide
- Part 5: Best Practices
- Part 6: Getting Started
Part 1: Understanding MCP
What is the Model Context Protocol?
MCP is an open standard developed by Anthropic that lets AI assistants connect to external tools and data sources. Think of it as a universal adapter between AI models and specialized tools.
+--------------+ +----------------+ +---------------+
| Claude | <--> | MCP Server | <--> | External |
| (AI Model) | | (CrawlForge) | | Resources |
+--------------+ +----------------+ +---------------+
^
MCP Protocol
(JSON-RPC over stdio)
Why MCP Matters for Web Scraping
Before MCP, getting real-time web data into AI was painful:
| Approach | Problems |
|---|---|
| Training data | Outdated, knowledge cutoff |
| RAG (Retrieval) | Limited to indexed documents |
| Function calling | Requires custom implementation |
| Browser plugins | Inconsistent, security concerns |
MCP solves these with:
- Standardized interface -- One protocol for all tools
- Real-time data -- Fresh information from any source
- Tool composability -- Combine multiple tools seamlessly
- Security model -- Controlled access to external resources
How MCP Works
MCP uses a client-server architecture with JSON-RPC:
Step 1: Server Discovery
{
"mcpServers": {
"crawlforge": {
"command": "npx",
"args": ["crawlforge-mcp-server"]
}
}
}
Step 2: Tool Registration -- the server advertises its capabilities
{
"tools": [
{
"name": "fetch_url",
"description": "Fetch content from a URL",
"inputSchema": {
"type": "object",
"properties": {
"url": { "type": "string", "format": "uri" }
},
"required": ["url"]
}
}
]
}
Step 3: Tool Invocation -- the AI calls a tool
{
"method": "tools/call",
"params": {
"name": "fetch_url",
"arguments": { "url": "https://example.com" }
}
}
Step 4: Response -- structured data comes back
{
"result": {
"content": "<html>...",
"status": 200,
"headers": {}
}
}
Part 2: The MCP Web Scraping Ecosystem
Several MCP servers provide web scraping capabilities:
| Server | Tools | Focus |
|---|---|---|
| CrawlForge | 18 | Comprehensive scraping, research, stealth |
| Firecrawl | ~5 | Basic scraping and crawling |
| Browser MCP | ~3 | Browser automation |
| Fetch MCP | 1 | Simple HTTP requests |
CrawlForge: ████████████████████ 18 tools
Firecrawl: █████ 5 tools
Browser: ███ 3 tools
Fetch: █ 1 tool
Part 3: CrawlForge's 18 Tools Explained
Basic Scraping (1-2 credits)
The foundation of web scraping. Always try this first. When to use: Starting point for any scraping task.1. fetch_url (1 credit) -- Fetch raw HTML from any URL
// Usage: "Fetch https://example.com"
// Returns:
{
html: "<html>...",
status: 200,
headers: { /* ... */ },
timing: { total: 523 }
}
Removes HTML tags, scripts, and styles. Returns readable text with word count and reading time. When to use: Blog posts, articles, documentation where you need readable text.2. extract_text (1 credit) -- Clean text extraction
When to use: Site exploration, finding pages to scrape, building sitemaps.3. extract_links (1 credit) -- Discover all links on a page
{
url: "https://example.com",
filter_external: true // Internal links only
}
// Returns: { links: [{ href: "/about", text: "About Us" }, ...], total: 45 }
Pulls title, description, Open Graph, JSON-LD structured data. When to use: SEO analysis, content previews, structured data extraction.4. extract_metadata (1 credit) -- SEO metadata extraction
Structured Extraction (2-3 credits)
When to use: E-commerce scraping, structured data, known page layouts.5. scrape_structured (2 credits) -- CSS selector-based extraction
{
url: "https://example.com/products",
selectors: {
title: "h1.product-title",
price: "span.price",
description: ".product-description"
}
}
// Returns: { data: { title: "Product Name", price: "$99.99", description: "..." } }
Like Readability -- returns title, author, published date, clean content, images, reading time. When to use: News articles, blog posts, editorial content.6. extract_content (2 credits) -- Intelligent article extraction
When to use: Site audits, crawl planning, content discovery.7. map_site (2 credits) -- Site structure discovery
{
url: "https://example.com",
max_urls: 1000,
include_sitemap: true
}
// Returns: pages[], structure tree
Language detection, sentiment scoring, topic extraction, named entity recognition, readability analysis. When to use: Content analysis, sentiment tracking, topic extraction.8. analyze_content (3 credits) -- NLP analysis
Advanced Scraping (4-5 credits)
When to use: Research papers, reports, documentation PDFs.9. process_document (2 credits) -- PDF and document handling
{
source: "https://example.com/report.pdf",
sourceType: "pdf_url"
}
// Returns: { text: "...", pages: 15, metadata: { author, created } }
Returns concise summary, key points, and word reduction percentage. When to use: Long documents, research synthesis, content digests.10. summarize_content (4 credits) -- AI-powered summarization
When to use: Full site scraping, content aggregation, archiving.11. crawl_deep (4 credits) -- Multi-page crawling
{
url: "https://example.com",
max_depth: 3,
max_pages: 100,
include_patterns: ["/blog/*"],
exclude_patterns: ["/admin/*"]
}
Scrape up to 50 URLs in parallel with concurrency control. When to use: Multiple known URLs, competitor monitoring, price tracking.12. batch_scrape (5 credits) -- Parallel scraping of multiple URLs
When to use: SPAs, infinite scroll, dynamic content, login-required pages.13. scrape_with_actions (5 credits) -- Browser automation
{
url: "https://example.com/app",
actions: [
{ type: "wait", selector: ".content" },
{ type: "click", selector: "#load-more" },
{ type: "wait", timeout: 2000 },
{ type: "scroll", selector: "body" },
{ type: "screenshot" }
]
}
When to use: Discovery, finding sources, research starting point.14. search_web (5 credits) -- Google search integration
{
query: "web scraping best practices 2026",
limit: 10,
site: "github.com" // Optional site filter
}
Specialized Tools (2-10 credits)
Hides WebDriver, randomizes fingerprints, simulates human behavior. Configurable stealth levels. When to use: Protected sites, Cloudflare bypass, anti-bot evasion.15. stealth_mode (5 credits) -- Anti-detection bypass
Set a baseline, then detect changes with before/after diffs and significance scoring. When to use: Price monitoring, competitor tracking, content updates.16. track_changes (3 credits) -- Content monitoring
Set country code and language to get region-specific content and pricing. When to use: Regional pricing, localized content, geo-restricted data.17. localization (2 credits) -- Geo-targeted scraping
When to use: Research projects, due diligence, market analysis.18. deep_research (10 credits) -- Multi-source research
{
topic: "quantum computing commercialization",
maxUrls: 50,
enableSourceVerification: true,
enableConflictDetection: true
}
// Returns: { synthesis, sources[], conflicts[], citations[] }
Part 4: Integration Guide
Claude Code Setup
# 1. Install CrawlForge MCP server
npm install -g crawlforge-mcp-server
# 2. Run setup wizard
npx crawlforge-setup
# 3. Add to Claude Code
claude
> /mcp add crawlforge npx crawlforge-mcp-server
# 4. Verify
> /mcp list
# Should show: crawlforge (18 tools)
Claude Desktop Setup
Edit your config file:
macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
{
"mcpServers": {
"crawlforge": {
"command": "npx",
"args": ["crawlforge-mcp-server"],
"env": {
"CRAWLFORGE_API_KEY": "cf_live_your_key_here"
}
}
}
}
Custom Application (MCP SDK)
import { Client } from "@modelcontextprotocol/sdk/client/index.js";
import { StdioClientTransport } from "@modelcontextprotocol/sdk/client/stdio.js";
const transport = new StdioClientTransport({
command: "npx",
args: ["crawlforge-mcp-server"],
env: { CRAWLFORGE_API_KEY: process.env.CRAWLFORGE_API_KEY }
});
const client = new Client({ name: "my-app", version: "1.0.0" });
await client.connect(transport);
// List available tools
const tools = await client.listTools();
console.log(`Available tools: ${tools.tools.length}`);
// Call a tool
const result = await client.callTool({
name: "fetch_url",
arguments: { url: "https://example.com" }
});
Part 5: Best Practices
Credit Optimization
Pick the cheapest tool that gets the job done:
| Goal | Expensive | Efficient |
|---|---|---|
| Check if page exists |
deep_research (10) |
fetch_url (1) |
| Get article text |
scrape_with_actions (5) |
extract_content (2) |
| Find competitor URLs |
search_web x 10 (50) |
extract_links (1) |
| Scrape 20 product pages |
fetch_url x 20 (20) |
batch_scrape (5) |
Error Handling
try {
const result = await fetchUrl(url);
if (result.status >= 400) {
// Try with stealth mode for blocked requests
return await stealthMode(url);
}
return result;
} catch (error) {
await sleep(retryDelay * attempt);
return retry(url, attempt + 1);
}
Rate Limiting
Respect target sites:
// Good: Reasonable delays
for (const url of urls) {
await scrape(url);
await sleep(1000 + Math.random() * 2000); // 1-3s delay
}
// Better: Use batch_scrape with built-in rate limiting
await batchScrape(urls, { delayBetweenRequests: 1500 });
Caching
Don't scrape the same URL twice:
const cache = new Map<string, ScrapedContent>();
async function smartScrape(url: string) {
if (cache.has(url)) return cache.get(url);
const result = await fetchUrl(url);
cache.set(url, result);
return result;
}
Part 6: Getting Started
npm install -g crawlforge-mcp-server
npx crawlforge-setup
What 1,000 Credits Gets You
| Use Case | Tool | Credits | Capacity |
|---|---|---|---|
| Basic scraping | fetch_url |
1 | 1,000 pages |
| Article extraction | extract_content |
2 | 500 articles |
| Site mapping | map_site |
2 | 500 sites |
| Batch jobs | batch_scrape |
5 | 200 batches |
| Research projects | deep_research |
10 | 100 topics |
Key Takeaways
- MCP is the standard -- All major AI assistants support it
- CrawlForge leads with 18 tools -- 4x more than alternatives
-
Start simple -- Use
fetch_url(1 credit) before reaching for advanced tools - Combine tools -- Chain operations for powerful workflows
-
Be ethical -- Respect
robots.txtand rate limits
More from CrawlForge:
Top comments (0)