How I Built a 12-Tool MCP Server for AI Agents in 4 Hours (and What It Taught Me About 2026 Scraping)

#mcp #webscraping #ai #javascript

Update (June 2026): Reddit shut down its public .json endpoints - universal 403, no proxy tier helps. The reddit tool in this MCP server now auto-degrades to Reddit's RSS feeds behind a circuit breaker (you lose score/comment counts; items are tagged source: rss-fallback). The server itself is back live as v1.1.0 with Streamable HTTP, SSE keepalives for long tool calls, and .05 per successful call. Details in the repo.

The Model Context Protocol (MCP) standard is becoming the default way to expose tools to LLMs in 2026. Anthropic, OpenAI, and every major editor vendor - Cursor, Cline, Continue, Windsurf - now ship with MCP support. If you have any API or scraper, MCP-ifying it is a free distribution channel into every AI client on the market.

I had 12 web scrapers running on Apify Store. Last weekend I bundled them into a single MCP server. Total time: about 4 hours. Here is what I learned, including the bug that almost shipped silently broken.

Why MCP changes the economics of scrapers

A scraper sitting on Apify Store competes with thousands of other scrapers on a single SEO surface. A scraper exposed as an MCP tool is callable directly from inside Claude, ChatGPT, Cursor, or any agent loop - without the user ever leaving their editor.

The shift matters because the buyer changes. The Apify Store buyer is a developer evaluating scrapers manually. The MCP buyer is an LLM choosing the right tool from a registry of dozens. If your tool description, parameters, and output format are clean, you get picked.

The transport bug that almost shipped

MCP has two transports: legacy SSE (GET /sse + POST /messages) and modern Streamable HTTP (POST /mcp). I implemented SSE first and tested with curl - works. Pushed it.

Then I ran an end-to-end test from Claude Desktop. Failure: Cannot POST /mcp.

Modern clients (Claude Desktop, Cursor, the MCP Inspector) all expect Streamable HTTP. SSE alone is not enough in 2026. The fix is to mount both transports on the same server, with session ID management for the modern one:

import { StreamableHTTPServerTransport } from '@modelcontextprotocol/sdk/server/streamableHttp.js';
import { SSEServerTransport } from '@modelcontextprotocol/sdk/server/sse.js';

const httpSessions = new Map();

app.post('/mcp', async (req, res) => {
    const sessionId = req.headers['mcp-session-id'];
    let transport;

    if (sessionId && httpSessions.has(sessionId)) {
        transport = httpSessions.get(sessionId);
    } else if (!sessionId && req.body?.method === 'initialize') {
        transport = new StreamableHTTPServerTransport({
            sessionIdGenerator: () => randomUUID(),
            onsessioninitialized: (id) => httpSessions.set(id, transport),
        });
        const server = makeServer();
        await server.connect(transport);
    } else {
        return res.status(400).json({
            jsonrpc: '2.0',
            error: { code: -32000, message: 'Bad Request: invalid session' },
            id: null,
        });
    }

    await transport.handleRequest(req, res, req.body);
});

Lesson: never trust your MCP server until you have run an initialize ? tools/list handshake against it from an actual MCP client. curl is not enough.

The s-card change that broke every eBay scraper in 2026

eBay quietly migrated their search results from .s-item to .s-card selectors earlier this year. Every scraper on the market silently returned 0 items for weeks. I noticed because my own logs showed extracted 0 consistently, even though the page was rendering normally.

The fix is one line, but the lesson is bigger: scrapers need a self-test that catches "0 items extracted" as an error, not a success. Apify's Actor.fail() plus a smoke run on every build prevents the silent decay.

const items = $('.s-card').toArray().map(parseCard);
if (items.length === 0) {
    throw new Error('Selector returned 0 items - likely DOM change');
}

When Cloudflare Turnstile wins

I tried hard to scrape Yelp business listings. Cloudflare Turnstile blocks the request before any DOM loads. I tried Camoufox (a stealth Firefox fork), residential proxies, mouse jitter, even logged-in cookies - everything resulted in the same "Just a moment..." page on a loop.

After three hours I pivoted to the Yelp Fusion API. Free tier: 5000 calls per day. The trick is the BYOK pattern (Bring Your Own Key): each user supplies their own free API key as a tool parameter. The actor doesn't store anyone's key, doesn't pay for API calls, and ships in 30 minutes.

If a target has a generous free API, fight the API. If not, fight Cloudflare. Don't fight both.

Camoufox memory: 1GB is not enough

When I did need stealth (Glassdoor, before I gave up), Camoufox in 1GB containers crashed mid-page. The browser silently OOM'd and Playwright reported the page as "still loading" forever.

Bumping the Apify actor's memory to 4GB fixed it. The browser process alone consumes 1.5-2GB at steady state with one tab; anything below 4GB is gambling. This is a thing nobody documents until you spend two hours debugging an "unkillable Cloudflare challenge" that turned out to be a dead browser.

The BYOK model for distribution

For the MCP server itself, I didn't want users billing me for compute. The pattern that works:

Apify hosts the actor in Standby mode (the actor wakes on HTTP request).
The user authenticates via their own Apify token: https://renzomacar--multi-scraper-mcp.apify.actor/mcp?token={apifyToken}.
Smithery.ai (an MCP registry) forwards the user's token through the gateway as a query parameter, so neither I nor Smithery sees it persisted.

Result: the user pays Apify for their own compute, I pay nothing for distribution, and any LLM client gets 12 scrapers instantly.

What I would do differently

Validate both MCP transports from day one.
Add a "0-items" failure check on every scraper.
Default to 4GB on any actor running a real browser.
Pivot to APIs faster when Cloudflare Enterprise is in front.
Treat the tool description as a sales pitch to a model, not a developer.

The MCP server is open source: github.com/Perufitlife/multi-scraper-mcp

The 12 tools cover Reddit, Amazon, eBay, Google Maps (businesses + reviews), Yelp (Fusion API + events), YouTube, TikTok, Indeed, and a few more. If you have your own MCP server to compare notes on, drop a comment.