Large Language Models (LLMs) operate in a vacuum. To build autonomous agents that perform market research, track public pricing across e-commerce sites, or analyze real estate listings, you must provide them with real-time access to the web. Static Retrieval-Augmented Generation (RAG) is insufficient for data that changes hourly. Agents need the ability to reach out, fetch current pages, and read the contents.
The Model Context Protocol (MCP) standardizes how AI models connect to external tools. Instead of writing custom tool-calling logic for every agent framework (LangChain, LlamaIndex, AutoGen), you write an MCP server once. Any MCP-compatible client—including Claude Desktop—can then discover and execute your tools automatically.
This tutorial demonstrates how to build an MCP server that gives your AI agents the ability to read the web. We will build a Python-based server that exposes a single tool for data extraction, utilizing an external infrastructure layer to handle headless browsers and proxy rotation.
The Architecture of Agentic Scraping
When an agent needs real-time data, it enters a standard tool-calling loop. The MCP architecture cleanly separates the reasoning engine from the execution environment.
By isolating the extraction logic within an MCP server, your agent does not need to know about timeouts, HTTP headers, or network retries. It simply requests a URL and receives text.
Core Concept: Preparing Data for the Context Window
Before writing the server, we must address the most common failure point in agentic scraping: token limits.
Raw HTML from modern single-page applications is bloated with inline CSS, SVG paths, and minified JavaScript. Feeding an 800KB HTML file into an agent's context window will instantly exhaust token limits and degrade the model's reasoning capabilities.
The solution is converting HTML into clean Markdown before returning it to the agent. This strips the structural noise while preserving the semantic hierarchy (headings, links, tables) that the LLM needs to understand the page structure.
Data Extraction: cURL vs. Python
To implement the extraction, we use AlterLab. When your agent requests a URL, the MCP server will fire an API request to fetch the cleaned data.
Here is the exact same extraction operation demonstrated in both cURL and Python. Notice the format="markdown" parameter, which is critical for LLM consumption.
```bash title="Terminal"
cURL Implementation
curl -X POST https://api.alterlab.io/v1/scrape \
-H "X-API-Key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"format": "markdown",
"render_js": true
}'
```python title="scraper_test.py" {5-9}
# Python SDK Implementation
client = alterlab.Client("YOUR_API_KEY")
# The highlight below shows the critical LLM-optimization parameters
response = client.scrape(
url="https://example.com",
format="markdown",
render_js=True
)
print(response.text)
If you are building complex data pipelines, checking the Python SDK documentation will provide advanced configuration options for specific site architectures.
Building the MCP Server
We will use the official mcp Python package provided by Anthropic. This package abstracts away the JSON-RPC messages and standard I/O handling, allowing you to define tools using standard Python decorators and type hints.
Prerequisites
Initialize a new Python project and install the required dependencies:
```bash title="Terminal"
mkdir agent-scraper-mcp
cd agent-scraper-mcp
python -m venv venv
source venv/bin/activate
pip install mcp alterlab pydantic
### The Server Code
Create a file named `server.py`. This script initializes the MCP server and registers the web scraping tool. The descriptive docstrings inside the tool definition are critical—the MCP protocol passes these descriptions directly to the LLM so it knows *when* and *how* to use the tool.
```python title="server.py" {18-20,31-34}
from mcp.server.fastmcp import FastMCP
from pydantic import BaseModel, Field
# Initialize the MCP Server
mcp = FastMCP("WebScraper")
# Initialize the extraction client
# Ensure ALTERLAB_API_KEY is set in your environment variables
api_key = os.environ.get("ALTERLAB_API_KEY")
if not api_key:
raise ValueError("ALTERLAB_API_KEY environment variable is missing.")
client = alterlab.Client(api_key)
# The docstring and type hints below are sent to the LLM.
# Write them as instructions to the AI agent.
@mcp.tool()
def scrape_public_url(url: str, render_js: bool = True) -> str:
"""
Extracts readable text from a publicly accessible URL.
Use this tool when you need to read the current contents of a webpage.
Returns the page content formatted as Markdown.
Args:
url: The full HTTP/HTTPS URL of the target page.
render_js: Set to False only if you know the site is static HTML.
"""
try:
# Highlighting the actual extraction execution
response = client.scrape(
url=url,
format="markdown",
render_js=render_js
)
# Guardrail against overly massive pages
content = response.text
if len(content) > 100000:
return content[:100000] + "\n\n...[Content truncated for length]..."
return content
except Exception as e:
return f"Error extracting data from {url}: {str(e)}"
if __name__ == "__main__":
# Run the server using Standard I/O transport
mcp.run(transport='stdio')
Handling Anti-Bot and Dynamic Content
You might wonder why we don't just use Python's requests library inside the MCP tool.
When agents operate autonomously, they frequently encounter Cloudflare challenges, Datadome blocks, and pages that require extensive JavaScript rendering to populate the DOM. If your agent's requests.get() call returns a 403 Forbidden or an empty HTML skeleton, the agent will hallucinate an answer based on the failure message or simply crash the workflow.
By delegating the extraction to an infrastructure layer with robust anti-bot handling, the MCP server guarantees that the agent receives the actual page content. The agent focuses purely on semantic reasoning, while the API handles proxy rotation, headless browser management, and fingerprinting.
Connecting the Server to an Agent
MCP servers typically communicate over Standard I/O (stdio). This means the agent framework spawns the server as a subprocess and communicates via standard input and output streams.
Testing with Claude Desktop
The easiest way to test your new server is by plugging it into Claude Desktop. You need to modify Claude's configuration file to point to your Python script.
Configuration file locations:
- macOS:
~/Library/Application Support/Claude/claude_desktop_config.json - Windows:
%APPDATA%\Claude\claude_desktop_config.json
Add your server to the mcpServers object:
```json title="claude_desktop_config.json" {6-9}
{
"mcpServers": {
"web-scraper": {
"command": "/path/to/your/agent-scraper-mcp/venv/bin/python",
"args": [
"/path/to/your/agent-scraper-mcp/server.py"
],
"env": {
"ALTERLAB_API_KEY": "your_api_key_here"
}
}
}
}
Restart Claude Desktop. You will now see a small "plug" icon indicating available tools. You can issue prompts like:
> "Read the documentation at https://docs.python.org/3/library/asyncio.html and summarize the latest changes to the TaskGroup API."
Claude will recognize that it lacks real-time knowledge of that URL, invoke the `scrape_public_url` tool via MCP, wait for the Markdown response, and formulate a correct, grounded answer based on the live page content.
## Production Considerations for Agentic Pipelines
When moving from local testing to production agent deployments (e.g., deploying on AWS or running background workers with LangGraph), keep these architectural principles in mind:
1. **Timeout Management:** Web extraction can take anywhere from 1 to 15 seconds depending on the target's rendering complexity. Ensure your MCP client and the overlying LLM API calls have appropriate timeout buffers configured.
2. **Context Window Protection:** The truncation logic in the `server.py` snippet (`content[:100000]`) is critical. Unbounded scraping returns will trigger `context_length_exceeded` errors from your LLM provider.
3. **Structured Data:** If your agent specifically needs JSON output instead of Markdown, you can define a secondary tool in your MCP server (`extract_structured_data`) and utilize Cortex AI to map the DOM into a predefined JSON schema. Read the [API docs](https://alterlab.io/docs) for implementation details on schema enforcement.
<div data-infographic="stats">
<div data-stat data-value="Markdown" data-label="Optimal LLM Format"></div>
<div data-stat data-value="stdio" data-label="Local MCP Transport"></div>
<div data-stat data-value="100k" data-label="Suggested Char Limit"></div>
</div>
## Takeaways
Building an MCP server bridges the gap between static LLM reasoning and real-time internet data.
- Use the Model Context Protocol to write tool definitions once, allowing any compliant agent framework to discover and use your extraction capabilities.
- Never feed raw HTML into an agent. Always convert to Markdown to preserve context windows and reduce token costs.
- Offload browser management and proxy rotation to dedicated infrastructure so your AI agents can focus strictly on reasoning and analysis.
By implementing this architecture, you transform isolated language models into capable, internet-aware research assistants.
Top comments (0)