Build an MCP Server for Real-Time LLM Web Scraping

#python #ai #aiagents #llm

TL;DR

Large Language Models need live data to avoid hallucinations, but feeding them raw HTML exhausts context windows quickly. The Model Context Protocol (MCP) standardizes how AI agents access external tools. By pairing an MCP server with a headless scraping API, you can ground your agents with real-time, token-efficient Markdown representations of any public web page.

The Context Window Problem

When you ask an AI agent to analyze a live public URL, it needs to fetch that page. The standard approach is to execute an HTTP GET request and dump the response directly into the prompt context.

This fails immediately on modern websites.

A standard e-commerce product page or news article easily exceeds 2MB of raw HTML, inline CSS, and base64 encoded images. That translates to roughly 500,000 tokens. Pushing that into an LLM context window is slow, expensive, and degrades the model's ability to reason about the actual content. The model suffers from the "lost in the middle" phenomenon, missing key data points buried in navigation boilerplate.

You need a middleware layer. The server must fetch the page, execute the JavaScript required to render the DOM, strip the boilerplate, convert the core content into clean Markdown, and hand that specific string back to the model.

Enter the Model Context Protocol (MCP)

MCP is an open standard that dictates how LLMs communicate with external tools. Instead of writing custom tool-calling loops for every provider (OpenAI, Anthropic, Google), you write one MCP server. Any compatible client, like Claude Desktop, can connect to your server and discover its capabilities.

We will build a Python-based MCP server that exposes a single tool: read_webpage. When the LLM decides it needs information from a URL, it calls this tool.

Setting Up the Project

Initialize a new Python environment and install the required dependencies. We need the official MCP SDK and the Python SDK to handle the browser rendering and formatting.

```bash title="Terminal"
python -m venv venv
source venv/bin/activate
pip install mcp alterlab




## Building the MCP Server

MCP servers communicate over standard input and output (stdio). This means the AI client launches the server as a subprocess and talks to it via JSON-RPC messages on stdin/stdout.

Create a file for your server implementation.



```python title="mcp_server.py" {16-24}

from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent, CallToolResult

# Initialize the MCP server
app = Server("web-reader")

# Initialize the extraction client
api_key = os.environ.get("ALTERLAB_API_KEY")
client = alterlab.Client(api_key)

@app.list_tools()
async def list_tools() -> list[Tool]:
    """Define the tools available to the LLM."""
    return [
        Tool(
            name="read_webpage",
            description="Extracts the main content of a public web page and returns clean Markdown. Use this to read articles, documentation, or public data.",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The absolute URL to read"
                    }
                },
                "required": ["url"]
            }
        )
    ]

The list_tools decorator registers the schema. The inputSchema follows standard JSON Schema format. The LLM reads this description to understand exactly when and how to use the tool.

Next, implement the tool execution logic.

```python title="mcp_server.py" {8-11}
@app.call_tool()
async def call_tool(name: str, arguments: dict) -> CallToolResult:
"""Handle incoming tool executions from the LLM."""
if name != "read_webpage":
raise ValueError(f"Unknown tool: {name}")

url = arguments.get("url")
if not url:
    return CallToolResult(
        content=[TextContent(type="text", text="Error: URL is required")]
    )

try:
    # Request Markdown format directly to save tokens
    response = client.scrape(
        url,
        formats=["markdown"],
        wait_for_network_idle=True
    )

    markdown_content = response.markdown

    # Failsafe for empty extraction
    if not markdown_content:
         markdown_content = "The page was fetched, but no readable content was found. It may be an image or require authentication."

    return CallToolResult(
        content=[TextContent(type="text", text=markdown_content)]
    )

except Exception as e:
    # Always return errors as text to the LLM so it can adjust and retry
    return CallToolResult(
        content=[TextContent(type="text", text=f"Failed to extract page: {str(e)}")]
    )

async def main():
async with stdio_server() as (read_stream, write_stream):
await app.run(read_stream, write_stream, app.create_initialization_options())

if name == "main":
asyncio.run(main())




Notice the error handling block. When building MCP servers, you should rarely raise exceptions that crash the server process. If an extraction fails due to a 404 or a timeout, return that error as a string in the `CallToolResult`. The LLM reads this error, understands the request failed, and can attempt a different URL or strategy.

### The cURL Alternative

If you prefer building your MCP server in another language like Node.js or Go, you do not need an SDK. You can make standard HTTP requests to the REST endpoint. Here is the exact equivalent using cURL to show the underlying request structure:



```bash title="Terminal" {3-5}
curl -X POST https://api.alterlab.io/v1/scrape \
  -H "X-API-Key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/public-dataset",
    "formats": ["markdown"],
    "wait_for_network_idle": true
  }'

The formats=["markdown"] parameter is the crucial piece for token efficiency. It instructs the API to strip the navigation, footers, scripts, and CSS, returning only the semantic core of the page.

Handling JavaScript and Anti-Bot Measures

A common mistake when building web reading tools for LLMs is using a standard HTTP library like requests or axios.

Many modern websites ship an empty HTML shell and render the actual content via React, Vue, or Angular. A simple GET request returns a blank page to your LLM. Furthermore, public databases and directories often employ aggressive rate limiting or basic bot protection challenges.

To ensure your LLM actually receives the data, your server must execute JavaScript and handle browser challenges. In the Python code above, setting wait_for_network_idle=True forces the headless browser to wait until all XHR requests complete before extracting the DOM. The heavy lifting of proxy rotation and bypass is handled automatically by using a dedicated anti-bot solution built into the API layer.

Testing with Claude Desktop

To test this server locally, you can connect it directly to Claude Desktop. Claude Desktop supports running local MCP servers as subprocesses.

Locate your Claude Desktop configuration file. On macOS, it is found at ~/Library/Application Support/Claude/claude_desktop_config.json. On Linux, check ~/.config/Claude/claude_desktop_config.json.

Update the configuration to point to your Python script.

```json title="claude_desktop_config.json" {4-8}
{
"mcpServers": {
"web-reader": {
"command": "/path/to/your/venv/bin/python",
"args": [
"/path/to/your/mcp_server.py"
],
"env": {
"ALTERLAB_API_KEY": "your_api_key_here"
}
}
}
}




Restart Claude Desktop. You will now see a tool icon indicating the server is connected. You can prompt Claude naturally: "Summarize the latest release notes at https://github.com/mcp/server."

Claude will recognize the URL, invoke your `read_webpage` tool, wait for the Markdown response, and generate an answer based on the real-time data.

## Structured Data Extraction

Markdown is excellent for general reading and summarization. However, if your agent needs to act programmatically on specific data points, you want structured JSON.

You can add a second tool to your MCP server specifically for extracting JSON payloads using LLM-powered parsing. Modify the API request to utilize Cortex AI extraction. Check the [documentation](https://alterlab.io/docs) for the specific schema definitions.



```python title="mcp_server.py" {12-16}
# Inside a new tool function: extract_structured_data
response = client.scrape(
    url,
    extract={
        "schema": {
            "type": "object",
            "properties": {
                "pricing_tiers": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            }
        }
    }
)
return CallToolResult(
    content=[TextContent(type="text", text=json.dumps(response.extracted_data))]
)

This configuration offloads the heavy reasoning to the scraping API. The MCP server receives a clean JSON object and passes it directly to your agent.

Takeaways

Building an MCP server gives your AI agents reliable access to live public data. By outsourcing the rendering and markdown conversion to a specialized API, you keep your server logic minimal. You prevent context window bloat, reduce token costs, and eliminate hallucinated answers by grounding your models in reality.

Keep your server robust by returning errors as text within the tool response, allowing the LLM to handle failures gracefully. Build specialized tools for Markdown reading versus structured JSON extraction to give your agents the right format for the task.