John Rooney for Zyte

Posted on Sep 10

Build a Web Scraping Tool Server with FastMCP & Zyte API

#zyte #webscraping #data #mcpserver

Large Language Models (LLMs) are incredibly powerful, but they have a fundamental limitation: they're stuck in the past. Their knowledge is frozen at the time of their last training run, and often they can't access the live, dynamic information on the internet, due to bans, and the need for using browsers to access some data.

So, how do you connect your AI to the real world? How do you empower it to fetch real-time product prices, download the latest articles, or analyze a website's current HTML? You give it tools.

In this guide, we'll walk through building a robust MCP Server using FastMCP. This server will expose powerful web scraping capabilities from the Zyte API, effectively giving your AI the ability to browse and understand the live web on your behalf.

You'll need a Zyte API Key for this. Follow this link to sign up and get free credit.

What is an MCP Server and Why Does Your AI Need One?

Think of an LLM as a brilliant brain in a jar. It can reason, write, and access a vast internal library of knowledge, but it can't easily interact with the outside world. An MCP (Model Context Protocol) Server is the robotic body you build for that brain. Each function you add to the server is like a hand or a sensor—a tool the brain can use to perform actions.

This architecture, often part of a system for Retrieval-Augmented Generation (RAG) or AI Agents, works like this:

You ask the AI a question that requires live data, like, "What's the price of the new iPhone on Apple's website?"
The AI, knowing it can't answer from memory, looks at its available tools.
It sees a tool named extract_product with the description "extract product data from a given URL." It understands this is the right tool for the job.
The AI calls your MCP server, asking it to run the extract_product tool with the URL for the iPhone.
Your server executes the code, scrapes the data using the Zyte API, and returns clean, structured information (price, availability, etc.).
The AI receives this fresh data and uses it to give you a precise, up-to-the-minute answer.

By building an MCP server, you're not just asking an AI questions; you're directing a powerful agent that can actively gather and process information from the world for you.

The Architecture: Building a Professional-Grade Toolbelt

Before we create the tools, we need to build a solid foundation. A simple script won't cut it; we're building a reliable service. This means structuring our code for clarity, reusability, and safety.

_The full code is at the bottom of this article.
_

The Foundation: A Dedicated API Client

Instead of scattering requests.post calls throughout our code, we'll create a dedicated ZyteAPIClient class. This is a software engineering best practice that encapsulates all the logic for interacting with the Zyte API in one place.

Clean & Reusable: All the details—the base URL, authentication, and timeout settings—are handled within the class. If the API ever changes, you only need to update it in one spot.
Specialised Methods: We create specific methods like extract_product and extract_html. Each method is responsible for building the correct JSON payload for its specific task, making our tool code incredibly simple later on.

class ZyteAPIClient:
    def __init__(self, api_key: str, timeout: int = 30):
        self.api_key = api_key
        self.base_url = "https://api.zyte.com/v1/extract"
        self.timeout = timeout
        self.session = requests.Session()

    def _make_request(self, payload: Dict[str, Any]) -> ZyteAPIResponse:
        # ... (full request logic here)

The Blueprint: Pydantic for Data Integrity

What happens if a website doesn't have a price, or the API response changes slightly? Our program could crash. To prevent this, we use Pydantic to define data models.

Think of Pydantic models as a quality control inspector for your data. They define the "shape" (or schema) of the data we expect to receive.

Validation: If the API returns data that doesn't match our ProductResponse schema, Pydantic will raise an error, preventing bad data from corrupting our system.
Predictability: By using these models, we guarantee that the data flowing through our application is always structured correctly. This makes our tools reliable and safe for the AI to use.

class ProductResponse(BaseModel):
    name: Optional[str] = None
    price: Optional[str] = None
    currency: Optional[str] = None
    # ... (other product fields)

class ZyteAPIResponse(BaseModel):
    product: Optional[ProductResponse] = None
    httpResponseBody: Optional[str] = None
    browserHtml: Optional[str] = None

Forging the Tools with FastMCP

Now that our foundation is solid, we can create the tools for our AI. This is where FastMCP makes things incredibly simple.

The @mcp.tool decorator is the magic wand that instantly turns any Python function into a secure, callable API endpoint for your AI.

mcp = FastMCP("zyte-mcp-server")
zyte_client = ZyteAPIClient(API_KEY)

@mcp.tool
def extract_product(url: str) -> dict:
    """
    extract product data from a given URL using Zyte API
    """
    response = zyte_client.extract_product(url)
    return response.model_dump()

The most important part of this code is the docstring: """extract product data from a given URL...""".

This is not just a comment for other developers. The docstring is the instruction manual for the AI. The LLM reads this description to understand what the tool does and when it should be used. A clear, concise, and descriptive docstring is the key to creating an effective AI agent. We create two more tools for extracting raw and browser-rendered HTML, each with a clear docstring.

Firing Up the Server

With our client, data models, and tools defined, the last step is to bring our server to life. This single line of code tells FastMCP to start a web server, which listens for incoming requests from an AI and executes the appropriate tool.

if __name__ == "__main__":
    mcp.run()

Your MCP server is now online, and your AI has a powerful new set of web-scraping superpowers, ready to be called upon.

The Complete Code

Here is the full script ready to run.

from fastmcp import FastMCP
import requests
from base64 import b64decode
from typing import Dict, Any, Optional, List
from pydantic import BaseModel
import logging
import os

# --- 1. Pydantic Models for Data Validation ---
class ProductResponse(BaseModel):
    name: Optional[str] = None
    price: Optional[str] = None
    currency: Optional[str] = None
    availability: Optional[str] = None
    sku: Optional[str] = None
    brand: Optional[Dict[str, Any]] = None
    images: Optional[List[Dict[str, Any]]] = None
    description: Optional[str] = None
    url: Optional[str] = None

class HTMLResponse(BaseModel):
    html: str

class ZyteAPIResponse(BaseModel):
    product: Optional[ProductResponse] = None
    httpResponseBody: Optional[str] = None
    browserHtml: Optional[str] = None

# --- 2. Dedicated API Client for Clean, Reusable Logic ---
class ZyteAPIClient:
    def __init__(self, api_key: str, timeout: int = 30):
        self.api_key = api_key
        self.base_url = "https://api.zyte.com/v1/extract"
        self.timeout = timeout
        self.session = requests.Session()

    def _make_request(self, payload: Dict[str, Any]) -> ZyteAPIResponse:
        logging.info(f"Making request to {self.base_url} for URL: {payload.get('url', 'unknown')}")
        response = self.session.post(
            self.base_url,
            auth=(self.api_key, ""),
            json=payload,
            timeout=self.timeout
        )
        logging.info(f"Response status: {response.status_code}")
        response.raise_for_status()
        return ZyteAPIResponse(**response.json())

    def extract_product(self, url: str) -> ProductResponse:
        payload = {
            "url": url,
            "product": True,
        }
        response = self._make_request(payload)
        return response.product or ProductResponse()

    def extract_html(self, url: str) -> HTMLResponse:
        payload = {
            "url": url,
            "httpResponseBody": True,
        }
        response = self._make_request(payload)
        if response.httpResponseBody:
            http_response_body = b64decode(response.httpResponseBody)
            return HTMLResponse(html=http_response_body.decode("utf-8"))
        return HTMLResponse(html="")

    def extract_browser_html(self, url: str) -> HTMLResponse:
        payload = {
            "url": url,
            "browserHtml": True,
        }
        response = self._make_request(payload)
        return HTMLResponse(html=response.browserHtml or "")

# --- 3. FastMCP Server and Tool Definitions ---
mcp = FastMCP("zyte-mcp-server")
API_KEY = os.getenv("ZYTE_API_KEY")
if API_KEY is None:
    raise Exception("ZYTE_API_KEY environment variable is not set")

zyte_client = ZyteAPIClient(API_KEY)

@mcp.tool
def extract_product(url: str) -> dict:
    """
    Extracts structured product data (name, price, SKU, etc.) from a given product page URL.
    Use this when a user asks for specific details about a product.
    """
    response = zyte_client.extract_product(url)
    return response.model_dump()

@mcp.tool
def extract_html(url: str) -> dict:
    """
    Extracts the raw static HTML content of a page from a given URL.
    Use this for simple websites or when you need the source code before JavaScript runs.
    """
    response = zyte_client.extract_html(url)
    return response.model_dump()

@mcp.tool
def extract_html_with_browser(url: str) -> dict:
    """
    Extracts the HTML content of a page after it has been fully rendered in a web browser,
    including content generated by JavaScript. Use this for dynamic, complex websites.
    """
    response = zyte_client.extract_browser_html(url)
    return response.model_dump()

# --- 4. Run the Server ---
if __name__ == "__main__":
    mcp.run()

Top comments (1)

OnlineProxy • Sep 10

FastMCP’s async features and Zyte’s smart proxy setup have been lifesavers, but getting them to play nice together wasn’t exactly smooth sailing. There were some pain points, like dealing with auth handling, long timeouts, and those annoying cost spikes. I lean on Pydantic for solid data validation and use Playwright when scraping JS-heavy sites. When things need to be faster, I switch to basic HTML scrapers to keep it efficient. Rate limiting’s handled with retries, proxies, and Zyte’s auto-throttling to keep everything smooth.