DEV Community

John Rooney for Extract Data

Posted on

How I use AI and MCP to Scrape Data

Web Scraping MCP Tool

How to Build Custom Tools for Copilot with Python and Zyte API

AI coding assistants like GitHub Copilot are incredibly powerful, but their capabilities are often sandboxed. They can write code, but they can't always interact with specific external APIs or access proprietary tools relevant to your project. What if you could give your AI assistant direct access to specialized tools, like a high-performance web scraping API?

That's exactly what this tutorial covers. We will build a simple MCP server using Python that exposes custom functions as tools for your AI assistant. To demonstrate a powerful real-world use case, we'll integrate the Zyte API to give our AI sophisticated web scraping abilities, allowing it to access and process content from virtually any website.


Why Extend Your AI's Capabilities?

By default, an AI assistant can't access live web pages if they are blocked by bot detection, nor can it interact with your company's internal staging environment or private APIs. By building a custom tool bridge, you can grant specific permissions and capabilities:

  • Access Restricted Data: Allow the AI to query internal databases or APIs.
  • Bypass Web Blocks: Use a robust web scraping service like Zyte API to fetch web content reliably, overcoming anti-scraping measures that would normally stop the AI.
  • Perform Specialized Tasks: Create tools for complex calculations, image processing, or interacting with specific hardware.

In this guide, we'll use a Python framework to create these tools and demonstrate how to perform advanced web scraping tasks by simply chatting with Copilot.


Step 1: Setting Up Your Basic Tool Server

First, we need to create a basic server that will host our tools. We'll use a Python framework designed for this purpose (here referred to based on a library for creating MCP, or Machine Conversation Protocol, tools).

Let's start with a minimal example. Create a Python file (e.g., main.py) and add the following code. This basic server defines a single tool called add_two_numbers.

# main.py
from fastmcp import FastMCP
from base64 import b64decode
import requests
import os

mcp = FastMCP("Demo MCP Server")

# @mcp.tool decorator registers the function as an available tool for the AI.
@mcp.tool
def add_two_numbers(a: int, b: int) -> int:
    """Adds two integers together."""
    return a + b

if __name__ == "__main__":
    mcp.run()
Enter fullscreen mode Exit fullscreen mode

The key element here is the @mcp.tool decorator. This tells the server framework to expose the add_two_numbers function so that AI assistants like Copilot can discover and execute it.


Step 2: Configuration and Verification

To make these tools discoverable by your IDE and AI assistant on a per-project basis, you need to create a configuration file.

  1. Create a Configuration File: In your project's root directory, create a .mcp/mcp.json file. This JSON file tells the local environment where to find your tool server. You can often generate a template for this file using a command provided by your tool framework (e.g., fastmcp install mcp.json main.py).

  2. Configuration Example (.mcp/mcp.json):

    {"servers": {
    "Demo MCP Server": {
        "command": "uv",
        "args": [
            "run",
            "--with",
            "fastmcp",
            "fastmcp",
            "run",
            "/path/to/main.py"
        ],
        "env": {}
    }
    }}
    
  3. Run and Verify:

  * Start your Python server. You should see output indicating that the server is running and has discovered your tools.
  * In your IDE (like VS Code), open the command palette (`Cmd+Shift+X` or `Ctrl+Shift+X`) and look for extensions managing AI tools. You should see your "Demo MCP Server" listed as active.
  * You can also open the Copilot chat interface, find the "Configure Tools" option, and verify that the `add_two_numbers` tool is listed.
Enter fullscreen mode Exit fullscreen mode

To test it, ask Copilot to perform the task, explicitly telling it to use your tool:

Prompt: "Using the MCP server, add 5 and 6."

Copilot will ask for permission to run the tool from your local server. After you allow it, it will execute the function on your server and return the result (11). This confirms the connection between your AI and your custom code is working.


Step 3: Supercharging Your AI with Web Scraping Capabilities

Now for the powerful part. Let's create a tool that fetches website content using the Zyte API. This allows us to scrape pages that are normally protected by anti-bot measures.

1. Define the New Tool Function

First, let's define the function signature in main.py. We'll create an extract_html function that takes a URL and returns the page content. It's crucial to add a descriptive docstring so the AI understands what the tool does and when to use it.

import requests
import base64
import os

# ... (previous add_two_numbers function) ...

@mcp.tool
def extract_html(url: str) -> dict:
    """
    Extracts HTML content from a given URL using the Zyte API.
    Returns the HTML content.
    """
    # API call logic will go here
    pass
Enter fullscreen mode Exit fullscreen mode

2. Securely Handle API Keys

Never hardcode your API keys. We'll use environment variables to keep them secure. Add a check at the start of your script to ensure the API key is available before starting the server.

# At the top of main.py
import os

API_KEY = os.getenv("ZYTE_API_KEY")
if API_KEY is None:
    raise ValueError("ZYTE_API_KEY environment variable not set.")
Enter fullscreen mode Exit fullscreen mode

3. Implement the Zyte API Call

Now, we'll implement the logic inside the extract_html function. We will make a request to the Zyte API endpoint, passing our target URL and API key. The Zyte API handles proxies, retries, and browser rendering, returning clean HTML.

# main.py - updated extract_html function

@mcp.tool
def extract_html(url: str) -> dict:
    """
    Extracts HTML content from a given URL using the Zyte API.
    Returns the HTML content.
    """
    response = requests.post(
        "https://api.zyte.com/v1/extract",
        auth=(API_KEY, ""), # Use your API key here
        json={
            "url": url,
            "httpResponseBody": True # Requesting the raw HTML body
        }
    )
    response_data = response.json()

    # Decode the Base64 encoded HTML response from Zyte API
    html_content = base64.b64decode(response_data["httpResponseBody"]).decode('utf-8')

    return {"html": html_content}
Enter fullscreen mode Exit fullscreen mode

After adding this code, restart your server. The new extract_html tool will now be available to your AI assistant.


Step 4: Real-World Application: Scraping and Parsing a Product Page

Let's see this in action. The real value isn't just fetching the HTML; it's enabling the AI to reason about data it couldn't access before.

Scenario: We want to extract product information from an e-commerce page that often blocks scrapers.

Workflow:

  1. Fetch the HTML via the tool:

    Prompt: "Extract the HTML from [product page URL] using Zyte."

    Copilot will invoke your extract_html tool. The Zyte API will process the request, bypass any blocks, and send the full HTML back to the AI's context.

  2. Ask the AI to parse the data:

    Prompt: "This is a product page. Give me CSS selectors for a standard product schema (name, price, description) that work on this page. Then, return a Python code block using BeautifulSoup to parse it."

Now, because the AI has access to the actual HTML via your tool, it can accurately generate working CSS selectors and parsing code. Without the custom tool, Copilot would likely respond with "I cannot access live websites" or provide generic, non-functional code.

Here's an example of what the AI might generate:

# AI-generated code based on the fetched HTML context

from bs4 import BeautifulSoup

# The HTML content would be loaded here from the previous step's output
html_doc = """... full HTML content from Zyte API ..."""
soup = BeautifulSoup(html_doc, 'html.parser')

product_schema = {
    "name": soup.select_one("h1.product-title").get_text(strip=True) if soup.select_one("h1.product-title") else None,
    "price": soup.select_one("span.price-amount").get_text(strip=True) if soup.select_one("span.price-amount") else None,
    "description": soup.select_one("div.product-description").get_text(strip=True) if soup.select_one("div.product-description") else None,
}

print(product_schema)
Enter fullscreen mode Exit fullscreen mode

Conclusion: Unleash Your AI's Potential

Creating custom tools for your AI assistant bridges the gap between general-purpose code generation and specialized, project-specific automation. As demonstrated, building a simple Python server to expose functions is straightforward. By connecting powerful external services like the Zyte API, you effectively give your AI superpowers, enabling complex workflows like reliable web scraping and data extraction directly from your development environment.

This approach allows you to build highly specialized, powerful integrations limited only by the tools you decide to create.

Full Code

from fastmcp import FastMCP
from base64 import b64decode
import requests
import os

mcp = FastMCP("Demo MCP Server")


@mcp.tool
def add(a: int, b: int) -> int:
    """Add two numbers"""
    return a + b


@mcp.tool
def subtract(a: int, b: int) -> int:
    """Subtract two numbers"""
    return a - b


@mcp.tool
def request_html(url: str) -> str:
    """Fetch HTML content from a URL"""
    response = requests.get(url)
    return response.text


@mcp.tool
def zyte_product(url: str) -> dict:
    """
    Fetch product data from Zyte API and return the product JSON.
    """
    api_key = os.environ.get("ZYTE_API_KEY")
    if not api_key:
        raise RuntimeError("ZYTE_API_KEY environment variable not set")
    api_response = requests.post(
        "https://api.zyte.com/v1/extract",
        auth=(api_key, ""),
        json={
            "url": url,
            "httpResponseBody": True,
            "product": True,
            "productOptions": {"extractFrom": "httpResponseBody", "ai": True},
            "followRedirect": True,
        },
    )
    data = api_response.json()
    return data["product"]


@mcp.tool
def get_html_from_zyte(url: str) -> dict:
    """
    Fetches and returns only the HTML content from Zyte API for a given URL.
    """
    api_key = os.environ.get("ZYTE_API_KEY")
    if not api_key:
        raise RuntimeError("ZYTE_API_KEY environment variable not set")
    api_response = requests.post(
        "https://api.zyte.com/v1/extract",
        auth=(api_key, ""),
        json={
            "url": url,
            "httpResponseBody": True,
        },
    )
    http_response_body: bytes = b64decode(api_response.json()["httpResponseBody"])
    return {"html": http_response_body.decode("utf-8")}


if __name__ == "__main__":
    mcp.run()
Enter fullscreen mode Exit fullscreen mode

Top comments (0)