Rost

Posted on Oct 26 • Originally published at glukhov.org

Building MCP Servers in Python: WebSearch & Scrape Guide

#ai #python #llm #mcp

The Model Context Protocol (MCP) is revolutionizing how AI assistants interact with external data sources and tools. In this comprehensive guide, we'll explore how to build MCP servers in Python, with practical examples focused on web search and scraping capabilities.

What is the Model Context Protocol?

Model Context Protocol (MCP) is an open protocol introduced by Anthropic to standardize how AI assistants connect to external systems. Instead of building custom integrations for each data source, MCP provides a unified interface that allows:

AI assistants (like Claude, ChatGPT, or custom LLM applications) to discover and use tools
Developers to expose data sources, tools, and prompts through a standardized protocol
Seamless integration without reinventing the wheel for each use case

The protocol operates on a client-server architecture where:

MCP Clients (AI assistants) discover and use capabilities
MCP Servers expose resources, tools, and prompts
Communication happens via JSON-RPC over stdio or HTTP/SSE

Why Build MCP Servers in Python?

Python is an excellent choice for MCP server development because:

Rich Ecosystem: Libraries like requests, beautifulsoup4, selenium, and playwright make web scraping straightforward
MCP SDK: Official Python SDK (mcp) provides robust server implementation support
Rapid Development: Python's simplicity allows quick prototyping and iteration
AI/ML Integration: Easy integration with AI libraries like langchain, openai, and data processing tools
Community Support: Large community with extensive documentation and examples

Setting Up Your Development Environment

First, create a virtual environment and install the required dependencies:

# Create and activate virtual environment
python -m venv mcp-env
source mcp-env/bin/activate  # On Windows: mcp-env\Scripts\activate

# Install MCP SDK and web scraping libraries
pip install mcp requests beautifulsoup4 playwright lxml
playwright install  # Install browser drivers for Playwright

Building a Basic MCP Server

Let's start with a minimal MCP server structure:

import asyncio
from mcp.server import Server
from mcp.types import Tool, TextContent
import mcp.server.stdio

# Create server instance
app = Server("websearch-scraper")

@app.list_tools()
async def list_tools() -> list[Tool]:
    """Define available tools"""
    return [
        Tool(
            name="search_web",
            description="Search the web for information",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Search query"
                    },
                    "max_results": {
                        "type": "number",
                        "description": "Maximum number of results",
                        "default": 5
                    }
                },
                "required": ["query"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict) -> list[TextContent]:
    """Handle tool execution"""
    if name == "search_web":
        query = arguments["query"]
        max_results = arguments.get("max_results", 5)

        # Implement search logic here
        results = await perform_web_search(query, max_results)

        return [TextContent(
            type="text",
            text=f"Search results for '{query}':\n\n{results}"
        )]

    raise ValueError(f"Unknown tool: {name}")

async def perform_web_search(query: str, max_results: int) -> str:
    """Placeholder for actual search implementation"""
    return f"Found {max_results} results for: {query}"

async def main():
    """Run the server"""
    async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            app.create_initialization_options()
        )

if __name__ == "__main__":
    asyncio.run(main())

Implementing Web Search Functionality

Now let's implement a real web search tool using DuckDuckGo (which doesn't require API keys):

import asyncio
import requests
from bs4 import BeautifulSoup
from urllib.parse import quote_plus

async def search_duckduckgo(query: str, max_results: int = 5) -> list[dict]:
    """Search DuckDuckGo and parse results"""

    url = f"https://html.duckduckgo.com/html/?q={quote_plus(query)}"
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
    }

    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()

        soup = BeautifulSoup(response.text, 'html.parser')
        results = []

        for result in soup.select('.result')[:max_results]:
            title_elem = result.select_one('.result__title')
            snippet_elem = result.select_one('.result__snippet')
            url_elem = result.select_one('.result__url')

            if title_elem and snippet_elem:
                results.append({
                    "title": title_elem.get_text(strip=True),
                    "snippet": snippet_elem.get_text(strip=True),
                    "url": url_elem.get_text(strip=True) if url_elem else "N/A"
                })

        return results

    except Exception as e:
        raise Exception(f"Search failed: {str(e)}")

def format_search_results(results: list[dict]) -> str:
    """Format search results for display"""
    if not results:
        return "No results found."

    formatted = []
    for i, result in enumerate(results, 1):
        formatted.append(
            f"{i}. {result['title']}\n"
            f"   {result['snippet']}\n"
            f"   URL: {result['url']}\n"
        )

    return "\n".join(formatted)

Adding Web Scraping Capabilities

Let's add a tool to scrape and extract content from web pages:

from playwright.async_api import async_playwright
import asyncio

async def scrape_webpage(url: str, selector: str = None) -> dict:
    """Scrape content from a webpage using Playwright"""

    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        try:
            await page.goto(url, timeout=30000)

            # Wait for content to load
            await page.wait_for_load_state('networkidle')

            if selector:
                # Extract specific element
                element = await page.query_selector(selector)
                content = await element.inner_text() if element else "Selector not found"
            else:
                # Extract main content
                content = await page.inner_text('body')

            title = await page.title()

            return {
                "title": title,
                "content": content[:5000],  # Limit content length
                "url": url,
                "success": True
            }

        except Exception as e:
            return {
                "error": str(e),
                "url": url,
                "success": False
            }
        finally:
            await browser.close()

# Add scraper tool to the MCP server
@app.list_tools()
async def list_tools() -> list[Tool]:
    return [
        Tool(
            name="search_web",
            description="Search the web using DuckDuckGo",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"},
                    "max_results": {"type": "number", "default": 5}
                },
                "required": ["query"]
            }
        ),
        Tool(
            name="scrape_webpage",
            description="Scrape content from a webpage",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {"type": "string", "description": "URL to scrape"},
                    "selector": {
                        "type": "string",
                        "description": "Optional CSS selector for specific content"
                    }
                },
                "required": ["url"]
            }
        )
    ]

Complete MCP Server Implementation

Here's a complete, production-ready MCP server with both search and scrape capabilities:

#!/usr/bin/env python3
"""
MCP Server for Web Search and Scraping
Provides tools for searching the web and extracting content from pages
"""

import asyncio
import logging
from typing import Any
import requests
from bs4 import BeautifulSoup
from urllib.parse import quote_plus
from playwright.async_api import async_playwright

from mcp.server import Server
from mcp.types import Tool, TextContent, ImageContent, EmbeddedResource
import mcp.server.stdio

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger("websearch-scraper")

# Create server
app = Server("websearch-scraper")

# Search implementation
async def search_web(query: str, max_results: int = 5) -> str:
    """Search DuckDuckGo and return formatted results"""
    url = f"https://html.duckduckgo.com/html/?q={quote_plus(query)}"
    headers = {"User-Agent": "Mozilla/5.0"}

    try:
        response = requests.get(url, headers=headers, timeout=10)
        soup = BeautifulSoup(response.text, 'html.parser')
        results = []

        for result in soup.select('.result')[:max_results]:
            title = result.select_one('.result__title')
            snippet = result.select_one('.result__snippet')
            link = result.select_one('.result__url')

            if title and snippet:
                results.append({
                    "title": title.get_text(strip=True),
                    "snippet": snippet.get_text(strip=True),
                    "url": link.get_text(strip=True) if link else ""
                })

        # Format results
        if not results:
            return "No results found."

        formatted = [f"Found {len(results)} results for '{query}':\n"]
        for i, r in enumerate(results, 1):
            formatted.append(f"\n{i}. **{r['title']}**")
            formatted.append(f"   {r['snippet']}")
            formatted.append(f"   {r['url']}")

        return "\n".join(formatted)

    except Exception as e:
        logger.error(f"Search failed: {e}")
        return f"Search error: {str(e)}"

# Scraper implementation
async def scrape_page(url: str, selector: str = None) -> str:
    """Scrape webpage content using Playwright"""
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page = await browser.new_page()

        try:
            await page.goto(url, timeout=30000)
            await page.wait_for_load_state('networkidle')

            title = await page.title()

            if selector:
                element = await page.query_selector(selector)
                content = await element.inner_text() if element else "Selector not found"
            else:
                content = await page.inner_text('body')

            # Limit content length
            content = content[:8000] + "..." if len(content) > 8000 else content

            result = f"**{title}**\n\nURL: {url}\n\n{content}"
            return result

        except Exception as e:
            logger.error(f"Scraping failed: {e}")
            return f"Scraping error: {str(e)}"
        finally:
            await browser.close()

# MCP Tool definitions
@app.list_tools()
async def list_tools() -> list[Tool]:
    """List available MCP tools"""
    return [
        Tool(
            name="search_web",
            description="Search the web using DuckDuckGo. Returns titles, snippets, and URLs.",
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "The search query"
                    },
                    "max_results": {
                        "type": "number",
                        "description": "Maximum number of results (default: 5)",
                        "default": 5
                    }
                },
                "required": ["query"]
            }
        ),
        Tool(
            name="scrape_webpage",
            description="Extract content from a webpage. Can target specific elements with CSS selectors.",
            inputSchema={
                "type": "object",
                "properties": {
                    "url": {
                        "type": "string",
                        "description": "The URL to scrape"
                    },
                    "selector": {
                        "type": "string",
                        "description": "Optional CSS selector to extract specific content"
                    }
                },
                "required": ["url"]
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: Any) -> list[TextContent]:
    """Handle tool execution"""
    try:
        if name == "search_web":
            query = arguments["query"]
            max_results = arguments.get("max_results", 5)
            result = await search_web(query, max_results)
            return [TextContent(type="text", text=result)]

        elif name == "scrape_webpage":
            url = arguments["url"]
            selector = arguments.get("selector")
            result = await scrape_page(url, selector)
            return [TextContent(type="text", text=result)]

        else:
            raise ValueError(f"Unknown tool: {name}")

    except Exception as e:
        logger.error(f"Tool execution failed: {e}")
        return [TextContent(
            type="text",
            text=f"Error executing {name}: {str(e)}"
        )]

async def main():
    """Run the MCP server"""
    logger.info("Starting WebSearch-Scraper MCP Server")

    async with mcp.server.stdio.stdio_server() as (read_stream, write_stream):
        await app.run(
            read_stream,
            write_stream,
            app.create_initialization_options()
        )

if __name__ == "__main__":
    asyncio.run(main())

Configuring Your MCP Server

To use your MCP server with Claude Desktop or other MCP clients, create a configuration file:

For Claude Desktop (claude_desktop_config.json):

{
  "mcpServers": {
    "websearch-scraper": {
      "command": "python",
      "args": [
        "/path/to/your/mcp_server.py"
      ],
      "env": {}
    }
  }
}

Location:

macOS: ~/Library/Application Support/Claude/claude_desktop_config.json
Windows: %APPDATA%\Claude\claude_desktop_config.json
Linux: ~/.config/Claude/claude_desktop_config.json

Testing Your MCP Server

Create a test script to verify functionality:

import asyncio
import json
import sys
from io import StringIO

async def test_mcp_server():
    """Test MCP server locally"""

    # Test search
    print("Testing web search...")
    results = await search_web("Python MCP tutorial", 3)
    print(results)

    # Test scraper
    print("\n\nTesting webpage scraper...")
    content = await scrape_page("https://example.com")
    print(content[:500])

if __name__ == "__main__":
    asyncio.run(test_mcp_server())

Advanced Features and Best Practices

1. Rate Limiting

Implement rate limiting to avoid overwhelming target servers:

import time
from collections import defaultdict

class RateLimiter:
    def __init__(self, max_requests: int = 10, time_window: int = 60):
        self.max_requests = max_requests
        self.time_window = time_window
        self.requests = defaultdict(list)

    async def acquire(self, key: str):
        now = time.time()
        self.requests[key] = [
            t for t in self.requests[key] 
            if now - t < self.time_window
        ]

        if len(self.requests[key]) >= self.max_requests:
            raise Exception("Rate limit exceeded")

        self.requests[key].append(now)

limiter = RateLimiter(max_requests=10, time_window=60)

2. Caching

Add caching to improve performance:

from functools import lru_cache
import hashlib

@lru_cache(maxsize=100)
async def cached_search(query: str, max_results: int):
    return await search_web(query, max_results)

3. Error Handling

Implement robust error handling:

from enum import Enum

class ErrorType(Enum):
    NETWORK_ERROR = "network_error"
    PARSE_ERROR = "parse_error"
    RATE_LIMIT = "rate_limit_exceeded"
    INVALID_INPUT = "invalid_input"

def handle_error(error: Exception, error_type: ErrorType) -> str:
    logger.error(f"{error_type.value}: {str(error)}")
    return f"Error ({error_type.value}): {str(error)}"

4. Input Validation

Validate user inputs before processing:

from urllib.parse import urlparse

def validate_url(url: str) -> bool:
    try:
        result = urlparse(url)
        return all([result.scheme, result.netloc])
    except:
        return False

def validate_query(query: str) -> bool:
    return len(query.strip()) > 0 and len(query) < 500

Deployment Considerations

Using SSE Transport for Web Deployment

For web-based deployments, use SSE (Server-Sent Events) transport:

import mcp.server.sse

async def main_sse():
    """Run server with SSE transport"""
    from starlette.applications import Starlette
    from starlette.routing import Mount

    sse = mcp.server.sse.SseServerTransport("/messages")

    starlette_app = Starlette(
        routes=[
            Mount("/mcp", app=sse.get_server())
        ]
    )

    import uvicorn
    await uvicorn.Server(
        config=uvicorn.Config(starlette_app, host="0.0.0.0", port=8000)
    ).serve()

Docker Deployment

Create a Dockerfile for containerized deployment:

FROM python:3.11-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    wget \
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
RUN playwright install chromium
RUN playwright install-deps

# Copy application
COPY mcp_server.py .

CMD ["python", "mcp_server.py"]

Performance Optimization

Async Operations

Use asyncio for concurrent operations:

async def search_multiple_queries(queries: list[str]) -> list[str]:
    """Search multiple queries concurrently"""
    tasks = [search_web(query) for query in queries]
    results = await asyncio.gather(*tasks)
    return results

Connection Pooling

Reuse connections for better performance:

import aiohttp

session = None

async def get_session():
    global session
    if session is None:
        session = aiohttp.ClientSession()
    return session

async def fetch_url(url: str) -> str:
    session = await get_session()
    async with session.get(url) as response:
        return await response.text()

Security Best Practices

Input Sanitization: Always validate and sanitize user inputs
URL Whitelisting: Consider implementing URL whitelisting for scraping
Timeout Controls: Set appropriate timeouts to prevent resource exhaustion
Content Limits: Limit the size of scraped content
Authentication: Implement authentication for production deployments
HTTPS: Use HTTPS for SSE transport in production

Useful Links and Resources

Conclusion

Building MCP servers in Python opens up powerful possibilities for extending AI assistants with custom tools and data sources. The web search and scraping capabilities demonstrated here are just the beginning—you can extend this foundation to integrate databases, APIs, file systems, and virtually any external system.

The Model Context Protocol is still evolving, but its standardized approach to AI tool integration makes it an exciting technology for developers building the next generation of AI-powered applications. Whether you're creating internal tools for your organization or building public MCP servers for the community, Python provides an excellent foundation for rapid development and deployment.

DEV Community