Endogen

Posted on Mar 7

Web2API — Turning Websites into REST APIs (and MCP Tools)

#python #api #mcp #web

I needed data from websites that don't have APIs. Not once, not as a quick scrape, but as persistent, queryable endpoints I could hit programmatically. So I built Web2API.

The Problem

Most useful data on the internet lives behind HTML. Some sites offer APIs, many don't. The typical approach is writing one-off scrapers — fragile scripts that break whenever the site changes a CSS class. I wanted something different:

Declarative — define what to extract, not how to click through pages
Persistent — a running service with stable endpoints, not a script I run manually
Modular — add new sites without touching the core codebase
AI-ready — expose scraped data as tools that language models can call

The Solution

Web2API is a FastAPI service backed by Playwright (headless Chromium). You define recipes in YAML — each recipe describes a website, its endpoints, and what data to extract. The service runs continuously and serves the scraped data as clean JSON REST endpoints.

A Recipe Looks Like This

name: "Hacker News"
slug: "hackernews"
base_url: "https://news.ycombinator.com"
endpoints:
  read:
    url: "https://news.ycombinator.com/news?p={page}"
    items:
      container: "tr.athing"
      fields:
        title:
          selector: ".titleline > a"
          attribute: "text"
        url:
          selector: ".titleline > a"
          attribute: "href"
          transform: "absolute_url"
        score:
          selector: ".score"
          context: "next_sibling"
          attribute: "text"
          transform: "regex_int"
          optional: true
    pagination:
      type: "page_param"
      param: "p"
      start: 1

That's it. No Python code. Install the recipe, and you get:

curl http://localhost:8010/hackernews/read?page=1

{
  "items": [
    {
      "title": "Show HN: I built a thing",
      "url": "https://example.com",
      "fields": { "score": 153 }
    }
  ],
  "pagination": { "current_page": 1, "has_next": true }
}

When YAML Isn't Enough

Some sites require actual interaction — typing into fields, waiting for dynamic content, handling streaming responses. For those, recipes can include a custom Python scraper alongside the YAML:

recipes/
  allenai/
    recipe.yaml     # endpoint definitions
    scraper.py      # custom interaction logic

The scraper gets a blank Playwright page and full control:

class Scraper(BaseScraper):
    def supports(self, endpoint: str) -> bool:
        return endpoint in {"chat", "olmo-32b"}

    async def scrape(self, endpoint, page, params):
        # Navigate, interact, parse streaming responses...
        return ScrapeResult(items=[...])

Endpoints not handled by the scraper fall back to declarative YAML extraction. This hybrid approach means simple sites stay simple, and complex ones get the flexibility they need.

Recipe Management

Recipes live in a catalog — a git repository with available integrations. The service has a CLI and web UI for managing them:

# See what's available
web2api recipes catalog list

# Install one
web2api recipes catalog add hackernews --yes

# Check dependencies
web2api recipes doctor hackernews

You can also install recipes from local paths, custom git repos, or just drop a folder into the recipes directory. The web UI shows both the catalog and installed recipes with one-click install/uninstall.

The MCP Server

This is where it gets interesting. Web2API includes a built-in MCP (Model Context Protocol) server that automatically exposes every recipe endpoint as a native tool for AI assistants.

Install a recipe → it's immediately available as an MCP tool. Uninstall it → the tool disappears. No configuration, no restart needed.

{
  "mcpServers": {
    "web2api": {
      "command": "npx",
      "args": ["-y", "mcp-remote", "https://your-host/mcp/"]
    }
  }
}

Add that to your Claude Desktop config, and suddenly Claude can search the web, translate text, query Hacker News — whatever recipes you have installed.

How Tools Are Built

Each recipe endpoint becomes its own MCP tool with a proper name, description, and typed parameters. The tool registration happens dynamically — when recipes change, tools rebuild automatically:

# Inside _ToolRegistry
async def _fn(**kwargs: str) -> str:
    response = await execute_recipe_endpoint(
        app=self.app,
        recipe=recipe,
        endpoint_name=endpoint,
        page=1,
        q=kwargs.get("q", ""),
        query_params=params,
    )
    return format_tool_result(response.model_dump(mode="json"))

Tools execute recipes in-process — no HTTP self-calls, no overhead. The function signatures are built dynamically with inspect.Signature so MCP clients get proper parameter schemas.

Recipes can also define custom tool_name values for AI-friendly naming:

endpoints:
  search:
    tool_name: "web_search"  # instead of the default "brave-search__search"

This matters more than you'd think — some models struggle with names containing dashes or double underscores.

HTTP Bridge

For non-MCP clients, there's also a simpler HTTP bridge:

# List available tools
curl https://your-host/mcp/tools

# Call a tool
curl -X POST https://your-host/mcp/tools/web_search \
  -H "Content-Type: application/json" \
  -d '{"q": "latest news"}'

The bridge supports filtering by recipe slug — useful when you want to expose only specific tools to a particular consumer:

GET /mcp/only/brave-search/tools     # only brave-search tools
GET /mcp/exclude/allenai/tools       # everything except allenai

File Uploads

Some recipes need files — vision models that analyze images, document processors, etc. Web2API handles multipart uploads:

curl -X POST "http://localhost:8010/allenai/molmo2?q=Describe+this+image" \
  -F "files=@photo.jpg"

Files are saved to a temp directory, passed to the scraper, and cleaned up after the response. Upload filenames are sanitized against path traversal.

Architecture

The stack is deliberately simple:

FastAPI for the HTTP layer
Playwright (Chromium) for browser automation
Pydantic for config validation
Docker for deployment

A shared browser pool manages Playwright contexts with configurable concurrency and TTL. An in-memory response cache with stale-while-revalidate keeps things fast for repeated queries.

Request → Cache check → Browser pool → Playwright page → Extract → Cache store → Response

What I Use It For

I run Web2API on a VPS behind nginx with a handful of recipes:

Allen AI — chat with OLMo and Tülu models, analyze images with Molmo 2
Brave Search — web search that my AI tools can call
DeepL — translation between German and English
Hacker News — front page and search
Wikipedia — article search and full content extraction

The MCP server feeds into Claude Desktop for direct tool use, and the HTTP bridge provides web search capabilities to a Telegram bot I built on top of it.

Try It

git clone https://github.com/Endogen/web2api.git
cd web2api
docker compose up --build -d

# Install a recipe
docker compose exec web2api web2api recipes catalog add hackernews --yes

# Query it
curl -s http://localhost:8010/hackernews/read?page=1 | jq '.items[:3]'

The recipe catalog is open — contributions welcome.

Links:

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.