agenthustler

Posted on Mar 13 • Edited on Apr 19

Building a Pay-Per-Request Web Scraping API with Apify Actors, FastAPI, and the x402 Protocol

#python #webdev #apify #cryptocurrency

What you'll build: A production Python service that wraps Apify community actors behind a unified async API and charges callers $0.05 USDC per request — no account, no OAuth, no API keys required from the consumer side.

The Problem: Social Data Has No Unified Interface

You'd think in 2026 getting posts from Bluesky, articles from Substack, or stories from Hacker News would be trivial. It isn't.

Each platform has its own quirks:

Bluesky (AT Protocol) — has an API, but pagination is cursor-based and the auth flow expects a DID (Decentralized Identifier). Extracting engagement metrics requires chasing multiple nested objects per post.
Substack — no public API. Newsletters are behind a mix of public RSS feeds, paywalled content, and inconsistent rendering. The "just scrape it" approach breaks constantly as their markup changes.
Hacker News — the Algolia search API is decent, but getting it to behave consistently in a structured pipeline requires managing rate limits, result pagination, and normalizing the response shape.

None of this is insurmountable individually. But if you're building something that needs data from multiple sources — a trend-detection tool, a research assistant, or an AI agent that monitors developer sentiment — you're looking at maintaining three different scrapers, three different error-handling paths, and three different data schemas.

The cleaner alternative: wrap Apify's well-maintained community actors and expose a dead-simple unified API on top. That's what this guide builds.

Prerequisites

Python 3.11+
An Apify account (free tier works — actors in this guide use minimal compute units)
A Base network USDC wallet if you want to test x402 payments (optional; the service also supports API key auth)
A Linux server with nginx (this guide uses Alpine Linux; adapt as needed)

Install the Python dependencies:

pip install fastapi uvicorn httpx python-dotenv

Set up your environment variables:

export APIFY_TOKEN="your_apify_api_token"
export API_KEY="your_fallback_api_key"       # optional, for API key auth
export PAY_TO="0xYourUSDCWalletAddress"      # for x402 payments
export BASE_URL="https://your-service.example.com"
export FACILITATOR_URL="https://facilitator.xpay.sh"

Enter Apify Actors: Someone Already Solved the Hard Part

Apify's actor ecosystem is genuinely underrated for this use case. Before writing a single line of custom scraping code, I found community actors that already handle the gnarly bits:

Platform	Actor	What it does
Bluesky	Bluesky Scraper	Search posts by keyword, returns engagement metrics
Substack	Substack Scraper	Scrape articles from any publication slug
Hacker News	HN Search Scraper	Full-text search with score/comment counts

The Apify API contract is beautifully uniform: POST /v2/acts/{actorId}/run-sync-get-dataset-items. You send your input JSON, wait for the run to complete, and get back a dataset. One pattern, three scrapers. This is exactly the abstraction layer I needed.

Architecture: Three Layers, One File

The entire service is ~300 lines of Python (excluding the HTML landing page). Here's the stack:

Client → nginx (TLS termination) → FastAPI (port 8001) → Apify API
                                          ↑
                               x402 payment verification
                               via facilitator.xpay.sh

Infrastructure: Alpine Linux VPS, 256MB RAM, 3GB disk. FastAPI with async httpx fits comfortably in this envelope.

Layer 1: The Apify Wrapper

Define the actor IDs and a single async function that handles every Apify call:

# app.py
import httpx
import base64
import json
from fastapi import FastAPI, Request, Header, HTTPException
from fastapi.responses import JSONResponse

APIFY_BASE = "https://api.apify.com/v2/acts"
APIFY_TOKEN = os.environ["APIFY_TOKEN"]

ACTORS = {
    "bluesky":  "apify/bluesky-scraper",
    "substack": "apify/substack-scraper",
    "hn":       "apify/hacker-news-scraper",
}

app = FastAPI()

async def call_apify(actor_id: str, body: dict, settle_data: dict | None = None):
    url = (
        f"{APIFY_BASE}/{actor_id}/run-sync-get-dataset-items"
        f"?token={APIFY_TOKEN}&timeout=300"
    )
    async with httpx.AsyncClient(timeout=320.0) as client:
        try:
            resp = await client.post(url, json=body)
            resp.raise_for_status()
            headers = {}
            if settle_data:
                encoded = base64.b64encode(json.dumps(settle_data).encode()).decode()
                headers["X-PAYMENT-RESPONSE"] = encoded
            return JSONResponse(
                content=resp.json(),
                status_code=resp.status_code,
                headers=headers,
            )
        except httpx.HTTPStatusError as e:
            return JSONResponse(
                content={"error": str(e), "detail": e.response.text},
                status_code=e.response.status_code,
            )

Three things worth noting here:

run-sync-get-dataset-items blocks until the run finishes and returns the dataset directly. No polling, no run ID tracking, no separate dataset fetch. For synchronous API semantics this is ideal.
timeout=300 on the query string is Apify's actor timeout. httpx.AsyncClient(timeout=320.0) gives an extra 20 seconds for network overhead.
settle_data is the x402 payment receipt — included in the response headers so the caller's wallet can confirm settlement. More on this below.

The endpoint handlers are deliberately thin:

@app.post("/api/bluesky/search")
async def bluesky_search(request: Request, x_api_key: str | None = Header(None)):
    settle_data = await authenticate_request(request, x_api_key)
    body = await request.json()
    return await call_apify(ACTORS["bluesky"], body, settle_data)

@app.post("/api/hn/search")
async def hn_search(request: Request, x_api_key: str | None = Header(None)):
    settle_data = await authenticate_request(request, x_api_key)
    body = await request.json()
    return await call_apify(ACTORS["hn"], body, settle_data)

@app.post("/api/substack/search")
async def substack_search(request: Request, x_api_key: str | None = Header(None)):
    settle_data = await authenticate_request(request, x_api_key)
    body = await request.json()
    return await call_apify(ACTORS["substack"], body, settle_data)

No business logic, no schema validation at this layer. The actor handles input validation on the Apify side; we forward and return.

Layer 2: Input/Output Schemas

Each endpoint has explicit schemas that serve double duty — used for both x402 payment metadata and for MCP/agent-card discovery:

INPUT_SCHEMAS = {
    "/api/bluesky/search": {
        "type": "object",
        "properties": {
            "searchQuery": {
                "type": "string",
                "description": "Search query for Bluesky posts",
            },
            "maxItems":   {"type": "integer", "default": 10},
            "scrapeType": {"type": "string",  "default": "posts"},
        },
        "required": ["searchQuery"],
    },
    "/api/hn/search": {
        "type": "object",
        "properties": {
            "searchTerms": {
                "type": "array",
                "items": {"type": "string"},
                "description": "List of search terms to query HN",
            },
            "maxResults": {"type": "integer", "default": 10},
        },
        "required": ["searchTerms"],
    },
    "/api/substack/search": {
        "type": "object",
        "properties": {
            "publicationSlug": {
                "type": "string",
                "description": "Substack publication slug (e.g. 'stratechery')",
            },
            "maxArticles": {"type": "integer", "default": 10},
        },
        "required": ["publicationSlug"],
    },
}

Defining these schemas up front pays dividends: they're reused across the x402 payment requirements, the OpenAPI spec, the MCP manifest, and the A2A agent card. When an actor changes a field name, you update it in one place and all four discovery surfaces stay consistent.

Layer 3: x402 — Pay-Per-Request Without an Account

x402 is an emerging HTTP payment protocol built on ERC-20 tokens (specifically USDC on Base in this implementation). The full flow:

Client sends a request with no credentials
Server returns HTTP 402 with a PAYMENT-REQUIRED header containing payment details
Client sends a signed payment in an X-PAYMENT header
Server verifies + settles with a facilitator service, then forwards to Apify
Server returns data with X-PAYMENT-RESPONSE containing the settlement receipt

Building the payment requirements response:

BASE_URL = os.environ["BASE_URL"]
PAY_TO   = os.environ["PAY_TO"]   # Your USDC address on Base

def make_payment_requirements(resource_path: str) -> dict:
    accept = {
        "scheme":           "exact",
        "mimeType":         "application/json",
        "network":          "base",
        "maxAmountRequired": "50000",   # $0.05 USDC (6 decimal places)
        "resource":         f"{BASE_URL}{resource_path}",
        "description":      f"API access: {resource_path}",
        "payTo":            PAY_TO,
        "maxTimeoutSeconds": 60,
        "asset":            "0x833589fCD6eDb6E08f4c7C32D4f71b54bdA02913",  # USDC on Base
        "extra":            {"name": "USDC", "version": "2"},
    }
    if resource_path in INPUT_SCHEMAS:
        accept["inputSchema"] = INPUT_SCHEMAS[resource_path]
    return {"x402Version": 1, "error": "Payment required", "accepts": [accept]}

The verify-and-settle flow delegates all blockchain complexity to a facilitator:

FACILITATOR_URL = os.environ.get("FACILITATOR_URL", "https://facilitator.xpay.sh")

async def verify_and_settle_payment(
    payment_b64: str, resource_path: str
) -> tuple[bool, str, dict | None]:
    payment_payload     = json.loads(base64.b64decode(payment_b64))
    payment_requirements = make_payment_requirements(resource_path)["accepts"][0]

    facilitator_body = {
        "x402Version":        1,
        "paymentPayload":     payment_payload,
        "paymentRequirements": payment_requirements,
    }

    async with httpx.AsyncClient(timeout=30.0) as client:
        # Step 1: verify the payment is valid BEFORE calling Apify
        verify_resp = await client.post(
            f"{FACILITATOR_URL}/verify", json=facilitator_body
        )
        result = verify_resp.json()
        if not result.get("isValid", False):
            return False, f"Payment invalid: {result.get('invalidReason')}", None

        # Step 2: settle — this actually moves the USDC on-chain
        settle_resp = await client.post(
            f"{FACILITATOR_URL}/settle", json=facilitator_body
        )
        result = settle_resp.json()
        if not result.get("success", False):
            return False, "Settlement failed", None

        return True, "", result

The two-step verify-then-settle pattern is important: you don't want to call an expensive actor run and then discover the payment was invalid.

Authentication Fallback

The service supports both x402 and a traditional API key for development and backward compatibility:

async def authenticate_request(
    request: Request, x_api_key: str | None
) -> dict | None:
    # Path 1: API key auth — returns None (no settlement receipt)
    if x_api_key and x_api_key == os.environ.get("API_KEY"):
        return None

    # Path 2: x402 payment — returns settlement receipt
    x_payment = request.headers.get("x-payment")
    if x_payment:
        success, error_msg, settle_data = await verify_and_settle_payment(
            x_payment, request.url.path
        )
        if success:
            return settle_data
        raise HTTPException(status_code=400, detail=error_msg)

    # Path 3: neither — trigger the 402 flow
    requirements = make_payment_requirements(request.url.path)
    raise HTTPException(
        status_code=402,
        headers={"PAYMENT-REQUIRED": json.dumps(requirements)},
    )

Agent Discovery: Making the API AI-Native

An unexpected benefit of x402 is that it makes the API discoverable by AI agents without any human account creation. The service exposes three well-known endpoints:

GET /.well-known/x402           → list of paid resources with schemas
GET /.well-known/agent-card.json → Google A2A protocol agent card
GET /.well-known/mcp.json       → MCP tool manifest for Claude/GPT agents

The MCP endpoint exposes each scraping endpoint as a tool with full input/output schemas and payment info. An AI agent that understands x402 can discover this service, verify the cost ($0.05/call), autonomously pay with its own wallet, and get structured data — no human in the loop.

Example MCP tool entry (generated dynamically from INPUT_SCHEMAS):

@app.get("/.well-known/mcp.json")
async def mcp_manifest():
    tools = []
    for path, schema in INPUT_SCHEMAS.items():
        tools.append({
            "name":        path.lstrip("/").replace("/", "_"),
            "description": f"Scrape data via {path}. Costs $0.05 USDC per call.",
            "inputSchema": schema,
            "payment": {
                "required": True,
                "amount":   "$0.05 USDC",
                "protocol": "x402",
            },
        })
    return {"tools": tools}

Deployment: Alpine Linux, 256MB RAM

The server is a small Alpine Linux container. Getting FastAPI running reliably on 256MB required a few deliberate choices:

Uvicorn, single worker. Multiple workers would exceed the RAM budget. One is fine for the traffic level of a side-project API.
Async throughout. Apify calls can take 30–60 seconds. Synchronous handling would starve the event loop. Every I/O operation uses async/await with httpx.
No database. Payment verification and settlement are stateless — the facilitator handles all that. No SQLite, no Redis, no persistence layer.
nginx as TLS proxy. FastAPI binds to localhost:8001; nginx handles TLS and port-forwards via the VPS provider's subdomain.

The service startup script:

#!/bin/sh
pkill -f "uvicorn app:app" || true
sleep 1
cd /home/frog/hustler-proxy
nohup uvicorn app:app --host 127.0.0.1 --port 8001 > uvicorn.log 2>&1 &
echo "Started PID $!"

Testing the Full Flow

Check the 402 response before wiring up a wallet:

curl -si -X POST https://your-service.example.com/api/hn/search \
  -H "Content-Type: application/json" \
  -d '{"searchTerms": ["apify", "web scraping"], "maxResults": 3}'

Expected response:

HTTP/2 402
PAYMENT-REQUIRED: {"x402Version":1,"error":"Payment required","accepts":[...]}

Test with API key auth first to confirm the Apify integration works:

curl -s -X POST https://your-service.example.com/api/hn/search \
  -H "Content-Type: application/json" \
  -H "X-API-Key: your_fallback_api_key" \
  -d '{"searchTerms": ["apify", "web scraping"], "maxResults": 3}' \
  | python3 -m json.tool

Once confirmed, test x402 using the x402-fetch JavaScript library:

import { withPaymentInterceptor } from "x402-fetch";
import { privateKeyToAccount } from "viem/accounts";

const account = privateKeyToAccount("0x…your_private_key…");
const fetch402 = withPaymentInterceptor(fetch, account);

const res = await fetch402("https://your-service.example.com/api/hn/search", {
  method: "POST",
  headers: { "Content-Type": "application/json" },
  body: JSON.stringify({ searchTerms: ["apify", "web scraping"], maxResults: 5 }),
});

const stories = await res.json();
console.log(stories);

Lessons Learned

1. Apify actors remove the maintenance burden, not just the initial work.
The Bluesky AT Protocol changed its auth requirements twice in six weeks. Both times, the community actor was updated within a day. Rolling your own scraper means those breaking changes are your problem to chase.

2. x402 is spec-fragile in ways the documentation doesn't warn you about.
The PAYMENT-REQUIRED header must contain the full JSON body, not a pointer to it. The mimeType field in the accepts array must be present. The inputSchema must be inside the accepts array, not in the outer body. Each of these I discovered by running the x402scan tool against my own service and reading the failure messages carefully.

3. The facilitator abstraction is the right call for small services.
Doing on-chain verification yourself means running a node or paying for an RPC provider. The facilitator pattern (verify + settle as a service) adds a trusted third party but removes 200+ lines of web3 code from your application. For a $0.05/call service the trade-off is obvious.

4. run-sync-get-dataset-items is the correct Apify endpoint for synchronous APIs.
The two-step run-then-fetch pattern (POST /runs → poll → GET /datasets/{id}/items) is fine for async jobs but adds 2–4 extra HTTP round-trips for every request. The sync endpoint blocks and returns the dataset in one call. Actors that take 30+ seconds to complete are fine — httpx handles the timeout gracefully.

5. Schema definitions are the API's contract, not an afterthought.
Defining INPUT_SCHEMAS up front meant they could be reused across the x402 payment requirements, the OpenAPI spec, the MCP manifest, and the A2A agent card. When an actor changed a field name, updating one dict kept all four discovery surfaces consistent.

What to Build Next

This pattern generalizes well beyond social data:

E-commerce price monitor: wrap an Amazon product scraper actor with x402 pricing per ASIN lookup
Job listing aggregator: combine LinkedIn, Indeed, and Glassdoor actors behind a single /api/jobs/search endpoint
AI research assistant: expose a set of actors as MCP tools so a Claude agent can discover and pay for web data autonomously
Actor usage analytics: log every call_apify() invocation to SQLite and expose a /api/stats endpoint — Apify's Actor run logs give you timing and compute unit cost per call, which lets you set prices based on actual cost

The core insight is that Apify's uniform API contract (run-sync-get-dataset-items + dataset response) makes it trivial to add new data sources. Adding a new actor is a one-line dict entry in ACTORS plus a new endpoint handler.

All code in this article is production code running on a live service. The approach works on the free Apify tier for low-volume use cases; higher traffic will require a paid Apify plan to cover compute unit costs.

Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.

Need a reliable scraping API? ScraperAPI handles proxies, CAPTCHAs, and browsers so you don't have to. Get 5,000 free API credits with code SCRAPE13833889.

Disclosure: This post contains affiliate links. I may earn a commission if you sign up through my links, at no extra cost to you.

Compare web scraping APIs:

ScraperAPI — 5,000 free credits, 50+ countries, structured data parsing
Scrape.do — From $29/mo, strong Cloudflare bypass
ScrapeOps — Proxy comparison + monitoring dashboard

Need custom web scraping? I offer data extraction services on Hunazo — the trusted AI agent marketplace. Bluesky, Hacker News, Substack, and any website. Browse my services →

📘 Get the Complete Web Scraping Playbook

Want the full guide? The Complete Web Scraping Playbook 2026 — 48 pages covering proxies, anti-bot bypass, stealth browsers, and production-ready architectures. Just $9.

Skip the Build

You don't have to reinvent this. We maintain a production-grade scraper as an Apify actor — proxies, anti-bot, retries, and schema all handled. You can run it on a pay-per-result basis and get clean JSON without writing a single line of scraping code.

Hacker News Scraper on Apify

DEV Community