agenthustler

Posted on Mar 26 • Edited on Apr 19

How to Build a Web Scraping API with FastAPI in 2026

#python #webdev #tutorial #webscraping

Web scraping has evolved from a niche developer skill into a core data infrastructure component for modern teams. In 2026, if you're running multiple scrapers across different projects, the smartest move is to centralize them behind an API. This tutorial walks you through building a production-grade web scraping API using FastAPI, httpx, and BeautifulSoup4.

Why Build a Scraping API?

Before diving into code, let's talk about why an API layer makes sense:

Centralized logic: Fix a scraping issue once, every consumer benefits
Team sharing: Data, backend, and ML teams all hit the same endpoint
Rate limiting and caching: Protect your IP pool and reduce redundant requests
Monetization: Wrap it in an auth layer and charge for access
Observability: One place to log errors, measure latency, and track usage

If you're still running one-off scripts per project, you're doing it the hard way.

Project Setup

We'll use three core libraries:

pip install fastapi uvicorn httpx beautifulsoup4 lxml

Here's the project structure we're targeting:

scraping-api/
├── main.py
├── models.py
├── scraper.py
├── cache.py
├── limiter.py
└── Dockerfile

Start by creating your main.py:

from fastapi import FastAPI

app = FastAPI(title="Scraping API", version="1.0.0")

@app.get("/health")
async def health():
    return {"status": "ok"}

Run it with:

uvicorn main:app --reload

Basic Scraping Endpoint

Define your Pydantic models in models.py:

from pydantic import BaseModel, HttpUrl
from typing import Optional, List

class ScrapeRequest(BaseModel):
    url: HttpUrl
    selector: Optional[str] = None  # CSS selector to target specific elements

class ScrapeResponse(BaseModel):
    url: str
    title: Optional[str]
    text: Optional[str]
    links: List[str]
    status_code: int
    cached: bool = False

Now build the scraping logic in scraper.py:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Wire it into main.py:

from fastapi import FastAPI, HTTPException
from models import ScrapeRequest, ScrapeResponse
from scraper import scrape_url

app = FastAPI(title="Scraping API", version="1.0.0")

@app.post("/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest):
    try:
        result = await scrape_url(str(request.url), request.selector)
        return result
    except httpx.TimeoutException:
        raise HTTPException(status_code=504, detail="Target URL timed out")
    except httpx.HTTPStatusError as e:
        raise HTTPException(status_code=e.response.status_code, detail=str(e))
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

Test it against your running server:

curl -X POST http://localhost:9000/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'

Adding Async Batch Scraping

One of FastAPI's biggest strengths is native async support. Use asyncio.gather to scrape multiple URLs concurrently:

import asyncio
from typing import List
from pydantic import BaseModel, HttpUrl
from typing import Optional

class BatchScrapeRequest(BaseModel):
    urls: List[HttpUrl]
    selector: Optional[str] = None

class BatchScrapeResponse(BaseModel):
    results: List[dict]
    total: int
    errors: int

@app.post("/scrape/batch", response_model=BatchScrapeResponse)
async def scrape_batch(request: BatchScrapeRequest):
    if len(request.urls) > 20:
        raise HTTPException(status_code=400, detail="Max 20 URLs per batch")

    tasks = [scrape_url(str(url), request.selector) for url in request.urls]
    raw_results = await asyncio.gather(*tasks, return_exceptions=True)

    results = []
    errors = 0

    for url, result in zip(request.urls, raw_results):
        if isinstance(result, Exception):
            errors += 1
            results.append({"url": str(url), "error": str(result)})
        else:
            results.append(result)

    return {"results": results, "total": len(request.urls), "errors": errors}

Batch scraping 10 URLs that would each take 1 second sequentially now completes in roughly 1 second total. That's the power of async.

Rate Limiting

Without rate limiting, your scraping API is one aggressive client away from getting your IPs banned or your server overloaded. Here's a simple in-memory rate limiter:

# limiter.py
import time
from collections import defaultdict
from fastapi import Request, HTTPException

RATE_LIMIT = 10      # requests
WINDOW = 60          # seconds

request_log: dict = defaultdict(list)

def check_rate_limit(client_ip: str):
    now = time.time()
    window_start = now - WINDOW

    # Clean old entries
    request_log[client_ip] = [
        ts for ts in request_log[client_ip] if ts > window_start
    ]

    if len(request_log[client_ip]) >= RATE_LIMIT:
        raise HTTPException(
            status_code=429,
            detail=f"Rate limit exceeded: {RATE_LIMIT} requests per {WINDOW}s"
        )

    request_log[client_ip].append(now)

Add it as a dependency in your routes:

from fastapi import Depends, Request
from limiter import check_rate_limit

@app.post("/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest, req: Request):
    check_rate_limit(req.client.host)
    # ... rest of handler

For production, consider slowapi which integrates cleanly with FastAPI and supports Redis-backed limits for multi-instance deployments.

Caching Results

Re-scraping the same URL every few seconds wastes resources and increases ban risk. A simple TTL cache keeps things efficient:

# cache.py
import time
from typing import Any, Optional

class TTLCache:
    def __init__(self, ttl_seconds: int = 300):
        self.ttl = ttl_seconds
        self._store: dict = {}

    def get(self, key: str) -> Optional[Any]:
        if key in self._store:
            value, timestamp = self._store[key]
            if time.time() - timestamp < self.ttl:
                return value
            del self._store[key]
        return None

    def set(self, key: str, value: Any):
        self._store[key] = (value, time.time())

    def clear_expired(self):
        now = time.time()
        self._store = {
            k: v for k, v in self._store.items()
            if now - v[1] < self.ttl
        }

scrape_cache = TTLCache(ttl_seconds=300)  # 5 minute TTL

Update your scrape endpoint to use it:

from cache import scrape_cache

@app.post("/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest, req: Request):
    check_rate_limit(req.client.host)
    cache_key = f"{request.url}::{request.selector}"

    cached = scrape_cache.get(cache_key)
    if cached:
        cached["cached"] = True
        return cached

    result = await scrape_url(str(request.url), request.selector)
    scrape_cache.set(cache_key, result)
    return result

Error Handling

Robust error handling separates toy scripts from production APIs. Common failure modes to handle:

from fastapi import HTTPException
import httpx

async def scrape_url_safe(url: str, selector=None):
    try:
        return await scrape_url(url, selector)

    except httpx.TimeoutException:
        raise HTTPException(
            status_code=504,
            detail={"error": "timeout", "url": url, "message": "Target took too long to respond"}
        )

    except httpx.HTTPStatusError as e:
        raise HTTPException(
            status_code=e.response.status_code,
            detail={"error": "http_error", "url": url, "status": e.response.status_code}
        )

    except httpx.ConnectError:
        raise HTTPException(
            status_code=502,
            detail={"error": "connect_error", "url": url, "message": "Could not reach target host"}
        )

    except ValueError as e:
        raise HTTPException(
            status_code=400,
            detail={"error": "invalid_input", "message": str(e)}
        )

Return a consistent error envelope across all endpoints — your API consumers will thank you.

Deploying to Production

Docker

FROM python:3.12-slim

WORKDIR /app
COPY . .

RUN pip install --no-cache-dir fastapi uvicorn httpx beautifulsoup4 lxml

EXPOSE 9000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "9000"]

Build and run:

docker build -t scraping-api .
docker run -p 9000:9000 scraping-api

Cloud Platforms

Railway: Push your repo, add a railway.json with your start command, done
Fly.io: fly launch auto-detects the Dockerfile and deploys in minutes
VPS (DigitalOcean, Hetzner, Vultr): Docker Compose + Nginx reverse proxy for more control

For a lightweight API that doesn't need persistent storage, Railway or Fly.io are the fastest paths to production.

Using Proxy Services for Scale

Here's where most tutorials stop, but where real production scraping begins. Scraping at scale means dealing with:

IP bans and CAPTCHAs
Dynamic JavaScript rendering
Geographic restrictions
Rate limiting by target sites

If your scraping API needs to handle serious volume, you'll want managed proxy infrastructure rather than maintaining your own.

ScraperAPI is one of the most battle-tested options. It handles proxy rotation, CAPTCHA solving, and header management automatically. Instead of a raw httpx.get(url), you pass the URL through their proxy endpoint and get clean HTML back — no more blocked requests, no IP management overhead.

ThorData offers residential proxy pools — IPs that look like real home users rather than datacenter blocks. Residential proxies are significantly harder for sites to detect and block, making them worth the premium for high-value targets.

Apify takes a different approach entirely: rather than managing proxies yourself, you use pre-built actors for common scraping targets (Amazon, Google, LinkedIn, etc.) or deploy your own scrapers to their managed cloud. If you'd rather not operate infrastructure at all, Apify handles everything for you.

The right choice depends on your volume and technical appetite. For a self-hosted FastAPI setup, ScraperAPI or ThorData complement your architecture cleanly — just update your httpx client to route through their endpoints.

Conclusion

You now have the foundation for a production-grade web scraping API:

Async scraping with httpx and BeautifulSoup4
Batch endpoint for concurrent multi-URL scraping
Rate limiting to protect your infrastructure
TTL caching to reduce redundant requests
Clean error handling with consistent response shapes
Docker-ready for deployment

The code shown here is deliberately minimal — production deployments would add authentication (API keys or JWT), persistent caching with Redis, metrics and logging with Prometheus or structured logs, and a queue for large batch jobs.

Start simple, measure what matters, and layer in complexity only as your use case demands it. Happy scraping.

DEV Community