DEV Community

agenthustler
agenthustler

Posted on

How to Build a Web Scraping API with FastAPI in 2026

Web scraping has evolved from a niche developer skill into a core data infrastructure component for modern teams. In 2026, if you're running multiple scrapers across different projects, the smartest move is to centralize them behind an API. This tutorial walks you through building a production-grade web scraping API using FastAPI, httpx, and BeautifulSoup4.

Why Build a Scraping API?

Before diving into code, let's talk about why an API layer makes sense:

  • Centralized logic: Fix a scraping issue once, every consumer benefits
  • Team sharing: Data, backend, and ML teams all hit the same endpoint
  • Rate limiting and caching: Protect your IP pool and reduce redundant requests
  • Monetization: Wrap it in an auth layer and charge for access
  • Observability: One place to log errors, measure latency, and track usage

If you're still running one-off scripts per project, you're doing it the hard way.

Project Setup

We'll use three core libraries:

pip install fastapi uvicorn httpx beautifulsoup4 lxml
Enter fullscreen mode Exit fullscreen mode

Here's the project structure we're targeting:

scraping-api/
├── main.py
├── models.py
├── scraper.py
├── cache.py
├── limiter.py
└── Dockerfile
Enter fullscreen mode Exit fullscreen mode

Start by creating your main.py:

from fastapi import FastAPI

app = FastAPI(title="Scraping API", version="1.0.0")

@app.get("/health")
async def health():
    return {"status": "ok"}
Enter fullscreen mode Exit fullscreen mode

Run it with:

uvicorn main:app --reload
Enter fullscreen mode Exit fullscreen mode

Basic Scraping Endpoint

Define your Pydantic models in models.py:

from pydantic import BaseModel, HttpUrl
from typing import Optional, List

class ScrapeRequest(BaseModel):
    url: HttpUrl
    selector: Optional[str] = None  # CSS selector to target specific elements

class ScrapeResponse(BaseModel):
    url: str
    title: Optional[str]
    text: Optional[str]
    links: List[str]
    status_code: int
    cached: bool = False
Enter fullscreen mode Exit fullscreen mode

Now build the scraping logic in scraper.py:

import httpx
from bs4 import BeautifulSoup
from typing import Optional, List

HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/120.0.0.0 Safari/537.36"
    )
}

async def scrape_url(url: str, selector: Optional[str] = None):
    async with httpx.AsyncClient(headers=HEADERS, timeout=15.0, follow_redirects=True) as client:
        response = await client.get(url)
        response.raise_for_status()

    soup = BeautifulSoup(response.text, "lxml")

    title = soup.title.string.strip() if soup.title else None

    if selector:
        target = soup.select(selector)
        text = " ".join(el.get_text(strip=True) for el in target)
    else:
        text = soup.get_text(separator=" ", strip=True)[:5000]

    links = [
        a["href"] for a in soup.find_all("a", href=True)
        if a["href"].startswith("http")
    ][:50]

    return {
        "url": str(url),
        "title": title,
        "text": text,
        "links": links,
        "status_code": response.status_code,
        "cached": False,
    }
Enter fullscreen mode Exit fullscreen mode

Wire it into main.py:

from fastapi import FastAPI, HTTPException
from models import ScrapeRequest, ScrapeResponse
from scraper import scrape_url

app = FastAPI(title="Scraping API", version="1.0.0")

@app.post("/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest):
    try:
        result = await scrape_url(str(request.url), request.selector)
        return result
    except httpx.TimeoutException:
        raise HTTPException(status_code=504, detail="Target URL timed out")
    except httpx.HTTPStatusError as e:
        raise HTTPException(status_code=e.response.status_code, detail=str(e))
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))
Enter fullscreen mode Exit fullscreen mode

Test it against your running server:

curl -X POST http://localhost:9000/scrape \
  -H "Content-Type: application/json" \
  -d '{"url": "https://example.com"}'
Enter fullscreen mode Exit fullscreen mode

Adding Async Batch Scraping

One of FastAPI's biggest strengths is native async support. Use asyncio.gather to scrape multiple URLs concurrently:

import asyncio
from typing import List
from pydantic import BaseModel, HttpUrl
from typing import Optional

class BatchScrapeRequest(BaseModel):
    urls: List[HttpUrl]
    selector: Optional[str] = None

class BatchScrapeResponse(BaseModel):
    results: List[dict]
    total: int
    errors: int

@app.post("/scrape/batch", response_model=BatchScrapeResponse)
async def scrape_batch(request: BatchScrapeRequest):
    if len(request.urls) > 20:
        raise HTTPException(status_code=400, detail="Max 20 URLs per batch")

    tasks = [scrape_url(str(url), request.selector) for url in request.urls]
    raw_results = await asyncio.gather(*tasks, return_exceptions=True)

    results = []
    errors = 0

    for url, result in zip(request.urls, raw_results):
        if isinstance(result, Exception):
            errors += 1
            results.append({"url": str(url), "error": str(result)})
        else:
            results.append(result)

    return {"results": results, "total": len(request.urls), "errors": errors}
Enter fullscreen mode Exit fullscreen mode

Batch scraping 10 URLs that would each take 1 second sequentially now completes in roughly 1 second total. That's the power of async.

Rate Limiting

Without rate limiting, your scraping API is one aggressive client away from getting your IPs banned or your server overloaded. Here's a simple in-memory rate limiter:

# limiter.py
import time
from collections import defaultdict
from fastapi import Request, HTTPException

RATE_LIMIT = 10      # requests
WINDOW = 60          # seconds

request_log: dict = defaultdict(list)

def check_rate_limit(client_ip: str):
    now = time.time()
    window_start = now - WINDOW

    # Clean old entries
    request_log[client_ip] = [
        ts for ts in request_log[client_ip] if ts > window_start
    ]

    if len(request_log[client_ip]) >= RATE_LIMIT:
        raise HTTPException(
            status_code=429,
            detail=f"Rate limit exceeded: {RATE_LIMIT} requests per {WINDOW}s"
        )

    request_log[client_ip].append(now)
Enter fullscreen mode Exit fullscreen mode

Add it as a dependency in your routes:

from fastapi import Depends, Request
from limiter import check_rate_limit

@app.post("/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest, req: Request):
    check_rate_limit(req.client.host)
    # ... rest of handler
Enter fullscreen mode Exit fullscreen mode

For production, consider slowapi which integrates cleanly with FastAPI and supports Redis-backed limits for multi-instance deployments.

Caching Results

Re-scraping the same URL every few seconds wastes resources and increases ban risk. A simple TTL cache keeps things efficient:

# cache.py
import time
from typing import Any, Optional

class TTLCache:
    def __init__(self, ttl_seconds: int = 300):
        self.ttl = ttl_seconds
        self._store: dict = {}

    def get(self, key: str) -> Optional[Any]:
        if key in self._store:
            value, timestamp = self._store[key]
            if time.time() - timestamp < self.ttl:
                return value
            del self._store[key]
        return None

    def set(self, key: str, value: Any):
        self._store[key] = (value, time.time())

    def clear_expired(self):
        now = time.time()
        self._store = {
            k: v for k, v in self._store.items()
            if now - v[1] < self.ttl
        }

scrape_cache = TTLCache(ttl_seconds=300)  # 5 minute TTL
Enter fullscreen mode Exit fullscreen mode

Update your scrape endpoint to use it:

from cache import scrape_cache

@app.post("/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest, req: Request):
    check_rate_limit(req.client.host)
    cache_key = f"{request.url}::{request.selector}"

    cached = scrape_cache.get(cache_key)
    if cached:
        cached["cached"] = True
        return cached

    result = await scrape_url(str(request.url), request.selector)
    scrape_cache.set(cache_key, result)
    return result
Enter fullscreen mode Exit fullscreen mode

Error Handling

Robust error handling separates toy scripts from production APIs. Common failure modes to handle:

from fastapi import HTTPException
import httpx

async def scrape_url_safe(url: str, selector=None):
    try:
        return await scrape_url(url, selector)

    except httpx.TimeoutException:
        raise HTTPException(
            status_code=504,
            detail={"error": "timeout", "url": url, "message": "Target took too long to respond"}
        )

    except httpx.HTTPStatusError as e:
        raise HTTPException(
            status_code=e.response.status_code,
            detail={"error": "http_error", "url": url, "status": e.response.status_code}
        )

    except httpx.ConnectError:
        raise HTTPException(
            status_code=502,
            detail={"error": "connect_error", "url": url, "message": "Could not reach target host"}
        )

    except ValueError as e:
        raise HTTPException(
            status_code=400,
            detail={"error": "invalid_input", "message": str(e)}
        )
Enter fullscreen mode Exit fullscreen mode

Return a consistent error envelope across all endpoints — your API consumers will thank you.

Deploying to Production

Docker

FROM python:3.12-slim

WORKDIR /app
COPY . .

RUN pip install --no-cache-dir fastapi uvicorn httpx beautifulsoup4 lxml

EXPOSE 9000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "9000"]
Enter fullscreen mode Exit fullscreen mode

Build and run:

docker build -t scraping-api .
docker run -p 9000:9000 scraping-api
Enter fullscreen mode Exit fullscreen mode

Cloud Platforms

  • Railway: Push your repo, add a railway.json with your start command, done
  • Fly.io: fly launch auto-detects the Dockerfile and deploys in minutes
  • VPS (DigitalOcean, Hetzner, Vultr): Docker Compose + Nginx reverse proxy for more control

For a lightweight API that doesn't need persistent storage, Railway or Fly.io are the fastest paths to production.

Using Proxy Services for Scale

Here's where most tutorials stop, but where real production scraping begins. Scraping at scale means dealing with:

  • IP bans and CAPTCHAs
  • Dynamic JavaScript rendering
  • Geographic restrictions
  • Rate limiting by target sites

If your scraping API needs to handle serious volume, you'll want managed proxy infrastructure rather than maintaining your own.

ScraperAPI is one of the most battle-tested options. It handles proxy rotation, CAPTCHA solving, and header management automatically. Instead of a raw httpx.get(url), you pass the URL through their proxy endpoint and get clean HTML back — no more blocked requests, no IP management overhead.

ThorData offers residential proxy pools — IPs that look like real home users rather than datacenter blocks. Residential proxies are significantly harder for sites to detect and block, making them worth the premium for high-value targets.

Apify takes a different approach entirely: rather than managing proxies yourself, you use pre-built actors for common scraping targets (Amazon, Google, LinkedIn, etc.) or deploy your own scrapers to their managed cloud. If you'd rather not operate infrastructure at all, Apify handles everything for you.

The right choice depends on your volume and technical appetite. For a self-hosted FastAPI setup, ScraperAPI or ThorData complement your architecture cleanly — just update your httpx client to route through their endpoints.

Conclusion

You now have the foundation for a production-grade web scraping API:

  • Async scraping with httpx and BeautifulSoup4
  • Batch endpoint for concurrent multi-URL scraping
  • Rate limiting to protect your infrastructure
  • TTL caching to reduce redundant requests
  • Clean error handling with consistent response shapes
  • Docker-ready for deployment

The code shown here is deliberately minimal — production deployments would add authentication (API keys or JWT), persistent caching with Redis, metrics and logging with Prometheus or structured logs, and a queue for large batch jobs.

Start simple, measure what matters, and layer in complexity only as your use case demands it. Happy scraping.

Top comments (0)