Web scraping has evolved from a niche developer skill into a core data infrastructure component for modern teams. In 2026, if you're running multiple scrapers across different projects, the smartest move is to centralize them behind an API. This tutorial walks you through building a production-grade web scraping API using FastAPI, httpx, and BeautifulSoup4.
Why Build a Scraping API?
Before diving into code, let's talk about why an API layer makes sense:
- Centralized logic: Fix a scraping issue once, every consumer benefits
- Team sharing: Data, backend, and ML teams all hit the same endpoint
- Rate limiting and caching: Protect your IP pool and reduce redundant requests
- Monetization: Wrap it in an auth layer and charge for access
- Observability: One place to log errors, measure latency, and track usage
If you're still running one-off scripts per project, you're doing it the hard way.
Project Setup
We'll use three core libraries:
pip install fastapi uvicorn httpx beautifulsoup4 lxml
Here's the project structure we're targeting:
scraping-api/
├── main.py
├── models.py
├── scraper.py
├── cache.py
├── limiter.py
└── Dockerfile
Start by creating your main.py:
from fastapi import FastAPI
app = FastAPI(title="Scraping API", version="1.0.0")
@app.get("/health")
async def health():
return {"status": "ok"}
Run it with:
uvicorn main:app --reload
Basic Scraping Endpoint
Define your Pydantic models in models.py:
from pydantic import BaseModel, HttpUrl
from typing import Optional, List
class ScrapeRequest(BaseModel):
url: HttpUrl
selector: Optional[str] = None # CSS selector to target specific elements
class ScrapeResponse(BaseModel):
url: str
title: Optional[str]
text: Optional[str]
links: List[str]
status_code: int
cached: bool = False
Now build the scraping logic in scraper.py:
import httpx
from bs4 import BeautifulSoup
from typing import Optional, List
HEADERS = {
"User-Agent": (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
}
async def scrape_url(url: str, selector: Optional[str] = None):
async with httpx.AsyncClient(headers=HEADERS, timeout=15.0, follow_redirects=True) as client:
response = await client.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.text, "lxml")
title = soup.title.string.strip() if soup.title else None
if selector:
target = soup.select(selector)
text = " ".join(el.get_text(strip=True) for el in target)
else:
text = soup.get_text(separator=" ", strip=True)[:5000]
links = [
a["href"] for a in soup.find_all("a", href=True)
if a["href"].startswith("http")
][:50]
return {
"url": str(url),
"title": title,
"text": text,
"links": links,
"status_code": response.status_code,
"cached": False,
}
Wire it into main.py:
from fastapi import FastAPI, HTTPException
from models import ScrapeRequest, ScrapeResponse
from scraper import scrape_url
app = FastAPI(title="Scraping API", version="1.0.0")
@app.post("/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest):
try:
result = await scrape_url(str(request.url), request.selector)
return result
except httpx.TimeoutException:
raise HTTPException(status_code=504, detail="Target URL timed out")
except httpx.HTTPStatusError as e:
raise HTTPException(status_code=e.response.status_code, detail=str(e))
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Test it against your running server:
curl -X POST http://localhost:9000/scrape \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'
Adding Async Batch Scraping
One of FastAPI's biggest strengths is native async support. Use asyncio.gather to scrape multiple URLs concurrently:
import asyncio
from typing import List
from pydantic import BaseModel, HttpUrl
from typing import Optional
class BatchScrapeRequest(BaseModel):
urls: List[HttpUrl]
selector: Optional[str] = None
class BatchScrapeResponse(BaseModel):
results: List[dict]
total: int
errors: int
@app.post("/scrape/batch", response_model=BatchScrapeResponse)
async def scrape_batch(request: BatchScrapeRequest):
if len(request.urls) > 20:
raise HTTPException(status_code=400, detail="Max 20 URLs per batch")
tasks = [scrape_url(str(url), request.selector) for url in request.urls]
raw_results = await asyncio.gather(*tasks, return_exceptions=True)
results = []
errors = 0
for url, result in zip(request.urls, raw_results):
if isinstance(result, Exception):
errors += 1
results.append({"url": str(url), "error": str(result)})
else:
results.append(result)
return {"results": results, "total": len(request.urls), "errors": errors}
Batch scraping 10 URLs that would each take 1 second sequentially now completes in roughly 1 second total. That's the power of async.
Rate Limiting
Without rate limiting, your scraping API is one aggressive client away from getting your IPs banned or your server overloaded. Here's a simple in-memory rate limiter:
# limiter.py
import time
from collections import defaultdict
from fastapi import Request, HTTPException
RATE_LIMIT = 10 # requests
WINDOW = 60 # seconds
request_log: dict = defaultdict(list)
def check_rate_limit(client_ip: str):
now = time.time()
window_start = now - WINDOW
# Clean old entries
request_log[client_ip] = [
ts for ts in request_log[client_ip] if ts > window_start
]
if len(request_log[client_ip]) >= RATE_LIMIT:
raise HTTPException(
status_code=429,
detail=f"Rate limit exceeded: {RATE_LIMIT} requests per {WINDOW}s"
)
request_log[client_ip].append(now)
Add it as a dependency in your routes:
from fastapi import Depends, Request
from limiter import check_rate_limit
@app.post("/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest, req: Request):
check_rate_limit(req.client.host)
# ... rest of handler
For production, consider slowapi which integrates cleanly with FastAPI and supports Redis-backed limits for multi-instance deployments.
Caching Results
Re-scraping the same URL every few seconds wastes resources and increases ban risk. A simple TTL cache keeps things efficient:
# cache.py
import time
from typing import Any, Optional
class TTLCache:
def __init__(self, ttl_seconds: int = 300):
self.ttl = ttl_seconds
self._store: dict = {}
def get(self, key: str) -> Optional[Any]:
if key in self._store:
value, timestamp = self._store[key]
if time.time() - timestamp < self.ttl:
return value
del self._store[key]
return None
def set(self, key: str, value: Any):
self._store[key] = (value, time.time())
def clear_expired(self):
now = time.time()
self._store = {
k: v for k, v in self._store.items()
if now - v[1] < self.ttl
}
scrape_cache = TTLCache(ttl_seconds=300) # 5 minute TTL
Update your scrape endpoint to use it:
from cache import scrape_cache
@app.post("/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest, req: Request):
check_rate_limit(req.client.host)
cache_key = f"{request.url}::{request.selector}"
cached = scrape_cache.get(cache_key)
if cached:
cached["cached"] = True
return cached
result = await scrape_url(str(request.url), request.selector)
scrape_cache.set(cache_key, result)
return result
Error Handling
Robust error handling separates toy scripts from production APIs. Common failure modes to handle:
from fastapi import HTTPException
import httpx
async def scrape_url_safe(url: str, selector=None):
try:
return await scrape_url(url, selector)
except httpx.TimeoutException:
raise HTTPException(
status_code=504,
detail={"error": "timeout", "url": url, "message": "Target took too long to respond"}
)
except httpx.HTTPStatusError as e:
raise HTTPException(
status_code=e.response.status_code,
detail={"error": "http_error", "url": url, "status": e.response.status_code}
)
except httpx.ConnectError:
raise HTTPException(
status_code=502,
detail={"error": "connect_error", "url": url, "message": "Could not reach target host"}
)
except ValueError as e:
raise HTTPException(
status_code=400,
detail={"error": "invalid_input", "message": str(e)}
)
Return a consistent error envelope across all endpoints — your API consumers will thank you.
Deploying to Production
Docker
FROM python:3.12-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir fastapi uvicorn httpx beautifulsoup4 lxml
EXPOSE 9000
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "9000"]
Build and run:
docker build -t scraping-api .
docker run -p 9000:9000 scraping-api
Cloud Platforms
-
Railway: Push your repo, add a
railway.jsonwith your start command, done -
Fly.io:
fly launchauto-detects the Dockerfile and deploys in minutes - VPS (DigitalOcean, Hetzner, Vultr): Docker Compose + Nginx reverse proxy for more control
For a lightweight API that doesn't need persistent storage, Railway or Fly.io are the fastest paths to production.
Using Proxy Services for Scale
Here's where most tutorials stop, but where real production scraping begins. Scraping at scale means dealing with:
- IP bans and CAPTCHAs
- Dynamic JavaScript rendering
- Geographic restrictions
- Rate limiting by target sites
If your scraping API needs to handle serious volume, you'll want managed proxy infrastructure rather than maintaining your own.
ScraperAPI is one of the most battle-tested options. It handles proxy rotation, CAPTCHA solving, and header management automatically. Instead of a raw httpx.get(url), you pass the URL through their proxy endpoint and get clean HTML back — no more blocked requests, no IP management overhead.
ThorData offers residential proxy pools — IPs that look like real home users rather than datacenter blocks. Residential proxies are significantly harder for sites to detect and block, making them worth the premium for high-value targets.
Apify takes a different approach entirely: rather than managing proxies yourself, you use pre-built actors for common scraping targets (Amazon, Google, LinkedIn, etc.) or deploy your own scrapers to their managed cloud. If you'd rather not operate infrastructure at all, Apify handles everything for you.
The right choice depends on your volume and technical appetite. For a self-hosted FastAPI setup, ScraperAPI or ThorData complement your architecture cleanly — just update your httpx client to route through their endpoints.
Conclusion
You now have the foundation for a production-grade web scraping API:
- Async scraping with httpx and BeautifulSoup4
- Batch endpoint for concurrent multi-URL scraping
- Rate limiting to protect your infrastructure
- TTL caching to reduce redundant requests
- Clean error handling with consistent response shapes
- Docker-ready for deployment
The code shown here is deliberately minimal — production deployments would add authentication (API keys or JWT), persistent caching with Redis, metrics and logging with Prometheus or structured logs, and a queue for large batch jobs.
Start simple, measure what matters, and layer in complexity only as your use case demands it. Happy scraping.
Top comments (0)