From Script to Business
You have built web scrapers for yourself. Now it is time to turn that skill into a product. A web scraping SaaS lets customers submit URLs and receive structured data — no coding required on their end.
This guide covers the architecture, billing, and scaling decisions you will face.
Architecture Overview
A scraping SaaS has three layers:
- API Layer — receives scraping requests, authenticates users
- Worker Layer — executes scrapes, manages browser pools
- Storage Layer — caches results, stores user data
Client -> API Gateway -> Task Queue -> Worker Pool -> Proxy Layer -> Target Site
| |
Auth/Billing Results DB
The API Layer (FastAPI)
from fastapi import FastAPI, HTTPException, Depends, Header
from pydantic import BaseModel
from typing import Optional
import uuid
import redis
import json
app = FastAPI(title="ScrapeService API")
redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)
class ScrapeRequest(BaseModel):
url: str
render_js: bool = False
wait_for: Optional[str] = None
extract: Optional[dict] = None
class ScrapeResponse(BaseModel):
task_id: str
status: str
credits_used: int
async def verify_api_key(x_api_key: str = Header()):
user = redis_client.hgetall(f"apikey:{x_api_key}")
if not user:
raise HTTPException(401, "Invalid API key")
credits = int(user.get("credits", 0))
if credits <= 0:
raise HTTPException(402, "Insufficient credits")
return {"api_key": x_api_key, "user_id": user["user_id"], "credits": credits}
@app.post("/v1/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest, user=Depends(verify_api_key)):
task_id = str(uuid.uuid4())
credits_cost = 5 if request.render_js else 1
# Deduct credits
redis_client.hincrby(f"apikey:{user['api_key']}", "credits", -credits_cost)
# Queue the task
task = {
"task_id": task_id,
"url": request.url,
"render_js": request.render_js,
"wait_for": request.wait_for,
"extract": request.extract,
"user_id": user["user_id"],
}
redis_client.lpush("scrape_queue", json.dumps(task))
return ScrapeResponse(
task_id=task_id,
status="queued",
credits_used=credits_cost
)
@app.get("/v1/result/{task_id}")
async def get_result(task_id: str, user=Depends(verify_api_key)):
result = redis_client.get(f"result:{task_id}")
if not result:
return {"status": "processing"}
return json.loads(result)
The Worker Layer
import asyncio
import aiohttp
from playwright.async_api import async_playwright
import redis
import json
class ScrapeWorker:
def __init__(self, worker_id, proxy_pool):
self.worker_id = worker_id
self.proxy_pool = proxy_pool
self.redis = redis.Redis(decode_responses=True)
async def run(self):
print(f"Worker {self.worker_id} started")
while True:
task_json = self.redis.brpop("scrape_queue", timeout=5)
if not task_json:
continue
task = json.loads(task_json[1])
try:
result = await self.execute(task)
self.redis.setex(
f"result:{task['task_id']}",
3600, # Cache for 1 hour
json.dumps({"status": "completed", "data": result})
)
except Exception as e:
self.redis.setex(
f"result:{task['task_id']}",
3600,
json.dumps({"status": "failed", "error": str(e)})
)
async def execute(self, task):
if task.get("render_js"):
return await self._browser_scrape(task)
return await self._http_scrape(task)
async def _http_scrape(self, task):
proxy = self.proxy_pool.get_proxy()
async with aiohttp.ClientSession() as session:
async with session.get(task["url"], proxy=proxy) as resp:
html = await resp.text()
return {"html": html, "status_code": resp.status}
async def _browser_scrape(self, task):
async with async_playwright() as p:
browser = await p.chromium.launch(headless=True)
page = await browser.new_page()
await page.goto(task["url"], wait_until="networkidle")
if task.get("wait_for"):
await page.wait_for_selector(task["wait_for"], timeout=15000)
html = await page.content()
await browser.close()
return {"html": html, "status_code": 200}
Proxy Management
Your proxy layer is the backbone of reliability:
import random
from collections import deque
class ProxyPool:
def __init__(self):
self.proxies = deque()
self.failed = {} # proxy -> failure count
def add_proxies(self, proxy_list):
for p in proxy_list:
self.proxies.append(p)
def get_proxy(self):
proxy = self.proxies[0]
self.proxies.rotate(-1) # Round robin
return proxy
def mark_failed(self, proxy):
self.failed[proxy] = self.failed.get(proxy, 0) + 1
if self.failed[proxy] > 3:
self.proxies.remove(proxy)
For production, use ThorData residential proxies. They handle rotation and provide clean IPs across 195 countries.
Billing with Stripe
import stripe
stripe.api_key = "sk_live_..."
CREDIT_PACKS = {
"starter": {"credits": 10000, "price_cents": 2900},
"growth": {"credits": 50000, "price_cents": 9900},
"scale": {"credits": 250000, "price_cents": 29900},
}
@app.post("/v1/billing/purchase")
async def purchase_credits(pack: str, user=Depends(verify_api_key)):
if pack not in CREDIT_PACKS:
raise HTTPException(400, "Invalid pack")
pack_info = CREDIT_PACKS[pack]
session = stripe.checkout.Session.create(
payment_method_types=["card"],
line_items=[{
"price_data": {
"currency": "usd",
"unit_amount": pack_info["price_cents"],
"product_data": {"name": f"{pack.title()} Credit Pack"},
},
"quantity": 1,
}],
mode="payment",
success_url="https://yourdomain.com/success",
metadata={"user_id": user["user_id"], "credits": pack_info["credits"]},
)
return {"checkout_url": session.url}
@app.post("/webhooks/stripe")
async def stripe_webhook(request):
event = stripe.Webhook.construct_event(...)
if event.type == "checkout.session.completed":
session = event.data.object
credits = int(session.metadata["credits"])
user_id = session.metadata["user_id"]
# Add credits to user account
redis_client.hincrby(f"user:{user_id}", "credits", credits)
Scaling Considerations
Horizontal Scaling
# docker-compose.yml
services:
api:
build: .
command: uvicorn api:app --host 0.0.0.0 --port 9090
deploy:
replicas: 3
worker:
build: .
command: python worker.py
deploy:
replicas: 10
shm_size: '2gb' # Required for Chromium
redis:
image: redis:7-alpine
volumes:
- redis_data:/data
Cost Structure
| Component | Cost per 100K requests |
|---|---|
| Proxy (residential) | $50-100 |
| Compute (workers) | $20-40 |
| Browser instances | $10-20 |
| Storage/bandwidth | $5-10 |
| Total | $85-170 |
Pricing at $29/10K requests gives you healthy margins.
Monitoring
Use ScrapeOps to monitor success rates per target site. When a site changes its anti-bot measures, you need to know immediately.
Alternatively, integrate ScraperAPI as your proxy layer — they handle all anti-bot bypass, so you can focus on your product instead of proxy infrastructure.
MVP Checklist
- [ ] API with key authentication
- [ ] Credit-based billing (Stripe)
- [ ] HTTP and browser-based scraping
- [ ] Proxy rotation
- [ ] Result caching
- [ ] Rate limiting per user
- [ ] Dashboard for API key management
- [ ] Documentation with code examples
Conclusion
A scraping SaaS is a proven business model — ScraperAPI, Bright Data, and Apify all generate millions in revenue. The technical moat is in proxy management, browser infrastructure, and anti-bot bypass. Start with a credit-based model, focus on reliability, and let your customers tell you which features to build next.
Top comments (0)