DEV Community

agenthustler
agenthustler

Posted on

Building a Web Scraping SaaS: Architecture, Billing, and Scaling

From Script to Business

You have built web scrapers for yourself. Now it is time to turn that skill into a product. A web scraping SaaS lets customers submit URLs and receive structured data — no coding required on their end.

This guide covers the architecture, billing, and scaling decisions you will face.

Architecture Overview

A scraping SaaS has three layers:

  1. API Layer — receives scraping requests, authenticates users
  2. Worker Layer — executes scrapes, manages browser pools
  3. Storage Layer — caches results, stores user data
Client -> API Gateway -> Task Queue -> Worker Pool -> Proxy Layer -> Target Site
                |                          |
              Auth/Billing              Results DB
Enter fullscreen mode Exit fullscreen mode

The API Layer (FastAPI)

from fastapi import FastAPI, HTTPException, Depends, Header
from pydantic import BaseModel
from typing import Optional
import uuid
import redis
import json

app = FastAPI(title="ScrapeService API")
redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)

class ScrapeRequest(BaseModel):
    url: str
    render_js: bool = False
    wait_for: Optional[str] = None
    extract: Optional[dict] = None

class ScrapeResponse(BaseModel):
    task_id: str
    status: str
    credits_used: int

async def verify_api_key(x_api_key: str = Header()):
    user = redis_client.hgetall(f"apikey:{x_api_key}")
    if not user:
        raise HTTPException(401, "Invalid API key")
    credits = int(user.get("credits", 0))
    if credits <= 0:
        raise HTTPException(402, "Insufficient credits")
    return {"api_key": x_api_key, "user_id": user["user_id"], "credits": credits}

@app.post("/v1/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest, user=Depends(verify_api_key)):
    task_id = str(uuid.uuid4())
    credits_cost = 5 if request.render_js else 1

    # Deduct credits
    redis_client.hincrby(f"apikey:{user['api_key']}", "credits", -credits_cost)

    # Queue the task
    task = {
        "task_id": task_id,
        "url": request.url,
        "render_js": request.render_js,
        "wait_for": request.wait_for,
        "extract": request.extract,
        "user_id": user["user_id"],
    }
    redis_client.lpush("scrape_queue", json.dumps(task))

    return ScrapeResponse(
        task_id=task_id,
        status="queued",
        credits_used=credits_cost
    )

@app.get("/v1/result/{task_id}")
async def get_result(task_id: str, user=Depends(verify_api_key)):
    result = redis_client.get(f"result:{task_id}")
    if not result:
        return {"status": "processing"}
    return json.loads(result)
Enter fullscreen mode Exit fullscreen mode

The Worker Layer

import asyncio
import aiohttp
from playwright.async_api import async_playwright
import redis
import json

class ScrapeWorker:
    def __init__(self, worker_id, proxy_pool):
        self.worker_id = worker_id
        self.proxy_pool = proxy_pool
        self.redis = redis.Redis(decode_responses=True)

    async def run(self):
        print(f"Worker {self.worker_id} started")
        while True:
            task_json = self.redis.brpop("scrape_queue", timeout=5)
            if not task_json:
                continue

            task = json.loads(task_json[1])
            try:
                result = await self.execute(task)
                self.redis.setex(
                    f"result:{task['task_id']}",
                    3600,  # Cache for 1 hour
                    json.dumps({"status": "completed", "data": result})
                )
            except Exception as e:
                self.redis.setex(
                    f"result:{task['task_id']}",
                    3600,
                    json.dumps({"status": "failed", "error": str(e)})
                )

    async def execute(self, task):
        if task.get("render_js"):
            return await self._browser_scrape(task)
        return await self._http_scrape(task)

    async def _http_scrape(self, task):
        proxy = self.proxy_pool.get_proxy()
        async with aiohttp.ClientSession() as session:
            async with session.get(task["url"], proxy=proxy) as resp:
                html = await resp.text()
                return {"html": html, "status_code": resp.status}

    async def _browser_scrape(self, task):
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()

            await page.goto(task["url"], wait_until="networkidle")

            if task.get("wait_for"):
                await page.wait_for_selector(task["wait_for"], timeout=15000)

            html = await page.content()
            await browser.close()
            return {"html": html, "status_code": 200}
Enter fullscreen mode Exit fullscreen mode

Proxy Management

Your proxy layer is the backbone of reliability:

import random
from collections import deque

class ProxyPool:
    def __init__(self):
        self.proxies = deque()
        self.failed = {}  # proxy -> failure count

    def add_proxies(self, proxy_list):
        for p in proxy_list:
            self.proxies.append(p)

    def get_proxy(self):
        proxy = self.proxies[0]
        self.proxies.rotate(-1)  # Round robin
        return proxy

    def mark_failed(self, proxy):
        self.failed[proxy] = self.failed.get(proxy, 0) + 1
        if self.failed[proxy] > 3:
            self.proxies.remove(proxy)
Enter fullscreen mode Exit fullscreen mode

For production, use ThorData residential proxies. They handle rotation and provide clean IPs across 195 countries.

Billing with Stripe

import stripe

stripe.api_key = "sk_live_..."

CREDIT_PACKS = {
    "starter": {"credits": 10000, "price_cents": 2900},
    "growth": {"credits": 50000, "price_cents": 9900},
    "scale": {"credits": 250000, "price_cents": 29900},
}

@app.post("/v1/billing/purchase")
async def purchase_credits(pack: str, user=Depends(verify_api_key)):
    if pack not in CREDIT_PACKS:
        raise HTTPException(400, "Invalid pack")

    pack_info = CREDIT_PACKS[pack]

    session = stripe.checkout.Session.create(
        payment_method_types=["card"],
        line_items=[{
            "price_data": {
                "currency": "usd",
                "unit_amount": pack_info["price_cents"],
                "product_data": {"name": f"{pack.title()} Credit Pack"},
            },
            "quantity": 1,
        }],
        mode="payment",
        success_url="https://yourdomain.com/success",
        metadata={"user_id": user["user_id"], "credits": pack_info["credits"]},
    )
    return {"checkout_url": session.url}

@app.post("/webhooks/stripe")
async def stripe_webhook(request):
    event = stripe.Webhook.construct_event(...)
    if event.type == "checkout.session.completed":
        session = event.data.object
        credits = int(session.metadata["credits"])
        user_id = session.metadata["user_id"]
        # Add credits to user account
        redis_client.hincrby(f"user:{user_id}", "credits", credits)
Enter fullscreen mode Exit fullscreen mode

Scaling Considerations

Horizontal Scaling

# docker-compose.yml
services:
  api:
    build: .
    command: uvicorn api:app --host 0.0.0.0 --port 9090
    deploy:
      replicas: 3

  worker:
    build: .
    command: python worker.py
    deploy:
      replicas: 10
    shm_size: '2gb'  # Required for Chromium

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data
Enter fullscreen mode Exit fullscreen mode

Cost Structure

Component Cost per 100K requests
Proxy (residential) $50-100
Compute (workers) $20-40
Browser instances $10-20
Storage/bandwidth $5-10
Total $85-170

Pricing at $29/10K requests gives you healthy margins.

Monitoring

Use ScrapeOps to monitor success rates per target site. When a site changes its anti-bot measures, you need to know immediately.

Alternatively, integrate ScraperAPI as your proxy layer — they handle all anti-bot bypass, so you can focus on your product instead of proxy infrastructure.

MVP Checklist

  • [ ] API with key authentication
  • [ ] Credit-based billing (Stripe)
  • [ ] HTTP and browser-based scraping
  • [ ] Proxy rotation
  • [ ] Result caching
  • [ ] Rate limiting per user
  • [ ] Dashboard for API key management
  • [ ] Documentation with code examples

Conclusion

A scraping SaaS is a proven business model — ScraperAPI, Bright Data, and Apify all generate millions in revenue. The technical moat is in proxy management, browser infrastructure, and anti-bot bypass. Start with a credit-based model, focus on reliability, and let your customers tell you which features to build next.

Top comments (0)