agenthustler

Posted on Mar 26 • Edited on Apr 19

Building a Web Scraping SaaS: Architecture, Billing, and Scaling

#python #programming #tutorial #webdev

From Script to Business

You have built web scrapers for yourself. Now it is time to turn that skill into a product. A web scraping SaaS lets customers submit URLs and receive structured data — no coding required on their end.

This guide covers the architecture, billing, and scaling decisions you will face.

Architecture Overview

A scraping SaaS has three layers:

API Layer — receives scraping requests, authenticates users
Worker Layer — executes scrapes, manages browser pools
Storage Layer — caches results, stores user data

Client -> API Gateway -> Task Queue -> Worker Pool -> Proxy Layer -> Target Site
                |                          |
              Auth/Billing              Results DB

The API Layer (FastAPI)

from fastapi import FastAPI, HTTPException, Depends, Header
from pydantic import BaseModel
from typing import Optional
import uuid
import redis
import json

app = FastAPI(title="ScrapeService API")
redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)

class ScrapeRequest(BaseModel):
    url: str
    render_js: bool = False
    wait_for: Optional[str] = None
    extract: Optional[dict] = None

class ScrapeResponse(BaseModel):
    task_id: str
    status: str
    credits_used: int

async def verify_api_key(x_api_key: str = Header()):
    user = redis_client.hgetall(f"apikey:{x_api_key}")
    if not user:
        raise HTTPException(401, "Invalid API key")
    credits = int(user.get("credits", 0))
    if credits <= 0:
        raise HTTPException(402, "Insufficient credits")
    return {"api_key": x_api_key, "user_id": user["user_id"], "credits": credits}

@app.post("/v1/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest, user=Depends(verify_api_key)):
    task_id = str(uuid.uuid4())
    credits_cost = 5 if request.render_js else 1

    # Deduct credits
    redis_client.hincrby(f"apikey:{user['api_key']}", "credits", -credits_cost)

    # Queue the task
    task = {
        "task_id": task_id,
        "url": request.url,
        "render_js": request.render_js,
        "wait_for": request.wait_for,
        "extract": request.extract,
        "user_id": user["user_id"],
    }
    redis_client.lpush("scrape_queue", json.dumps(task))

    return ScrapeResponse(
        task_id=task_id,
        status="queued",
        credits_used=credits_cost
    )

@app.get("/v1/result/{task_id}")
async def get_result(task_id: str, user=Depends(verify_api_key)):
    result = redis_client.get(f"result:{task_id}")
    if not result:
        return {"status": "processing"}
    return json.loads(result)

The Worker Layer

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

Proxy Management

Your proxy layer is the backbone of reliability:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).

For production, use ThorData residential proxies. They handle rotation and provide clean IPs across 195 countries.

Billing with Stripe

import stripe

stripe.api_key = "sk_live_..."

CREDIT_PACKS = {
    "starter": {"credits": 10000, "price_cents": 2900},
    "growth": {"credits": 50000, "price_cents": 9900},
    "scale": {"credits": 250000, "price_cents": 29900},
}

@app.post("/v1/billing/purchase")
async def purchase_credits(pack: str, user=Depends(verify_api_key)):
    if pack not in CREDIT_PACKS:
        raise HTTPException(400, "Invalid pack")

    pack_info = CREDIT_PACKS[pack]

    session = stripe.checkout.Session.create(
        payment_method_types=["card"],
        line_items=[{
            "price_data": {
                "currency": "usd",
                "unit_amount": pack_info["price_cents"],
                "product_data": {"name": f"{pack.title()} Credit Pack"},
            },
            "quantity": 1,
        }],
        mode="payment",
        success_url="https://yourdomain.com/success",
        metadata={"user_id": user["user_id"], "credits": pack_info["credits"]},
    )
    return {"checkout_url": session.url}

@app.post("/webhooks/stripe")
async def stripe_webhook(request):
    event = stripe.Webhook.construct_event(...)
    if event.type == "checkout.session.completed":
        session = event.data.object
        credits = int(session.metadata["credits"])
        user_id = session.metadata["user_id"]
        # Add credits to user account
        redis_client.hincrby(f"user:{user_id}", "credits", credits)

Scaling Considerations

Horizontal Scaling

# docker-compose.yml
services:
  api:
    build: .
    command: uvicorn api:app --host 0.0.0.0 --port 9090
    deploy:
      replicas: 3

  worker:
    build: .
    command: python worker.py
    deploy:
      replicas: 10
    shm_size: '2gb'  # Required for Chromium

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data

Cost Structure

Component	Cost per 100K requests
Proxy (residential)	$50-100
Compute (workers)	$20-40
Browser instances	$10-20
Storage/bandwidth	$5-10
Total	$85-170

Pricing at $29/10K requests gives you healthy margins.

Monitoring

Use ScrapeOps to monitor success rates per target site. When a site changes its anti-bot measures, you need to know immediately.

Alternatively, integrate ScraperAPI as your proxy layer — they handle all anti-bot bypass, so you can focus on your product instead of proxy infrastructure.

MVP Checklist

[ ] API with key authentication
[ ] Credit-based billing (Stripe)
[ ] HTTP and browser-based scraping
[ ] Proxy rotation
[ ] Result caching
[ ] Rate limiting per user
[ ] Dashboard for API key management
[ ] Documentation with code examples

Conclusion

A scraping SaaS is a proven business model — ScraperAPI, Bright Data, and Apify all generate millions in revenue. The technical moat is in proxy management, browser infrastructure, and anti-bot bypass. Start with a credit-based model, focus on reliability, and let your customers tell you which features to build next.

Top comments (1)

Harjot Singh • May 31

Web scraping SaaS is a great case study because the billing/scaling problem is unusually hard - your costs are external and adversarial (proxies, CAPTCHAs, sites changing their markup, IP bans) so cost-per-request is volatile in a way most SaaS never deals with. Metering and pricing around that volatility is the real architecture challenge: you're reselling something whose underlying cost can spike without warning.

The billing design that survives this is usage-based with a real cost floor baked in - because flat pricing on a volatile-cost product is how you wake up underwater when proxy costs spike or a target site gets harder. You have to pass through (or cap) the variability. That's the same predictable-cost-on-volatile-backend problem I obsess over in Moonshift (a multi-agent pipeline that ships a prompt to a deployed SaaS) - the model calls are metered/volatile, but routing + caps let me offer ~$3 flat by controlling the variance underneath. Really meaty post - the architecture/billing/scaling trio is exactly right. How are you handling the cost volatility - usage-based pass-through, or absorbing it with margin buffers? That's the make-or-break call for a scraping SaaS.