DEV Community

agenthustler
agenthustler

Posted on • Edited on

Building a Web Scraping SaaS: Architecture, Billing, and Scaling

From Script to Business

You have built web scrapers for yourself. Now it is time to turn that skill into a product. A web scraping SaaS lets customers submit URLs and receive structured data — no coding required on their end.

This guide covers the architecture, billing, and scaling decisions you will face.

Architecture Overview

A scraping SaaS has three layers:

  1. API Layer — receives scraping requests, authenticates users
  2. Worker Layer — executes scrapes, manages browser pools
  3. Storage Layer — caches results, stores user data
Client -> API Gateway -> Task Queue -> Worker Pool -> Proxy Layer -> Target Site
                |                          |
              Auth/Billing              Results DB
Enter fullscreen mode Exit fullscreen mode

The API Layer (FastAPI)

from fastapi import FastAPI, HTTPException, Depends, Header
from pydantic import BaseModel
from typing import Optional
import uuid
import redis
import json

app = FastAPI(title="ScrapeService API")
redis_client = redis.Redis(host="localhost", port=6379, decode_responses=True)

class ScrapeRequest(BaseModel):
    url: str
    render_js: bool = False
    wait_for: Optional[str] = None
    extract: Optional[dict] = None

class ScrapeResponse(BaseModel):
    task_id: str
    status: str
    credits_used: int

async def verify_api_key(x_api_key: str = Header()):
    user = redis_client.hgetall(f"apikey:{x_api_key}")
    if not user:
        raise HTTPException(401, "Invalid API key")
    credits = int(user.get("credits", 0))
    if credits <= 0:
        raise HTTPException(402, "Insufficient credits")
    return {"api_key": x_api_key, "user_id": user["user_id"], "credits": credits}

@app.post("/v1/scrape", response_model=ScrapeResponse)
async def scrape(request: ScrapeRequest, user=Depends(verify_api_key)):
    task_id = str(uuid.uuid4())
    credits_cost = 5 if request.render_js else 1

    # Deduct credits
    redis_client.hincrby(f"apikey:{user['api_key']}", "credits", -credits_cost)

    # Queue the task
    task = {
        "task_id": task_id,
        "url": request.url,
        "render_js": request.render_js,
        "wait_for": request.wait_for,
        "extract": request.extract,
        "user_id": user["user_id"],
    }
    redis_client.lpush("scrape_queue", json.dumps(task))

    return ScrapeResponse(
        task_id=task_id,
        status="queued",
        credits_used=credits_cost
    )

@app.get("/v1/result/{task_id}")
async def get_result(task_id: str, user=Depends(verify_api_key)):
    result = redis_client.get(f"result:{task_id}")
    if not result:
        return {"status": "processing"}
    return json.loads(result)
Enter fullscreen mode Exit fullscreen mode

The Worker Layer

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

Proxy Management

Your proxy layer is the backbone of reliability:

# Implementation is proprietary (that IS the moat).
# Skip the build — use our ready-made Apify actor:
# see the CTA below for the link (fpr=yw6md3).
Enter fullscreen mode Exit fullscreen mode

For production, use ThorData residential proxies. They handle rotation and provide clean IPs across 195 countries.

Billing with Stripe

import stripe

stripe.api_key = "sk_live_..."

CREDIT_PACKS = {
    "starter": {"credits": 10000, "price_cents": 2900},
    "growth": {"credits": 50000, "price_cents": 9900},
    "scale": {"credits": 250000, "price_cents": 29900},
}

@app.post("/v1/billing/purchase")
async def purchase_credits(pack: str, user=Depends(verify_api_key)):
    if pack not in CREDIT_PACKS:
        raise HTTPException(400, "Invalid pack")

    pack_info = CREDIT_PACKS[pack]

    session = stripe.checkout.Session.create(
        payment_method_types=["card"],
        line_items=[{
            "price_data": {
                "currency": "usd",
                "unit_amount": pack_info["price_cents"],
                "product_data": {"name": f"{pack.title()} Credit Pack"},
            },
            "quantity": 1,
        }],
        mode="payment",
        success_url="https://yourdomain.com/success",
        metadata={"user_id": user["user_id"], "credits": pack_info["credits"]},
    )
    return {"checkout_url": session.url}

@app.post("/webhooks/stripe")
async def stripe_webhook(request):
    event = stripe.Webhook.construct_event(...)
    if event.type == "checkout.session.completed":
        session = event.data.object
        credits = int(session.metadata["credits"])
        user_id = session.metadata["user_id"]
        # Add credits to user account
        redis_client.hincrby(f"user:{user_id}", "credits", credits)
Enter fullscreen mode Exit fullscreen mode

Scaling Considerations

Horizontal Scaling

# docker-compose.yml
services:
  api:
    build: .
    command: uvicorn api:app --host 0.0.0.0 --port 9090
    deploy:
      replicas: 3

  worker:
    build: .
    command: python worker.py
    deploy:
      replicas: 10
    shm_size: '2gb'  # Required for Chromium

  redis:
    image: redis:7-alpine
    volumes:
      - redis_data:/data
Enter fullscreen mode Exit fullscreen mode

Cost Structure

Component Cost per 100K requests
Proxy (residential) $50-100
Compute (workers) $20-40
Browser instances $10-20
Storage/bandwidth $5-10
Total $85-170

Pricing at $29/10K requests gives you healthy margins.

Monitoring

Use ScrapeOps to monitor success rates per target site. When a site changes its anti-bot measures, you need to know immediately.

Alternatively, integrate ScraperAPI as your proxy layer — they handle all anti-bot bypass, so you can focus on your product instead of proxy infrastructure.

MVP Checklist

  • [ ] API with key authentication
  • [ ] Credit-based billing (Stripe)
  • [ ] HTTP and browser-based scraping
  • [ ] Proxy rotation
  • [ ] Result caching
  • [ ] Rate limiting per user
  • [ ] Dashboard for API key management
  • [ ] Documentation with code examples

Conclusion

A scraping SaaS is a proven business model — ScraperAPI, Bright Data, and Apify all generate millions in revenue. The technical moat is in proxy management, browser infrastructure, and anti-bot bypass. Start with a credit-based model, focus on reliability, and let your customers tell you which features to build next.

Top comments (0)