Bot Protection in SvelteKit on Cloudflare Pages

#sveltekit #cloudflare #security #webdev

If you're running a SvelteKit app on Cloudflare Pages and your content is publicly accessible, commodity scrapers will find it eventually. Here's the protection setup we use at Tested.gg - two layers, mostly free, minimal code.

The architecture problem first

If your API is behind Cloudflare Service Bindings (not publicly exposed over HTTP), scrapers can only hit your SvelteKit Pages app. That's your entire attack surface. All protection goes there, not on the API Worker.

This matters because most bot protection tutorials target REST APIs. In a service-binding architecture, the SvelteKit SSR layer is the only thing that faces the public internet.

Layer 1: WAF Rate Limiting (Cloudflare Dashboard, free)

Cloudflare's free tier includes WAF rate limiting rules. These execute at the edge, before your Pages Worker runs, which makes them strictly faster and cheaper than application-level rate limiting.

Dashboard > Security > WAF > Rate limiting rules:

General page limit:

Match: URI Path starts with /
Rate: 60 requests per minute per IP
Action: Block (10 minutes)

Write endpoint limit:

Match: URI Path starts with /site AND Method = POST
Rate: 10 requests per minute per IP
Action: Block (1 hour)

One important property: CDN cache HITs don't count against rate limiting rules because they never reach the Worker. Your thresholds only apply to cache misses - which is exactly the right behavior. A human browsing cached pages won't trip the limit; a scraper busting the cache will.

Layer 2: Threat Score in SvelteKit hooks

Cloudflare populates cf.threatScore on every request - a 0-100 score derived from their IP reputation database. 0 is clean, 100 is worst. It's available on every plan, for free.

First, add cf to your Platform interface in app.d.ts:

namespace App {
  interface Platform {
    env: Cloudflare.Env;
    context: {
      waitUntil(promise: Promise<unknown>): void;
    };
    caches: CacheStorage;
    cf?: IncomingRequestCfProperties;
  }
}

Then create a bot-guard.ts hook:

import type { Handle } from "@sveltejs/kit";

const THREAT_SCORE_THRESHOLD = 30;

export const handleBotGuard: Handle = ({ event, resolve }) => {
  const cf = event.platform?.cf;

  if (!cf) return resolve(event);

  const threatScore =
    "threatScore" in cf && typeof cf.threatScore === "number"
      ? cf.threatScore
      : 0;

  if (threatScore > THREAT_SCORE_THRESHOLD) {
    console.warn(
      `[bot-guard] Blocked: id=${event.locals.requestId} ip=${event.getClientAddress()} score=${threatScore} path=${event.url.pathname}`
    );
    return new Response("Access denied", {
      status: 403,
      headers: {
        "Content-Type": "text/plain",
        "Retry-After": "3600"
      }
    });
  }

  return resolve(event);
};

Wire it into hooks.server.ts after your platform setup, before auth or SSR:

export const handle: Handle = sequence(
  handleRequestId,
  handlePlatform,
  handleBotGuard,
  handleAuth
);

Position matters. Block high-threat requests before spending CPU on auth token validation, database calls, or SSR rendering.

Why not KV-based rate limiting?

The common suggestion is to implement per-IP rate limiting inside the Worker using KV. Skip it:

Dashboard WAF rules run at the edge before the Worker - strictly faster
KV's get + put pattern is not atomic - two concurrent requests from the same IP can both read count = 59, both write count = 60, and neither gets blocked
KV reads add latency and cost on every request

Dashboard rules are always preferable when they're expressive enough for your use case.

Tuning the threshold

Start at THREAT_SCORE_THRESHOLD = 30. Monitor Workers Logs for [bot-guard] entries after deploying. If you see false positives (real users blocked), raise it to 40. If you see continued bot traffic, lower it to 20.

This setup handles commodity scrapers and known-bad IPs. Sophisticated scrapers that rotate clean IPs and mimic browser behavior are a different problem - that's what Cloudflare's paid Bot Management tier is for. But for most content sites, these two layers cover the vast majority of automated traffic.