Sathish

Posted on Mar 17

Next.js 14 cron scraping: rate limits + retries

#nextjs #postgres #typescript #webdev

Run a daily scraper on Vercel. Without melting sources.
Enforce per-host rate limits in Node. Not “sleep(1000)”.
Make retries idempotent with Postgres locks.
Store failures so I can re-run only the broken ones.

Context

I’m building a job board for Psychiatric Mental Health Nurse Practitioners.
8,000+ active listings. 2,000+ companies.

The pipeline scrapes 200+ jobs daily from multiple sources.
Some are nice JSON feeds. Most aren’t.

My first version was dumb.
One cron. One loop. Fetch everything.

It worked. Until it didn’t.
429s. Random 403s. Timeouts.
Worse — half a run would succeed, then retries would duplicate work and waste time.

This post is how I stabilized it.
Rate limiting by host. Jitter. Backoff.
And a Postgres lock so reruns don’t stomp each other.

1) I stopped using “one cron to rule them all”

I used to do this:
“Cron hits /api/scrape and that endpoint scrapes everything.”

Brutal.
One slow host makes the whole run slow.
And Vercel timeouts become your scheduler.

Now I split it.
One cron endpoint schedules work items.
Then another endpoint processes one source at a time.

That gives me:

better observability
cheap retries
isolation when a source is flaky

Here’s the scheduler route.
Next.js 14 App Router. Runs on Vercel.

// app/api/cron/schedule/route.ts
import { NextResponse } from "next/server";
import { createClient } from "@supabase/supabase-js";

export const runtime = "nodejs";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
);

export async function GET(req: Request) {
  // Simple auth. Don't leave cron endpoints open.
  const token = new URL(req.url).searchParams.get("token");
  if (token !== process.env.CRON_TOKEN) {
    return NextResponse.json({ error: "unauthorized" }, { status: 401 });
  }

  // Keep it explicit. I only schedule enabled sources.
  const { data: sources, error } = await supabase
    .from("scrape_sources")
    .select("id")
    .eq("enabled", true);

  if (error) {
    return NextResponse.json({ error: error.message }, { status: 500 });
  }

  // One job per source.
  const jobs = sources!.map((s) => ({
    source_id: s.id,
    status: "queued" as const,
  }));

  const { error: insertErr } = await supabase.from("scrape_runs").insert(jobs);
  if (insertErr) {
    return NextResponse.json({ error: insertErr.message }, { status: 500 });
  }

  return NextResponse.json({ scheduled: jobs.length });
}

That table is small.
One row per source per run.
I don’t schedule “200 jobs”. I schedule “N sources”.
N is stable.

2) I rate-limit per host, not per run

I wasted 4 hours here.
Most of it was wrong.

I first added await new Promise(r => setTimeout(r, 500)).
It felt responsible.
It wasn’t.

Different hosts have different thresholds.
Also, parallelism matters.
Two concurrent requests to the same host can trip a limit even if each request is “slow enough”.

So I built a tiny per-host limiter.
In-memory.
Works fine because this is one worker invocation doing one source.

// lib/rateLimit.ts
const nextAllowedAt = new Map();

function sleep(ms: number) {
  return new Promise((r) => setTimeout(r, ms));
}

export async function perHostLimit(host: string, minIntervalMs: number) {
  const now = Date.now();
  const allowed = nextAllowedAt.get(host) ?? 0;

  const waitMs = Math.max(0, allowed - now);
  if (waitMs > 0) await sleep(waitMs);

  // Add jitter so multiple invocations don't sync up perfectly.
  const jitter = Math.floor(Math.random() * 250);
  nextAllowedAt.set(host, Date.now() + minIntervalMs + jitter);
}

Then I wrap fetches.
Host comes from the URL.

// lib/http.ts
import { perHostLimit } from "./rateLimit";

export async function limitedFetch(url: string, init?: RequestInit) {
  const host = new URL(url).host;

  // Default: 1 request / 1200ms per host.
  // I override per source when needed.
  await perHostLimit(host, 1200);

  const res = await fetch(url, {
    ...init,
    headers: {
      "user-agent": "pmhnp-job-scraper/1.0",
      ...(init?.headers ?? {}),
    },
  });

  return res;
}

This alone cut down 429s a lot.
Not to zero.
But enough that retries became rare.

3) Retries: I only retry the stuff that’s retryable

My earlier retry logic was “retry everything 3 times.”
That’s how you get banned.

Now I classify.
429 and 503? Retry.
403? Stop. Probably blocked.
404? Stop. Probably removed.

Also.
I log every failure in Postgres.
Because “I’ll check logs later” is a lie.

// lib/retry.ts
function sleep(ms: number) {
  return new Promise((r) => setTimeout(r, ms));
}

export async function fetchWithBackoff(
  fn: () => Promise,
  opts: { maxAttempts: number }
) {
  let attempt = 0;

  while (true) {
    attempt++;
    const res = await fn();

    if (res.ok) return res;

    const retryable = res.status === 429 || res.status === 503 || res.status === 502;
    if (!retryable || attempt >= opts.maxAttempts) return res;

    // Exponential backoff + jitter.
    const base = 500 * Math.pow(2, attempt - 1);
    const jitter = Math.floor(Math.random() * 300);
    await sleep(base + jitter);
  }
}

Then in the scraper.
I record failures with the exact status.
That’s what I filter on when re-running.

I learned the hard way that retrying 403s is pointless.
It just looks like a bot hammering a wall.
Because it is.

4) Idempotency: I use a Postgres advisory lock

This one saved me.

Vercel cron can overlap.
Deploys happen.
Manual re-runs happen.
And I don’t want two invocations scraping the same source.

I use pg_try_advisory_lock.
It’s a single SQL statement.
No extra tables.
The lock auto-releases when the DB connection closes.

Since I’m on Supabase Postgres, I call it through rpc().
So I created two functions.
One to acquire. One to release.

-- supabase/migrations/20260317_advisory_locks.sql
create or replace function public.try_lock_source(p_source_id bigint)
returns boolean
language sql
as $$
  select pg_try_advisory_lock(p_source_id);
$$;

create or replace function public.unlock_source(p_source_id bigint)
returns boolean
language sql
as $$
  select pg_advisory_unlock(p_source_id);
$$;

Then the worker route grabs the lock.
If it can’t, it exits cleanly.
No drama.

// app/api/cron/run-source/route.ts
import { NextResponse } from "next/server";
import { createClient } from "@supabase/supabase-js";

export const runtime = "nodejs";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
);

export async function POST(req: Request) {
  const { sourceId } = (await req.json()) as { sourceId: number };

  const { data: locked, error: lockErr } = await supabase
    .rpc("try_lock_source", { p_source_id: sourceId });

  if (lockErr) {
    return NextResponse.json({ error: lockErr.message }, { status: 500 });
  }

  if (!locked) {
    // Another run is already processing this source.
    return NextResponse.json({ skipped: true, reason: "locked" });
  }

  try {
    // Do the scrape. Insert jobs. Update scrape_runs.
    // (I keep that code in a separate module.)
    return NextResponse.json({ ok: true });
  } finally {
    await supabase.rpc("unlock_source", { p_source_id: sourceId });
  }
}

This is boring.
That’s why it’s good.

One thing that bit me — advisory locks are per DB session.
If you’re using a pooler that multiplexes sessions, behavior can get weird.
On Supabase, I run this through the normal Postgres connection behind the API.
It’s been consistent for me.

5) I store a “run log” so I can replay failures

When scraping breaks, it breaks in patterns.
One host changes HTML.
One API starts returning 503 for 20 minutes.

If I don’t store run outcomes, I end up re-scraping everything.
That’s noisy.
And it increases dedupe pressure downstream.

So I store:

source_id
status (queued | running | success | failed)
http_status
error_message
timestamps

Then my “re-run” cron just selects failures.
No guessing.

I’m not showing the whole schema here.
But the behavior is simple:
query scrape_runs where status = 'failed' and created_at > now() - interval '24 hours'.
Requeue those.

It’s not glamorous.
It’s the difference between a stable system and a nightly gamble.

Results

Before these changes, my daily scrape had frequent partial failures.
On an average day I’d see 18–35 failed fetches across sources, mostly 429 and 503.
Rerunning meant scraping everything again, which made rate limiting worse.

After splitting scheduling + per-source runs, adding per-host throttling, and using advisory locks, failures dropped to 2–6 per day.
Those are usually real issues now (HTML changed, endpoint removed).
The job board stays at 8,000+ active listings, and I keep adding ~200 new jobs daily without the pipeline flaking out.

Key takeaways

Don’t run “one cron endpoint” for everything. Schedule units of work.
Rate-limit by host. Not by loop iterations.
Retry only 429/502/503. Stop on 403/404.
Use Postgres advisory locks to prevent overlapping work.
Persist failures in a table so reruns are targeted, not noisy.

Closing

If you’re scraping on Vercel, overlap happens.
Even when you swear it won’t.

Do you use Postgres advisory locks for cron idempotency, or do you prefer a queue (like pg-boss / BullMQ) even for small scrape workloads?

DEV Community