- I stopped double-scraping jobs with a Postgres lease table.
- I made cron runs idempotent with a single SQL function.
- I got rid of “why is this job missing?” by logging per-source runs.
- Works on Next.js 14 + Supabase + Vercel cron.
Context
I’m building a job board for Psychiatric Mental Health Nurse Practitioners.
It scrapes 200+ jobs daily. Multiple sources. Different HTML. Different APIs. The usual pain.
At ~8,000 active listings and ~2,000 companies, one bad cron day hurts. Duplicates show up. Or worse — jobs silently don’t.
My first version was “just run scraping in a cron route.” Brutal.
Vercel cron can overlap. Deploys can overlap. And if a source gets slow, the next run starts anyway. I spent 4 hours “fixing dedupe” and most of it was wrong because the real issue was concurrency.
So I moved coordination into Postgres. Not Redis. Not queues. Just SQL.
1) I stopped trusting cron timing. I added a lease.
Two cron invocations can run at the same minute.
And if you redeploy while a scrape is running? You can get a second run. Same source. Same time window.
I needed a lock. But I also needed it to expire, because scrapers crash.
So: a lease table in Postgres.
-- supabase/migrations/202604280001_scrape_leases.sql
create table if not exists public.scrape_leases (
source text primary key,
lease_owner uuid not null,
lease_until timestamptz not null,
updated_at timestamptz not null default now()
);
-- Helps cleanup queries and debugging
create index if not exists scrape_leases_until_idx
on public.scrape_leases (lease_until);
A lease is just “I own scraping source until lease_until”.
No background workers. No magic.
And yeah, I tried pg_advisory_lock first. It works. Then you hit serverless edges and connection pooling weirdness. I wanted something I can inspect with a normal select *.
2) I made the lock acquisition a single SQL function
Doing “select then insert/update” from Node is where race conditions breed.
So I pushed it into a Postgres function.
One round-trip. Atomic.
-- supabase/migrations/202604280002_try_acquire_lease.sql
create or replace function public.try_acquire_scrape_lease(
p_source text,
p_owner uuid,
p_ttl_seconds int
)
returns boolean
language plpgsql
as $$
declare
v_now timestamptz := now();
v_until timestamptz := v_now + make_interval(secs => p_ttl_seconds);
begin
-- Insert if missing
insert into public.scrape_leases (source, lease_owner, lease_until, updated_at)
values (p_source, p_owner, v_until, v_now)
on conflict (source) do update
set lease_owner = excluded.lease_owner,
lease_until = excluded.lease_until,
updated_at = excluded.updated_at
where public.scrape_leases.lease_until < v_now
or public.scrape_leases.lease_owner = p_owner;
-- If we own it now, we acquired it.
return exists (
select 1
from public.scrape_leases l
where l.source = p_source
and l.lease_owner = p_owner
and l.lease_until >= v_now
);
end;
$$;
That where lease_until < now() or lease_owner = p_owner line matters.
It lets the same run renew its own lease. But it blocks other runs.
I learned the hard way that “lock once at the start” isn’t enough. A slow source can run longer than your TTL. Then the next cron steals the lease and now two runs are writing.
So I renew.
3) I wired it into a Next.js 14 cron route
This is App Router.
I use a single cron route, and loop sources.
Each source gets a lease owner UUID for that invocation.
// app/api/cron/scrape/route.ts
import { NextResponse } from "next/server";
import { createClient } from "@supabase/supabase-js";
import { randomUUID } from "crypto";
export const runtime = "nodejs";
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY!,
{ auth: { persistSession: false } }
);
const SOURCES = ["indeed", "ziprecruiter", "hospital_site"] as const;
async function acquire(source: string, owner: string, ttlSeconds: number) {
const { data, error } = await supabase.rpc("try_acquire_scrape_lease", {
p_source: source,
p_owner: owner,
p_ttl_seconds: ttlSeconds,
});
if (error) throw error;
return Boolean(data);
}
export async function GET(req: Request) {
const auth = req.headers.get("authorization");
if (auth !== `Bearer ${process.env.CRON_SECRET}`) {
return NextResponse.json({ ok: false }, { status: 401 });
}
const owner = randomUUID();
const results: Record = {};
for (const source of SOURCES) {
const gotLease = await acquire(source, owner, 8 * 60); // 8 minutes
if (!gotLease) {
results[source] = "skipped (lease held)";
continue;
}
// Do the real work.
// Important: renew inside the scraper for long runs.
await scrapeSource(source, owner);
results[source] = "ok";
}
return NextResponse.json({ ok: true, owner, results });
}
async function scrapeSource(source: string, owner: string) {
// placeholder for your per-source scraper.
// I keep these in /lib/scrapers/*
await new Promise((r) => setTimeout(r, 50));
// Example renewal for long jobs. In real code I do this every ~60s.
await acquire(source, owner, 8 * 60);
}
This isn’t fancy.
But it stopped the overlap bugs immediately.
Also: don’t forget runtime = "nodejs" if you use crypto.randomUUID().
I hit ReferenceError: crypto is not defined once when I forgot and it ran on edge. Wasted 30 minutes.
4) I logged every run. Because “skipped” isn’t enough.
After I added leases, I still had the “where did the jobs go?” problem.
A source might be skipped for 6 hours because another run grabbed the lease and then crashed before logging anything useful.
So I added a run log table.
One row per source run. Status + counts + error.
-- supabase/migrations/202604280003_scrape_runs.sql
create table if not exists public.scrape_runs (
id bigserial primary key,
source text not null,
owner uuid not null,
started_at timestamptz not null default now(),
finished_at timestamptz,
status text not null default 'running', -- running|ok|error|skipped
fetched_count int not null default 0,
upserted_count int not null default 0,
error text
);
create index if not exists scrape_runs_source_started_idx
on public.scrape_runs (source, started_at desc);
And in the route, I insert/update around the scrape.
// lib/scrapeRunLog.ts
import { createClient } from "@supabase/supabase-js";
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY!,
{ auth: { persistSession: false } }
);
export async function startRun(source: string, owner: string) {
const { data, error } = await supabase
.from("scrape_runs")
.insert({ source, owner, status: "running" })
.select("id")
.single();
if (error) throw error;
return data.id as number;
}
export async function finishRun(params: {
id: number;
status: "ok" | "error" | "skipped";
fetchedCount?: number;
upsertedCount?: number;
errorText?: string;
}) {
const { error } = await supabase
.from("scrape_runs")
.update({
status: params.status,
fetched_count: params.fetchedCount ?? 0,
upserted_count: params.upsertedCount ?? 0,
error: params.errorText ?? null,
finished_at: new Date().toISOString(),
})
.eq("id", params.id);
if (error) throw error;
}
Now when I wake up and something’s off, I query:
- last run per source
- last error text
- “skipped” streaks
It’s boring. It’s also the difference between guessing and knowing.
Results
Before leases, I saw overlapping runs a few times a week. Real duplicates. Not just “same job from two sources” duplicates — literal same source, same job, inserted twice because two scrapers ran concurrently.
After shipping the lease + run log, the cron route ran 96 times over 48 hours (every 30 minutes). 0 overlaps per source. 0 double inserts from concurrency.
I also got faster debugging. When a source started failing, I had the last 10 error strings in scrape_runs instead of scanning Vercel logs for “Error: socket hang up”.
The board stayed steady at 8,000+ active listings without surprise spikes.
Key takeaways
- Don’t coordinate scrapes in Node. Coordinate in Postgres.
- Use a lease with TTL. Scrapers crash.
- Make lease acquisition one SQL function. One round-trip. Atomic.
- Renew leases during long scrapes, not just at the start.
- Log every source run with counts + error text, or you’ll debug from vibes.
Closing
If you’re scraping multiple sources, concurrency bugs show up late — right when you’ve got enough data that cleaning it is miserable.
Do you prefer Postgres leases like this, or do you use advisory locks (pg_advisory_lock) for cron coordination in serverless environments?
Top comments (0)