- I stopped duplicate cron runs with Postgres advisory locks.
- I moved “job already exists” into one SQL upsert.
- I rate-limit per source, not globally.
- I keep Next.js out of the scraping path entirely.
Context
I’m building a job board for Psychiatric Mental Health Nurse Practitioners. Next.js 14 on Vercel. Supabase (Postgres) behind it.
The board has 8,000+ active listings across 2,000+ companies. I scrape 200+ jobs daily from multiple sources.
My first version was naive. A Vercel Cron hit an API route. The route scraped. Then wrote rows. It worked… until it didn’t.
One day I had 3 copies of the same job. Another day I missed an entire source. Brutal.
The core issue wasn’t “scraping is hard”. It was “cron is not a single threaded program”. Retries happen. Overlaps happen. Two regions happen. And “check-then-insert” is a race.
So I rebuilt the pipeline around the database. Postgres decides what runs. Postgres decides what’s new.
1) I don’t trust cron. I make Postgres gate it.
Vercel Cron is solid. But it can overlap.
I saw it in logs. Same minute. Same endpoint. Two executions.
If both runs scrape the same source, they both insert. Even with dedupe later, I still waste requests and hit rate limits.
I fixed it with a Postgres advisory lock. It’s dumb. It’s perfect.
- One lock per source.
- If I can’t get the lock, I exit.
- No retries. No waiting.
// lib/locks.ts
// Requires: npm i pg
import { Client } from "pg";
// Stable 32-bit hash for advisory locks
function hash32(input: string) {
let h = 2166136261;
for (let i = 0; i < input.length; i++) {
h ^= input.charCodeAt(i);
h = Math.imul(h, 16777619);
}
return h | 0;
}
export async function withAdvisoryLock(
client: Client,
key: string,
fn: () => Promise
) {
const lockId = hash32(key);
const { rows } = await client.query(
"select pg_try_advisory_lock($1) as locked",
[lockId]
);
if (!rows[0]?.locked) return { ran: false as const };
try {
const result = await fn();
return { ran: true as const, result };
} finally {
await client.query("select pg_advisory_unlock($1)", [lockId]);
}
}
And yeah, I’m using pg directly for this job. Not the Supabase client.
I want one connection. One transaction if needed. Fewer moving parts.
2) I stopped “check-then-insert”. Upsert or regret.
My original flow:
- query by URL
- if not found, insert
Two concurrent runs = both don’t find it = both insert. Classic.
I pushed dedupe into the schema.
I store a canonical_url_hash and make it unique. Hashing matters because URLs get long. Postgres indexes long text fine, but I prefer a fixed width key.
Also: sources lie. UTM params. Tracking junk. Same job, different URL.
So I normalize first. Then hash.
-- migrations/20260224_jobs_dedupe.sql
create extension if not exists pgcrypto;
alter table if exists public.jobs
add column if not exists canonical_url text,
add column if not exists canonical_url_hash bytea;
-- Unique per canonical URL
create unique index if not exists jobs_canonical_url_hash_uidx
on public.jobs (canonical_url_hash);
Then the upsert. One statement. No race.
// lib/upsertJob.ts
import crypto from "node:crypto";
import { Client } from "pg";
function canonicalizeUrl(raw: string) {
const u = new URL(raw);
// Drop common tracking params. Add your own.
[
"utm_source",
"utm_medium",
"utm_campaign",
"utm_term",
"utm_content",
"gh_src",
].forEach((p) => u.searchParams.delete(p));
u.hash = "";
// Normalize trailing slash
u.pathname = u.pathname.replace(/\/$/, "");
return u.toString();
}
export async function upsertJob(client: Client, job: {
title: string;
company: string;
location: string;
source: string;
url: string;
posted_at: string; // ISO
}) {
const canonical = canonicalizeUrl(job.url);
const hash = crypto.createHash("sha256").update(canonical).digest();
const sql = `
insert into public.jobs
(title, company, location, source, url, canonical_url, canonical_url_hash, posted_at)
values
($1,$2,$3,$4,$5,$6,$7,$8)
on conflict (canonical_url_hash)
do update set
title = excluded.title,
company = excluded.company,
location = excluded.location,
source = excluded.source,
url = excluded.url,
posted_at = greatest(public.jobs.posted_at, excluded.posted_at)
returning id;
`;
const { rows } = await client.query(sql, [
job.title,
job.company,
job.location,
job.source,
job.url,
canonical,
hash,
job.posted_at,
]);
return rows[0].id as string;
}
That greatest() line saved me later.
Some sources “update” the same job daily. If I blindly overwrite posted_at, sorting turns into nonsense.
3) I rate-limit per source. One bad source shouldn’t slow all.
I made this mistake.
I had a global limiter. 1 request / second.
It kept me safe. It also meant one slow source made the entire run take forever. Then the cron window overlapped. Then I got double runs. Again.
So now each source gets its own limiter.
I’m not doing anything fancy. Just Bottleneck.
// lib/limiters.ts
// Requires: npm i bottleneck
import Bottleneck from "bottleneck";
export const limiters = {
// Aggressive sources
indeed: new Bottleneck({ minTime: 1200, maxConcurrent: 1 }),
linkedin: new Bottleneck({ minTime: 1500, maxConcurrent: 1 }),
// Nicer sources
greenhouse: new Bottleneck({ minTime: 400, maxConcurrent: 2 }),
lever: new Bottleneck({ minTime: 400, maxConcurrent: 2 }),
};
export async function limited(source: keyof typeof limiters, fn: () => Promise) {
const limiter = limiters[source];
if (!limiter) return fn();
return limiter.schedule(fn);
}
Then in the scraper:
- parse listing pages
- for each job detail, call
limited(source, ...)
The win wasn’t speed. It was predictability.
My cron run time stopped oscillating between 2 minutes and 17 minutes.
4) I keep scraping out of Next.js routes. It runs as a worker.
I started with app/api/scrape/route.ts.
Bad idea.
A Next.js route is a request/response handler. It’s not a job runner.
- Timeouts are real.
- Memory limits are real.
- Retries happen.
- Logs get noisy fast.
So now the cron calls a tiny endpoint that enqueues work. The worker does the scraping.
I’m using Supabase + Postgres, so I store a queue table.
-- migrations/20260224_queue.sql
create table if not exists public.scrape_runs (
id bigserial primary key,
source text not null,
status text not null default 'queued',
enqueued_at timestamptz not null default now(),
started_at timestamptz,
finished_at timestamptz,
error text
);
create index if not exists scrape_runs_status_idx
on public.scrape_runs (status, enqueued_at);
The Next.js route only inserts rows.
// app/api/cron/enqueue/route.ts
import { NextResponse } from "next/server";
import { Client } from "pg";
export async function GET(req: Request) {
// Minimal auth. Don't ship this publicly.
const token = new URL(req.url).searchParams.get("token");
if (token !== process.env.CRON_TOKEN) {
return NextResponse.json({ ok: false }, { status: 401 });
}
const client = new Client({ connectionString: process.env.DATABASE_URL });
await client.connect();
const sources = ["greenhouse", "lever", "indeed", "linkedin"];
await client.query(
`insert into public.scrape_runs (source) select unnest($1::text[])`,
[sources]
);
await client.end();
return NextResponse.json({ ok: true, queued: sources.length });
}
And the worker:
- pulls one queued run
- takes advisory lock for that source
- marks started
- scrapes + upserts
- marks finished
I run it on a separate process. Locally: node worker.js. In prod: a scheduled job runner.
Spent 4 hours trying to do this “inside Next”. Most of it was wrong.
Results
Before the lock + upsert changes, I’d see 20–60 duplicate rows per week depending on how many overlaps happened. I measured it by counting identical canonical_url_hash values.
After pushing uniqueness into Postgres and wrapping each source scrape in an advisory lock, duplicates dropped to 0 in the last 14 days.
Cron runs also stabilized. My daily scrape adds 200+ jobs, and the run time is now consistently 6–9 minutes instead of randomly spiking past 15.
The best part: fewer wasted requests. Rate limiting stopped being a firefight.
Key takeaways
- Cron endpoints must be idempotent. If they can’t be, they’ll break.
- Advisory locks are the simplest “only one runner” switch I’ve used.
- Deduplication belongs in a unique index, not in app code.
- Normalize URLs before hashing. UTM params will wreck you.
- Rate limit per source. One hostile site shouldn’t slow everything.
Closing
If you’re scraping jobs (or any listings), how are you preventing overlapping runs in production: advisory locks, a queue, or a dedicated job system like Temporal/BullMQ?
Top comments (0)