Sathish

Posted on Feb 24

Next.js job board: reliable scrapes with pg locks

#nextjs #typescript #webdev

I stopped duplicate cron runs with Postgres advisory locks.
I moved “job already exists” into one SQL upsert.
I rate-limit per source, not globally.
I keep Next.js out of the scraping path entirely.

Context

I’m building a job board for Psychiatric Mental Health Nurse Practitioners. Next.js 14 on Vercel. Supabase (Postgres) behind it.

The board has 8,000+ active listings across 2,000+ companies. I scrape 200+ jobs daily from multiple sources.

My first version was naive. A Vercel Cron hit an API route. The route scraped. Then wrote rows. It worked… until it didn’t.

One day I had 3 copies of the same job. Another day I missed an entire source. Brutal.

The core issue wasn’t “scraping is hard”. It was “cron is not a single threaded program”. Retries happen. Overlaps happen. Two regions happen. And “check-then-insert” is a race.

So I rebuilt the pipeline around the database. Postgres decides what runs. Postgres decides what’s new.

1) I don’t trust cron. I make Postgres gate it.

Vercel Cron is solid. But it can overlap.

I saw it in logs. Same minute. Same endpoint. Two executions.

If both runs scrape the same source, they both insert. Even with dedupe later, I still waste requests and hit rate limits.

I fixed it with a Postgres advisory lock. It’s dumb. It’s perfect.

One lock per source.
If I can’t get the lock, I exit.
No retries. No waiting.

// lib/locks.ts
// Requires: npm i pg
import { Client } from "pg";

// Stable 32-bit hash for advisory locks
function hash32(input: string) {
  let h = 2166136261;
  for (let i = 0; i < input.length; i++) {
    h ^= input.charCodeAt(i);
    h = Math.imul(h, 16777619);
  }
  return h | 0;
}

export async function withAdvisoryLock(
  client: Client,
  key: string,
  fn: () => Promise
) {
  const lockId = hash32(key);
  const { rows } = await client.query(
    "select pg_try_advisory_lock($1) as locked",
    [lockId]
  );

  if (!rows[0]?.locked) return { ran: false as const };

  try {
    const result = await fn();
    return { ran: true as const, result };
  } finally {
    await client.query("select pg_advisory_unlock($1)", [lockId]);
  }
}

And yeah, I’m using pg directly for this job. Not the Supabase client.

I want one connection. One transaction if needed. Fewer moving parts.

2) I stopped “check-then-insert”. Upsert or regret.

My original flow:

query by URL
if not found, insert

Two concurrent runs = both don’t find it = both insert. Classic.

I pushed dedupe into the schema.

I store a canonical_url_hash and make it unique. Hashing matters because URLs get long. Postgres indexes long text fine, but I prefer a fixed width key.

Also: sources lie. UTM params. Tracking junk. Same job, different URL.

So I normalize first. Then hash.

-- migrations/20260224_jobs_dedupe.sql
create extension if not exists pgcrypto;

alter table if exists public.jobs
  add column if not exists canonical_url text,
  add column if not exists canonical_url_hash bytea;

-- Unique per canonical URL
create unique index if not exists jobs_canonical_url_hash_uidx
  on public.jobs (canonical_url_hash);

Then the upsert. One statement. No race.

// lib/upsertJob.ts
import crypto from "node:crypto";
import { Client } from "pg";

function canonicalizeUrl(raw: string) {
  const u = new URL(raw);
  // Drop common tracking params. Add your own.
  [
    "utm_source",
    "utm_medium",
    "utm_campaign",
    "utm_term",
    "utm_content",
    "gh_src",
  ].forEach((p) => u.searchParams.delete(p));
  u.hash = "";
  // Normalize trailing slash
  u.pathname = u.pathname.replace(/\/$/, "");
  return u.toString();
}

export async function upsertJob(client: Client, job: {
  title: string;
  company: string;
  location: string;
  source: string;
  url: string;
  posted_at: string; // ISO
}) {
  const canonical = canonicalizeUrl(job.url);
  const hash = crypto.createHash("sha256").update(canonical).digest();

  const sql = `
    insert into public.jobs
      (title, company, location, source, url, canonical_url, canonical_url_hash, posted_at)
    values
      ($1,$2,$3,$4,$5,$6,$7,$8)
    on conflict (canonical_url_hash)
    do update set
      title = excluded.title,
      company = excluded.company,
      location = excluded.location,
      source = excluded.source,
      url = excluded.url,
      posted_at = greatest(public.jobs.posted_at, excluded.posted_at)
    returning id;
  `;

  const { rows } = await client.query(sql, [
    job.title,
    job.company,
    job.location,
    job.source,
    job.url,
    canonical,
    hash,
    job.posted_at,
  ]);

  return rows[0].id as string;
}

That greatest() line saved me later.

Some sources “update” the same job daily. If I blindly overwrite posted_at, sorting turns into nonsense.

3) I rate-limit per source. One bad source shouldn’t slow all.

I made this mistake.

I had a global limiter. 1 request / second.

It kept me safe. It also meant one slow source made the entire run take forever. Then the cron window overlapped. Then I got double runs. Again.

So now each source gets its own limiter.

I’m not doing anything fancy. Just Bottleneck.

// lib/limiters.ts
// Requires: npm i bottleneck
import Bottleneck from "bottleneck";

export const limiters = {
  // Aggressive sources
  indeed: new Bottleneck({ minTime: 1200, maxConcurrent: 1 }),
  linkedin: new Bottleneck({ minTime: 1500, maxConcurrent: 1 }),

  // Nicer sources
  greenhouse: new Bottleneck({ minTime: 400, maxConcurrent: 2 }),
  lever: new Bottleneck({ minTime: 400, maxConcurrent: 2 }),
};

export async function limited(source: keyof typeof limiters, fn: () => Promise) {
  const limiter = limiters[source];
  if (!limiter) return fn();
  return limiter.schedule(fn);
}

Then in the scraper:

parse listing pages
for each job detail, call limited(source, ...)

The win wasn’t speed. It was predictability.

My cron run time stopped oscillating between 2 minutes and 17 minutes.

4) I keep scraping out of Next.js routes. It runs as a worker.

I started with app/api/scrape/route.ts.

Bad idea.

A Next.js route is a request/response handler. It’s not a job runner.

Timeouts are real.
Memory limits are real.
Retries happen.
Logs get noisy fast.

So now the cron calls a tiny endpoint that enqueues work. The worker does the scraping.

I’m using Supabase + Postgres, so I store a queue table.

-- migrations/20260224_queue.sql
create table if not exists public.scrape_runs (
  id bigserial primary key,
  source text not null,
  status text not null default 'queued',
  enqueued_at timestamptz not null default now(),
  started_at timestamptz,
  finished_at timestamptz,
  error text
);

create index if not exists scrape_runs_status_idx
  on public.scrape_runs (status, enqueued_at);

The Next.js route only inserts rows.

// app/api/cron/enqueue/route.ts
import { NextResponse } from "next/server";
import { Client } from "pg";

export async function GET(req: Request) {
  // Minimal auth. Don't ship this publicly.
  const token = new URL(req.url).searchParams.get("token");
  if (token !== process.env.CRON_TOKEN) {
    return NextResponse.json({ ok: false }, { status: 401 });
  }

  const client = new Client({ connectionString: process.env.DATABASE_URL });
  await client.connect();

  const sources = ["greenhouse", "lever", "indeed", "linkedin"];
  await client.query(
    `insert into public.scrape_runs (source) select unnest($1::text[])`,
    [sources]
  );

  await client.end();
  return NextResponse.json({ ok: true, queued: sources.length });
}

And the worker:

pulls one queued run
takes advisory lock for that source
marks started
scrapes + upserts
marks finished

I run it on a separate process. Locally: node worker.js. In prod: a scheduled job runner.

Spent 4 hours trying to do this “inside Next”. Most of it was wrong.

Results

Before the lock + upsert changes, I’d see 20–60 duplicate rows per week depending on how many overlaps happened. I measured it by counting identical canonical_url_hash values.

After pushing uniqueness into Postgres and wrapping each source scrape in an advisory lock, duplicates dropped to 0 in the last 14 days.

Cron runs also stabilized. My daily scrape adds 200+ jobs, and the run time is now consistently 6–9 minutes instead of randomly spiking past 15.

The best part: fewer wasted requests. Rate limiting stopped being a firefight.

Key takeaways

Cron endpoints must be idempotent. If they can’t be, they’ll break.
Advisory locks are the simplest “only one runner” switch I’ve used.
Deduplication belongs in a unique index, not in app code.
Normalize URLs before hashing. UTM params will wreck you.
Rate limit per source. One hostile site shouldn’t slow everything.

Closing

If you’re scraping jobs (or any listings), how are you preventing overlapping runs in production: advisory locks, a queue, or a dedicated job system like Temporal/BullMQ?