DEV Community

Sathish
Sathish

Posted on

Next.js job board: reliable scrapes with pg locks

  • I stopped duplicate cron runs with Postgres advisory locks.
  • I moved “job already exists” into one SQL upsert.
  • I rate-limit per source, not globally.
  • I keep Next.js out of the scraping path entirely.

Context

I’m building a job board for Psychiatric Mental Health Nurse Practitioners. Next.js 14 on Vercel. Supabase (Postgres) behind it.

The board has 8,000+ active listings across 2,000+ companies. I scrape 200+ jobs daily from multiple sources.

My first version was naive. A Vercel Cron hit an API route. The route scraped. Then wrote rows. It worked… until it didn’t.

One day I had 3 copies of the same job. Another day I missed an entire source. Brutal.

The core issue wasn’t “scraping is hard”. It was “cron is not a single threaded program”. Retries happen. Overlaps happen. Two regions happen. And “check-then-insert” is a race.

So I rebuilt the pipeline around the database. Postgres decides what runs. Postgres decides what’s new.

1) I don’t trust cron. I make Postgres gate it.

Vercel Cron is solid. But it can overlap.

I saw it in logs. Same minute. Same endpoint. Two executions.

If both runs scrape the same source, they both insert. Even with dedupe later, I still waste requests and hit rate limits.

I fixed it with a Postgres advisory lock. It’s dumb. It’s perfect.

  • One lock per source.
  • If I can’t get the lock, I exit.
  • No retries. No waiting.
// lib/locks.ts
// Requires: npm i pg
import { Client } from "pg";

// Stable 32-bit hash for advisory locks
function hash32(input: string) {
  let h = 2166136261;
  for (let i = 0; i < input.length; i++) {
    h ^= input.charCodeAt(i);
    h = Math.imul(h, 16777619);
  }
  return h | 0;
}

export async function withAdvisoryLock(
  client: Client,
  key: string,
  fn: () => Promise
) {
  const lockId = hash32(key);
  const { rows } = await client.query(
    "select pg_try_advisory_lock($1) as locked",
    [lockId]
  );

  if (!rows[0]?.locked) return { ran: false as const };

  try {
    const result = await fn();
    return { ran: true as const, result };
  } finally {
    await client.query("select pg_advisory_unlock($1)", [lockId]);
  }
}
Enter fullscreen mode Exit fullscreen mode

And yeah, I’m using pg directly for this job. Not the Supabase client.

I want one connection. One transaction if needed. Fewer moving parts.

2) I stopped “check-then-insert”. Upsert or regret.

My original flow:

  • query by URL
  • if not found, insert

Two concurrent runs = both don’t find it = both insert. Classic.

I pushed dedupe into the schema.

I store a canonical_url_hash and make it unique. Hashing matters because URLs get long. Postgres indexes long text fine, but I prefer a fixed width key.

Also: sources lie. UTM params. Tracking junk. Same job, different URL.

So I normalize first. Then hash.

-- migrations/20260224_jobs_dedupe.sql
create extension if not exists pgcrypto;

alter table if exists public.jobs
  add column if not exists canonical_url text,
  add column if not exists canonical_url_hash bytea;

-- Unique per canonical URL
create unique index if not exists jobs_canonical_url_hash_uidx
  on public.jobs (canonical_url_hash);
Enter fullscreen mode Exit fullscreen mode

Then the upsert. One statement. No race.

// lib/upsertJob.ts
import crypto from "node:crypto";
import { Client } from "pg";

function canonicalizeUrl(raw: string) {
  const u = new URL(raw);
  // Drop common tracking params. Add your own.
  [
    "utm_source",
    "utm_medium",
    "utm_campaign",
    "utm_term",
    "utm_content",
    "gh_src",
  ].forEach((p) => u.searchParams.delete(p));
  u.hash = "";
  // Normalize trailing slash
  u.pathname = u.pathname.replace(/\/$/, "");
  return u.toString();
}

export async function upsertJob(client: Client, job: {
  title: string;
  company: string;
  location: string;
  source: string;
  url: string;
  posted_at: string; // ISO
}) {
  const canonical = canonicalizeUrl(job.url);
  const hash = crypto.createHash("sha256").update(canonical).digest();

  const sql = `
    insert into public.jobs
      (title, company, location, source, url, canonical_url, canonical_url_hash, posted_at)
    values
      ($1,$2,$3,$4,$5,$6,$7,$8)
    on conflict (canonical_url_hash)
    do update set
      title = excluded.title,
      company = excluded.company,
      location = excluded.location,
      source = excluded.source,
      url = excluded.url,
      posted_at = greatest(public.jobs.posted_at, excluded.posted_at)
    returning id;
  `;

  const { rows } = await client.query(sql, [
    job.title,
    job.company,
    job.location,
    job.source,
    job.url,
    canonical,
    hash,
    job.posted_at,
  ]);

  return rows[0].id as string;
}
Enter fullscreen mode Exit fullscreen mode

That greatest() line saved me later.

Some sources “update” the same job daily. If I blindly overwrite posted_at, sorting turns into nonsense.

3) I rate-limit per source. One bad source shouldn’t slow all.

I made this mistake.

I had a global limiter. 1 request / second.

It kept me safe. It also meant one slow source made the entire run take forever. Then the cron window overlapped. Then I got double runs. Again.

So now each source gets its own limiter.

I’m not doing anything fancy. Just Bottleneck.

// lib/limiters.ts
// Requires: npm i bottleneck
import Bottleneck from "bottleneck";

export const limiters = {
  // Aggressive sources
  indeed: new Bottleneck({ minTime: 1200, maxConcurrent: 1 }),
  linkedin: new Bottleneck({ minTime: 1500, maxConcurrent: 1 }),

  // Nicer sources
  greenhouse: new Bottleneck({ minTime: 400, maxConcurrent: 2 }),
  lever: new Bottleneck({ minTime: 400, maxConcurrent: 2 }),
};

export async function limited(source: keyof typeof limiters, fn: () => Promise) {
  const limiter = limiters[source];
  if (!limiter) return fn();
  return limiter.schedule(fn);
}
Enter fullscreen mode Exit fullscreen mode

Then in the scraper:

  • parse listing pages
  • for each job detail, call limited(source, ...)

The win wasn’t speed. It was predictability.

My cron run time stopped oscillating between 2 minutes and 17 minutes.

4) I keep scraping out of Next.js routes. It runs as a worker.

I started with app/api/scrape/route.ts.

Bad idea.

A Next.js route is a request/response handler. It’s not a job runner.

  • Timeouts are real.
  • Memory limits are real.
  • Retries happen.
  • Logs get noisy fast.

So now the cron calls a tiny endpoint that enqueues work. The worker does the scraping.

I’m using Supabase + Postgres, so I store a queue table.

-- migrations/20260224_queue.sql
create table if not exists public.scrape_runs (
  id bigserial primary key,
  source text not null,
  status text not null default 'queued',
  enqueued_at timestamptz not null default now(),
  started_at timestamptz,
  finished_at timestamptz,
  error text
);

create index if not exists scrape_runs_status_idx
  on public.scrape_runs (status, enqueued_at);
Enter fullscreen mode Exit fullscreen mode

The Next.js route only inserts rows.

// app/api/cron/enqueue/route.ts
import { NextResponse } from "next/server";
import { Client } from "pg";

export async function GET(req: Request) {
  // Minimal auth. Don't ship this publicly.
  const token = new URL(req.url).searchParams.get("token");
  if (token !== process.env.CRON_TOKEN) {
    return NextResponse.json({ ok: false }, { status: 401 });
  }

  const client = new Client({ connectionString: process.env.DATABASE_URL });
  await client.connect();

  const sources = ["greenhouse", "lever", "indeed", "linkedin"];
  await client.query(
    `insert into public.scrape_runs (source) select unnest($1::text[])`,
    [sources]
  );

  await client.end();
  return NextResponse.json({ ok: true, queued: sources.length });
}
Enter fullscreen mode Exit fullscreen mode

And the worker:

  • pulls one queued run
  • takes advisory lock for that source
  • marks started
  • scrapes + upserts
  • marks finished

I run it on a separate process. Locally: node worker.js. In prod: a scheduled job runner.

Spent 4 hours trying to do this “inside Next”. Most of it was wrong.

Results

Before the lock + upsert changes, I’d see 20–60 duplicate rows per week depending on how many overlaps happened. I measured it by counting identical canonical_url_hash values.

After pushing uniqueness into Postgres and wrapping each source scrape in an advisory lock, duplicates dropped to 0 in the last 14 days.

Cron runs also stabilized. My daily scrape adds 200+ jobs, and the run time is now consistently 6–9 minutes instead of randomly spiking past 15.

The best part: fewer wasted requests. Rate limiting stopped being a firefight.

Key takeaways

  • Cron endpoints must be idempotent. If they can’t be, they’ll break.
  • Advisory locks are the simplest “only one runner” switch I’ve used.
  • Deduplication belongs in a unique index, not in app code.
  • Normalize URLs before hashing. UTM params will wreck you.
  • Rate limit per source. One hostile site shouldn’t slow everything.

Closing

If you’re scraping jobs (or any listings), how are you preventing overlapping runs in production: advisory locks, a queue, or a dedicated job system like Temporal/BullMQ?

Top comments (0)