Sathish

Posted on Apr 28

Next.js job board: reliable scrapes with leases

#nextjs #typescript #postgres #webdev

I stopped double-scraping jobs with a Postgres lease table.
I made cron runs idempotent with a single SQL function.
I got rid of “why is this job missing?” by logging per-source runs.
Works on Next.js 14 + Supabase + Vercel cron.

Context

I’m building a job board for Psychiatric Mental Health Nurse Practitioners.

It scrapes 200+ jobs daily. Multiple sources. Different HTML. Different APIs. The usual pain.

At ~8,000 active listings and ~2,000 companies, one bad cron day hurts. Duplicates show up. Or worse — jobs silently don’t.

My first version was “just run scraping in a cron route.” Brutal.

Vercel cron can overlap. Deploys can overlap. And if a source gets slow, the next run starts anyway. I spent 4 hours “fixing dedupe” and most of it was wrong because the real issue was concurrency.

So I moved coordination into Postgres. Not Redis. Not queues. Just SQL.

1) I stopped trusting cron timing. I added a lease.

Two cron invocations can run at the same minute.

And if you redeploy while a scrape is running? You can get a second run. Same source. Same time window.

I needed a lock. But I also needed it to expire, because scrapers crash.

So: a lease table in Postgres.

-- supabase/migrations/202604280001_scrape_leases.sql

create table if not exists public.scrape_leases (
  source text primary key,
  lease_owner uuid not null,
  lease_until timestamptz not null,
  updated_at timestamptz not null default now()
);

-- Helps cleanup queries and debugging
create index if not exists scrape_leases_until_idx
  on public.scrape_leases (lease_until);

A lease is just “I own scraping source until lease_until”.

No background workers. No magic.

And yeah, I tried pg_advisory_lock first. It works. Then you hit serverless edges and connection pooling weirdness. I wanted something I can inspect with a normal select *.

2) I made the lock acquisition a single SQL function

Doing “select then insert/update” from Node is where race conditions breed.

So I pushed it into a Postgres function.

One round-trip. Atomic.

-- supabase/migrations/202604280002_try_acquire_lease.sql

create or replace function public.try_acquire_scrape_lease(
  p_source text,
  p_owner uuid,
  p_ttl_seconds int
)
returns boolean
language plpgsql
as $$
declare
  v_now timestamptz := now();
  v_until timestamptz := v_now + make_interval(secs =&gt; p_ttl_seconds);
begin
  -- Insert if missing
  insert into public.scrape_leases (source, lease_owner, lease_until, updated_at)
  values (p_source, p_owner, v_until, v_now)
  on conflict (source) do update
    set lease_owner = excluded.lease_owner,
        lease_until = excluded.lease_until,
        updated_at = excluded.updated_at
  where public.scrape_leases.lease_until &lt; v_now
     or public.scrape_leases.lease_owner = p_owner;

  -- If we own it now, we acquired it.
  return exists (
    select 1
    from public.scrape_leases l
    where l.source = p_source
      and l.lease_owner = p_owner
      and l.lease_until &gt;= v_now
  );
end;
$$;

That where lease_until < now() or lease_owner = p_owner line matters.

It lets the same run renew its own lease. But it blocks other runs.

I learned the hard way that “lock once at the start” isn’t enough. A slow source can run longer than your TTL. Then the next cron steals the lease and now two runs are writing.

So I renew.

3) I wired it into a Next.js 14 cron route

This is App Router.

I use a single cron route, and loop sources.

Each source gets a lease owner UUID for that invocation.

// app/api/cron/scrape/route.ts
import { NextResponse } from "next/server";
import { createClient } from "@supabase/supabase-js";
import { randomUUID } from "crypto";

export const runtime = "nodejs";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!,
  { auth: { persistSession: false } }
);

const SOURCES = ["indeed", "ziprecruiter", "hospital_site"] as const;

async function acquire(source: string, owner: string, ttlSeconds: number) {
  const { data, error } = await supabase.rpc("try_acquire_scrape_lease", {
    p_source: source,
    p_owner: owner,
    p_ttl_seconds: ttlSeconds,
  });
  if (error) throw error;
  return Boolean(data);
}

export async function GET(req: Request) {
  const auth = req.headers.get("authorization");
  if (auth !== `Bearer ${process.env.CRON_SECRET}`) {
    return NextResponse.json({ ok: false }, { status: 401 });
  }

  const owner = randomUUID();
  const results: Record = {};

  for (const source of SOURCES) {
    const gotLease = await acquire(source, owner, 8 * 60); // 8 minutes
    if (!gotLease) {
      results[source] = "skipped (lease held)";
      continue;
    }

    // Do the real work.
    // Important: renew inside the scraper for long runs.
    await scrapeSource(source, owner);
    results[source] = "ok";
  }

  return NextResponse.json({ ok: true, owner, results });
}

async function scrapeSource(source: string, owner: string) {
  // placeholder for your per-source scraper.
  // I keep these in /lib/scrapers/*
  await new Promise((r) =&gt; setTimeout(r, 50));

  // Example renewal for long jobs. In real code I do this every ~60s.
  await acquire(source, owner, 8 * 60);
}

This isn’t fancy.

But it stopped the overlap bugs immediately.

Also: don’t forget runtime = "nodejs" if you use crypto.randomUUID().

I hit ReferenceError: crypto is not defined once when I forgot and it ran on edge. Wasted 30 minutes.

4) I logged every run. Because “skipped” isn’t enough.

After I added leases, I still had the “where did the jobs go?” problem.

A source might be skipped for 6 hours because another run grabbed the lease and then crashed before logging anything useful.

So I added a run log table.

One row per source run. Status + counts + error.

-- supabase/migrations/202604280003_scrape_runs.sql

create table if not exists public.scrape_runs (
  id bigserial primary key,
  source text not null,
  owner uuid not null,
  started_at timestamptz not null default now(),
  finished_at timestamptz,
  status text not null default 'running', -- running|ok|error|skipped
  fetched_count int not null default 0,
  upserted_count int not null default 0,
  error text
);

create index if not exists scrape_runs_source_started_idx
  on public.scrape_runs (source, started_at desc);

And in the route, I insert/update around the scrape.

// lib/scrapeRunLog.ts
import { createClient } from "@supabase/supabase-js";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!,
  { auth: { persistSession: false } }
);

export async function startRun(source: string, owner: string) {
  const { data, error } = await supabase
    .from("scrape_runs")
    .insert({ source, owner, status: "running" })
    .select("id")
    .single();
  if (error) throw error;
  return data.id as number;
}

export async function finishRun(params: {
  id: number;
  status: "ok" | "error" | "skipped";
  fetchedCount?: number;
  upsertedCount?: number;
  errorText?: string;
}) {
  const { error } = await supabase
    .from("scrape_runs")
    .update({
      status: params.status,
      fetched_count: params.fetchedCount ?? 0,
      upserted_count: params.upsertedCount ?? 0,
      error: params.errorText ?? null,
      finished_at: new Date().toISOString(),
    })
    .eq("id", params.id);
  if (error) throw error;
}

Now when I wake up and something’s off, I query:

last run per source
last error text
“skipped” streaks

It’s boring. It’s also the difference between guessing and knowing.

Results

Before leases, I saw overlapping runs a few times a week. Real duplicates. Not just “same job from two sources” duplicates — literal same source, same job, inserted twice because two scrapers ran concurrently.

After shipping the lease + run log, the cron route ran 96 times over 48 hours (every 30 minutes). 0 overlaps per source. 0 double inserts from concurrency.

I also got faster debugging. When a source started failing, I had the last 10 error strings in scrape_runs instead of scanning Vercel logs for “Error: socket hang up”.

The board stayed steady at 8,000+ active listings without surprise spikes.

Key takeaways

Don’t coordinate scrapes in Node. Coordinate in Postgres.
Use a lease with TTL. Scrapers crash.
Make lease acquisition one SQL function. One round-trip. Atomic.
Renew leases during long scrapes, not just at the start.
Log every source run with counts + error text, or you’ll debug from vibes.

Closing

If you’re scraping multiple sources, concurrency bugs show up late — right when you’ve got enough data that cleaning it is miserable.

Do you prefer Postgres leases like this, or do you use advisory locks (pg_advisory_lock) for cron coordination in serverless environments?

DEV Community