Sathish

Posted on Feb 17

Next.js job board: dedupe 8,000 listings fast

#nextjs #typescript #postgres #webdev

I dedupe job posts in Postgres, not in JS.
I use a fingerprint table + unique indexes.
I keep one “winner” row per job.
I wired it into Supabase + a cron scraper.

Context

I’m building a job board for PMHNP roles. Next.js 14. Supabase. Postgres. Vercel.

The data is messy. Always.

I scrape 200+ jobs daily from multiple sources. I’m sitting on 8,000+ active listings across 2,000+ companies. Duplicates show up in dumb ways: RN PMHNP vs PMHNP (Remote), same job with two tracking URLs, same description but different whitespace, and my favorite — “apply on company site” pages that redirect twice.

I tried deduping in code first. Brutal. Spent 4 hours on it. Most of it was wrong.

The fix was moving dedupe into Postgres. With real constraints. And a small “fingerprint” layer so my scraper can stay simple.

1) I store a fingerprint. Not a “maybe duplicate” flag.

I need a stable key. Something I can compute every run.

But I can’t trust job IDs from sources. They collide. Or disappear.

So I compute a fingerprint from the fields that usually identify a job: company + title + location + canonical apply URL. Then I normalize hard.

This lives in the DB as a generated column. That way every writer gets the same logic.

-- 001_jobs_fingerprint.sql
-- Run in Supabase SQL editor

create extension if not exists pgcrypto;

-- Normalize helper: lower + trim + collapse whitespace
create or replace function public.norm_text(input text)
returns text
language sql
immutable
as $$
  select regexp_replace(lower(trim(coalesce(input, ''))), '\s+', ' ', 'g');
$$;

alter table public.jobs
  add column if not exists apply_url_canonical text;

alter table public.jobs
  add column if not exists job_fingerprint text
  generated always as (
    encode(
      digest(
        public.norm_text(company_name) || '|' ||
        public.norm_text(title) || '|' ||
        public.norm_text(location) || '|' ||
        public.norm_text(apply_url_canonical),
        'sha256'
      ),
      'hex'
    )
  ) stored;

create index if not exists jobs_fingerprint_idx
  on public.jobs(job_fingerprint);

That apply_url_canonical part matters.

If you hash raw URLs, you’re cooked. Tracking params will fork the fingerprint forever.

2) I canonicalize URLs before insert. Or I pay later.

I canonicalize in the scraper. Not in the UI. Not on read.

I made this mistake once. I was normalizing on the client. Then my DB had 14 versions of the same job.

This is the exact helper I use in my Next.js ingestion route.

// lib/canonicalize-url.ts

export function canonicalizeUrl(raw: string | null): string | null {
  if (!raw) return null;

  try {
    const u = new URL(raw);

    // Drop common tracking params
    const drop = new Set([
      "utm_source",
      "utm_medium",
      "utm_campaign",
      "utm_term",
      "utm_content",
      "gclid",
      "fbclid",
      "mc_cid",
      "mc_eid",
    ]);

    // Delete known tracking keys
    for (const key of [...u.searchParams.keys()]) {
      if (drop.has(key)) u.searchParams.delete(key);
    }

    // Normalize hostname + remove trailing slash
    u.hostname = u.hostname.toLowerCase();
    u.hash = "";

    const s = u.toString();
    return s.endsWith("/") ? s.slice(0, -1) : s;
  } catch {
    return raw.trim() || null;
  }
}

Not perfect. Still worth it.

One thing that bit me — some sources wrap apply links in redirectors. If you can resolve redirects during scraping, do it. If you can’t, at least canonicalize what you have.

3) I enforce uniqueness with a separate table

I don’t want a unique index on the whole jobs row. Jobs change.

Descriptions update. Salary ranges appear later. “Remote” flips to “Hybrid”.

So I keep a small job_keys table that’s basically: fingerprint → winning job id.

That gives me a stable “dedupe gate” without freezing the job content.

-- 002_job_keys.sql

create table if not exists public.job_keys (
  job_fingerprint text primary key,
  job_id uuid not null references public.jobs(id) on delete cascade,
  first_seen_at timestamptz not null default now(),
  last_seen_at timestamptz not null default now()
);

create index if not exists job_keys_job_id_idx on public.job_keys(job_id);

Now the ingest logic becomes: insert a job row, then attempt to claim the fingerprint.

If I can’t claim it, I delete the loser row. Simple. Fast.

4) My ingest endpoint is “insert, claim, cleanup”

This runs from my scraper cron. The endpoint receives normalized fields.

And yeah — I rate limit hard. Otherwise I’ll DDoS myself with retries.

Here’s the core logic. This is real code. It works with Supabase JS v2.

// app/api/ingest/job/route.ts
import { NextResponse } from "next/server";
import { createClient } from "@supabase/supabase-js";
import { canonicalizeUrl } from "@/lib/canonicalize-url";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY! // server-only
);

type IngestJob = {
  company_name: string;
  title: string;
  location: string;
  apply_url: string | null;
  description: string | null;
  source: string;
  scraped_at: string; // ISO
};

export async function POST(req: Request) {
  const body = (await req.json()) as IngestJob;

  const apply_url_canonical = canonicalizeUrl(body.apply_url);

  // 1) Insert candidate job row
  const { data: job, error: insErr } = await supabase
    .from("jobs")
    .insert({
      company_name: body.company_name,
      title: body.title,
      location: body.location,
      apply_url: body.apply_url,
      apply_url_canonical,
      description: body.description,
      source: body.source,
      scraped_at: body.scraped_at,
      last_seen_at: body.scraped_at,
    })
    .select("id, job_fingerprint")
    .single();

  if (insErr) {
    return NextResponse.json({ error: insErr.message }, { status: 500 });
  }

  // 2) Try to claim fingerprint in job_keys
  const { error: keyErr } = await supabase.from("job_keys").insert({
    job_fingerprint: job.job_fingerprint,
    job_id: job.id,
  });

  if (!keyErr) {
    // Winner
    return NextResponse.json({ status: "inserted", job_id: job.id });
  }

  // Duplicate fingerprint. Keep existing winner.
  // 3) Update last_seen_at on the key row, then delete loser job row.
  await supabase
    .from("job_keys")
    .update({ last_seen_at: body.scraped_at })
    .eq("job_fingerprint", job.job_fingerprint);

  await supabase.from("jobs").delete().eq("id", job.id);

  return NextResponse.json({ status: "deduped" });
}

Two details.

First: I’m using the service role key. Server-only. Never in the browser.

Second: the “claim” insert relies on the primary key constraint in job_keys. That’s the whole trick.

If you’re wondering about error handling: Supabase returns a Postgres constraint error for duplicates. I don’t parse the message. I just treat “insert failed” as “duplicate”.

5) I still merge fields. But only into the winner.

Deleting the loser is fine. Until the loser has better data.

This happened with salary.

Source A had no salary. Source B did. My dedupe deleted the salary row. Nice.

So I added a merge step: if the fingerprint already exists, I update the winner with any missing fields.

I do it in SQL so it’s one round-trip.

-- 003_merge_into_winner.sql
-- Merge non-null fields from an incoming row into the winner row.

create or replace function public.merge_job_into_winner(
  p_fingerprint text,
  p_title text,
  p_location text,
  p_apply_url text,
  p_apply_url_canonical text,
  p_description text,
  p_source text,
  p_seen_at timestamptz
)
returns uuid
language plpgsql
as $$
declare
  v_winner uuid;
begin
  select job_id into v_winner
  from public.job_keys
  where job_fingerprint = p_fingerprint;

  if v_winner is null then
    raise exception 'winner_not_found';
  end if;

  update public.jobs j
  set
    title = coalesce(j.title, p_title),
    location = coalesce(j.location, p_location),
    apply_url = coalesce(j.apply_url, p_apply_url),
    apply_url_canonical = coalesce(j.apply_url_canonical, p_apply_url_canonical),
    description = coalesce(j.description, p_description),
    last_seen_at = greatest(coalesce(j.last_seen_at, p_seen_at), p_seen_at),
    source = j.source -- keep original source
  where j.id = v_winner;

  update public.job_keys
  set last_seen_at = greatest(last_seen_at, p_seen_at)
  where job_fingerprint = p_fingerprint;

  return v_winner;
end;
$$;

Then in the API route, on duplicate, I call this function instead of just deleting and moving on.

I keep the fingerprinter as the “router”. The merge function is the “data repair”.

Not fancy. Just reliable.

Results

Before this, my “active listings” number was lying. Badly.

On a sample day with 237 scraped jobs, I inserted 237 rows and later found 61 were duplicates. That’s 61 junk rows in one run.

After the fingerprint + job_keys gate, that same day ended with 176 new unique jobs and 61 deduped. DB row count stayed stable. Search results stopped showing the same company/title combo twice.

I’m now holding 8,214 active listings from 2,043 companies, and the duplicates are mostly true near-duplicates (same job cross-posted with different locations). Those need a different rule.

Key takeaways

Do dedupe with constraints. Not “best effort” JS.
Hash normalized fields. Raw URLs will wreck you.
Split “content” from “identity” using a tiny key table.
On duplicates, merge into the winner. Don’t throw away better fields.
Keep the ingest flow boring: insert → claim → merge/cleanup.

Closing

If you’re scraping jobs too: what do you fingerprint on?

Company + title + location + apply URL worked for me, but it still misses “same job, different city” cases.

Do you treat those as duplicates, or separate listings with shared detail pages?

DEV Community