DEV Community

Sathish
Sathish

Posted on

Next.js job board dedupe with Postgres upsert

  • I dedupe scraped jobs with a single Postgres unique key.
  • I normalize URLs + titles before hashing. Otherwise duplicates slip.
  • I upsert in Supabase, then update “last_seen_at” safely.
  • I keep search fast with pg_trgm on title + company.

Context

I’m building a job board for Psychiatric Mental Health Nurse Practitioners.

It’s not tiny.
8,000+ active listings. 2,000+ companies.
And I scrape 200+ jobs daily from multiple sources.

The boring part hurts the most.
Deduplication.

Same job shows up with 6 URLs.
Or the URL’s the same, but tracking params differ.
Or the title changes from “PMHNP” to “Psych NP”.

I tried “just dedupe in code”.
Spent 4 hours on this. Most of it was wrong.
Race conditions. Partial failures. “Why are there 37 duplicates again?”

So I moved the dedupe boundary into Postgres.
One unique key. One upsert path. No drama.

1) I define “same job” as a stable fingerprint

If you don’t define this, you’re done.
You’ll keep arguing with yourself.

I started with source_url as the unique key.
Brutal.
A bunch of sources rotate URLs, add utm_*, or redirect through tracking.

Now I compute a fingerprint from normalized fields.
Source + canonicalized URL + title + company + location.
Not perfect.
But stable.

Here’s the exact normalization I ship.

// lib/dedupe.ts
import crypto from "crypto";

export function canonicalizeUrl(input: string): string {
  const u = new URL(input);

  // Drop noisy params. Add more if your sources use them.
  const drop = new Set([
    "utm_source",
    "utm_medium",
    "utm_campaign",
    "utm_term",
    "utm_content",
    "gclid",
    "fbclid",
  ]);

  // Remove fragments. They cause fake duplicates.
  u.hash = "";

  // Remove known tracking params.
  for (const key of [...u.searchParams.keys()]) {
    if (drop.has(key)) u.searchParams.delete(key);
  }

  // Sort remaining params for stability.
  const sorted = new URLSearchParams(
    [...u.searchParams.entries()].sort(([a], [b]) => a.localeCompare(b))
  );
  u.search = sorted.toString() ? `?${sorted.toString()}` : "";

  // Normalize hostname + pathname.
  u.hostname = u.hostname.toLowerCase();
  u.pathname = u.pathname.replace(/\/+$/, "");

  return u.toString();
}

export function normalizeText(s: string): string {
  return s
    .trim()
    .toLowerCase()
    .replace(/\s+/g, " ")
    .replace(/[’']/g, "'");
}

export function jobFingerprint(input: {
  source: string;
  url: string;
  title: string;
  company: string;
  location: string;
}): string {
  const stable = [
    normalizeText(input.source),
    canonicalizeUrl(input.url),
    normalizeText(input.title),
    normalizeText(input.company),
    normalizeText(input.location),
  ].join("|");

  return crypto.createHash("sha256").update(stable).digest("hex");
}
Enter fullscreen mode Exit fullscreen mode

One thing that bit me — location.
“Remote - US” vs “Remote (United States)”.
I normalize whitespace and case, but I don’t try to geocode it.
If I over-normalize, I merge real distinct jobs.
That’s worse.

2) I let Postgres enforce uniqueness (not my code)

This is the real fix.
Everything else is a band-aid.

I store the fingerprint in a jobs table.
Then I enforce uniqueness with a constraint.

Supabase makes this easy, but you still need SQL.
I ran this in a migration.

-- db/migrations/20260310_jobs_dedupe.sql
create extension if not exists pgcrypto;
create extension if not exists pg_trgm;

create table if not exists public.jobs (
  id uuid primary key default gen_random_uuid(),
  source text not null,
  source_url text not null,
  canonical_url text not null,
  fingerprint text not null,
  title text not null,
  company text not null,
  location text not null,
  posted_at timestamptz null,
  first_seen_at timestamptz not null default now(),
  last_seen_at timestamptz not null default now(),
  is_active boolean not null default true
);

-- The whole point.
create unique index if not exists jobs_fingerprint_uniq
on public.jobs (fingerprint);

-- Helps lookup + listing pages.
create index if not exists jobs_last_seen_idx
on public.jobs (last_seen_at desc);

-- Search indexes. (Used later.)
create index if not exists jobs_title_trgm_idx
on public.jobs using gin (title gin_trgm_ops);
create index if not exists jobs_company_trgm_idx
on public.jobs using gin (company gin_trgm_ops);
Enter fullscreen mode Exit fullscreen mode

I used to do “check then insert”.
Classic mistake.
Two scrapers run at once.
Both pass the check.
Then both insert.

Unique index fixes that.
Database says no.
Every time.

3) I upsert, then bump last_seen_at

Supabase’s .upsert() works fine.
But I still want deterministic behavior.

On conflict:

  • keep first_seen_at
  • update last_seen_at
  • update fields that drift (title, location, canonical_url)

Also.
I don’t want to mark a job inactive just because one scrape failed.
So I only flip is_active via a separate “stale job” job.

This is the ingestion code I run from a server route.

// app/api/ingest/route.ts
import { NextResponse } from "next/server";
import { createClient } from "@supabase/supabase-js";
import { canonicalizeUrl, jobFingerprint } from "@/lib/dedupe";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY! // server-only
);

type ScrapedJob = {
  source: string;
  sourceUrl: string;
  title: string;
  company: string;
  location: string;
  postedAt?: string;
};

export async function POST(req: Request) {
  const body = (await req.json()) as { jobs: ScrapedJob[] };

  const rows = body.jobs.map((j) => {
    const canonicalUrl = canonicalizeUrl(j.sourceUrl);
    const fingerprint = jobFingerprint({
      source: j.source,
      url: canonicalUrl,
      title: j.title,
      company: j.company,
      location: j.location,
    });

    return {
      source: j.source,
      source_url: j.sourceUrl,
      canonical_url: canonicalUrl,
      fingerprint,
      title: j.title,
      company: j.company,
      location: j.location,
      posted_at: j.postedAt ? new Date(j.postedAt).toISOString() : null,
      last_seen_at: new Date().toISOString(),
      is_active: true,
    };
  });

  // Upsert on fingerprint. Unique index enforces it.
  const { error } = await supabase
    .from("jobs")
    .upsert(rows, { onConflict: "fingerprint" });

  if (error) {
    // I log the first failing fingerprint when debugging.
    return NextResponse.json({ ok: false, error: error.message }, { status: 500 });
  }

  return NextResponse.json({ ok: true, ingested: rows.length });
}
Enter fullscreen mode Exit fullscreen mode

I learned this the hard way.
If you upsert on source_url, you’ll keep duplicates forever.
Because source_url isn’t stable.

Also.
Don’t run this with the anon key.
You’ll end up with RLS errors like:
new row violates row-level security policy for table "jobs"
Use the service role key on the server.

4) I mark stale jobs inactive with one SQL update

Scraping isn’t reliable.
Sources timeout.
Rate limiting happens.
Or I deploy and my cron didn’t run.

So “not scraped today” doesn’t mean “job is gone”.
But “not seen in 14 days” usually does.

I run a daily job.
It flips is_active based on last_seen_at.
No loops.
No pagination.
One statement.

-- Run daily (Supabase scheduled function or external cron)
update public.jobs
set is_active = false
where is_active = true
  and last_seen_at < now() - interval '14 days';
Enter fullscreen mode Exit fullscreen mode

This simple rule cleaned up my listings.
And it made SEO cleaner too.
Expired jobs stop surfacing in the main pages.

One thing I won’t do again.
I initially used 3 days.
Bad call.
Weekend gaps + source outages made real jobs disappear.
Support emails were… not fun.

5) I keep search fast with pg_trgm (and a boring query)

Job boards live or die on search.
And I’m not shipping Elasticsearch for this.
Not for 8,000 rows.

pg_trgm gets me “good enough” fuzzy matching.
Fast.
Simple.

I query title + company.
I use similarity() for ranking.
Then I cap results.

// lib/search.ts
import { createClient } from "@supabase/supabase-js";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_ANON_KEY! // safe for read-only if RLS allows
);

export async function searchJobs(q: string) {
  // Use an RPC so the query plan stays stable.
  const { data, error } = await supabase.rpc("search_jobs", {
    q,
    limit_count: 25,
  });

  if (error) throw new Error(error.message);
  return data as Array<{
    id: string;
    title: string;
    company: string;
    location: string;
    canonical_url: string;
  }>;
}
Enter fullscreen mode Exit fullscreen mode

And the SQL RPC:

create or replace function public.search_jobs(q text, limit_count int)
returns table (
  id uuid,
  title text,
  company text,
  location text,
  canonical_url text
)
language sql
stable
as $$
  select
    j.id,
    j.title,
    j.company,
    j.location,
    j.canonical_url
  from public.jobs j
  where j.is_active = true
    and (
      j.title % q
      or j.company % q
    )
  order by
    greatest(similarity(j.title, q), similarity(j.company, q)) desc,
    j.last_seen_at desc
  limit limit_count;
$$;
Enter fullscreen mode Exit fullscreen mode

This stayed snappy for me.
Even with 8,000+ active rows.

And yes.
I tried full tsvector too.
Spent another 2 hours.
Relevance was worse for abbreviations like “PMHNP”.
Trigram matched what users typed.
So I kept it.

Results

My database stopped accumulating obvious duplicates.
Before the fingerprint key, I had days where 200 scraped jobs produced 260 rows.
Gross.

Now a typical day is 200–230 scraped inputs producing 190–215 inserts/updates.
The rest are clean conflicts.

Search latency dropped too.
My title/company search went from ~900ms p95 (naive ilike '%q%') to ~70ms p95 with pg_trgm on the same dataset.

Most importantly.
I don’t babysit dedupe anymore.
Postgres enforces it.

Key takeaways

  • Define “same job” as a fingerprint, not a URL.
  • Canonicalize URLs before hashing. Strip utm_* and friends.
  • Put the unique constraint in Postgres. Don’t race your own scrapers.
  • Upsert on the fingerprint, then update last_seen_at every scrape.
  • Use pg_trgm for fuzzy search before adding heavier infra.

Closing

If you’re scraping jobs too: what do you use for the unique key?
Raw URL, canonical URL, or a fingerprint like this — and what field caused your worst duplicate bug?

Top comments (0)