Sathish

Posted on Mar 3

Next.js job board: stop duplicate jobs at ingest

#nextjs #typescript #webdev

I prevent duplicate job rows with a stable dedupe_key.
I normalize titles + companies before hashing. Brutal but works.
I enforce uniqueness in Postgres, not in app code.
I upsert in one SQL call. No race conditions.

Context

I’m building a job board for Psychiatric Mental Health Nurse Practitioners. Next.js 14. Supabase. Postgres. Vercel.

The dataset’s not small. 8,000+ active listings. 2,000+ companies. I scrape 200+ jobs daily from multiple sources.

Duplicates were killing me.

Not obvious ones. The annoying kind. Same job. Slightly different title casing. “Psych NP” vs “PMHNP”. Or the source adds “(Remote)” one day and removes it the next. My UI looked spammy. Search got worse. And my “new jobs” email was embarrassing.

I tried deduping in TypeScript after insert. Spent 4 hours on it. Most of it was wrong. Race conditions. Partial failures. And I still had duplicates.

So I moved dedupe to the ingest boundary. And I made Postgres enforce it.

1) I pick a dedupe key. Then I never change it.

URL looks tempting. Don’t.

Some sources rotate tracking params. Others change path slugs. Some don’t even have stable URLs.

I needed a key that survives small text edits.

What I ended up using:

normalized company name
normalized job title
location bucket (remote vs city/state)
optional external_id if the source has one

Then I hash it. SHA-256. Store it.

// lib/dedupe.ts
import { createHash } from "node:crypto";

export function normalizeText(input: string): string {
  return input
    .toLowerCase()
    .replace(/&/g, " and ")
    .replace(/\b(pmhnp|psych\s*np|psychiatric\s*nurse\s*practitioner)\b/g, "pmhnp")
    .replace(/\bft\b/g, "full time")
    .replace(/\bpt\b/g, "part time")
    .replace(/[^a-z0-9]+/g, " ")
    .trim()
    .replace(/\s+/g, " ");
}

export function locationBucket(locationRaw: string): string {
  const s = normalizeText(locationRaw);
  if (/(remote|telehealth|work from home)/.test(s)) return "remote";
  // Keep it coarse. City/state changes too often.
  return s.slice(0, 40); // stable-ish prefix
}

export function buildDedupeKey(params: {
  company: string;
  title: string;
  location: string;
  externalId?: string | null;
}): string {
  const company = normalizeText(params.company);
  const title = normalizeText(params.title);
  const loc = locationBucket(params.location);

  const raw = [company, title, loc, params.externalId ?? ""].join("|");
  return createHash("sha256").update(raw).digest("hex");
}

Yes, the normalization is opinionated.

And yes, I broke it once. I changed the normalization rules mid-week and suddenly the same job got a new hash. Duplicate flood.

Now I treat buildDedupeKey() like a schema migration. If I change it, I version it. New column. Backfill job. Slow and boring. That’s the point.

2) I let Postgres say “nope” with a unique index

App-side checks are lies under concurrency.

My scraper runs multiple fetchers. Sometimes the same job comes from two sources within seconds. If I do “select then insert”, I lose.

So I added a generated dedupe_key column (stored) and a unique index.

In Supabase, I ran this SQL.

-- dedupe_key is written by my ingest code.
-- Keep it NOT NULL so "missing" doesn't bypass uniqueness.
alter table public.jobs
  add column if not exists dedupe_key text;

update public.jobs
  set dedupe_key = coalesce(dedupe_key, '')
where dedupe_key is null;

alter table public.jobs
  alter column dedupe_key set not null;

-- One row per unique job identity.
create unique index if not exists jobs_dedupe_key_uniq
  on public.jobs (dedupe_key);

-- Helps the common query too.
create index if not exists jobs_posted_at_idx
  on public.jobs (posted_at desc);

That update looks dumb. It is.

I had existing rows with nulls. Postgres treats nulls as distinct in unique indexes. So I could’ve had infinite null duplicates.

I learned that one the loud way.

Also: I didn’t make dedupe_key a generated column in Postgres.

I wanted the dedupe rules in code, close to the scraper. If I move it into SQL, I’ll forget why a regex exists and delete it.

3) I upsert in one call. No “check then insert”

Once the unique index exists, the ingest path becomes simple.

Compute dedupe_key in the scraper.

Then insert ... on conflict (dedupe_key) do update.

One statement. Atomic.

Here’s the exact pattern I use with the Supabase JS client. It’s not magic. It just maps to PostgREST upsert.

// app/api/ingest/route.ts (Next.js 14 App Router)
import { NextResponse } from "next/server";
import { createClient } from "@supabase/supabase-js";
import { buildDedupeKey } from "@/lib/dedupe";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY! // server-only
);

type IngestJob = {
  title: string;
  company: string;
  location: string;
  url: string;
  source: string;
  external_id?: string | null;
  posted_at: string; // ISO
};

export async function POST(req: Request) {
  const payload = (await req.json()) as { jobs: IngestJob[] };

  const rows = payload.jobs.map((j) => ({
    title: j.title,
    company: j.company,
    location: j.location,
    url: j.url,
    source: j.source,
    external_id: j.external_id ?? null,
    posted_at: j.posted_at,
    dedupe_key: buildDedupeKey({
      company: j.company,
      title: j.title,
      location: j.location,
      externalId: j.external_id ?? null,
    }),
    last_seen_at: new Date().toISOString(),
  }));

  const { error } = await supabase
    .from("jobs")
    .upsert(rows, {
      onConflict: "dedupe_key",
      ignoreDuplicates: false,
    });

  if (error) {
    // I log the full error. Supabase gives useful messages.
    // Example I hit: "duplicate key value violates unique constraint jobs_dedupe_key_uniq"
    return NextResponse.json({ ok: false, error: error.message }, { status: 500 });
  }

  return NextResponse.json({ ok: true, count: rows.length });
}

One thing that bit me — upsert() will happily overwrite fields with null if you send null.

So I don’t send nullable fields unless I’m sure.

If a source doesn’t provide salary, I omit it from the payload. I don’t send salary_min: null.

Otherwise a “worse” source can erase a “better” source.

4) I track “last seen” and expire jobs with a single query

Deduping isn’t just “avoid duplicates”.

It also changes how you delete.

If I dedupe properly, I can treat a job as “alive” if I’ve seen it recently. Different sources can refresh it. Same row.

So every ingest updates last_seen_at.

Then I expire anything not seen in 21 days.

I originally did this with a Node script looping rows. It was slow. And I hit statement timeouts.

Now it’s one SQL statement.

-- Expire jobs not seen recently.
-- I run this daily.
update public.jobs
set
  is_active = false,
  expired_at = now()
where
  is_active = true
  and last_seen_at < (now() - interval '21 days');

This also fixed a UI bug.

I had jobs that were “gone” from sources but still in my DB forever. Search results looked stale. Users assumed the whole site was stale.

Not great.

Results

Before this change, I’d routinely see the same job 2–4 times in search. I spot-checked one company page and found 17 obvious duplicates out of 63 rows.

After adding dedupe_key + unique index + upsert, duplicates stopped at the database boundary. My ingest still processes 200+ scraped jobs daily, but the number of new rows now matches reality. On a typical day I insert ~110–160 new rows and the rest become updates (mostly last_seen_at).

My listings count stayed stable around 8,000+ active jobs instead of inflating every week.

Key takeaways

Don’t dedupe after insert. Enforce it at insert.
Pick a stable dedupe key. Then treat it like schema.
A unique index beats any TypeScript “if exists” check.
Upsert in one call. Anything else races.
Track last_seen_at so deletions become a simple SQL update.

Closing

If you’re scraping jobs too: what do you use as your dedupe identity — URL, external ID, or a normalized hash like this? And how do you handle the moment you need to change the rules without backfilling your entire table?