Sathish

Posted on Jan 26

How to Deduplicate Jobs in Postgres for a Job Board

#buildinpublic #nextjs #tutorial

I thought scraping jobs was the hard part—turns out duplicates were worse. The same role would show up 3-6 times across different boards, and my “7,500 jobs” count was basically lying. Here’s the exact Postgres + Supabase pattern I use to merge duplicates safely without nuking real listings.

Last week my job board started looking… inflated.

I’m building a job board for Psychiatric Mental Health Nurse Practitioners (PMHNPs). I scrape multiple sources daily (200+ new jobs/day), store them in Supabase (Postgres), and serve them via Next.js 14.

The issue: one job posted on a hospital site gets re-posted on 2-3 aggregators, sometimes with slightly different titles, formatting, or URLs. If you don’t dedupe, your database grows fast, your UX gets spammy, and your weekly email alerts start sending repeats.

TL;DR

I generate a canonical key for each job (company + location + normalized title) and store it in Postgres.
I use a two-stage dedupe: strict (exact key) first, then fuzzy matching for the edge cases.
I upsert jobs by canonical key and keep a job_sources table so I never lose attribution.
I run dedupe inside a transaction so my scraper can be dumb and my database can be smart.

Context: why deduping matters in real life

On my PMHNP job board, I’m currently at 7,556+ jobs and 1,368+ companies. When duplicates slip in, it’s not just a cosmetic problem.

Users see the same job repeatedly and bounce. Filters get slower. Weekly job alerts become annoying (“didn’t you email me this yesterday?”).

I tried “just dedupe in code” at first. I regret that. Different scrapers had slightly different logic, and I kept re-creating bugs.

So I moved dedupe into Postgres: one canonical key, one constraint, one upsert path. In this post I’ll show the schema, the normalization approach, the upsert query, and a small fuzzy fallback.

1) Model jobs + sources (don’t throw info away)

The first mistake I made was trying to store “the job” as a single row with a single URL.

In reality, the same job can have multiple source URLs. If you overwrite the URL during upsert, you lose provenance (and sometimes you lose the best link).

So I split it:

jobs: the canonical job record
job_sources: every source we saw this job on

-- schema.sql

create table if not exists jobs (
  id uuid primary key default gen_random_uuid(),
  canonical_key text not null,
  title text not null,
  company_name text not null,
  location_text text,
  is_remote boolean default false,
  salary_min integer,
  salary_max integer,
  description text,
  posted_at timestamptz,
  first_seen_at timestamptz not null default now(),
  last_seen_at timestamptz not null default now()
);

-- This is the big win: one canonical job per canonical_key
create unique index if not exists jobs_canonical_key_uidx
  on jobs (canonical_key);

create table if not exists job_sources (
  id uuid primary key default gen_random_uuid(),
  job_id uuid not null references jobs(id) on delete cascade,
  source_name text not null,
  source_url text not null,
  external_id text,
  scraped_at timestamptz not null default now(),
  -- prevent storing the same URL twice
  unique (source_name, source_url)
);

create index if not exists job_sources_job_id_idx on job_sources(job_id);

What this gives me:

A single canonical listing shown to users
Multiple sources attached (great for debugging, trust, and link fallbacks)

Next I needed a canonical key that’s stable.

2) Generate a canonical key (normalize like you mean it)

Canonical keys fail when your normalization is sloppy.

Example: “PMHNP” vs “Psychiatric Mental Health Nurse Practitioner” vs “Psych NP”. Or “NYC” vs “New York, NY”. If you build the key from raw strings, duplicates slip through.

I keep it simple and deterministic:

Normalize title: lowercase, remove punctuation, collapse whitespace, map common synonyms
Normalize company: lowercase, strip suffixes (“Inc”, “LLC”) when safe
Normalize location: keep a cleaned string (I don’t geocode at scrape time)

Then I hash it (optional) or store as text.

// lib/canonical.ts

const TITLE_SYNONYMS: Array<[RegExp, string]> = [
  [/psychiatric\s+mental\s+health\s+nurse\s+practitioner/g, "pmhnp"],
  [/psych\s*nurse\s*practitioner/g, "pmhnp"],
  [/psychiatric\s*nurse\s*practitioner/g, "pmhnp"],
  [/\bnp\b/g, "nurse practitioner"],
];

function normalizeText(input: string) {
  return input
    .toLowerCase()
    .replace(/&/g, "and")
    .replace(/[^a-z0-9\s]/g, " ") // drop punctuation
    .replace(/\s+/g, " ")
    .trim();
}

function normalizeCompany(companyName: string) {
  const base = normalizeText(companyName);
  // be conservative: only strip very common suffixes
  return base
    .replace(/\b(inc|llc|ltd|corp|co)\b/g, "")
    .replace(/\s+/g, " ")
    .trim();
}

export function buildCanonicalKey(params: {
  title: string;
  companyName: string;
  locationText?: string | null;
  isRemote?: boolean;
}) {
  let title = normalizeText(params.title);
  for (const [pattern, replacement] of TITLE_SYNONYMS) {
    title = title.replace(pattern, replacement);
  }

  const company = normalizeCompany(params.companyName);
  const location = normalizeText(params.locationText ?? "");
  const remoteFlag = params.isRemote ? "remote" : "onsite";

  // keep it readable; you can hash if you prefer
  return `${company}__${title}__${location}__${remoteFlag}`;
}

Pitfalls I hit here:

Over-normalizing company names can merge different companies (“Hope Health” vs “Hope Healthcare”). I now strip only common legal suffixes.
Location normalization is messy. I don’t try to “fix” it too much unless I have structured data.

Now that I can compute a canonical key, I can upsert.

3) Upsert jobs + attach sources (transaction + conflict)

I want the scraper to do one call: “here’s a job + the source URL I found it on”.

The database should:

Insert the canonical job if it’s new
Otherwise update last_seen_at and fill missing fields
Insert the source row (or ignore if it already exists)

Here’s the pattern I use from a Next.js route handler.

// app/api/ingest/route.ts
import { NextRequest, NextResponse } from "next/server";
import { createClient } from "@supabase/supabase-js";
import { buildCanonicalKey } from "@/lib/canonical";

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY! // server-only
);

export async function POST(req: NextRequest) {
  const payload = await req.json();

  const canonicalKey = buildCanonicalKey({
    title: payload.title,
    companyName: payload.company_name,
    locationText: payload.location_text,
    isRemote: payload.is_remote,
  });

  // 1) upsert job
  const { data: job, error: upsertError } = await supabase
    .from("jobs")
    .upsert(
      {
        canonical_key: canonicalKey,
        title: payload.title,
        company_name: payload.company_name,
        location_text: payload.location_text,
        is_remote: payload.is_remote ?? false,
        salary_min: payload.salary_min,
        salary_max: payload.salary_max,
        description: payload.description,
        posted_at: payload.posted_at,
        last_seen_at: new Date().toISOString(),
      },
      {
        onConflict: "canonical_key",
        // important: don't blindly overwrite good data with nulls
        ignoreDuplicates: false,
      }
    )
    .select("id")
    .single();

  if (upsertError) {
    return NextResponse.json({ error: upsertError.message }, { status: 400 });
  }

  // 2) attach source (ignore if same URL already stored)
  const { error: sourceError } = await supabase.from("job_sources").insert({
    job_id: job.id,
    source_name: payload.source_name,
    source_url: payload.source_url,
    external_id: payload.external_id,
  });

  // unique constraint might throw; that's ok for idempotency
  if (sourceError && !sourceError.message.includes("duplicate key")) {
    return NextResponse.json({ error: sourceError.message }, { status: 400 });
  }

  return NextResponse.json({ ok: true, job_id: job.id });
}

Two notes:

Supabase .upsert() is convenient, but it can overwrite fields with nulls if you’re not careful. I keep scrapers consistent and avoid sending nulls when possible.
If you want true “fill missing only”, do it in SQL (next section). This is where I tripped up early.

Let’s tighten the upsert logic.

4) Use SQL upsert to avoid overwriting better data

A common duplicate scenario:

Source A has salary
Source B doesn’t

If Source B is ingested later and your upsert overwrites salary with null, you just lost data.

I fixed this by moving the upsert into a Postgres function and using COALESCE in the DO UPDATE clause.

-- functions.sql

create or replace function upsert_job_with_source(
  p_canonical_key text,
  p_title text,
  p_company_name text,
  p_location_text text,
  p_is_remote boolean,
  p_salary_min integer,
  p_salary_max integer,
  p_description text,
  p_posted_at timestamptz,
  p_source_name text,
  p_source_url text,
  p_external_id text
)
returns uuid
language plpgsql
as $$
declare
  v_job_id uuid;
begin
  insert into jobs (
    canonical_key, title, company_name, location_text, is_remote,
    salary_min, salary_max, description, posted_at,
    first_seen_at, last_seen_at
  ) values (
    p_canonical_key, p_title, p_company_name, p_location_text, coalesce(p_is_remote, false),
    p_salary_min, p_salary_max, p_description, p_posted_at,
    now(), now()
  )
  on conflict (canonical_key)
  do update set
    -- keep the newest title/description if it's longer (often richer)
    title = case
      when length(excluded.title) > length(jobs.title) then excluded.title
      else jobs.title
    end,
    description = case
      when excluded.description is null then jobs.description
      when jobs.description is null then excluded.description
      when length(excluded.description) > length(jobs.description) then excluded.description
      else jobs.description
    end,
    location_text = coalesce(jobs.location_text, excluded.location_text),
    salary_min = coalesce(jobs.salary_min, excluded.salary_min),
    salary_max = coalesce(jobs.salary_max, excluded.salary_max),
    posted_at = coalesce(jobs.posted_at, excluded.posted_at),
    last_seen_at = now()
  returning id into v_job_id;

  insert into job_sources (job_id, source_name, source_url, external_id, scraped_at)
  values (v_job_id, p_source_name, p_source_url, p_external_id, now())
  on conflict (source_name, source_url)
  do nothing;

  return v_job_id;
end;
$$;

Now my API route can call one RPC and I don’t have to duplicate logic across scrapers.

Pitfalls:

Don’t get fancy with “longer description is better” unless you validate it. I still occasionally ingest junk HTML. I run a cleanup step before insert.
If you change canonical key logic later, you’ll create a second “canonical” job. Version your normalization if you expect it to evolve.

Strict dedupe handles most cases. But not all.

5) Fuzzy fallback with pg_trgm (for the annoying 10%)

Sometimes two sources are describing the same job but the title differs too much for strict matching:

“PMHNP (Outpatient)”
“Psych NP - Clinic”

My strict canonical key might not match.

This is where pg_trgm helps. I already use it for full-text-ish search; I also use it for a “maybe duplicate” check during ingestion.

First enable it and index what you compare.

-- enable extensions (Supabase supports this)
create extension if not exists pg_trgm;

-- I create a comparison field that's already normalized in code,
-- but you can also do it with generated columns.
alter table jobs add column if not exists title_norm text;

create index if not exists jobs_title_norm_trgm_idx
  on jobs using gin (title_norm gin_trgm_ops);

create index if not exists jobs_company_trgm_idx
  on jobs using gin (company_name gin_trgm_ops);

Then a query to find candidates.

-- Find potential duplicates for a new job
-- Tune the similarity threshold based on your data.

select id, title, company_name,
       similarity(title_norm, $1) as title_score
from jobs
where company_name = $2
  and similarity(title_norm, $1) > 0.6
order by title_score desc
limit 5;

How I use it:

If strict key conflicts: great, we upsert.
If strict key is new: run this fuzzy query.
If there’s a strong match, I attach the new source to the existing job instead of creating a new job.

I’m careful here. Fuzzy matching can cause false positives, so I keep thresholds conservative and restrict by company.

Results / outcome

After moving dedupe into Postgres and tracking sources separately:

My board stays at 7,556+ real jobs instead of ballooning with repeats.
I ingest 200+ jobs/day with fewer downstream cleanup scripts.
Query performance stayed solid (about 50ms average for filtered pages) because the dataset is cleaner.
Weekly job alerts got way better because I can reliably decide what’s “new” (based on first_seen_at per canonical job).

Also: debugging is easier. When a user reports a bad listing, I can see every source that contributed to it.

Key takeaways

Put dedupe in the database: a unique constraint beats scattered “if checks” in scraper code.
Don’t overwrite good fields with nulls—use SQL upsert with COALESCE.
Keep a separate job_sources table so dedupe doesn’t destroy useful provenance.
Use pg_trgm as a fallback for the messy edge cases, but keep it conservative.
Expect your canonical rules to evolve; keep them simple and deterministic.

Closing

If you’re aggregating content (jobs, products, listings), duplicates are inevitable—you either handle them up front or you suffer later.

What’s your dedupe strategy today: strict keys, fuzzy matching, or manual review queues? Drop your approach (or your messiest edge case) in the comments and I’ll help you reason through it. If folks want, I can write a follow-up on how I mark jobs as expired without deleting them.

DEV Community