I thought scraping jobs was the hard part—turns out duplicates were worse. The same role would show up 3-6 times across different boards, and my “7,500 jobs” count was basically lying. Here’s the exact Postgres + Supabase pattern I use to merge duplicates safely without nuking real listings.
Last week my job board started looking… inflated.
I’m building a job board for Psychiatric Mental Health Nurse Practitioners (PMHNPs). I scrape multiple sources daily (200+ new jobs/day), store them in Supabase (Postgres), and serve them via Next.js 14.
The issue: one job posted on a hospital site gets re-posted on 2-3 aggregators, sometimes with slightly different titles, formatting, or URLs. If you don’t dedupe, your database grows fast, your UX gets spammy, and your weekly email alerts start sending repeats.
TL;DR
- I generate a canonical key for each job (company + location + normalized title) and store it in Postgres.
- I use a two-stage dedupe: strict (exact key) first, then fuzzy matching for the edge cases.
- I upsert jobs by canonical key and keep a job_sources table so I never lose attribution.
- I run dedupe inside a transaction so my scraper can be dumb and my database can be smart.
Context: why deduping matters in real life
On my PMHNP job board, I’m currently at 7,556+ jobs and 1,368+ companies. When duplicates slip in, it’s not just a cosmetic problem.
Users see the same job repeatedly and bounce. Filters get slower. Weekly job alerts become annoying (“didn’t you email me this yesterday?”).
I tried “just dedupe in code” at first. I regret that. Different scrapers had slightly different logic, and I kept re-creating bugs.
So I moved dedupe into Postgres: one canonical key, one constraint, one upsert path. In this post I’ll show the schema, the normalization approach, the upsert query, and a small fuzzy fallback.
1) Model jobs + sources (don’t throw info away)
The first mistake I made was trying to store “the job” as a single row with a single URL.
In reality, the same job can have multiple source URLs. If you overwrite the URL during upsert, you lose provenance (and sometimes you lose the best link).
So I split it:
- jobs: the canonical job record
- job_sources: every source we saw this job on
-- schema.sql
create table if not exists jobs (
id uuid primary key default gen_random_uuid(),
canonical_key text not null,
title text not null,
company_name text not null,
location_text text,
is_remote boolean default false,
salary_min integer,
salary_max integer,
description text,
posted_at timestamptz,
first_seen_at timestamptz not null default now(),
last_seen_at timestamptz not null default now()
);
-- This is the big win: one canonical job per canonical_key
create unique index if not exists jobs_canonical_key_uidx
on jobs (canonical_key);
create table if not exists job_sources (
id uuid primary key default gen_random_uuid(),
job_id uuid not null references jobs(id) on delete cascade,
source_name text not null,
source_url text not null,
external_id text,
scraped_at timestamptz not null default now(),
-- prevent storing the same URL twice
unique (source_name, source_url)
);
create index if not exists job_sources_job_id_idx on job_sources(job_id);
What this gives me:
- A single canonical listing shown to users
- Multiple sources attached (great for debugging, trust, and link fallbacks)
Next I needed a canonical key that’s stable.
2) Generate a canonical key (normalize like you mean it)
Canonical keys fail when your normalization is sloppy.
Example: “PMHNP” vs “Psychiatric Mental Health Nurse Practitioner” vs “Psych NP”. Or “NYC” vs “New York, NY”. If you build the key from raw strings, duplicates slip through.
I keep it simple and deterministic:
- Normalize title: lowercase, remove punctuation, collapse whitespace, map common synonyms
- Normalize company: lowercase, strip suffixes (“Inc”, “LLC”) when safe
- Normalize location: keep a cleaned string (I don’t geocode at scrape time)
Then I hash it (optional) or store as text.
// lib/canonical.ts
const TITLE_SYNONYMS: Array<[RegExp, string]> = [
[/psychiatric\s+mental\s+health\s+nurse\s+practitioner/g, "pmhnp"],
[/psych\s*nurse\s*practitioner/g, "pmhnp"],
[/psychiatric\s*nurse\s*practitioner/g, "pmhnp"],
[/\bnp\b/g, "nurse practitioner"],
];
function normalizeText(input: string) {
return input
.toLowerCase()
.replace(/&/g, "and")
.replace(/[^a-z0-9\s]/g, " ") // drop punctuation
.replace(/\s+/g, " ")
.trim();
}
function normalizeCompany(companyName: string) {
const base = normalizeText(companyName);
// be conservative: only strip very common suffixes
return base
.replace(/\b(inc|llc|ltd|corp|co)\b/g, "")
.replace(/\s+/g, " ")
.trim();
}
export function buildCanonicalKey(params: {
title: string;
companyName: string;
locationText?: string | null;
isRemote?: boolean;
}) {
let title = normalizeText(params.title);
for (const [pattern, replacement] of TITLE_SYNONYMS) {
title = title.replace(pattern, replacement);
}
const company = normalizeCompany(params.companyName);
const location = normalizeText(params.locationText ?? "");
const remoteFlag = params.isRemote ? "remote" : "onsite";
// keep it readable; you can hash if you prefer
return `${company}__${title}__${location}__${remoteFlag}`;
}
Pitfalls I hit here:
- Over-normalizing company names can merge different companies (“Hope Health” vs “Hope Healthcare”). I now strip only common legal suffixes.
- Location normalization is messy. I don’t try to “fix” it too much unless I have structured data.
Now that I can compute a canonical key, I can upsert.
3) Upsert jobs + attach sources (transaction + conflict)
I want the scraper to do one call: “here’s a job + the source URL I found it on”.
The database should:
- Insert the canonical job if it’s new
- Otherwise update
last_seen_atand fill missing fields - Insert the source row (or ignore if it already exists)
Here’s the pattern I use from a Next.js route handler.
// app/api/ingest/route.ts
import { NextRequest, NextResponse } from "next/server";
import { createClient } from "@supabase/supabase-js";
import { buildCanonicalKey } from "@/lib/canonical";
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY! // server-only
);
export async function POST(req: NextRequest) {
const payload = await req.json();
const canonicalKey = buildCanonicalKey({
title: payload.title,
companyName: payload.company_name,
locationText: payload.location_text,
isRemote: payload.is_remote,
});
// 1) upsert job
const { data: job, error: upsertError } = await supabase
.from("jobs")
.upsert(
{
canonical_key: canonicalKey,
title: payload.title,
company_name: payload.company_name,
location_text: payload.location_text,
is_remote: payload.is_remote ?? false,
salary_min: payload.salary_min,
salary_max: payload.salary_max,
description: payload.description,
posted_at: payload.posted_at,
last_seen_at: new Date().toISOString(),
},
{
onConflict: "canonical_key",
// important: don't blindly overwrite good data with nulls
ignoreDuplicates: false,
}
)
.select("id")
.single();
if (upsertError) {
return NextResponse.json({ error: upsertError.message }, { status: 400 });
}
// 2) attach source (ignore if same URL already stored)
const { error: sourceError } = await supabase.from("job_sources").insert({
job_id: job.id,
source_name: payload.source_name,
source_url: payload.source_url,
external_id: payload.external_id,
});
// unique constraint might throw; that's ok for idempotency
if (sourceError && !sourceError.message.includes("duplicate key")) {
return NextResponse.json({ error: sourceError.message }, { status: 400 });
}
return NextResponse.json({ ok: true, job_id: job.id });
}
Two notes:
- Supabase
.upsert()is convenient, but it can overwrite fields with nulls if you’re not careful. I keep scrapers consistent and avoid sending nulls when possible. - If you want true “fill missing only”, do it in SQL (next section). This is where I tripped up early.
Let’s tighten the upsert logic.
4) Use SQL upsert to avoid overwriting better data
A common duplicate scenario:
- Source A has salary
- Source B doesn’t
If Source B is ingested later and your upsert overwrites salary with null, you just lost data.
I fixed this by moving the upsert into a Postgres function and using COALESCE in the DO UPDATE clause.
-- functions.sql
create or replace function upsert_job_with_source(
p_canonical_key text,
p_title text,
p_company_name text,
p_location_text text,
p_is_remote boolean,
p_salary_min integer,
p_salary_max integer,
p_description text,
p_posted_at timestamptz,
p_source_name text,
p_source_url text,
p_external_id text
)
returns uuid
language plpgsql
as $$
declare
v_job_id uuid;
begin
insert into jobs (
canonical_key, title, company_name, location_text, is_remote,
salary_min, salary_max, description, posted_at,
first_seen_at, last_seen_at
) values (
p_canonical_key, p_title, p_company_name, p_location_text, coalesce(p_is_remote, false),
p_salary_min, p_salary_max, p_description, p_posted_at,
now(), now()
)
on conflict (canonical_key)
do update set
-- keep the newest title/description if it's longer (often richer)
title = case
when length(excluded.title) > length(jobs.title) then excluded.title
else jobs.title
end,
description = case
when excluded.description is null then jobs.description
when jobs.description is null then excluded.description
when length(excluded.description) > length(jobs.description) then excluded.description
else jobs.description
end,
location_text = coalesce(jobs.location_text, excluded.location_text),
salary_min = coalesce(jobs.salary_min, excluded.salary_min),
salary_max = coalesce(jobs.salary_max, excluded.salary_max),
posted_at = coalesce(jobs.posted_at, excluded.posted_at),
last_seen_at = now()
returning id into v_job_id;
insert into job_sources (job_id, source_name, source_url, external_id, scraped_at)
values (v_job_id, p_source_name, p_source_url, p_external_id, now())
on conflict (source_name, source_url)
do nothing;
return v_job_id;
end;
$$;
Now my API route can call one RPC and I don’t have to duplicate logic across scrapers.
Pitfalls:
- Don’t get fancy with “longer description is better” unless you validate it. I still occasionally ingest junk HTML. I run a cleanup step before insert.
- If you change canonical key logic later, you’ll create a second “canonical” job. Version your normalization if you expect it to evolve.
Strict dedupe handles most cases. But not all.
5) Fuzzy fallback with pg_trgm (for the annoying 10%)
Sometimes two sources are describing the same job but the title differs too much for strict matching:
- “PMHNP (Outpatient)”
- “Psych NP - Clinic”
My strict canonical key might not match.
This is where pg_trgm helps. I already use it for full-text-ish search; I also use it for a “maybe duplicate” check during ingestion.
First enable it and index what you compare.
-- enable extensions (Supabase supports this)
create extension if not exists pg_trgm;
-- I create a comparison field that's already normalized in code,
-- but you can also do it with generated columns.
alter table jobs add column if not exists title_norm text;
create index if not exists jobs_title_norm_trgm_idx
on jobs using gin (title_norm gin_trgm_ops);
create index if not exists jobs_company_trgm_idx
on jobs using gin (company_name gin_trgm_ops);
Then a query to find candidates.
-- Find potential duplicates for a new job
-- Tune the similarity threshold based on your data.
select id, title, company_name,
similarity(title_norm, $1) as title_score
from jobs
where company_name = $2
and similarity(title_norm, $1) > 0.6
order by title_score desc
limit 5;
How I use it:
- If strict key conflicts: great, we upsert.
- If strict key is new: run this fuzzy query.
- If there’s a strong match, I attach the new source to the existing job instead of creating a new job.
I’m careful here. Fuzzy matching can cause false positives, so I keep thresholds conservative and restrict by company.
Results / outcome
After moving dedupe into Postgres and tracking sources separately:
- My board stays at 7,556+ real jobs instead of ballooning with repeats.
- I ingest 200+ jobs/day with fewer downstream cleanup scripts.
- Query performance stayed solid (about 50ms average for filtered pages) because the dataset is cleaner.
- Weekly job alerts got way better because I can reliably decide what’s “new” (based on
first_seen_atper canonical job).
Also: debugging is easier. When a user reports a bad listing, I can see every source that contributed to it.
Key takeaways
- Put dedupe in the database: a unique constraint beats scattered “if checks” in scraper code.
- Don’t overwrite good fields with nulls—use SQL upsert with COALESCE.
- Keep a separate job_sources table so dedupe doesn’t destroy useful provenance.
- Use pg_trgm as a fallback for the messy edge cases, but keep it conservative.
- Expect your canonical rules to evolve; keep them simple and deterministic.
Closing
If you’re aggregating content (jobs, products, listings), duplicates are inevitable—you either handle them up front or you suffer later.
What’s your dedupe strategy today: strict keys, fuzzy matching, or manual review queues? Drop your approach (or your messiest edge case) in the comments and I’ll help you reason through it. If folks want, I can write a follow-up on how I mark jobs as expired without deleting them.
Top comments (0)