- I dedupe scraped jobs with a single Postgres unique key.
- I normalize URLs + titles before hashing. Otherwise duplicates slip.
- I upsert in Supabase, then update “last_seen_at” safely.
- I keep search fast with pg_trgm on title + company.
Context
I’m building a job board for Psychiatric Mental Health Nurse Practitioners.
It’s not tiny.
8,000+ active listings. 2,000+ companies.
And I scrape 200+ jobs daily from multiple sources.
The boring part hurts the most.
Deduplication.
Same job shows up with 6 URLs.
Or the URL’s the same, but tracking params differ.
Or the title changes from “PMHNP” to “Psych NP”.
I tried “just dedupe in code”.
Spent 4 hours on this. Most of it was wrong.
Race conditions. Partial failures. “Why are there 37 duplicates again?”
So I moved the dedupe boundary into Postgres.
One unique key. One upsert path. No drama.
1) I define “same job” as a stable fingerprint
If you don’t define this, you’re done.
You’ll keep arguing with yourself.
I started with source_url as the unique key.
Brutal.
A bunch of sources rotate URLs, add utm_*, or redirect through tracking.
Now I compute a fingerprint from normalized fields.
Source + canonicalized URL + title + company + location.
Not perfect.
But stable.
Here’s the exact normalization I ship.
// lib/dedupe.ts
import crypto from "crypto";
export function canonicalizeUrl(input: string): string {
const u = new URL(input);
// Drop noisy params. Add more if your sources use them.
const drop = new Set([
"utm_source",
"utm_medium",
"utm_campaign",
"utm_term",
"utm_content",
"gclid",
"fbclid",
]);
// Remove fragments. They cause fake duplicates.
u.hash = "";
// Remove known tracking params.
for (const key of [...u.searchParams.keys()]) {
if (drop.has(key)) u.searchParams.delete(key);
}
// Sort remaining params for stability.
const sorted = new URLSearchParams(
[...u.searchParams.entries()].sort(([a], [b]) => a.localeCompare(b))
);
u.search = sorted.toString() ? `?${sorted.toString()}` : "";
// Normalize hostname + pathname.
u.hostname = u.hostname.toLowerCase();
u.pathname = u.pathname.replace(/\/+$/, "");
return u.toString();
}
export function normalizeText(s: string): string {
return s
.trim()
.toLowerCase()
.replace(/\s+/g, " ")
.replace(/[’']/g, "'");
}
export function jobFingerprint(input: {
source: string;
url: string;
title: string;
company: string;
location: string;
}): string {
const stable = [
normalizeText(input.source),
canonicalizeUrl(input.url),
normalizeText(input.title),
normalizeText(input.company),
normalizeText(input.location),
].join("|");
return crypto.createHash("sha256").update(stable).digest("hex");
}
One thing that bit me — location.
“Remote - US” vs “Remote (United States)”.
I normalize whitespace and case, but I don’t try to geocode it.
If I over-normalize, I merge real distinct jobs.
That’s worse.
2) I let Postgres enforce uniqueness (not my code)
This is the real fix.
Everything else is a band-aid.
I store the fingerprint in a jobs table.
Then I enforce uniqueness with a constraint.
Supabase makes this easy, but you still need SQL.
I ran this in a migration.
-- db/migrations/20260310_jobs_dedupe.sql
create extension if not exists pgcrypto;
create extension if not exists pg_trgm;
create table if not exists public.jobs (
id uuid primary key default gen_random_uuid(),
source text not null,
source_url text not null,
canonical_url text not null,
fingerprint text not null,
title text not null,
company text not null,
location text not null,
posted_at timestamptz null,
first_seen_at timestamptz not null default now(),
last_seen_at timestamptz not null default now(),
is_active boolean not null default true
);
-- The whole point.
create unique index if not exists jobs_fingerprint_uniq
on public.jobs (fingerprint);
-- Helps lookup + listing pages.
create index if not exists jobs_last_seen_idx
on public.jobs (last_seen_at desc);
-- Search indexes. (Used later.)
create index if not exists jobs_title_trgm_idx
on public.jobs using gin (title gin_trgm_ops);
create index if not exists jobs_company_trgm_idx
on public.jobs using gin (company gin_trgm_ops);
I used to do “check then insert”.
Classic mistake.
Two scrapers run at once.
Both pass the check.
Then both insert.
Unique index fixes that.
Database says no.
Every time.
3) I upsert, then bump last_seen_at
Supabase’s .upsert() works fine.
But I still want deterministic behavior.
On conflict:
- keep
first_seen_at - update
last_seen_at - update fields that drift (title, location, canonical_url)
Also.
I don’t want to mark a job inactive just because one scrape failed.
So I only flip is_active via a separate “stale job” job.
This is the ingestion code I run from a server route.
// app/api/ingest/route.ts
import { NextResponse } from "next/server";
import { createClient } from "@supabase/supabase-js";
import { canonicalizeUrl, jobFingerprint } from "@/lib/dedupe";
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY! // server-only
);
type ScrapedJob = {
source: string;
sourceUrl: string;
title: string;
company: string;
location: string;
postedAt?: string;
};
export async function POST(req: Request) {
const body = (await req.json()) as { jobs: ScrapedJob[] };
const rows = body.jobs.map((j) => {
const canonicalUrl = canonicalizeUrl(j.sourceUrl);
const fingerprint = jobFingerprint({
source: j.source,
url: canonicalUrl,
title: j.title,
company: j.company,
location: j.location,
});
return {
source: j.source,
source_url: j.sourceUrl,
canonical_url: canonicalUrl,
fingerprint,
title: j.title,
company: j.company,
location: j.location,
posted_at: j.postedAt ? new Date(j.postedAt).toISOString() : null,
last_seen_at: new Date().toISOString(),
is_active: true,
};
});
// Upsert on fingerprint. Unique index enforces it.
const { error } = await supabase
.from("jobs")
.upsert(rows, { onConflict: "fingerprint" });
if (error) {
// I log the first failing fingerprint when debugging.
return NextResponse.json({ ok: false, error: error.message }, { status: 500 });
}
return NextResponse.json({ ok: true, ingested: rows.length });
}
I learned this the hard way.
If you upsert on source_url, you’ll keep duplicates forever.
Because source_url isn’t stable.
Also.
Don’t run this with the anon key.
You’ll end up with RLS errors like:
new row violates row-level security policy for table "jobs"
Use the service role key on the server.
4) I mark stale jobs inactive with one SQL update
Scraping isn’t reliable.
Sources timeout.
Rate limiting happens.
Or I deploy and my cron didn’t run.
So “not scraped today” doesn’t mean “job is gone”.
But “not seen in 14 days” usually does.
I run a daily job.
It flips is_active based on last_seen_at.
No loops.
No pagination.
One statement.
-- Run daily (Supabase scheduled function or external cron)
update public.jobs
set is_active = false
where is_active = true
and last_seen_at < now() - interval '14 days';
This simple rule cleaned up my listings.
And it made SEO cleaner too.
Expired jobs stop surfacing in the main pages.
One thing I won’t do again.
I initially used 3 days.
Bad call.
Weekend gaps + source outages made real jobs disappear.
Support emails were… not fun.
5) I keep search fast with pg_trgm (and a boring query)
Job boards live or die on search.
And I’m not shipping Elasticsearch for this.
Not for 8,000 rows.
pg_trgm gets me “good enough” fuzzy matching.
Fast.
Simple.
I query title + company.
I use similarity() for ranking.
Then I cap results.
// lib/search.ts
import { createClient } from "@supabase/supabase-js";
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_ANON_KEY! // safe for read-only if RLS allows
);
export async function searchJobs(q: string) {
// Use an RPC so the query plan stays stable.
const { data, error } = await supabase.rpc("search_jobs", {
q,
limit_count: 25,
});
if (error) throw new Error(error.message);
return data as Array<{
id: string;
title: string;
company: string;
location: string;
canonical_url: string;
}>;
}
And the SQL RPC:
create or replace function public.search_jobs(q text, limit_count int)
returns table (
id uuid,
title text,
company text,
location text,
canonical_url text
)
language sql
stable
as $$
select
j.id,
j.title,
j.company,
j.location,
j.canonical_url
from public.jobs j
where j.is_active = true
and (
j.title % q
or j.company % q
)
order by
greatest(similarity(j.title, q), similarity(j.company, q)) desc,
j.last_seen_at desc
limit limit_count;
$$;
This stayed snappy for me.
Even with 8,000+ active rows.
And yes.
I tried full tsvector too.
Spent another 2 hours.
Relevance was worse for abbreviations like “PMHNP”.
Trigram matched what users typed.
So I kept it.
Results
My database stopped accumulating obvious duplicates.
Before the fingerprint key, I had days where 200 scraped jobs produced 260 rows.
Gross.
Now a typical day is 200–230 scraped inputs producing 190–215 inserts/updates.
The rest are clean conflicts.
Search latency dropped too.
My title/company search went from ~900ms p95 (naive ilike '%q%') to ~70ms p95 with pg_trgm on the same dataset.
Most importantly.
I don’t babysit dedupe anymore.
Postgres enforces it.
Key takeaways
- Define “same job” as a fingerprint, not a URL.
- Canonicalize URLs before hashing. Strip
utm_*and friends. - Put the unique constraint in Postgres. Don’t race your own scrapers.
- Upsert on the fingerprint, then update
last_seen_atevery scrape. - Use
pg_trgmfor fuzzy search before adding heavier infra.
Closing
If you’re scraping jobs too: what do you use for the unique key?
Raw URL, canonical URL, or a fingerprint like this — and what field caused your worst duplicate bug?
Top comments (0)