- I dedupe job posts in Postgres, not in JS.
- I use a fingerprint table + unique indexes.
- I keep one “winner” row per job.
- I wired it into Supabase + a cron scraper.
Context
I’m building a job board for PMHNP roles. Next.js 14. Supabase. Postgres. Vercel.
The data is messy. Always.
I scrape 200+ jobs daily from multiple sources. I’m sitting on 8,000+ active listings across 2,000+ companies. Duplicates show up in dumb ways: RN PMHNP vs PMHNP (Remote), same job with two tracking URLs, same description but different whitespace, and my favorite — “apply on company site” pages that redirect twice.
I tried deduping in code first. Brutal. Spent 4 hours on it. Most of it was wrong.
The fix was moving dedupe into Postgres. With real constraints. And a small “fingerprint” layer so my scraper can stay simple.
1) I store a fingerprint. Not a “maybe duplicate” flag.
I need a stable key. Something I can compute every run.
But I can’t trust job IDs from sources. They collide. Or disappear.
So I compute a fingerprint from the fields that usually identify a job: company + title + location + canonical apply URL. Then I normalize hard.
This lives in the DB as a generated column. That way every writer gets the same logic.
-- 001_jobs_fingerprint.sql
-- Run in Supabase SQL editor
create extension if not exists pgcrypto;
-- Normalize helper: lower + trim + collapse whitespace
create or replace function public.norm_text(input text)
returns text
language sql
immutable
as $$
select regexp_replace(lower(trim(coalesce(input, ''))), '\s+', ' ', 'g');
$$;
alter table public.jobs
add column if not exists apply_url_canonical text;
alter table public.jobs
add column if not exists job_fingerprint text
generated always as (
encode(
digest(
public.norm_text(company_name) || '|' ||
public.norm_text(title) || '|' ||
public.norm_text(location) || '|' ||
public.norm_text(apply_url_canonical),
'sha256'
),
'hex'
)
) stored;
create index if not exists jobs_fingerprint_idx
on public.jobs(job_fingerprint);
That apply_url_canonical part matters.
If you hash raw URLs, you’re cooked. Tracking params will fork the fingerprint forever.
2) I canonicalize URLs before insert. Or I pay later.
I canonicalize in the scraper. Not in the UI. Not on read.
I made this mistake once. I was normalizing on the client. Then my DB had 14 versions of the same job.
This is the exact helper I use in my Next.js ingestion route.
// lib/canonicalize-url.ts
export function canonicalizeUrl(raw: string | null): string | null {
if (!raw) return null;
try {
const u = new URL(raw);
// Drop common tracking params
const drop = new Set([
"utm_source",
"utm_medium",
"utm_campaign",
"utm_term",
"utm_content",
"gclid",
"fbclid",
"mc_cid",
"mc_eid",
]);
// Delete known tracking keys
for (const key of [...u.searchParams.keys()]) {
if (drop.has(key)) u.searchParams.delete(key);
}
// Normalize hostname + remove trailing slash
u.hostname = u.hostname.toLowerCase();
u.hash = "";
const s = u.toString();
return s.endsWith("/") ? s.slice(0, -1) : s;
} catch {
return raw.trim() || null;
}
}
Not perfect. Still worth it.
One thing that bit me — some sources wrap apply links in redirectors. If you can resolve redirects during scraping, do it. If you can’t, at least canonicalize what you have.
3) I enforce uniqueness with a separate table
I don’t want a unique index on the whole jobs row. Jobs change.
Descriptions update. Salary ranges appear later. “Remote” flips to “Hybrid”.
So I keep a small job_keys table that’s basically: fingerprint → winning job id.
That gives me a stable “dedupe gate” without freezing the job content.
-- 002_job_keys.sql
create table if not exists public.job_keys (
job_fingerprint text primary key,
job_id uuid not null references public.jobs(id) on delete cascade,
first_seen_at timestamptz not null default now(),
last_seen_at timestamptz not null default now()
);
create index if not exists job_keys_job_id_idx on public.job_keys(job_id);
Now the ingest logic becomes: insert a job row, then attempt to claim the fingerprint.
If I can’t claim it, I delete the loser row. Simple. Fast.
4) My ingest endpoint is “insert, claim, cleanup”
This runs from my scraper cron. The endpoint receives normalized fields.
And yeah — I rate limit hard. Otherwise I’ll DDoS myself with retries.
Here’s the core logic. This is real code. It works with Supabase JS v2.
// app/api/ingest/job/route.ts
import { NextResponse } from "next/server";
import { createClient } from "@supabase/supabase-js";
import { canonicalizeUrl } from "@/lib/canonicalize-url";
const supabase = createClient(
process.env.SUPABASE_URL!,
process.env.SUPABASE_SERVICE_ROLE_KEY! // server-only
);
type IngestJob = {
company_name: string;
title: string;
location: string;
apply_url: string | null;
description: string | null;
source: string;
scraped_at: string; // ISO
};
export async function POST(req: Request) {
const body = (await req.json()) as IngestJob;
const apply_url_canonical = canonicalizeUrl(body.apply_url);
// 1) Insert candidate job row
const { data: job, error: insErr } = await supabase
.from("jobs")
.insert({
company_name: body.company_name,
title: body.title,
location: body.location,
apply_url: body.apply_url,
apply_url_canonical,
description: body.description,
source: body.source,
scraped_at: body.scraped_at,
last_seen_at: body.scraped_at,
})
.select("id, job_fingerprint")
.single();
if (insErr) {
return NextResponse.json({ error: insErr.message }, { status: 500 });
}
// 2) Try to claim fingerprint in job_keys
const { error: keyErr } = await supabase.from("job_keys").insert({
job_fingerprint: job.job_fingerprint,
job_id: job.id,
});
if (!keyErr) {
// Winner
return NextResponse.json({ status: "inserted", job_id: job.id });
}
// Duplicate fingerprint. Keep existing winner.
// 3) Update last_seen_at on the key row, then delete loser job row.
await supabase
.from("job_keys")
.update({ last_seen_at: body.scraped_at })
.eq("job_fingerprint", job.job_fingerprint);
await supabase.from("jobs").delete().eq("id", job.id);
return NextResponse.json({ status: "deduped" });
}
Two details.
First: I’m using the service role key. Server-only. Never in the browser.
Second: the “claim” insert relies on the primary key constraint in job_keys. That’s the whole trick.
If you’re wondering about error handling: Supabase returns a Postgres constraint error for duplicates. I don’t parse the message. I just treat “insert failed” as “duplicate”.
5) I still merge fields. But only into the winner.
Deleting the loser is fine. Until the loser has better data.
This happened with salary.
Source A had no salary. Source B did. My dedupe deleted the salary row. Nice.
So I added a merge step: if the fingerprint already exists, I update the winner with any missing fields.
I do it in SQL so it’s one round-trip.
-- 003_merge_into_winner.sql
-- Merge non-null fields from an incoming row into the winner row.
create or replace function public.merge_job_into_winner(
p_fingerprint text,
p_title text,
p_location text,
p_apply_url text,
p_apply_url_canonical text,
p_description text,
p_source text,
p_seen_at timestamptz
)
returns uuid
language plpgsql
as $$
declare
v_winner uuid;
begin
select job_id into v_winner
from public.job_keys
where job_fingerprint = p_fingerprint;
if v_winner is null then
raise exception 'winner_not_found';
end if;
update public.jobs j
set
title = coalesce(j.title, p_title),
location = coalesce(j.location, p_location),
apply_url = coalesce(j.apply_url, p_apply_url),
apply_url_canonical = coalesce(j.apply_url_canonical, p_apply_url_canonical),
description = coalesce(j.description, p_description),
last_seen_at = greatest(coalesce(j.last_seen_at, p_seen_at), p_seen_at),
source = j.source -- keep original source
where j.id = v_winner;
update public.job_keys
set last_seen_at = greatest(last_seen_at, p_seen_at)
where job_fingerprint = p_fingerprint;
return v_winner;
end;
$$;
Then in the API route, on duplicate, I call this function instead of just deleting and moving on.
I keep the fingerprinter as the “router”. The merge function is the “data repair”.
Not fancy. Just reliable.
Results
Before this, my “active listings” number was lying. Badly.
On a sample day with 237 scraped jobs, I inserted 237 rows and later found 61 were duplicates. That’s 61 junk rows in one run.
After the fingerprint + job_keys gate, that same day ended with 176 new unique jobs and 61 deduped. DB row count stayed stable. Search results stopped showing the same company/title combo twice.
I’m now holding 8,214 active listings from 2,043 companies, and the duplicates are mostly true near-duplicates (same job cross-posted with different locations). Those need a different rule.
Key takeaways
- Do dedupe with constraints. Not “best effort” JS.
- Hash normalized fields. Raw URLs will wreck you.
- Split “content” from “identity” using a tiny key table.
- On duplicates, merge into the winner. Don’t throw away better fields.
- Keep the ingest flow boring: insert → claim → merge/cleanup.
Closing
If you’re scraping jobs too: what do you fingerprint on?
Company + title + location + apply URL worked for me, but it still misses “same job, different city” cases.
Do you treat those as duplicates, or separate listings with shared detail pages?
Top comments (0)